Written by Pete Corey on Jul 27, 2016.

Today we’re going to be tackling the Molecule to Atom
kata. The goal of this kata is to parse a string representation of a molecule into
its component elements, or atoms.

As always, the first step is to get our project set up. We’ll be using
Mocha as our test runner and Babel for ES6 support.

.babelrc

```
+{
+ "presets": ["es2015"]
+}
```

.gitignore

```
+node_modules/
```

package.json

```
+{
+ "main": "index.js",
+ "scripts": {
+ "test": "mocha ./test --compilers js:babel-register"
+ },
+ "dependencies": {
+ "babel-preset-es2015": "^6.9.0",
+ "babel-register": "^6.9.0",
+ "chai": "^3.5.0",
+ "lodash": "^4.12.0",
+ "mocha": "^2.4.5"
+ }
+}
```

test/index.js

```
+import { expect } from "chai";
+
+describe("index", function() {
+
+ it("works");
+
+});
```

The best place to start is at the beginning. We’ll begin solving this
kata by writing the most basic test we can think of. We expect an empty
string to be parsed into an empty object:

```
expect(parseMolecule("")).to.deep.equal({});
```

After writing this test, it fails. `parseMolecule`

is not defined. We
can quickly remedy this by importing `parseMolecule`

into our
test file and then exporting it from our main module.

Lastly, we need `parseMolecule`

to return an empty object. Just like
that, our tests are green.

index.js

```
+export function parseMolecule(formula) {
+ return {};
+}
```

test/index.js

```
import { expect } from "chai";
+import { parseMolecule } from "../index";
-describe("index", function() {
+describe("Molecule to Atoms", function() {
- it("works");
+ it("it parses a molecule", () => {
+ expect(parseMolecule("")).to.deep.equal({});
+ });
```

Knowing this solution is going to get complex fairly quickly, we’d
like to leverage some abstractions to make it more testable and
maintainable.

The most obvious abstraction we can think of is a `molecule`

. We create
a `molecule`

by passing in a `formula`

. From there, we can call `parse`

to get an object-based representation of the molecule.

We can rewrite our `parseMolecule`

function to use `molecule`

in a clean
way.

index.js

```
+export function molecule(formula) {
+ function parse() {
+ return {};
+ }
+
+ return {
+ parse
+ };
+}
+
export function parseMolecule(formula) {
- return {};
+ return molecule(formula).parse();
}
```

Let’s add a new test case:

```
expect(parseMolecule("H")).to.deep.equal({ H: 1 });
```

This test fails. It’s expecting `{}`

to equal `{ H: 1 }`

.

The fastest way we can get ourselves to green is to return an object
that maps the provided `formula`

to `1`

.

```
return {
[formula]: 1
};
```

After making this change, our first test fails. We now need to
explicitly handle this base case:

```
if (!formula) {
return {};
}
```

And now we’re back to green again.

index.js

```
...
function parse() {
- return {};
+ if (!formula) {
+ return {};
+ }
+ return {
+ [formula]: 1
+ };
}
```

test/index.js

```
...
expect(parseMolecule("")).to.deep.equal({});
+ expect(parseMolecule("H")).to.deep.equal({ H: 1 });
});
```

Next we’ll move onto parsing a molecule with multiple types of elements.
Let’s try to parse `"HMg"`

:

```
expect(parseMolecule("HMg")).to.deep.equal({ H: 1, Mg: 1 });
```

As expected, this test fails. `parse`

is returning `{ HMg: 1 }`

, rather
than correctly splitting `"H"`

and `"Mg"`

.

To fix this, we’ll need to implement a new method on `molecule`

that
will split a formula into a number of parts. We’ll start things off by
splitting formulas into its component elements:

```
formula.match(/[A-Z][a-z]*/g) || [];
```

The `|| []`

default is required to keep our base case of parsing `""`

happy.

After rewriting `parse`

to use this new `parts`

method to assemble the
molecule object, our tests flip back to green.

index.js

```
export function molecule(formula) {
+ function parts() {
+ return formula.match(/[A-Z][a-z]*/g) || [];
+ }
+
function parse() {
- if (!formula) {
- return {};
- }
- return {
- [formula]: 1
- };
+ return parts().reduce((result, part) => {
+ result[part] = 1;
+ return result;
+ }, {});
}
...
return {
- parse
+ parse,
+ parts
};
```

test/index.js

```
import { expect } from "chai";
-import { parseMolecule } from "../index";
+import { parseMolecule, molecule } from "../index";
...
expect(parseMolecule("H")).to.deep.equal({ H: 1 });
+ expect(parseMolecule("HMg")).to.deep.equal({ H: 1, Mg: 1 });
+ });
+
+ describe("molecule", () => {
+ it("splits a formula into parts", () => {
+ expect(molecule("HMg").parts()).to.deep.equal(["H", "Mg"]);
+ });
});
```

Things are getting more complicated. Instead of just breaking apart our
`formula`

and looking for elements, we need to look for pairs of
elements and their corresponding `count`

, or ratio in the molecule.

A nice way to do this is using regular expression match groups and ES6
pattern matching. We can define our regex, and wrap the element and
optional count in match groups:

```
let regex = /([A-Z][a-z]*)(\d*)/g;
```

As we iterate over our regex matches, we can destructure the result
into its corresponding `element`

and `count`

pair.

Finally, we refactored `parts`

to return an object with `part`

and
`count`

keys to keep things explicit. Thinking ahead, I think this might
come in handy when we need to deal with nested expressions.

index.js

```
...
function parts() {
- return formula.match(/[A-Z][a-z]*/g) || [];
+ let parts, results = [];
+ let regex = /([A-Z][a-z]*)(\d*)/g;
+ while (parts = regex.exec(formula)) {
+ let [_, part, count] = parts;
+ results.push({
+ part,
+ count: parseInt(count) || 1
+ });
+ }
+ return results;
}
...
function parse() {
- return parts().reduce((result, part) => {
- result[part] = 1;
+ return parts().reduce((result, {part, count}) => {
+ result[part] = count;
return result;
```

test/index.js

```
...
expect(parseMolecule("HMg")).to.deep.equal({ H: 1, Mg: 1 });
+ expect(parseMolecule("H2Mg")).to.deep.equal({ H: 2, Mg: 1 });
});
...
it("splits a formula into parts", () => {
- expect(molecule("HMg").parts()).to.deep.equal(["H", "Mg"]);
+ expect(molecule("HMg").parts()).to.deep.equal([
+ { part: "H", count: 1 },
+ { part: "Mg", count: 1 },
+ ]);
});
```

One thing I noticed while looking over our solution is that it
did not support multiple instances of the same element in a molecule.
For example, with `"HMgH"`

, the last `"H"`

element would override the
previous.

I added a test to confirm my suspicions, and quickly fixed the problem.
Instead of overriding the element in the `result`

object with `count`

,
increment it (if it exists) by `count`

:

```
result[part] = ~~result[part] + count;
```

The `~~`

operator is rarely seen, but it’s an incredibly useful
operator. In this case, if `result[part]`

is undefined, it will be
converted to `0`

, preventing a `NaN`

result from our addition.

index.js

```
...
return parts().reduce((result, {part, count}) => {
- result[part] = count;
+ result[part] = ~~result[part] + count;
return result;
```

test/index.js

```
...
expect(parseMolecule("H2Mg")).to.deep.equal({ H: 2, Mg: 1 });
+ expect(parseMolecule("H2MgH")).to.deep.equal({ H: 3, Mg: 1 });
});
```

Next up on our feature todo list is adding support for nested
expressions. A simple nested expressions to get us going could be
`"[H]Mg"`

. Let’s set our expectations with a test:

```
expect(parseMolecule("[H]Mg")).to.deep.equal({ H: 1, Mg: 1 });
```

Now each “part” in our `parts`

method could be either an element, or a
nested expressions within square brackets. Let’s update our regular
expression to reflect that:

```
let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g;
```

Thanks to the magic of match groups and destructing, we can assign the
nested expression (if it exists) to `square`

:

```
let [_, __, square, part, count] = parts;
```

Finally, if `square`

exists, let’s recursively process the nested
expression and append its `parts`

onto our list of `results`

.

With those changes, our tests flip back to green. Beautiful.

index.js

```
...
let parts, results = [];
- let regex = /([A-Z][a-z]*)(\d*)/g;
+ let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g;
while (parts = regex.exec(formula)) {
- let [_, part, count] = parts;
- results.push({
- part,
- count: parseInt(count) || 1
- });
+ let [_, __, square, part, count] = parts;
+ if (square) {
+ let nested = molecule(square).parts();
+ results = results.concat(nested);
+ }
+ else {
+ results.push({
+ part,
+ count: parseInt(count) || 1
+ });
+ }
}
```

test/index.js

```
...
expect(parseMolecule("H2MgH")).to.deep.equal({ H: 3, Mg: 1 });
+ expect(parseMolecule("[H]Mg")).to.deep.equal({ H: 1, Mg: 1 });
});
```

Looking forward, the next feature we’ll want to support is using ratios,
or `count`

on nested expressions. This means that we’ll need some notion
of “multiplying” molecules by some `count`

.

I imagine calling `molcule(...).multiply(2)`

would return a new
`molecule`

. This means that `multiply`

will need access to our parsed
interpretation of the `formula`

.

We could have `multiply`

call `parts`

, multiply each element by `count`

,
serialize the result back into a formula and then create and return a
new `molecule`

, but that’s a very roundabout way of doing things.

Instead, let’s define an internal `_parts`

list that’s created every
time a `molecule`

is created. The `parts`

method will simply return
`_parts`

, and `multiply`

will be able to modify `_parts`

directly.

After doing our refactoring, all of our tests are still passing. We’re
safe.

index.js

```
export function molecule(formula) {
- function parts() {
- let parts, results = [];
- let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g;
- while (parts = regex.exec(formula)) {
- let [_, __, square, part, count] = parts;
- if (square) {
- let nested = molecule(square).parts();
- results = results.concat(nested);
- }
- else {
- results.push({
- part,
- count: parseInt(count) || 1
- });
- }
+ let _parts = [];
+
+ let matches;
+ let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g;
+ while (matches = regex.exec(formula)) {
+ let [_, __, square, part, count] = matches;
+ if (square) {
+ let nested = molecule(square).parts();
+ _parts = _parts.concat(nested);
+ }
+ else {
+ _parts.push({
+ part,
+ count: parseInt(count) || 1
+ });
}
- return results;
+ }
+
+ function parts() {
+ return _parts;
}
```

With that refactor out of the way, we can easily implement `multiply`

.
To give ourselves an explicit objective, let’s write a test first:

```
expect(molecule("H").multiply(2).parse()).to.deep.equal({ H: 2 });
```

Implementing `multiply`

is straight forward. We just loop over each of
the `_parts`

of the molecule, multiplying its `count`

by the provided
`count`

. Finally, we return `this`

, so calls to `molecule`

methods can
be chained together.

Next, let’s write a test for handling ratios on nested expressions:

```
expect(parseMolecule("[HO]2Mg")).to.deep.equal({ H: 2, O: 2, Mg: 1 });
```

This test fails, as expected, but it’s not difficult to get our suite
back to green. We just need to build our `nested`

molecule, `multiply`

is by its corresponding `count`

, and finally append its `parts`

onto our
results array.

index.js

```
...
let [_, __, square, part, count] = matches;
+ count = parseInt(count) || 1;
if (square) {
- let nested = molecule(square).parts();
- _parts = _parts.concat(nested);
+ let nested = molecule(square).multiply(count);
+ _parts = _parts.concat(nested.parts());
}
...
part,
- count: parseInt(count) || 1
+ count
});
...
+ function multiply(count) {
+ _parts.forEach((part) => {
+ part.count *= count;
+ });
+ return this;
+ }
+
function parts() {
...
parse,
- parts
+ parts,
+ multiply
};
```

test/index.js

```
...
expect(parseMolecule("[H]Mg")).to.deep.equal({ H: 1, Mg: 1 });
+ expect(parseMolecule("[HO]2Mg")).to.deep.equal({ H: 2, O: 2, Mg: 1 });
});
...
});
+ it("multiplies an object", () => {
+ expect(molecule("H").multiply(2).parse()).to.deep.equal({ H: 2 });
+ });
});
```

Unfortunately, there’s a big problem with our solution. Our regex
(`\[.*\]`

) is looking for opening and closing brackets and assuming that
everything within them are a nested expression.

This assumption breaks down when we have two nested expressions in the
same formula, like `"[H]O[H]"`

. Our regex will match on the first and
very last square brackets, returning `"H]O[H"`

as the nested expression.
We can write this as a test to see the failure in action:

```
expect(molecule("[H]2O[H]").parts()).to.deep.equal([
{ part: "H", count: 2 },
{ part: "O", count: 1 },
{ part: "H", count: 1 }
]);
```

Switching to a non-greedy regex (`\[.*?\]`

) would still fail, but this
time on nested subexpressions, like `"[H[O]]"`

.

We need a better way to parse out our formula parts.

Our new solution uses a regex to match on all opening brackets, closing
brackets, and elements with a trailing count:

```
let regex = /((\[)|(\])|([A-Z][a-z]*))(\d*)/g;
```

Every time we encounter an opening bracket, we push a new formula onto
our new `stack`

:

```
if (open) {
stack.push({
formula: ""
});
}
```

A non-empty stack means that we’re collecting the pieces of a nested
expression. That means we should just append any elements and counts we
find to that sub-expression’s formula:

```
else if (stack.length) {
stack[stack.length - 1].formula += part + count;
}
```

Finally, whenever we encounter a closing bracket, we want to create a
new `molecule`

out of the nested expression we just collected, and
appends its `parts`

onto our list of `parts`

:

```
else if (close) {
let nested = molecule(stack.pop().formula).multiply(count);
_parts = _parts.concat(nested.parts());
}
```

Now we’re properly processing sub-expressions and our tests return to
green.

index.js

```
...
let matches;
- let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g;
+ let stack = [];
+ let regex = /((\[)|(\])|([A-Z][a-z]*))(\d*)/g;
while (matches = regex.exec(formula)) {
- let [_, __, square, part, count] = matches;
+ let [_, __, open, close, part, count] = matches;
count = parseInt(count) || 1;
- if (square) {
- let nested = molecule(square).multiply(count);
+ if (open) {
+ stack.push({
+ formula: ""
+ });
+ }
+ else if (close) {
+ let nested = molecule(stack.pop().formula).multiply(count);
_parts = _parts.concat(nested.parts());
}
+ else if (stack.length) {
+ stack[stack.length - 1].formula += part + count;
+ }
else {
```

test/index.js

```
...
]);
+ expect(molecule("[H]2O[H]").parts()).to.deep.equal([
+ { part: "H", count: 2 },
+ { part: "O", count: 1 },
+ { part: "H", count: 1 }
+ ]);
});
```

Last up on our feature list is the ability to use parentheses and curly
brackets in addition to square brackets as sub-expression dividers.

We can add a test for this:

```
expect(parseMolecule("K4{ON(SO3)2}2")).to.deep.equal({K: 4, O: 14, N: 2, S: 4});
```

The most straight-forward way to accomplish this is to just replace all
`"{"`

and `"("`

characters with `"["`

, and `"}"`

and `")"`

characters
with `"]"`

in our `formula`

:

```
formula.replace(/[{(]/g, "[").replace(/[})]/g, "]");
```

Unfortunately, this test also exposed another bug in our solution. it
looks like counts on sub-expressions aren’t being applied to other
nested expressions.

We can fix this by giving our `stack`

a bit more state. In addition to
building up our current expressions’s `formula`

as we go, we also want
to keep track of any nested `molecules`

we encounter, so we can multiply
them by our `count`

.

After making this change, all of our tests pass.

index.js

```
...
+ formula = formula.replace(/[{(]/g, "[").replace(/[})]/g, "]");
+
let matches;
...
stack.push({
- formula: ""
+ formula: "",
+ molecules: []
});
...
else if (close) {
- let nested = molecule(stack.pop().formula).multiply(count);
- _parts = _parts.concat(nested.parts());
+ let popped = stack.pop();
+ popped.molecules.push(molecule(popped.formula));
+ popped.molecules.forEach((molecule) => {
+ molecule.multiply(count);
+ });
+ if (!stack.length) {
+ popped.molecules.forEach((molecule) => {
+ _parts = _parts.concat(molecule.parts());
+ });
+ }
+ else {
+ let last = stack[stack.length - 1];
+ last.molecules = last.molecules.concat(popped.molecules);
+ }
}
```

test/index.js

```
...
expect(parseMolecule("[HO]2Mg")).to.deep.equal({ H: 2, O: 2, Mg: 1 });
+ expect(parseMolecule("K4{ON(SO3)2}2")).to.deep.equal({K: 4, O: 14, N: 2, S: 4});
+ expect(parseMolecule("{[Co(NH3)4(OH)2]3Co}(SO4)3")).to.deep.equal({
+ Co: 4,
+ N: 12,
+ H: 42,
+ O: 18,
+ S: 3
+ });
});
```

# Final Thoughts

This was a doozy of a kata.

All in all, there are a lot of ideas flying around here and I fear I didn’t explain myself as well as I could have. Balancing trying to keep things as concise, while staying true to the intent of explaining every change is definitely a challenge.

The limitations of regular expressions are explored in this problem was well. Initially, a totally regex based solution seemed like a good approach, but the nested structure of our data made regexes difficult to use.

When using regexes, it’s always important to keep in mind that there are entire classes of problem where they’re simply not suitable.

Finally, my final solution isn’t as clean and clear as I would have liked. By the time I landed on a working solution, I feared that more refactor commits would have muddied the water. In future problems, I need to focus on refactoring sooner and more often.