This post is inspired by a Kattis problem: Stoichiometry.
Let’s face it. Every one of us who has learnt chemistry would’ve come across a problem like this before:
The task we’re given is to add coefficients to each chemical in order to balance the equation. This process is known as stoichiometry. It’s pretty straightforward to do this qualitatively by hand, but is it possible to come up with an algorithm that can do this automatically?
This post will explain how we can use principles of linear algebra to come up with an elegant approach to tackle this problem. I should make it clear here that this post will cover more mathematics than chemistry. In particular, this post will not explain how to come up with chemical equations based on chemical processes such as redox, etc., since that on its own is a hefty topic.
Understanding the Problem
the right form
In order to come up with an algorithm to tackle this problem, we must first convert the problem into a mathematical form, so that computers can understand it.
How can we encode a molecule or compound like in mathematical form? Notice that in the problem above, we have the following elements:
Each element has a particular count attached to it in the chemical formula. In , we have 1 hydrogen atom, 1 nitrogen atom and 3 oxygen atoms. Since each element stays as its own type during a chemical reaction (barring nuclear reactions), we have categorical data, as opposed to ordinal data.
There are several structures to represent categorical data in mathematics. Two of the common examples are:
- Vectors (or Arrays)
- Generating Functions
Vectors are perhaps the most intuitive, and there are many useful tools from linear algebra that we can apply, and hence we shall use them. In order to represent each molecule/compound as a vector, we must assign each element to a dimension (standard basis vector). In machine learning, this is known as One-Hot Encoding. We will make the following definitions:
With these definitions, we can represent as the following vector:
Giving names to each coefficient, our chemical equation then becomes a mathematical equation:
We can (and should) convert the equation above into a matrix equation:
Now we can give names to each matrix and vector, and the equation simplifies to:
diving deeper
Notice that with the equation above, the solutions are not unique, and hence the problem is not well-posed. We need to enforce additional constraints in order to ensure that there is only one solution.
Firstly, our coefficients have to be nonzero natural numbers:
This is still not enough. Even with this constraint, notice that if we have solutions , then its integer multiples are also valid solutions. Essentially, we want the smallest solution, and we assume that there is only one solution. If there are multiple solutions, we can still find them with certain techniques, which I’ll explain more later.
Secondly, we have to minimise the sum of coefficients (L1 norm). Minimising either or
will do, since the matrix equation ensures that the other is also minimised.
Now, we have a well-posed problem. What we have here is known as a Homogeneous Linear Diophantine Equation. Diophantine equations in general are known to be notoriously difficult to solve, but thankfully, the linear ones can be solved in polynomial time.
Approaches
brute force enumeration
A simple approach, of course, is to try out every single possible combination of coefficients until we get it right. This approach is known as brute force enumeration. This will eventually give us the correct answer, and as long as the coefficients are small, this will only take an inconsequential amount of time.
However, the time complexity of this approach is , where:
is the highest coefficient in the balanced equation.
is the number of terms on the left-hand side of the equation.
is the number of terms on the right-hand side of the equation.
We can get this expression by seeing that this approach assigns values into bins, where there are bins in total, and the radix (base) for each bin is
.
This means that the brute force approach does not scale well with the problem at all. It effectively takes exponential time to find a solution. We can employ clever search strategies to improve the average-case complexity, but a polynomial time solution would scale much better.
pseudocode
Considerations:
- We need a successor function, which will take in a vector of coefficients and return the next guess. There are multiple valid implementations.
- We need to impose a limit on the maximum allowed coefficient to ensure that our runtime terminates.
Procedure:
- Begin with initial guess for coefficient vector: Let
.
- Check if initial coefficient vector solves
. If yes, stop.
- If no, pass the coefficient vector through successor function to get next coefficient vector.
- Check if current coefficient vector solves
or if the highest coefficient exceeds the limit. If yes, stop.
- If no, return to step 3.
python implementation
def successor(
cur_vec: np.ndarray,
min_val: int,
max_val: int
):
vec_max = max(cur_vec)
idxs = np.where(cur_vec < vec_max)[0]
inc_idx = -1 if len(idxs) == 0 else idxs[-1]
cur_vec[inc_idx+1:].fill(min_val)
overflow = False
if inc_idx == -1:
if cur_vec[-1] < max_val:
cur_vec[-1] = vec_max + 1
else:
overflow = True
else:
cur_vec[inc_idx] += 1
return cur_vec, overflow
def bruteforce_stoichiometry(
U: np.ndarray,
V: np.ndarray
):
bins = U.shape[1] + V.shape[1]
coefs = np.ones(bins, dtype=int)
is_satisfied = all((U@coefs[:U.shape[1]] - V@coefs[U.shape[1]:]) == 0)
while not is_satisfied:
coefs, overflow = successor(coefs, 1, 10)
if overflow:
return False
is_satisfied = all((U@coefs[:U.shape[1]] - V@coefs[U.shape[1]:]) == 0)
return coefs
result
The brute force method could not find a solution to the above problem even after 1 minute, so I tested it on a simpler problem:
Invocation:
U = np.asarray([[1,4,0],[0,0,2]]).T
V = np.asarray([[1,0,2],[0,2,1]]).T
%timeit bruteforce_stoichiometry(U, V)
print(bruteforce_stoichiometry(U, V))
Output:
10000 loops, best of 5: 70.5 µs per loop
[1 2 1 2]
linear algebra
Finally, the cream of the crop. Let’s first focus on this matrix equation which we derived earlier:
This equation on its own is not very helpful, as it tells us how to find if we have
or vice versa, but doesn’t help us find either vector on its own. We need to manipulate the equation a little bit.
Let’s first assume that the columns of and
are independent. This means that none of the reactants can be converted into other reactants, and vice versa for the products. This could possibly limit the scope of possible problems this technique can be applied on.
Also, without loss of generality, we will assume that and
are rectangular matrices, and hence not invertible. Notice that since this is an equality, that means that:
lies in the column space of
. We do not lose information by left-multiplying both sides by
.
lies in the column space of
.We do not lose information by left-multiplying both sides by
.
We can left-multiply both sides of the equation by and
respectively to obtain 2 equations:
Since and
are full-rank square matrices, they are invertible. We can rearrange the equations:
Now we can make the following substitutions for simplicity:
We do not actually need to invert the matrices and
to solve for
and
. We can just use Cholesky Decomposition. With these substitutions, the equations become greatly simplified:
Now, we shall substitute these equations into each other:
Rearranging:
These equations are very helpful, as they show that:
is in the null space of
.
is in the null space of
.
Since we only need either or
(and not both), we can solve for the lower-dimensional vector first, and get the other via matrix multiplication. If there are no degenerate solutions (multiple solutions that satisfy the equation), there will only be one null-space vector. If there are multiple solutions, then we will have multiple basis vectors for the null space.
However, after solving for both vectors, we still have an issue; we have to normalise them to become integers.
Certain programming languages support performing computation on fractions, in which case it would be trivial to convert them into integers; just multiply all of them by the lowest common denominator. However, if the language you’re using doesn’t support fractions, you can find a solution as follows:
- Normalise the smallest coefficient to 1 by dividing all values by the smallest coefficient.
- Check if all coefficients are integers within machine precision. If yes, stop.
- If no, normalise the smallest coefficient to its current value + 1, and return to step 2.
This will eventually yield a valid solution. The time-complexity of this approach is , mainly incurred when solving for
and
. Indeed, it solves the problem in polynomial time.
pseudocode
- Convert each molecule/compound into a vector via one-hot encoding.
- Construct matrices
and
.
- Solve for
and
.
- Solve for the null space of
or
, depending on which matrix has lower rank.
- If language supports fractions, divide by lowest common denominator and terminate.
- If not, normalise the smallest coefficient to 1 by dividing all values by the smallest coefficient.
- Check if all coefficients are integers within machine precision. If yes, stop.
- If no, normalise the smallest coefficient to its current value + 1, and return to step 7.
python implementation
import numpy as np
from scipy.linalg import null_space
from math import isclose
def solve_stoichiometry(
U: np.ndarray,
V: np.ndarray
):
Su = np.linalg.lstsq(U.T@U, U.T@V, rcond=-1)[0]
Sv = np.linalg.lstsq(V.T@V, V.T@U, rcond=-1)[0]
if Su.shape[1] >= Su.shape[0]:
SuSv = Su@Sv
a = null_space(SuSv-np.identity(SuSv.shape[0]))
b = Sv@a
else:
SvSu = Sv@Su
b = null_space(SvSu-np.identity(SvSu.shape[0]))
a = Su@b
ab = np.concatenate([a.flatten(),b.flatten()])
min = ab[np.argmin(np.abs(ab))]
ab /= min
for multiplier in range(1, 101):
if all([isclose(val, np.round(val)) for val in ab*multiplier]):
return np.round(ab*multiplier).astype(np.int32)
result
Invocation:
U = np.asarray([[1,0,0,0],[0,1,1,3]]).T
V = np.asarray([[1,2,0,4],[0,0,1,2],[0,2,0,1]]).T
%timeit solve_stoichiometry(U, V)
print(solve_stoichiometry(U, V))
Output:
The slowest run took 6.37 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 212 µs per loop
[1 6 1 6 2]
Worst case was less than 1.35 ms.
solution
comparison
| Equation | Brute Force | Linear Algebra |
(Combustion of Methane) | 70.9 µs | 202 µs |
(Photosynthesis) | > 1 min | 200 µs |
(Low-Temperature Oxidation of Sulfur) | > 1 min | 212 µs |
(Production of Basic Beryllium Acetate) | > 1 min | 213 µs |
Conclusion
In the first comparison example (Combustion of Methane), we see that the brute force algorithm is faster than the linear algebra algorithm if the coefficients are small. Unfortunately, in the other cases, the brute force algorithm is unable to solve them even after 1 minute, proving the point that it scales horribly with the highest coefficient. The linear algebra algorithm, on the other hand, is fast and scales very well.
Interestingly, in the last comparison example (Production of Basic Beryllium Acetate), we also see that the linear algebra technique works even when certain chemical groups are assigned unique symbols (acetyl group in this example).
So there we have it; an approach that balances chemical equations in polynomial time, using linear algebra. Since I’m not a chemist myself, I’m not aware of any constraints that might arise from assuming that all one-hot encoded vectors are independent. If you have experience with this, do share what you know!
