This post is inspired by a Kattis problem: Stoichiometry.

Let’s face it. Every one of us who has learnt chemistry would’ve come across a problem like this before:

$\text{S} + \text{HNO}_3 \rightarrow \text{H}_2\text{SO}_4 + \text{NO}_2 + \text{H}_2\text{O}$

The task we’re given is to add coefficients to each chemical in order to balance the equation. This process is known as stoichiometry. It’s pretty straightforward to do this qualitatively by hand, but is it possible to come up with an algorithm that can do this automatically?

This post will explain how we can use principles of linear algebra to come up with an elegant approach to tackle this problem. I should make it clear here that this post will cover more mathematics than chemistry. In particular, this post will not explain how to come up with chemical equations based on chemical processes such as redox, etc., since that on its own is a hefty topic.

Understanding the Problem

the right form

In order to come up with an algorithm to tackle this problem, we must first convert the problem into a mathematical form, so that computers can understand it.

How can we encode a molecule or compound like $\text{HNO}_3$ in mathematical form? Notice that in the problem above, we have the following elements:

$\text{S,H,N,O}$

Each element has a particular count attached to it in the chemical formula. In $\text{HNO}_3$ , we have 1 hydrogen atom, 1 nitrogen atom and 3 oxygen atoms. Since each element stays as its own type during a chemical reaction (barring nuclear reactions), we have categorical data, as opposed to ordinal data.

There are several structures to represent categorical data in mathematics. Two of the common examples are:

Vectors (or Arrays)
Generating Functions

Vectors are perhaps the most intuitive, and there are many useful tools from linear algebra that we can apply, and hence we shall use them. In order to represent each molecule/compound as a vector, we must assign each element to a dimension (standard basis vector). In machine learning, this is known as One-Hot Encoding. We will make the following definitions:

$\text{S} = \textbf{e}_1\\ \text{H} = \textbf{e}_2\\ \text{N} = \textbf{e}_3\\ \text{O} = \textbf{e}_4$

With these definitions, we can represent $\text{HNO}_3$ as the following vector:

$\begin{pmatrix} 0\\1\\1\\3 \end{pmatrix}$

Giving names to each coefficient, our chemical equation then becomes a mathematical equation:

$a_1\begin{pmatrix} 1\\0\\0\\0 \end{pmatrix} + a_2\begin{pmatrix} 0\\1\\1\\3 \end{pmatrix} = b_1\begin{pmatrix} 1\\2\\0\\4 \end{pmatrix} + b_2\begin{pmatrix} 0\\0\\1\\2 \end{pmatrix} + b_3\begin{pmatrix} 0\\2\\0\\1 \end{pmatrix}$

We can (and should) convert the equation above into a matrix equation:

$\begin{bmatrix} 1&0\\0&1\\0&1\\0&3 \end{bmatrix} \begin{pmatrix} a_1\\a_2\end{pmatrix} = \begin{bmatrix} 1&0&0\\2&0&2\\0&1&0\\4&2&1 \end{bmatrix} \begin{pmatrix} b_1\\b_2\\b_3\end{pmatrix}$

Now we can give names to each matrix and vector, and the equation simplifies to:

$\textbf{Ua}=\textbf{Vb}$

diving deeper

Notice that with the equation above, the solutions are not unique, and hence the problem is not well-posed. We need to enforce additional constraints in order to ensure that there is only one solution.

Firstly, our coefficients have to be nonzero natural numbers:

$\textbf{a} \in\mathbb{N}_1^m,\textbf{b} \in\mathbb{N}_1^n$

This is still not enough. Even with this constraint, notice that if we have solutions $\left\{\textbf{a}^*, \textbf{b}^*\right\}$ , then its integer multiples are also valid solutions. Essentially, we want the smallest solution, and we assume that there is only one solution. If there are multiple solutions, we can still find them with certain techniques, which I’ll explain more later.

Secondly, we have to minimise the sum of coefficients (L1 norm). Minimising either $\textbf{a}$ or $\textbf{b}$ will do, since the matrix equation ensures that the other is also minimised.

$\textbf{a} = \text{argmin}\left\Vert\textbf{a}\right\Vert_1$

Now, we have a well-posed problem. What we have here is known as a Homogeneous Linear Diophantine Equation. Diophantine equations in general are known to be notoriously difficult to solve, but thankfully, the linear ones can be solved in polynomial time.

Approaches

brute force enumeration

A simple approach, of course, is to try out every single possible combination of coefficients until we get it right. This approach is known as brute force enumeration. This will eventually give us the correct answer, and as long as the coefficients are small, this will only take an inconsequential amount of time.

However, the time complexity of this approach is $O(k^{m+n})$ , where:

$k$ is the highest coefficient in the balanced equation.
$m$ is the number of terms on the left-hand side of the equation.
$n$ is the number of terms on the right-hand side of the equation.

We can get this expression by seeing that this approach assigns values into bins, where there are $m+n$ bins in total, and the radix (base) for each bin is $k$ .

This means that the brute force approach does not scale well with the problem at all. It effectively takes exponential time to find a solution. We can employ clever search strategies to improve the average-case complexity, but a polynomial time solution would scale much better.

pseudocode

Considerations:

We need a successor function, which will take in a vector of coefficients and return the next guess. There are multiple valid implementations.
We need to impose a limit on the maximum allowed coefficient to ensure that our runtime terminates.

Procedure:

Begin with initial guess for coefficient vector: Let $\textbf{a}=1,\textbf{b}=1$ .
Check if initial coefficient vector solves $\textbf{Ua}=\textbf{Vb}$ . If yes, stop.
If no, pass the coefficient vector through successor function to get next coefficient vector.
Check if current coefficient vector solves $\textbf{Ua}=\textbf{Vb}$ or if the highest coefficient exceeds the limit. If yes, stop.
If no, return to step 3.

python implementation

def successor(
  cur_vec: np.ndarray,
  min_val: int,
  max_val: int
):
  vec_max = max(cur_vec)
  idxs = np.where(cur_vec < vec_max)[0]
  inc_idx = -1 if len(idxs) == 0 else idxs[-1]
  cur_vec[inc_idx+1:].fill(min_val)
  overflow = False
  if inc_idx == -1:
    if cur_vec[-1] < max_val:
      cur_vec[-1] = vec_max + 1
    else:
      overflow = True
  else:
    cur_vec[inc_idx] += 1
  return cur_vec, overflow

def bruteforce_stoichiometry(
  U: np.ndarray,
  V: np.ndarray
):
  bins = U.shape[1] + V.shape[1]
  coefs = np.ones(bins, dtype=int)
  is_satisfied = all((U@coefs[:U.shape[1]] - V@coefs[U.shape[1]:]) == 0)
  while not is_satisfied:
    coefs, overflow = successor(coefs, 1, 10)
    if overflow:
      return False
    is_satisfied = all((U@coefs[:U.shape[1]] - V@coefs[U.shape[1]:]) == 0)
  return coefs

result

The brute force method could not find a solution to the above problem even after 1 minute, so I tested it on a simpler problem:

$\text{CH}_4 + \text{O}_2 \rightarrow \text{CO}_2 + \text{H}_2\text{O}$

Invocation:

U = np.asarray([[1,4,0],[0,0,2]]).T
V = np.asarray([[1,0,2],[0,2,1]]).T

%timeit bruteforce_stoichiometry(U, V)
print(bruteforce_stoichiometry(U, V))

Output:

10000 loops, best of 5: 70.5 µs per loop
[1 2 1 2]

linear algebra

Finally, the cream of the crop. Let’s first focus on this matrix equation which we derived earlier:

$\textbf{Ua}=\textbf{Vb}$

This equation on its own is not very helpful, as it tells us how to find $\textbf{a}$ if we have $\textbf{b}$ or vice versa, but doesn’t help us find either vector on its own. We need to manipulate the equation a little bit.

Let’s first assume that the columns of $\textbf{U}$ and $\textbf{V}$ are independent. This means that none of the reactants can be converted into other reactants, and vice versa for the products. This could possibly limit the scope of possible problems this technique can be applied on.

Also, without loss of generality, we will assume that $\textbf{U}$ and $\textbf{V}$ are rectangular matrices, and hence not invertible. Notice that since this is an equality, that means that:

$\textbf{Vb}$ lies in the column space of $\textbf{U}$ . We do not lose information by left-multiplying both sides by $\textbf{U}^T$ .
$\textbf{Ua}$ lies in the column space of $\textbf{V}$ .We do not lose information by left-multiplying both sides by $\textbf{V}^T$ .

We can left-multiply both sides of the equation by $\textbf{U}^T$ and $\textbf{V}^T$ respectively to obtain 2 equations:

$\textbf{U}^T\textbf{Ua}=\textbf{U}^T\textbf{Vb}$

$\textbf{V}^T\textbf{Ua}=\textbf{V}^T\textbf{Vb}$

Since $\textbf{U}^T\textbf{U}$ and $\textbf{V}^T\textbf{V}$ are full-rank square matrices, they are invertible. We can rearrange the equations:

$\textbf{a}=\left(\textbf{U}^T\textbf{U}\right)^{-1}\textbf{U}^T\textbf{Vb}$

$\left(\textbf{V}^T\textbf{V}\right)^{-1}\textbf{V}^T\textbf{Ua}=\textbf{b}$

Now we can make the following substitutions for simplicity:

$\textbf{S}_U=\left(\textbf{U}^T\textbf{U}\right)^{-1}\textbf{U}^T\textbf{V}$

$\textbf{S}_V=\left(\textbf{V}^T\textbf{V}\right)^{-1}\textbf{V}^T\textbf{U}$

We do not actually need to invert the matrices $\textbf{U}^T\textbf{U}$ and $\textbf{V}^T\textbf{V}$ to solve for $\textbf{S}_U$ and $\textbf{S}_V$ . We can just use Cholesky Decomposition. With these substitutions, the equations become greatly simplified:

$\textbf{a}=\textbf{S}_U\textbf{b}$

$\textbf{S}_V\textbf{a}=\textbf{b}$

Now, we shall substitute these equations into each other:

$\textbf{a}=\textbf{S}_U\textbf{S}_V\textbf{a}$

$\textbf{S}_V\textbf{S}_U\textbf{b}=\textbf{b}$

Rearranging:

$\left(\textbf{S}_U\textbf{S}_V-\textbf{I}\right)\textbf{a}=\textbf{0}$

$\left(\textbf{S}_V\textbf{S}_U-\textbf{I}\right)\textbf{b}=\textbf{0}$

These equations are very helpful, as they show that:

$\textbf{a}$ is in the null space of $\textbf{S}_U\textbf{S}_V-\textbf{I}$ .
$\textbf{b}$ is in the null space of $\textbf{S}_V\textbf{S}_U-\textbf{I}$ .

Since we only need either $\textbf{a}$ or $\textbf{b}$ (and not both), we can solve for the lower-dimensional vector first, and get the other via matrix multiplication. If there are no degenerate solutions (multiple solutions that satisfy the equation), there will only be one null-space vector. If there are multiple solutions, then we will have multiple basis vectors for the null space.

However, after solving for both vectors, we still have an issue; we have to normalise them to become integers.

Certain programming languages support performing computation on fractions, in which case it would be trivial to convert them into integers; just multiply all of them by the lowest common denominator. However, if the language you’re using doesn’t support fractions, you can find a solution as follows:

Normalise the smallest coefficient to 1 by dividing all values by the smallest coefficient.
Check if all coefficients are integers within machine precision. If yes, stop.
If no, normalise the smallest coefficient to its current value + 1, and return to step 2.

This will eventually yield a valid solution. The time-complexity of this approach is $O(m^3+n^3+k(m+n))$ , mainly incurred when solving for $\textbf{S}_U$ and $\textbf{S}_V$ . Indeed, it solves the problem in polynomial time.

pseudocode

Convert each molecule/compound into a vector via one-hot encoding.
Construct matrices $\textbf{U}$ and $\textbf{V}$ .
Solve for $\textbf{S}_U$ and $\textbf{S}_V$ .
Solve for the null space of $\textbf{S}_U\textbf{S}_V-\textbf{I}$ or $\textbf{S}_V\textbf{S}_U-\textbf{I}$ , depending on which matrix has lower rank.
If language supports fractions, divide by lowest common denominator and terminate.
If not, normalise the smallest coefficient to 1 by dividing all values by the smallest coefficient.
Check if all coefficients are integers within machine precision. If yes, stop.
If no, normalise the smallest coefficient to its current value + 1, and return to step 7.

python implementation

import numpy as np
from scipy.linalg import null_space
from math import isclose

def solve_stoichiometry(
  U: np.ndarray,
  V: np.ndarray
):
  Su = np.linalg.lstsq(U.T@U, U.T@V, rcond=-1)[0]
  Sv = np.linalg.lstsq(V.T@V, V.T@U, rcond=-1)[0]

  if Su.shape[1] >= Su.shape[0]:
    SuSv = Su@Sv
    a = null_space(SuSv-np.identity(SuSv.shape[0]))
    b = Sv@a
  else:
    SvSu = Sv@Su
    b = null_space(SvSu-np.identity(SvSu.shape[0]))
    a = Su@b
  ab = np.concatenate([a.flatten(),b.flatten()])
  min = ab[np.argmin(np.abs(ab))]
  ab /= min

  for multiplier in range(1, 101):
    if all([isclose(val, np.round(val)) for val in ab*multiplier]):
      return np.round(ab*multiplier).astype(np.int32)

result

Invocation:

U = np.asarray([[1,0,0,0],[0,1,1,3]]).T
V = np.asarray([[1,2,0,4],[0,0,1,2],[0,2,0,1]]).T

%timeit solve_stoichiometry(U, V)
print(solve_stoichiometry(U, V))

Output:

The slowest run took 6.37 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 212 µs per loop
[1 6 1 6 2]

Worst case was less than 1.35 ms.

solution

$\text{S} + 6\text{HNO}_3 \rightarrow \text{H}_2\text{SO}_4 + 6\text{NO}_2 + 2\text{H}_2\text{O}$

comparison

\text{CH}_4 + 2\text{O}_2 \rightarrow \text{CO}_2 + 2\text{H}_2\text{O}

6\text{H}_2\text{O} + 6\text{CO}_2 \rightarrow 6\text{O}_2 + \text{C}_6\text{H}_{12}\text{O}_6

Conclusion

In the first comparison example (Combustion of Methane), we see that the brute force algorithm is faster than the linear algebra algorithm if the coefficients are small. Unfortunately, in the other cases, the brute force algorithm is unable to solve them even after 1 minute, proving the point that it scales horribly with the highest coefficient. The linear algebra algorithm, on the other hand, is fast and scales very well.

Interestingly, in the last comparison example (Production of Basic Beryllium Acetate), we also see that the linear algebra technique works even when certain chemical groups are assigned unique symbols (acetyl group in this example).

So there we have it; an approach that balances chemical equations in polynomial time, using linear algebra. Since I’m not a chemist myself, I’m not aware of any constraints that might arise from assuming that all one-hot encoded vectors are independent. If you have experience with this, do share what you know!

Equation	Brute Force	Linear Algebra
$\text{CH}_4 + 2\text{O}_2 \rightarrow \text{CO}_2 + 2\text{H}_2\text{O}$ (Combustion of Methane)	70.9 µs	202 µs
$6\text{H}_2\text{O} + 6\text{CO}_2 \rightarrow 6\text{O}_2 + \text{C}_6\text{H}_{12}\text{O}_6$ (Photosynthesis)	> 1 min	200 µs
$\text{S} + 6\text{HNO}_3 \rightarrow \text{H}_2\text{SO}_4 + 6\text{NO}_2 + 2\text{H}_2\text{O}$ (Low-Temperature Oxidation of Sulfur)	> 1 min	212 µs
$2\text{Be}_2\text{CO}_3\text{(OH)}_2 + 6\text{AcOH} \rightarrow \text{Be}_4\text{O(AcO)}_6 + 5\text{H}_2\text{O}+ 2\text{CO}_2$ (Production of Basic Beryllium Acetate)	> 1 min	213 µs

Balancing Chemical Equations with Linear Algebra

Understanding the Problem

the right form

diving deeper

Approaches

brute force enumeration

pseudocode

python implementation

result

linear algebra

pseudocode

python implementation

result

solution

comparison

Conclusion

Leave a comment Cancel reply

Understanding the Problem

the right form

diving deeper

Approaches

brute force enumeration

pseudocode

python implementation

result

linear algebra

pseudocode

python implementation

result

solution

comparison

Conclusion

Share this:

Related

Leave a comment Cancel reply