## Blog: Attending to Mathematical Language with Transformers

**Mathematical expressions were generated, evaluated and used to train neural network models based on the transformer architecture. The expressions and their targets were analyzed as a character-level sequence transduction task in which the encoder and decoder are built on attention mechanisms. Three models were trained to understand and evaluate symbolic variables and expressions in mathematics: (1) the self-attentive and feed-forward transformer without recurrence or convolution, (2) the universal transformer with recurrence, and (3) the adaptive universal transformer with recurrence and adaptive computation time. The models respectively achieved test accuracies as high as 76.1%, 78.8% and 83.9% in evaluating the expressions to match the target values. For the cases inferred incorrectly, the results were very close to the targets. The models notably learned to add, subtract and multiply both positive and negative decimal numbers of variable digits assigned to symbolic variables.**

**Keywords:** attention, seq2seq, transformers, natural language understanding, neural networks

Arithmetic and algebra are important mathematical skills that should be acquired by one’s adolescence [1]. Therefore, we should expect that an artificially intelligent agent or system can at least master such problems without predetermined algorithms. Arithmetic involves the study of numbers and the effect on them of operators such as addition (**+**), subtraction (**﹣**), multiplication (∗), and division (**÷**). Algebra at the basic level involves the study of mathematical symbols and the rules governing how such symbols are manipulated. A mathematical expression is a phrase constructed with a finite arrangement of numbers, symbols and operators according to the rules of mathematics. Such rules are typically pre-programmed into computers and execute with ideally perfect accuracy. Here we describe neural network models trained to read mathematical phrases at the character level and evaluate the expressions for a result without any pre-programmed or hard-coded math rules.

Prior studies related to this work have used multilayer perceptrons [2], recurrent neural networks (RNN) [3,4], long short-term memory (LSTM) [5,6], Neural GPUs [7–9] and transformers [10,11]. These studies were mostly restricted to addition of integers with the same number of digits and did not involve symbolic variables or expressions. The study involving mathematical expressions sought to discover efficient mathematical identities [3]. For the studies that considered multiplication, accuracy for the multiplication tasks were either not explicitly reported [5,11] or involved binary representations [7–9]. In this work, we report results for directly evaluating mathematical expressions involving addition, subtraction and multiplication of both positive and negative decimal numbers with variable digits assigned to symbolic variables. The end-to-end process described below does not include any curriculum training or intermediary non-decimal representations.

Example input:x=85,y=-523,x∗y

Example target:-44455

The training and test data are generated by assigning symbolic variables either positive or negative decimal integers and then describing the algebraic operation to perform. Such expressions are generated as input strings as shown in the example above. We restrict our variable assignments to the range *x,y:[-1000,1000)* and the operations to the set *{+,-,*∗*}*. To ensure that the model embraces symbolic variables, the order in which *x* and *y* appears in the expression is randomly chosen. For instance, an input string contrasting from the example shown above might be *y=129,x=-531,x-y*. Each input string is accompanied by its target string, which is the evaluation of the mathematical expression. For this study, all targets considered are decimal integers represented at the character level. About 12 million unique samples were thus generated and randomly split into training and test sets at an approximate ratio of 9:1, respectively.

The entirety of each input is read and encoded at the character level. The entirety of each output is decoded at the character level. Only after training do the models come to interpret meaning behind the character sequences. One can imagine that a different character mapping be used to obfuscate the meaning assigned by mathematical practice but still be trainable for the models described here to capture the relationships between the individual characters (Table 1). Mapping such results back to the representations familiar in mathematical practice would yield the same results (Table 2).

The input-target pairs were first used to train a self-attentive and feed-forward transformer without recurrence or convolution in a similar manner as the base model previously reported [10]. The attention mechanism is the scaled dot-product attention

where *d* is the dimension (number of columns) of the input queries *Q*, keys *K*, and values *V*. For multi-head attention with *h* heads that jointly attend to different representation subspaces at different positions given a sequence of length *m* and the matrix 𝐻, the result is

where the projections are learned parameter matrices

The same hyperparameters were used as the standard transformer above except for the details that follow. The transformer used in this study is smaller. The encoder consisted of two identical layers, each of which consists of two sub-layers: a multi-head self-attention layer and a fully-connected feed-forward network layer. Layer normalization was used to preprocess the sub-layer inputs. The decoder consisted of two identical layers, each of which consists of three sub-layers: a multi-head self-attention layer, a multi-head attention layer over the output of the encoder stack, and a fully-connected feed-forward network layer. Each multi-headed attention layer consisted of 4 heads. Each fully-connected feed-forward network consisted of 128 neurons. A dropout rate of 0.1 was used to postprocess the output of each sub-layer before it is added to the sub-layer input by the residual connection.

The transformer model achieved an accuracy on the test set of 76.1%. When we analyze the performance by the type of expression, however, we find that the model infers with perfect accuracy symmetric *a(op)a* expressions such as *x+x*, *y-y*, and *x*∗*x*. Slightly less perfect were *a+b* addition tasks, such as *x+y* and *y+x*, which had 98% accuracy. The next challenging tasks involved *a-b* subtraction, such as *x-y* and *y-x*, which had 49% accuracy. The model struggled most with *a*∗*b* multiplication tasks, such as *x*∗*y* or *y*∗*x*, which had only 9% accuracy. Note that this is a single model trained to performing the different types of tasks. A summary of the results are shown in Table 3.

The results demonstrate that the transformer can learn to interpret and evaluate symbolic variables and expressions as represented by character strings, performing addition, subtraction and multiplication of both positive and negative decimal numbers. The transformer can correctly utilize the values assigned to symbolic variables for inference. Considering the example input string *𝑦=568,𝑥=-867,𝑦∗𝑦*, the model correctly ignores the value assigned to *x* and computes *322624* as the output. The attention visualizations for the encoder’s self-attention and decoder’s attention on the final layer of the encoder shown in Figs. 1 and 2, respectively, illustrate that the output characters attend almost exclusively on the characters representing the assignment to *y*. Furthermore, the order in which *x* and *y* assignments occur in the string are handled well, since the accuracy is high despite our data including random variations as mentioned above.

For the cases inferred incorrectly, the results are very close to the targets. As an example, the value produced for the input sequence *𝑦=-440,𝑥=687,𝑦∗*𝑥 is *-300280*, which is very close to the actual target value of *-302280 *considering the character match accuracy at each position. Only the thousandth place character is incorrect, which is representative of our general observation that the middle positions are most difficult to correctly infer. Interestingly, the first and last positions of the output attend primarily to the first and last positions representing the assignment to *y*, whereas the output positions in between do not exhibit such selective attention (Fig. 3 of Appendix A). This confusion could be the reason for the faulty inference of the characters in the middle of the output.

In order to improve the performance for evaluating non-symmetric subtraction and multiplication expressions, the transformer can be augmented with recurrent inductive bias as described by prior work on universal transformers [11]. Unlike the standard transformer, the universal transformer can be computationally universal given sufficient memory. At training step *t*, the universal transformer iterates to improve its representations for the input positions in parallel with a self-attention mechanism, followed by a recurrent transformation using a depth-wise separable convolution or a position-wise fully-connected layer. The universal transformer was thus reported to achieve state-of-the-art results on translation, natural language understanding, and learning-to-execute tasks similar to this study, outperforming both LSTM RNNs and the standard transformer given the same hyperparameters.

Using the same hyperparameters and dataset described above for the standard transformer, the universal transformer achieved better results on all types of asymmetric *a(op)b* expressions as shown in Table 3. The overall accuracy on the tasks is 78.8%. The most improvement occurred for the *a*∗*b* multiplication tasks, which more than doubled in accuracy. It therefore appears that the recurrent inductive bias as implemented in the universal transformer successfully addresses some of the shortcomings of the standard transformer model when using the same hyperparameters.

Since only *a-b *and *a*∗*b *tasks can be improved upon any further, we next add adaptive computation time (ACT) [12] to the universal transformer [11] in order to devote more processing resources to symbols not interpreted well by the model. For a neural network *R* with parametric state transition model *S*, output weights *W_y*, output bias *b_y*, input sequence *x_t*, state sequence *s_t*, intermediate update number *n*, ACT can be implemented by iterating through each step *t *of the sequence as follows

where 𝛿_n,1 is a binary indicator of whether the input step has been incremented at update *n*. An extra sigmoidal halting unit *h* and its associated weights *W_h* and bias *b_h* is added to the network to calculate the halting probability at intermediate steps up to the total number of updates *N(t)* according to

where a small threshold *𝜀*=0.01 is used to halt after a single update and an upper bound on updates *N(t)*≤24 is imposed. The state and output sequences *s_t* and *y_t* are calculated as

As shown in Table 3, the adaptive universal transformer improves on the *a-b *tasks almost to perfection but performs much worse on the *a*∗*b *tasks, producing an overall higher accuracy of 83.9%. The adaptive universal transformer may have focused only to improve the *a-b *tasks because it is more attainable than the *a*∗*b *tasks in improving overall efficiency.

The mathematical language understanding demonstrated in this study is foundational for an artificially intelligent agent. The framework and findings discussed should also be transferable to natural language understanding. The symbolic variable assignment is analogous to supporting facts in the bAbi story, question and answering tasks [13]. Symmetric *a(op)a* tasks only utilize one of the supporting facts, whereas asymmetric *a(op)b* tasks utilizes two supporting facts. The symbolic expressions and their evaluation studied here can thus be considered a simplified version of story, question and answering tasks that can be studied and analyzed more expediently and concretely. We expect that future studies will involve more types of symbolic expressions and variables, further elucidating how to improve the shortcomings of existing models to the benefit of more complex natural language understanding problems. Source code to reproduce the results of this study is available online [14].

The transformer model has been shown to work well for a myriad of applications beyond what we typically consider as sequence transduction tasks, e.g. image processing [15]. More generally, transformers can be applied to problems involving tensors as inputs and tensors as outputs, which is the motivation behind the *Tensor2Tensor* library used in this study [16]. The attention mechanism of the transformer architecture can be interpreted as a global receptive field that can analyze more than the limited receptive fields, which are often referred to as filters, in convolutional neural networks. We therefore expect that the transformer can serve as a unified model to incorporate and improve upon previous work in churn prediction [17], information retrieval [18], and collaborative filtering [19]. The customer’s history can be the story or input sequence, and the question can be whether they churn or what item they would choose from recommendations provided.

**References**

[1] Carraher, D. W., Schliemann, A. D., Brizuela, B. M., & Earnest, D. (2006). Arithmetic and algebra in early mathematics education. *Journal for Research in Mathematics education*, 87–115.

[2] Hoshen, Y., & Peleg, S. (2016, February). Visual Learning of Arithmetic Operation. In *AAAI* (pp. 3733–3739).

[3] Zaremba, W., Kurach, K., & Fergus, R. (2014). Learning to discover efficient mathematical identities. In *Advances in Neural Information Processing Systems* (pp. 1278–1286).

[4] Mickey, K. W., & McClelland, J. L. (2014, January). A neural network model of learning mathematical equivalence. In *Proceedings of the Annual Meeting of the Cognitive Science Society* (Vol. 36, №36).

[5] Zaremba, W., & Sutskever, I. (2014). Learning to execute. *arXiv preprint arXiv:1410.4615*.

[6] Kalchbrenner, N., Danihelka, I., & Graves, A. (2015). Grid long short-term memory. *arXiv preprint arXiv:1507.01526*.

[7] Kaiser, Ł., & Sutskever, I. (2015). Neural gpus learn algorithms. *arXiv preprint arXiv:1511.08228*.

[8] Price, E., Zaremba, W., & Sutskever, I. (2016). Extensions and Limitations of the Neural GPU. *arXiv preprint arXiv:1611.00736*.

[9] Freivalds, K., & Liepins, R. (2017). Improving the Neural GPU Architecture for Algorithm Learning. *arXiv preprint arXiv:1702.08727*.

[10] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In *Advances in Neural Information Processing Systems *(pp. 5998–6008).

[11] Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, Ł. (2018). Universal transformers. *arXiv preprint arXiv:1807.03819*.

[12] Graves, A. (2016). Adaptive computation time for recurrent neural networks. *arXiv preprint arXiv:1603.08983*.

[13] Weston, J., Bordes, A., Chopra, S., Rush, A. M., van Merriënboer, B., Joulin, A., & Mikolov, T. (2015). Towards ai-complete question answering: A set of prerequisite toy tasks. *arXiv preprint arXiv:1502.05698*.

[14] Tensor2Tensor (December 10, 2018), Retrieved from https://github.com/artitw/tensor2tensor

[15] Tensor2Tensor (November 18, 2018), Retrieved from https://github.com/tensorflow/tensor2tensor

[16] Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez, A. N., Gouws, S., … & Sepassi, R. (2018). Tensor2tensor for neural machine translation. *arXiv preprint arXiv:1803.07416*.

[17] Wangperawong, Artit, Cyrille Brun, Olav Laudy, and Rujikorn Pavasuthipaisit, *Churn analysis using deep convolutional neural networks and autoencoders*. arXiv:1604.05377, 2016.

[18] Wangperawong, A., Kriangchaivech, K., Lanari, A., Lam, S., & Wangperawong, P. (2018). Comparing heterogeneous entities using artificial neural networks of trainable weighted structural components and machine-learned activation functions. *arXiv preprint arXiv:1801.03143*.

[19] Liu, X., & Wangperawong, A. (2018). A Collaborative Approach to Angel and Venture Capital Investment Recommendations. *arXiv preprint arXiv:1807.09967*.

**Appendix A**

*Source: Artificial Intelligence on Medium*