### Blog: Artificial Intelligence driven portfolio management: a view of the landscape

Recent times have seen an exponential increase in application of artificial intelligence & machine learning to the finance domain. This post is academically lighter than the previous ones and is meant to offer a high-level overview of the state of the art in artificial intelligence research that may have a potential impact on financial domains, such as trading and portfolio management. The key takeaway should be a list of issues specific to AI applications, with possible solutions coming from the machine learning research community. Most of the references are not in the financial domain, but we will describe use cases specific to our work. In particular, this post will be based on recent publications, with particular attention to the most recent AI conferences (NeurIPS-18 and AAAI-19) we participated in. In this regard, readers may refer to our workshop papers [1] and [2].

At Neuri, we conduct and implement cutting-edge research on machine learning to create a real advantage in financial investment. In this post, we will focus on relevant topics like deep reinforcement learning (RL), continual learning, data augmentation, and model interpretability. The objective of this post is not to advertise our services, as we are applying this research to proprietary investments, but rather to contribute to the AI advancements in the financial domain. This is also not an exhaustive list of topics that are relevant for financial applications. The area is too vast to be covered within a single blog post, the intention here is to highlight some key areas.

### Deep Reinforcement Learning

Reinforcement learning consists of an agent interacting with the environment, in order to learn an optimal policy by trial and error for sequential decision-making problems. Dynamic portfolio optimization remains one of the most challenging problems in the field of finance. It is a sequential decision-making process of continuously reallocating funds into a number of different financial investment products, with the objective to maximize return while constraining risk. Classical approaches to this problem include dynamic programming and convex optimization, which require discrete actions and thus suffer from the “curse of dimensionality”. There have been efforts made to apply RL techniques to alleviate the dimensionality issue in the portfolio optimization problem. The main idea is to train an RL agent that is rewarded if its investment decisions increase the logarithmic rate of return and is penalized otherwise.

In our recent work [3], we proposed a novel model-based RL agent that incorporates the stock prediction and experts’ demonstration. Using historical real financial market data, we simulated trading with practical constraints and demonstrated that our proposed model is robust, profitable and risk-sensitive, as compared to baseline trading strategies and model-free RL agents from prior work. This framework works on both on-policy and off-policy settings and could be easily extended using some recent work on model-based RL (e.g., [4], [5] and [6]).

#### Multi-agent and Imperfect Information Games

Related to this area of research, at AAAI-19, we found multi-agent and imperfect information games particularly insightful, given our view of markets as a set of heterogeneous agents competing in an environment with partially observable states. A problem with multi-agent settings in finance is the growing non-stationarity with an increase in the number of agents, even though we can cluster agents in macro groups (e.g. speculators, hedgers vs arbitrageurs, or long agents vs short sell agents). We are seeing considerable progress in dealing with non-stationary environments in deep RL for multi-agent systems, for instance in [7] and [8].

A related research area is hierarchical RL (HRL), where the objective is to identify reusable skills and learn how to combine them. However, there are several issues as well, such as reward scale differences or reward incompatibility between different tasks. We find some insightful ideas and a good literature review from an MIT/IBM paper at AAAI-19 on knowledge transfer in multi-agent systems [9]. A possible application of this scheme is in the trading environment, where we could have a “manager and worker” framework between a macro (higher level) agent controlling worker agents that could represent different asset classes. In this setting, the manager would make the strategic allocations while the workers would tactically re-allocate between assets in the same sub-groups.

Another important issue is in the incompatibility of reward scales in additive reward schemes, and on this topic we liked a Georgia Tech poster on composable RL [10]. Their proposal is to deal with the “curse of dimensionality” in RL by splitting problems in smaller modules (thus reducing the size of the state space). We also need to remark that the concept of imperfect information games in finance is hard, in general, as it requires a considerable number of adversaries/agents. We quote from [11] that

As described for example in (Bewersdorff 2005, Chapter 43) for a Poker-like game, coalitions between players are possible in 3-player games, so it is difficult to even define “optimal play”.

Future research could be done, especially in the field of finance, to abstract or simplify (e.g. separation of flows in hedgers/speculators/retail as three players). One way is to look into counterfactual regret minimization (CFR), which is a family of iterative algorithms that are the most popular and, in practice, fastest approach to approximately solving large imperfect information games. It iteratively traverses the game tree in order to converge to a Nash equilibrium. In order to deal with extremely large games, CFR typically uses domain-specific heuristics to simplify the target game in a process known as abstraction. This simplified game is solved with tabular CFR, and its solution is mapped back to the full game. (interested readers may refer to [12] and [13], and AAAI-19 invited talk by Tuomas Sandholm about new results for solving imperfect-information games).

#### Risk-sensitive Reinforcement Learning

Another important consideration for trading application is the ability to mitigate the risk. Risk-aversion comes from a situation where a probability can be assigned to each possible outcome of a situation. In the sequential decision-making setting, the return is a random variable due to the stochastic transitions/rewards for a given Markov Decision Process (MDP). Such uncertainty becomes important when there is significant stochasticity in the MDP transitions, which may lead to significant variability in the return. The natural method for dealing with this uncertainty, motivated by classical studies in the financial literature, is through the notion of risks, such as its variance, value-at-risk (VaR), conditional value-at-risk (CVaR), or exponential-utility. Such measures capture the variability of the return or quantify the effect of rare but potentially disastrous outcomes. Based on the multi-period (temporal) nature, the risk in an MDP can be classified into two types: static risk measure and dynamic risk measure. In particular, risk-sensitive MDP considers stochastic optimization problems in which the objective involves a risk measure of the random cost, in contrast to the typical expected cost objective. Such problems are important when the decision-maker wishes to manage the variability of the cost, in addition to its expected outcome, and are standard in various applications in finance and operations research. Risk-sensitive MDP with dynamic risk measure and ambiguity-averse MDPs are closely connected. Similar to classical MDP, risk-sensitive MDP suffers from the “curse of dimensionality” and requires the full knowledge of the model.

Risk-sensitive reinforcement learning is proposed to alleviate this curse and is able to produce good solutions in a model-free environment. Specifically, [14] considered dynamic quantile-based risk measures and proposed a simulation-based approximate dynamic programming algorithm. A simulation-based approximate fitted value iteration for dynamic risk measures was proposed in [15]. Meanwhile, [16] studied dynamic coherent risk measures and provide actor-critic algorithms with value function approximation. Linear function approximation for variance-related criteria in MDPs was developed in [17]. In this latter work, temporal difference and least-squares temporal difference algorithms were developed for this problem class. A parametric policy gradient technique for static-CVaR minimizing reinforcement learning was proposed in [18] and [19]. In [20], a cutting plane algorithm for time-consistent multistage linear stochastic programming problems was given. Moreover, [21] studied exponential-utility functions, [22] and [23] studied mean-variance models, [24] studied policy gradient for coherent static risk measures, and [25] and [26] studied dynamic coherent risk for systems with linear dynamics.

We haven’t seen much about risk-sensitive RL during AAAI-19, though we’d expect this topic to be expanded in the coming conferences this year. Some of the relevant references from AAAI-19 are [27]-[30].

### Continual Learning

Continual learning (CL) is the ability of a model to learn continually from a stream of data, building on what was learned previously, hence exhibiting positive transfer, as well as the ability to remember previously seen tasks. CL comes into play when we need to increase the efficiency of our modeling efforts with a multitude of related learning objectives that are intimately related (such as learning to trade different assets, or training different agents for different trade sides, for instance). CL has implications for both supervised and unsupervised learning. For example, when the dataset is not properly shuffled or there exists a drift in the input distribution, the model overfits the recently seen data, forgetting the rest — phenomena referred to as catastrophic forgetting, which is part of CL and is something CL systems aim to address.

Continual learning is defined in practice through a series of desiderata. According to NeurIPS-2018 continual learning workshop, a non-exhaustive list includes:

**Online learning:**Learning occurs at every moment, with no fixed tasks or datasets, and no clear boundaries between tasks.**Presence of forward and/or backward transfer**: the model should be able to transfer from previously seen data or tasks to new ones, as well as being able to improve old tasks using information obtained from the new tasks.**Catastrophic forgetting resistance**: new learning should not destroy performance on previously seen data.**Bounded system size**: the model capacity should be fixed, forcing the system to use its capacity intelligently, as well as gracefully forgetting information in order to ensure future reward maximization.**No direct access to previous experience**: while the model can remember a limited amount of experience, a CL algorithm should not have direct access to past tasks or be able to rewind the environment.

On a related topic, a Google DeepMind AAAI-19 paper on multi-task learning [31] showed impressive performance in the domain of games.

In particular, the authors studied the problem of learning to master not one but multiple sequential-decision tasks at once. They proposed to automatically adapt the contribution of each task to the agent’s updates so that all tasks have a similar impact on the learning dynamics. We also found that some work (e.g., [32] and [33]) about transfer learning are also worthwhile to investigate. Potential financial applications of these are:

- Treating an asset or a trade direction as a task (a distillation of task-specific experts into a single share model).
- Continuous retraining for live trading agents (walk-forward training/testing and model updating).
- Transferring common understanding of asset price reactions to e.g. macroeconomic features (or others, such as risk or technical indicators).
- Specifying regimes as different tasks, with a single RL agent for each regime. In case of regime detection technique we refer the readers to one of our latest posts).

### Data Augmentation

Ian Goodfellow gave an invited talk on AAAI-19 (video) where he talked about adversarial learning. He reviewed the generative adversarial nets (GANs) which take a collection of training data and learn a distribution that can generate similar samples. GANs solve the generative modeling problem, but also the domain translation problem. For instance, we can translate day-time video streams into a night-time setting, without paired day-night examples. Among several examples, Ian showed a GAN variant called CycleGAN [34], which converts horses to zebras. GANs can be used to provide learned reward functions, as in SPIRAL [35]. In particular, they can generate reward functions in the appropriate input domain (robot camera percepts), and provide a useful distance measure such that a robot can learn from.

The use of GANs in finance is well motivated. As mentioned in our paper [3], the financial markets have limited data. Consider the case where new portfolio weights are decided by the agent on a daily basis. In such a scenario, which may not be uncommon, the size of a daily training set for a particular asset over the past 10 years is only around 2530, due to the fact that there are only about 253 trading days a year. Clearly, this is an extremely small dataset that may not be sufficient for training a robust RL agent. To mitigate this issue, one can augment the dataset via the recurrent GANs and generate synthetic time series. The generated data can be further checked via statistical tests (eg. Kolmogorov-Smirnov (KS) test) to see if it is representative of the true underlying distribution. Another interesting direction in our mind is to apply GANs in the semi-supervised way that Ian talks about. In particular, rather than just discriminate between real or fake and throwing away the discriminator after the training, the discriminator can be used as a classifier. Furthermore, it can be trained to distinguish between real asset 1, real asset 2, real asset 3 and finally fake asset.

There was considerable attention on GANs during the most recent AAAI, refer, for instance, to [36] and [37].

### Model Interpretability

Deep neural networks have achieved near-human accuracy levels in various types of classification and prediction tasks including images, text, speech, and video data. However, the networks continue to be treated mostly as black-box function approximators, mapping a given input to a classification output. The next step in this human-machine evolutionary process — incorporating these networks into mission-critical processes such as medical diagnosis, planning, and control — requires a level of trust associated with the machine output.

We are particularly interested in this topic because we wish to understand the actions (e.g. portfolio weights or trade directions) produced by RL trading agent. Specifically, we want to understand which part of the deep neural network contributes most to make such a decision and why it does so. We are happy to see a considerable amount of work done around this topic in this year’s AAAI. For example, authors in [39] introduce a novel perturbation manifold and its associated influence measure to quantify the effects of various perturbations on deep neural network classifiers. In [40], it is shown that that interpretation of deep learning predictions is extremely fragile in the following sense: two visually indistinguishable inputs with the same predicted label can be assigned very different interpretations. The authors systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10. Machine learning system quality verification has been also highlighted during the presentation of [41]. At Neuri, we pay considerable attention to algorithmic safety, given that our trading production systems are directly connected to money decision-making and any error might cost us money.

We believe it is important to highlight that sometimes there is a misunderstanding between interpretability and explainability. Most of the research, whether based on saliency maps, feature analysis, gradients to inputs mapping, or on mutual information between inputs and features, is all aimed at trying to explain the current behavior of a model. Interpretability, on the other hand, is an anthropomorphic concept. We believe explainability is what we should focus on in our domain, as we are mainly interested in understanding why a certain decision was chosen in terms of the underlying characteristics (inner workings) of the models. For instance, saliency maps do exactly this, as they are nothing but a complicated way of showing attention (not a new concept, originally formulated by Christoph Koch and Shimon Ullman [38] in order to explain how the visual pathways in the brain work).

### The Key Takeaway

Artificial Intelligence and machine learning research is growing at an exponential rate, with new research coming out every day. As the length of this article is limited, we can’t expand our discussion to cover all other topics important for the financial domain. Interested readers may refer to additional recent publications we found useful, see references [41]-[46].

At Neuri, we focus on RL (deep, hierarchical, risk-aware and model-based), continual learning (multi-task learning and transfer learning), data augmentation and refinement, model interpretability & generalization, imperfect information games and adversarial learning, Bayesian networks, and scalability (quite an exhaustive list we know :)). If you are an outstanding research engineer or scientist interested in applying advanced machine learning techniques to trading and investments, consider joining us in Singapore either as a full-time employee or as an intern! Start a conversation by dropping an email to sakya@neuri.ai

### Acknowledgment

Special thanks to Dr. Sakyasingha Dasgupta for his valuable input and discussions.

### About the Authors

Pengqian Yu is a *Research Scientist at Neuri Pte. Ltd.*

Ilya Kulyatin* *is a *Research Engineer at Neuri Pte. Ltd.*

### References

[1] D. Teng, and S. Dasgupta, *Continuous Time-series Forecasting with Deep and Shallow Stochastic Processes*, in NIPS Continual Learning 2018.

[2] D. Kimura, S. Chaudhury, R.Tachibana, and S. Dasgupta, *Internal Model from Observations for Reward Shaping*, in* *AAAI* *Reinforcement Learning in Games 2019.

[3] P. Yu, J. S. Lee, I. Kulyatin, Z. Shi, and S.Dasgupta, *Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization*, *arXiv preprint arXiv:1901.08740*.

[4] B. Kim, L. P. Kaelbling, and T. L. Perez, *Adversarial Actor-Critic Method for Task and Motion Planning Problems Using Planning Experience*, in AAAI 2019*.*

[5] B. Kartal, P. H. Leal, and M. E. Taylor, *Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL*,* *in AAAI 2019*.*

[6] Y. Mizutani, and Y. Tsuruoka, *Learning Task-Specific Representations of Environment Models in Deep Reinforcement Learning*,* *in AAAI 2019*.*

[7] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and S. Russell, *Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient*,* *in AAAI 2019*.*

[8] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, *Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments*,* *in NIPS 2017.

[9] D.K. Kim, M. Liu, S. Omidshafiei, S. Lopez-Cot, M. Riemer, G. Tesauro, M. Campbell, S. Mourad, G. Habibi, and J. P. How, *Heterogeneous Knowledge Transfer via Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning*,* *in AAAI 2019*.*

[10] C. Simpkins, and C. Isbell, *Composable Modular Reinforcement Learning, *in AAAI 2019*.*

[11] F. Bonnet, T. W. Neller, and S. Viennot, *Towards Optimal Play of Three-Player Piglet and Pig*, in AAAI 2019*.*

[12] T.W. Neller, and M. Lanctot, *An introduction to counterfactual regret minimization,* In *Proceedings of Model AI Assignments*,* The Fourth Symposium on Educational Advances in Artificial Intelligence (EAAI-2013)*.

[13] N. Brown, A. Lerer, S. Gross, and T. Sandholm, *Deep Counterfactual Regret Minimization*,* *in AAAI 2019*.*

[14] D. R. Jiang and W. B. Powell, *Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures*, Mathematics of Operations Research 43.2 (2017): 554–579.

[15] P. Yu, W. B. Haskell, and H. Xu, *Approximate Value Iteration for Risk-aware Markov Decision Processes*, IEEE Transactions on Automatic Control, 63(9):3135–3142, 2018.

[16] T. Aviv, Y. Chow, M. Ghavamzadeh, and S. Mannor, *Policy Gradient for Coherent Risk Measures*, In Advances in Neural Information Processing Systems, pages 1468–1476, 2015.

[17] T. Aviv, D. D. Castro, and S. Mannor. *Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes*, arXiv preprint arXiv:1301.0104, 2013.

[18] T. Aviv, Y. Glassner, and S. Mannor. *Optimizing the CVaR via Sampling*, AAAI. 2015.

[19] Y. Chow, and M. Ghavamzadeh, *Algorithms for CVaR optimization in MDPs*, Advances in Neural Information Processing Systems. 2014.

[20] T. Asamov, and A. Ruszczyński, *Time-consistent Approximations of Risk-averse Multistage Stochastic Optimization Problems*, Mathematical Programming, pages 1–35, 2014.

[21] V. S. Borkar, *A Sensitivity Formula for Risk-sensitive Cost and the Actor–critic Algorithm,* Systems & Control Letters 44.5 (2001): 339–346.

[22] T. Aviv, D. D. Castro, and S. Mannor, *Policy Gradients with Variance Related Risk Criteria*, Proceedings of the Twenty-ninth International Conference on Machine Learning. 2012.

[23] L. A. Prashanth, and M. Ghavamzadeh. *Actor-critic Algorithms for Risk-sensitive MDPs*, Advances in Neural Information Processing Systems. 2013.

[24] T. Aviv, Y. Chow, M. Ghavamzadeh, and S. Mannor, *Sequential Decision Making with Coherent Risk*, IEEE Transactions on Automatic Control 62.7 (2017): 3323–3338.

[25] P. Marek, and D. Subramanian, *An Approximate Solution Method for Large Risk-averse Markov Decision Processes*, Proceedings of the Conference on Uncertainty in Artificial. Intelligence, 2012.

[26] Y. Chow, and M. Pavone, *A Framework for Time-consistent, Risk-averse Model Predictive Control: Theory and Algorithms*, American Control Conference (ACC), 2014. IEEE, 2014.

[27] C. Lyle, P. S. Castro, and M. G. Bellemare, *A Comparative Analysis of Expected and Distributional Reinforcement Learning*,* *in AAAI 2019*.*

[28] G. R. Jeyakumar, and B. Ravindran, *Confidence-Based Aggregation of Multi-Step Returns for Reinforcement Learning, *in AAAI 2019*.*

[29] R. Cheng, G. Orosz, R. M. Murray and J. W. Burdick, *End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks*, in AAAI 2019*.*

[30] S. Ma, and J. Y. Y, *State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning*,* *in AAAI 2019*.*

[31] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. Hasselt, *Multi-task Deep Reinforcement Learning with PopArt*,* *in AAAI 2019*.*

[32] M. Riemer, T. Klinger, D. Bouneffouf, and M. Franceschini, *Scalable Recollections for Continual Lifelong Learning*, in AAAI 2019*.*

[33] X. Wang, L. Li, W. Ye, M. Long, and J. Wang, *Transferable Attention for Domain Adaptation*, in AAAI 2019*.*

[34] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, *Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks*, Proceedings of the IEEE international conference on computer vision. 2017.

[35] Y. Ganin, T. Kulkarni, I. Babuschkin, SM. Eslami, and O. Vinyals, *Synthesizing Programs for Images using Reinforced Adversarial Learning*, arXiv preprint arXiv:1804.01118, 2018.

[36] K. M. Yoo, Y. Shin, and S. Lee, *Data Augmentation for Spoken Language Understanding via Joint Variational Generation*, in AAAI 2019*.*

[37] Q. Yu, and W. Lam, *Data Augmentation based on Adversarial Autoencoder Handling Imbalance for Learning to Rank*, in AAAI 2019*.*

[38] C. Koch, and S. Ullman, *Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry,* Human Neurobiology 4:219–227 (1985).

[39] H. Shu, and H. Zhu, *Sensitivity Analysis of Deep Neural Networks*, in AAAI 2019*.*

[40] A. Ghorbani, A. Abid, and J. Zou, *Interpretation of Neural Networks is Fragile*, in AAAI 2019*.*

[41] S. Chakraborty, and K. S. Meel, *On testing uniform samplers*, in AAAI 2019*.*

[42] K. Aggarwal, S. Joty, L. Fernandez-Luque, and J. Srivastava, *Adversarial Unsupervised Representation Learning for Activity Time-Series*, in AAAI 2019*.*

[43] S. Chandar, C. Sankar, E. Vorontsov, S. E. Kahou, and Y. Bengio, *Towards Non-saturating Recurrent Units for Modelling Long-term Dependencies*, in AAAI 2019*.*

[44] A. Senderovich, J. C. Beck, A. Gal, and M. Weidlich. *Congestion Graphs for Automated Time Predictions*, in AAAI 2019*.*

[45] Z. A. Liao, C. Sharma, J. Cussens and P. van Beek, *Finding All Bayesian Network Structures within a Factor of Optimal*, in AAAI 2019*.*

[46] A. Vergari, A. Molina, R. Peharz, Z. Ghahramani, K. Kersting, and I. Valera, *Automatic Bayesian Density Analysis*, in AAAI 2019*.*

## Leave a Reply