May 30, 2019
It's notoriously difficult to extract any predictive power or "alpha" from your data in finance. When you do, it often quickly decays because the predictive power is also being mined by other Data Scientists or Quants out there.
This short article will share insights collected from some leading data scientists in finance to help you improve your productivity and the accuracy of your machine learning modelling.
1. Asking the right questions of data in finance
It can be very tempting to build models to forecast prices or exact values, yet this approach is extremely difficult and the utility of knowing the exact values is limited.
For example, it is much more useful to forecast whether a stock will significantly rise or not over the following period since this type of forecast can easily be acted on. Moreover, the certainty of the forecast will be more reliable since the task is easier. This example highlights how important it is to ask the right questions.
If you would like to find out which methods can help you find the right questions to ask, you can check out our guide. In short though, when it comes to data science in finance, any step which benefits from experience gathering can also benefit from machine learning.
2. Data cleaning and ingestion
The "garbage in, garbage out" thesis is universally known, but especially true when it comes to machine learning in finance. Varying date formats, missing values and incompatible joining identifiers can prevent any useful signal from being extracted and more importantly can lead to false insights.
The best approach to address these issues is to create a data model of your various inputs, and centralise the data cleaning part of your workflow as much as possible in order to avoid future incompatibilities or information leakage. Communication with the data engineers is key to make sure that the right inputs and features are being created for your model.
3. Model selection and optimisation
Machine learning models have their own strengths and weaknesses which are more or less suited for different problems or data sets. This is why it's important to quickly test multiple models on your data before settling on one and fine tuning it.
Moreover, the best type of model and hyper-parameter range for your problem will evolve over time. This is why it's recommended to build a flexible framework which allows you to iteratively train and test various combinations of models and hyper-parameters to give real-time understanding on what is driving performance.
When your models are not performing very well, building and optimising an ensemble model can help improve the forecasting accuracy.
4. Enriching data in finance
Once you have asked the right questions and trained some initial models, it can be worth exploring alternative data-sets in order to see if and how they can improve your forecasting accuracy.
For example, will a geolocation dataset of petrol stations allow me to better understand the behaviour of my consumers if we join both data sets?
Each data set will have some or no predictive power, but joined they will be worth more than their parts. In one study, we joined IBES and PermId datasets to forecast surprise earnings.
If this post improved your understanding of data science in finance, why not share it with a friend or colleague!