Machine Learning has become more and more popular. It's become accessible to a wider variety of users. And it is getting better and better at handling data - whether big or small. However one thing remains true about the data used in analysis - it is, most of the time, structured. Regardless of whether the data is stored in Excel spreadsheets, relational databases or big data repositories, it is structured. That is to say, it comes in the form of columns and rows of numbers, categories or labels.
A lot of what is often called “unstructured data” has been mostly ignored by businesses due to the difficulty in dealing with it, the computational costs and insufficient performance of models. What is unstructured data? Some define it as anything that does not conform to the rigid structure of rows and columns. Text images, sound and videos are often lumped together in this category. But, in fact, there is much structure in text. And with it, unexploited opportunities for machine learning.
Granted, text mining and natural language processing have been around for over half a century, but their applications were limited. Information retrieval systems have been around since the 1950s, while their most important variant - search engines - since the 1990s. Chat bots supporting customer care departments, machine translators, text-to-speech processors and other more sophisticated techniques started to be popular in more recent years. All of those solutions have been rather localised and specific. Businesses have not harnessed the power of natural language processing in more general applications.
The missing piece of the puzzle has been the ability to get our hands on the tools and methods allowing for effective use of the structure of text data. An analyst might have historical sales volumes, share prices, demand forecasts and performance indicators - and then he has a lot of texts. How can they be used? Can they offer any value?
Before we answer this question, let’s first think about the amount of text data around. Companies have gigabytes of texts in the form of articles, reports, documentation and so on. All customer feedback data - e-mails, phone calls transcripts etc. lie mostly dormant. On top of that there are online sources, not limited to social media. Twitter for example has on average 6000 new tweets per second (up to 1000 of those are in English). Even if half of those tweets were spam, it still leaves us with a lot of data that can carry vital signals. Markets changes, social events and even natural disasters were predicted by analysing tweets. There is no doubt that using this information offers great benefits.
To do it, text preprocessing is necessary. There are various methods of cleaning text data, standardising it (the most classical methods include stemming, stop words removal, part-of-speech tagging etc.) and then transforming into a vector representation. This can be done either classically (TF-IDF model) or using a word2vec family of neural networks. Once this has been achieved, vector representations of texts can be combined with other features to create powerful machine learning models.
Figure 1: Text Pre-Processing Pipeline
Figure 1 Definitions:
- Stop words removal - removing words that appear frequently in the text, but carry no information that may allow for distinguishing between documents. Examples of stop words are the, of, a.
- Stemming - reducing inflected forms of a word into its root or stem. For example the stem of fisher, fishing is fish
- Part-of-speech tagging - marking each word by its part of speech e.g. noun, verb
- Dependency parsing - generating a syntax tree for a sentence
- Vectorisation - mapping objects (here words) into vectors of numbers in some vector space
Another interesting application is sentiment analysis. Out of all our products, which ones are most liked by customers? What do they like about them? How do they compare with the competitors’ products? Such questions can be answered using sentiment scorers - models that can assign a number to text and thus measure how positive or negative the text is. Combined with feature level dependency parsing, this makes it a very powerful tool.
Text data does not have to be left out of analysis anymore - and it should not be! It offers huge benefits and gains to any business who would like to try unleashing the power hidden in texts, and perhaps images, videos and sounds.