Using unsupervised machine learning to label FT articles

Photo by Pietro Jeng on Unsplash

Why article clustering is important

FT is one of the largest providers of financial news in the world. We publish hundreds of articles every day. One of the most challenging tasks is consistently categorising these articles.

While FT journalists tag the articles manually, it is hard to ensure that similar articles will have the same tag. Having consistent labels attached to articles is very important when we want to use them for machine learning models and analysis of customer reading trends.

Labeling problem

To make these classifications, the only data available to us is article text. …

Identifying signals using time-series analysis and unsupervised machine learning.

Luke Chesser on Unsplash

Understanding the preferences of Financial Times readers is crucial for improving user experience and maintaining engagement with our products. Having accurate indicators showing which area is increasingly important can augment journalists’ work, by helping them to focus on topics of interest.

Trending topics prediction is a data science model built using machine learning and time-series analysis. We define article topics by an unsupervised machine learning algorithm and use time-series analysis to flag anomalies in data.

1. How can we help journalists write more relevant stories


Over time, different topics arise reflecting the changing interests in society. The streams of data we…

Photo by Chris Liverani on Unsplash

This story explains how to implement the moving average trading algorithm with R. If you’re interested in setting up your automated trading pipeline, you should first read this article. This story is a purely technical guide focusing on programming and statistics, not financial advice.

Throughout this story, we will build an R function which takes historical stock data and arbitrary threshold as inputs and based on it decides whether it is a good time to purchase given stock. We will look at Apple stocks. This article may require a certain level of statistical knowledge. …

Photo by Jason Briscoe on Unsplash

This article explains how to create a trading pipeline using R. The trading pipeline consists of 4 main elements

  • Connecting with Google API and loading current holdings data
  • Connecting with Robinhood API and getting current stock prices
  • Getting historic market data using Yahoo API
  • Decision algorithm and executing an order

Photo by Annie Spratt on Unsplash

Documentation is an important part of being a data scientist. I propose to use R Markdown and LaTeX to document data science models.

There are several desirable properties of good documentation

  • easily accessible
  • readable
  • consistent across projects
  • reproducible

Using LaTeX inside R Markdown allows users to use consistent LaTeX formatting across numerous project, write professional mathematical formulas explaining given model, consistently reference figures/articles, and dynamically produce graphs from outputs of the model.

LaTeX is a document preparation system originally intended for academics to introduce consistency across formatting of scientific publications. …

Adam Gajtkowski

Data Scientist. Holding degrees in economics, econometrics, and statistics. Employed in news industry.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store