Averisera Ltd - Machine Learning for Trend Detection

Introduction

One of the simplest things you can do to make sense out of a lump of data is to divide it into smaller lumps (called clusters), grouping similar datapoints together. This process is called clustering and is an example of unsupervised learning. After dividing your data into clusters, you can label them (e.g. “Category A”, “Category B”, etc.) and start thinking about which features are characteristic of which category. This, in turn, is called supervised learning, the second cornerstone of the insanely (as of 2016) fashionable intersection of statistics, computer science and artificial intelligence, known as machine learning. Below, we describe in simple terms a very popular clustering algorithm called $k$-means.

Algorithm

Let's try to keep mathematical notation to the minimum, but some is still necessary. The data being clusterised is a set $S$ of $N$ points, each described by $D$ coordinates. The $i$-th datapoint is denoted by $\vec{x}_i$ ($0 \le i < N$) and its $j$-th coefficient by $x_{i,j}$ ($0 \le j < D$). We determine how close or far away are two points $\vec{x}_i$ and $\vec{x}_{i'}$ by calculating their Euclidean distance \[ d(\vec{x}_i, \vec{x}_{i'}) := \sqrt{\sum_{j=0}^{D-1} (x_{i,j} - x_{i',j})^2} \ . \] Our aim is to divide the dataset $S$ into $K$ clusters, where usually $K \ll N$. The $k$-th cluster is denoted by $C_k$. An important characteristic of a cluster is the average location of all its points, called the centroid: \begin{equation} \label{eq:centroid} \vec{c}_k := \frac{1}{\left| C_k \right|} \sum_{\vec{x} \in C_k} \vec{x} \ , \end{equation} where $\left| C_k \right|$ is the number of points in cluster $C_k$. The centroid is, roughly speaking, the location of the cluster's “centre of mass”.

The aim of the $k$-means algorithm is to divide the dataset in such way that every point belonging to cluster $C_k$ lies closer to its centroid than to any other one. If this is true, we congratulate ourselves of a job well done and go home. If not, we reassign the points to the clusters with centroids nearest to them: \[ \vec{x}_i \rightarrow \argmin_{0 \le k < K} \ d(\vec{x}_i, \vec{c}_k) \ . \] The ties (points lying at equal minimal distance $d$ to more than one centroid) should be broken randomly. After reassigning the points, we recalculate the centroids according to Eq. \eqref{eq:centroid} and check how much they have moved from their previous positions. These two steps (reassign and recalculate) should be repeated until convergence is reached, i.e. the centroids stop moving or move very little.

In our description above we ignored one important question: how to initialise the centroid values? There are many different ways. Below we will describe just three of them.

Forgy

The Forgy algorithm chooses $K$ random points from the dataset and uses them as centroids. Obviously, if we select two points close to each other as centroids, this is not going to lead to a nice result, but this is not going to happen too often. It is a decent method if you want to solve standard clustering problems without much coding effort.

Random partition

This approach assigns each point randomly to one of the $K$ clusters and then calculates the centroids using Eq. \eqref{eq:centroid}. In a typical case, it will lead to all centroids being close to each other in the centre of the dataset. Not a good choice, usually.

$k$-means++

As the “++” suffix suggests, this method is a more serious affair. Like Forgy, it selects $K$ datapoints as initial centroids, but takes more care to space them wide apart. The first centroid is chosen entirely randomly (uniformly over $S$). After that, the $k$-th centroid is chosen using a probability distribution $P_k(\vec{x})$ with probabilities proportional to the square of the point's distance from the closest already chosen centroid: \[ P_k(\vec{x}) \sim \min_{0 \le l < k} d^2 (\vec{x}, \vec{c}_l) \ . \] This method tends to produce initial centroid locations more spread out than Forgy.

How many clusters?

The $k$-means algorithm described above assumes that the number of clusters $K$ is given in advance, and we just need to find their locations. However, we often don't know how many clusters there are in the data either. A statistically sound way to answer this question is using the so-called gap statistic, described below.

For a given number of clusters $K$, the $k$-means algorithm minimizes the within-cluster dispersion (also called within-cluster sum of squares), defined as \[ \sum_{k=0}^{K-1} \sum_{\vec{x} \in C_k} d^2 (\vec{x}, \vec{c}_k) \ . \] Its minimum over all possible partitions of the dataset $S$ into $K$ clusters is denoted by $W_K$ (this statistic is also a function of the dataset $S$, but we do not reflect that in the notation to keep it simpler). $W_K$ is a monotonically decreasing function of $K$, which means that the more clusters we have, the closer every datapoint lies to its centroid.

Traditionally, a non-rigorous method of determining the appropriate number of clusters was to plot $W_K$ and choose the value of $K$ over which $W_K$ flattens out at the optimal cluster number $\Kopt$, the so-called “elbow point”. An example of such a procedure is presented in the neighbouring graph.

The shape of the curve $W_K$ strongly suggests that the optimal cluster number $\Kopt=2$, but how can we tell this without looking at the plot? The gap statistic is a more refined and rigorous take on this simple heuristic method.

The fact that $W_K$ decreases monotonically by construction suggests that we can try to “detrend”. To this end, we compare $W_K$ with a “reference” value $W'_K$ calculated for a sample of size $N$ drawn from a known simple reference distribution $\mathcal{R}$, which does not to exhibit any clustering. The latter is usually assumed to be a uniform distribution over the minimal hypercube encapsulating the data. If we want to adjust it better to our dataset, we can orient the hypercube along the eigenvectors of its covariance matrix.

Just as $W_K$, $W'_K$ decreases monotonically with $K$. However, as we increase $K$ to bring it closer to the optimal value $\Kopt$, $W_K$ decreases faster than $W'_K$ because at least some of the datapoints $\vec{x} \in S$ will be assigned to their “correct” centroids, decreasing their distances from them by more than in the case of the reference distribution. The below graphs illustrate this effect. The within-cluster dispersion decreases with the increase of $K$ from 1 to 2 for clustered dataset $S$ and reference distribution $\mathcal{R}$ sample. Dashed lines link datapoints to their cluster's centroids (their colours correspond to cluster assignments). We obtain $\Delta W'_1 - \Delta W_1 \approx 0.84 > 0$, which is consistent with the fact that $\Kopt = 2$.

To remove the effects of distance scale (e.g. whether we measure the distance between datapoints in miles or millimetres), it is better to compare the changes in $\ln W_K$ and $\ln W'_K$, expecting $\Delta \ln W'_K - \Delta \ln W_K > 0$ when $K + 1$ is closer to $\Kopt$ than $K$. Another motivation for the logarithm is the fact that for data sampled from a mixture of Gaussian distributions, $-\ln W_K$ is proportional to log-likelihood.

Since $\ln W'_K$ is calculated from a random sample, it is a random variable itself - it has a non-zero variance. Hence, we need to make sure that we do not change our estimate of $\Kopt$ from $K$ to $K+1$ without a good reason. For this, we require not only that $\ln W_K$ decreases faster than $\ln W'_K$ for a single sample from $\mathcal{R}$, or that it is likely to decrease faster. We ask that it is unlikely not to decrease faster. Thus, we define the gap statistic announced at the beginning of this section as \[ G_K := E_{\mathcal{R}}[\ln W'_K] - \ln W_K \ , \] where $E_{\mathcal{R}}[\ln W'_K]$ is the expected value of $\ln W'_K$ under distribution $\mathcal{R}$. Starting from $K=1$, we accept $K+1$ over $K$ only if \[ G_K < G_{K+1} - \sqrt{\Var_{\mathcal{R}}[\ln W'_K]} \ , \] where $\Var_{\mathcal{R}} [\ln W'_K]$ is the variance of $\ln W'_K$ under $\mathcal{R}$. The optimal cluster number $\Kopt$ is assumed to be the first one for which this inequality does not hold.

In practice, we need to calculate the expected value and variance using Monte Carlo methods. We approximate $E_{\mathcal{R}}[\Delta W'_K]$ by an average $\bar{\ln W'_K}$ taken over $B$ samples from $\mathcal{R}$ (a typical value for $B$ is 100-300), and $\Var_{\mathcal{R}} [\ln W'_K]$ by the corresponding sample variance $\sigma^2(\ln W'_K)$, multiplied by $1 + 1 / B$ to account for the sampling error.

Timorous beasties

Although the $k$-means clustering gained popularity thanks to its simple and intuitive idea, it has its shortcomings. They are exposed by the below “mouse” example. The artificial dataset consists of three circular clusters of different sizes. Since the $k$-means method splits datapoints halfway between clusters, and hence favours clusters of similar sizes, it fails to find an acceptable solution.

The problem can be remedied if instead of looking at individual datapoints and their mutual distances, we focus on clusters. Each of them is characterised by an area and the density of datapoints it contains. Assigning datapoints to clusters should cause a minimum increase of the cluster area and a minimum decrease of its density. This criterion leads to a new clustering method, $k$-dense clustering (to be published soon).

Applications

Clustering methods have multiple applications beyond the classification problems. In this section we employ the $k$-means algorithm for trend detection in time series. Read on the technicalities below or go straight to testing the outcome in the interactive chart.

Drag points to change their positions or upload data from a CSV file:
Your data is not sent from your computer.

# of points
# of derivatives
Initialisation method
Include values in sample

Let us have a set of observations taken at sequential points in time, which we suspect of exhibiting temporal trends. We describe the set as a time series $X$, which is a sequence of $N$ value and time pairs, $X = \{ (x_i, t_i) \}$, $0 \le i < N$, with $t_{i+1} > t_i$. Loosely speaking, a trend in the time series is a region where subsequent slopes \[ s_i := \frac{x_{i+1} - x_i}{t_{i+1} - t_i} \] have similar values. We detect trends by associating them with clusters in a multidimensional sample \begin{equation} Y = \{ y_i := (\bar{t}_i, s_i, (Ds)_i, \dotsc, (D^d s)_i )\} \ ,\ i \le 0 < N - 1 \ , \end{equation} where $D$ is the discrete derivative operator defined as \[ (Ds)_i := \begin{cases} (s_{i} - s_{i-1})/(t_i - t_{i-1}) & i > 0 \\ (Ds)_1 & i = 0 \end{cases} \ , \qquad i \le 0 < N - 1 \ . \] and $\bar{t}_i := (t_{i+1} + t_i)/2$ is the mean time on the slope. Optionally, $y_i$ can also include the mean $\bar{x}_i := (x_{i+1} + x_i)/2$ as another coordinate, but that is usually not recommended, as it can confuse the algorithm in the case of a $\vee$- or $\wedge$-shaped time series. In most cases we want to find linear trends, so we set $d=0$.

Clustering slopes between points instead of points themselves has the advantage that even a 1-element cluster can be interpreted as a trend, for example in a time series [(0, 0), (1, 1), (2, 2), (10, 3), (11, 2), (12, 1)] the third slope is larger than the neighbouring ones and thus should be treated as a distinct, although very short trend.

Given a number of clusters $k$, we use the $k$-means algorithm to attribute each sample point to one of the clusters. The initial locations of the cluster centroids are selected using the $k$-means++ algorithm. We iterate the $k$-means update and assignment steps until the centroid locations do not change significantly. We begin with $k=1$ (all point in the same cluster) and increase it by 1 until the gap statistic test rejects $k + 1$ compared with $k$ (we use version `b' of the gap statistic, which employs the principal components analysis to construct the reference distribution). The number of trends in the series is equal to the maximum (i.e. last) not rejected value of $k$. The process of selecting the best value of $k$ is inherently random. To stabilise the results in borderline cases, we can run it an odd number of times (e.g. 3) and select the value of $k$ chosen most frequently as our result.

Points in sample $Y$ can be uniquely mapped to the slopes of the time series $X$. By partitioning $Y$ into $k$ clusters $\{C_j\}$, where $0 \le j < k$, the $k$-means algorithm assigns each slope $s_i$ to $j_i$-th cluster. Let $m_j$ be the median of all $i$ such that $i_j = j$. By construction, $m_j \neq m_{j'}$ for $j \neq j'$. Thus, we can uniquely sort the clusters $\{C_j\}$ by the medians $m_j$ in ascending order, and then re-label them so that $m_j < m_{j'}$ for $j < j'$.

It is important to remember that the results of $k$-means clustering depend on how we measure the distance along different dimensions, which may have different physical units. Similarly, whether a time series exhibits one or more trends depends on what we consider to be a significant difference in slope value. Our algorithm can solve the scaling problem in three different ways:

do not rescale anything,
before generating the sample $S$, the time series $X$ is rescaled and shifted to fit within a $[0, 1]^2$ box,
before clustering but after generating it, the sample $S$ is rescaled and shifted to fit within a $[0, 1]^{d+2}$ box.

Which option is the best depends on the data.

Machine learning

k-means clustering

Read on $k$-means, its implementation, drawbacks and possible solutions, and an example of its applications below.