## Collaborative Filtering Using Matrix Approximation

### Abstract

The objective of collaborative filtering is to predict the user ratings for an item based on the ratings of the same user and also on the ratings of other users. As the quantity of data grows, it becomes imperative for the collaborative filtering tasks to be agile and scalable. In this article, matrix approximation methods are used to build a collaborative filtering model. A database of item ratings by users is input to the model and output is the predicted ratings which are missing in the database. The article also presents an optimization technique using stochastic gradient descent which yields a low-complexity, high-throughput and a scalable model.

The Authors of this paper on collaborative filtering.

### Introduction

Recommender systems are ubiquitous in today’s online experience. Users recommend different items based on his choices in collaboration with choices of other users. For example, news articles, products, and movies are recommended to users based on user’s previous ratings and also induced by the ratings of other users. Collaborative filtering techniques estimate item ratings or preference scores for a user, based on the knowledge of the ratings of the same user as well as ratings of the other users. User $$X$$’s rating on an item $$i$$ is based on the ratings on item $$i$$ given by other users with similar ratings as that of $$X$$ on other items. Collaborative filtering algorithms learn and discover user preferences on items and form the basis of recommendation systems. As more users and items are added to the data, it is important to have collaborative filtering algorithms which are scalable and efficient. In this article, we build collaborative filtering models based on matrix approximation and optimized using stochastic gradient descent (SGD) approach. Our matrix approximation models have a run-time complexity of $$O(n)$$ per epoch where $$n$$ is the number of available ratings during training phase and $$O(1)$$ to predict rating for each test example. On the Movielens data set of 100,000 users for 1682 movies by 943 users, the best mean absolute error (MAE) and the best mean squared error (MSE) we obtained are $$0.6899$$ and $$0.8841$$ respectively.

### Collaborative Filtering Using Matrix Approximation

#### Background

The goal of collaborative filtering is to predict missing item ratings for users given a database of user ratings. Each user rating can be represented in a triplet $$\langle i, j, x_{ij} \rangle$$ where $$i$$ is the user, $$j$$ is the item and $$x_{ij}$$ is the rating of user $$i$$ for item $$j$$ . The database of user ratings can be represented as a matrix where each row, $$i$$, ($$1 \leq i \leq m$$, $$m$$ users in total), represents a single user, and each column $$j$$, ($$1 \leq j \leq n$$, $$n$$ items in total). Based on a provided set of ratings in the tuple form, the matrix representation contains ratings in positions found in the given data set, while other ratings in other positions are missing. The objective of the collaborative filtering taks is to predict the missing ratings.

#### Modeling

Ratings, which are usually integers in the range $$[1,5]$$, are a result of contributions from latent user’s properties and item’s properties. If the user $$i$$’s contribution is $$r_i$$ and item $$j$$’s contribution is $$c_j$$, then the additive approximation $$a_{ij}$$ for $$x_{ij}$$ can be written as, $$a_{ij} = r_i + c_j$$. Similarly, a multiplicative approximation can be written as $$a_{ij} = r_i \cdot c_j$$.

The optimal additive model can be arrived at by minimizing MSE below.

$$\sum_{\langle i,j \rangle in I} (r_i + c_j – x_{ij})^2$$

The additive model can be represented as shown in the equation below.

$$Az = b$$

where $$z=‘$$ are the values that need to be learned, $$b$$ is a vector of $$m+n$$ ratings and $$A$$ is a $$m+n$$ by $$m+n$$ is a sparse matrix with ones at positions $$i$$ and $$m+j$$ for $$\langle i,j \rangle \in I$$ where $$I$$ is the set for which $$x_{ij}$$ is non-zero. $$z$$ needs to be calculated by minimizing $$||Az – b||_2$$. Though, Moore-Penrose pseudo-inverse of A yields $$z = (A’A)^{-1}A’b$$, the inverse operation of a $$m+n$$ by $$m+n$$ matrix is an expensive operation in order of $$O((m+n)^3)$$ using Gauss-Jordan elimination.

### Optimization using SGD

SGD can be used to minimize an objective function, a $$L2$$ regularized loss function, $$E$$, in this case, with $$\mu$$ as the regularization strength is shown in the equation below.

$$E = \mu (\sum_i r_i^2 + \sum_j c_j^2) + \sum_{\langle i,j \rangle in I} | f(r_i, c_j) – x_{ij} | ^p$$

where, $$f(r_i, c_j)$$ is a prediction function which, for example, could be $$f(r_i, c_j) = r_i + c_j$$ or $$f(r_i, c_j) = r_i \cdot c_j$$ in additive and multiplicative models respectively.

In order to learn $$r_i$$ and $$c_j$$, we apply SGD as shown by the following equations

$$r_i := r_i – \lambda (\frac{\partial E}{\partial r_i})$$

$$c_j := c_j – \lambda (\frac{\partial E}{\partial c_j})$$

where $$lambda$$ is the learning rate. SGD approximates the true gradient at each training example. The algorithm sweeps through the training set and updates $$r_i$$ and $$c_j$$. Several passes (epochs) of the training set are performed till the values converge. Parameters such as $$\mu$$, $$lambda$$ and epochs need calibration. The advantages of SGD over matrix inversion or gradient descent are a) its inexpensive operations as gradient is approximated for a single example, b) ease of implementation and c) scalability as the convergence rate does not seem to increase with an increase in the size of the data set.

### Data For Collaborative Filtering

For implementing our collaborative filtering models, we used the MovieLens dataset. The dataset contains 100,000 ratings given by 943 users for 1682 movies. The matrix represented by 943 rows (users) and 1682 columns (movies) contains 1,586,126 elements(ratings) out of which 100,000 ratings are provided. The sparsity of the given information matrix is almost 94%. The task of the collaborative learning models is to predict the remaining ratings in the matrix using different matrix factorization models. Data is presented in $$\langle \mathrm{user}, \mathrm{movie}, \mathrm{rating} \rangle$$ format where each user has rated at least 20 movies. Figure below displays exploratory analysis results on the whole 100K data. Rating 4 was most frequent. And histograms of both items-per-user and users-per-item decreased exponentially. Average rating did not show increasing pattern with an increase of rated items. While some users give systematically higher ratings, some items receive systematically higher ratings. These observations justify adoption of collaboration filtering for prediction of missing ratings.

The provided data also contains 5 different training and test example subsets, named $$u_i.base$$ and $$u_i.test$$ respectively, $$i \in {1,2,3,4,5}$$, constructed from the above data of 100,000 ratings after a 80-20 split. The subsets are used for 5-fold cross-validation in our experiments.

Histograms on 100K dataset. (a) Histogram of 5 ratings; (b) Average rating vs. Number of rated items;
(c) Histogram of number of items per user; (c) Histogram of number of users per item.

## Collaborative Filtering Modeling Experiments

We will review multiple models of interest, then expand on the best model. For comparison we also look at some more trivial models based on arithmetical means.

Let $$x_{ij} in {0,1,2,3,4,5}$$ be the rating by user $$i$$ for movie $$j$$ if $$x_{ij} > 0$$ and define missing as $$x_{ij}=0$$ where $$i$$ and $$j$$ range $${1,2,…,m}$$ and $${1,2,…,n}$$ respectively. In practical terms the training and test data are originally represented as a table of $$K$$ examples, i.e., rows of triples $$\langle i,j,x_{ij}\rangle_k, k \in {1,2,…,K}$$. In this context an existing rating within a row $$k$$ can be represented element wise as $$x^k=x_{ij}>0$$ or vectorized as just x.

For the purpose of prediction and training we translate this in to two sparse matrices, $$R in {0,1}^{K\times m}$$ for the users and $$C in {0,1}^{K\times n}$$ for the items, where the respective elements $$a_{ki}^{(R)}$$ and $$a_{kj}^{(C)}$$ are defined as $$a_{ki}^{(R)} = I(x_{ij}> 0),\quad j \in {1,2,…,m}$$ and $$a_{kj}^{(C)} = I(x_{ij}>0), \quad i={1,2,…,n}$$. Here $$I(\cdot)$$ is the indicator function, equals to one if true and zero if false. Finally, a prediction, or approximation, of a rating $$x^{(k)}$$ will be denoted $$\tilde{x}^{(k)}$$ or vectorized as just $$\tilde{x}$$. If need be, the notation $$\tilde{x}_{ij}$$ will also be used for the prediction of $$x_{ij}$$.

#### Mean Models

All presented mean based model uses vectors $$r in Re^m$$ and $$c in Re^n$$, with elements $$r_i$$ and $$c_j$$, as learning parameters. Vector $$r$$ corresponds to all users and vector $$c$$ corresponds to all movies. Prediction is done as a single matrix multiplication of the following block matrix and block vector.

$$A = [R \quad C], \quad z = [r \quad c]^T \mathrm{where},; \tilde{x}=[R \quad C][r \quad c]^T=Az$$

Finally, let the following means be defined as

$$bar{x}_i=frac{\sum_{j=1}^{n}x_{ij}}{\sum_{k=1}^{K}a_{ki}^{(R)}} \mathrm{and} ; bar{x}_j=\frac{\sum_{i=1}^{m}x_{ij}}{\sum_{k=1}^{K}a_{kj}^{(C)}}$$

When user ratings for a specific movie are zero or are not provided by the user, we use the global mean of all ratings, give by

$$\bar{x}_i=\bar{y_j}=\frac{\sum_{k=1}^{K}x_k^{(C)}}{K}$$

for initialization and bootstrapping purposes.

##### User Mean Model

The user mean model just takes the average of all ratings a person has given and returns this as the prediction for unrated movies. This has no predictive power from the user’s point of view. This is represented by,

$$r_i=\bar{x}_{i.}, \quad c_j=0$$

##### Item Mean Model

The item mean model is a little more reasonable and averages all ratings a movie has. This can be used for a reasonable prediction of a rating. This is represented by,

$$r_i=0, \quad c_j=\bar{x}_{j.}$$

##### Bi-Mean Model

The bi-mean is a naive way of combining the user mean and items mean. It takes the average of both user and item means. Although, a co-efficient of 1/2 is used to find the combination both the means, different co-efficients can also be used. This is represented by,

$$r_i=\frac{1}{2}\bar{x}_{i.}, \quad c_j=\frac{1}{2}bar{x}_{.j}$$

##### Bias from Mean Model

The bias from mean model works by finding the user’s average rating $$r_i$$ as a mean on which $$c_j$$ can vary around. The interpretation of $$r_i$$ and $$c_i$$ is that each person tends to rate at generally high or generally low levels $$r_i$$ and the amount $$c_i$$ that deviates from this basis comes from all users who have rated the move. This model is represented by,

$$r_i=r_i, \quad c_j= \frac{\sum_{i=1}^{m}(x_{ij}-\bar{x}_{i.})}{\sum_{k=1}^{K}a_{kj}^C}$$

In words, to find the deviation $$c_j$$ of movie $$j$$, sum of all ratings less each user’s average rating can be calculated which can then be divided by the number of raters.

Our additive model is based on the formulation of the prediction function $$f(r_i, c_j)$$ as shown by the following equation.

$$f(r_i, c_j) = r_i + c_j$$

$$r_i$$ and $$c_j$$ are learned by minimizing the square-loss and employing regularized SGD as shown in the following equations where
$lambda$ is learning rate and $$mu$$ is regularization strength .

$$E = \mu (\sum_i r_{i}^2 + \sum_{j} c_{j}^2) + \sum_{ \langle i,j \rangle \in I} (r_{i} + c_{j} – x_{ij})^2$$

$$r_i := r_i – 2 \lambda \mu r_i – 2 \lambda (r_i + c_j – x_{ij})$$

$$c_j := c_j – 2 \lambda \mu c_j – 2 \lambda (r_i + c_j – x_{ij})$$

#### Multiplicative Model

Our multiplicative model is based on the formulation of the prediction function $$f(r_i, c_j)$$ as shown by the following equation.

$$f(r_i, c_j) = r_i cdot c_j$$

$r_i$ and $$c_j$$ are learned by minimizing the square-loss and employing regularized SGD as shown in the following equations.

$$E = mu (\sum_{i} r_{i}^2 + \sum_{j} c_{j}^2) + \sum_{ \langle i,j \rangle in I} (r_{i}\cdot c_{j} – x_{ij})^2$$

$$r_i := r_i – 2 \lambda \mu r_i – 2 \lambda (r_i + c_j – x_{ij})c_j$$

$$c_j := c_j – 2 \lambda \mu c_j – 2 \lambda (r_i + c_j – x_{ij})r_i$$

#### Combined Model

Our combined model is built by incorporating the additive and multplicative models simultaneously.

$$tilde{x}_{ij}=r_i^{(1)}c_j^{(1)} + r_i^{(2)}c_j^{(2)} + … + r_i^{(L)}c_j^{(L)}$$

where $$L in mathcal{N}_1$$ is a meta-parameter that needs to be learned. Given $$l=1,2,…,L$$ and $$i=1,…,m$$, we can, for purposes of implementation, enter all $$r_i^l in Re$$ into a vector $$r^l in Re^m$$ or matrix $$r in Re^{m times L}$$, likewise, we can enter $$c_j^{(l)} in Re$$ into a vector $$c^{(l)} in Re^n$$ or matrix $$c in Re^{n times L}$$. We thus have a way to simultaneously make predictions $$tilde{x} in Re^K$$ for all $$K$$ presented by the test examples $$\langle i,j\rangle_{k in {1,…,K}}$$ as follows.

$$\tilde{x} = \sum_{l=1}^L(Rr^l \bullet Cc^l)$$

where we define $$u \bullet v = diag(u)v in Re_q$$ be the notation for component wise multiplication for some $$u, v in Re^q$$.

textit{Stochastic Gradient Descent}. SGD is applied to learn $$r_i$$ and $$c_j$$. For the combination model, following are the SGD update rules for $$r_i^{(l)}$$ and $$c_j^{(l)}$$.

$$r_i^{(l)} := r_i^{(l)} – 2 \lambda \mu r_i^{(l)} – 2 \lambda (\tilde{x}_{ij} – x_{ij})c_j^{(l)}$$

$$c_j^{(l)} := c_j^{(l)} – 2 \lambda \mu c_j^{(l)} – 2 \lambda (\tilde{x}_{ij} – x_{ij})r_i^{(l)}$$

Learning rate $$\lambda$$ and regularization strength $$mu$$ are fixed meta-parameters that need to be optimized separately. Learning process is terminated when the change in mean squared error $$MSE_e$$ for epoch $$e$$ falls below a threshold $$delta$$ during training phase. Learning is continued if $$MSE_e – MSE_{e-1} > delta$$. Learning will also stop when maximum number of epochs, $$epoch_{max}$$ are completed.

### Experiment Setup

\textit{Performance Measure}. Mean Squared Error (MSE), Mean Absolute Error (MAE) are used as the performance measure. The lower the values the better the model. MSE is calculated as follows.

$$MSE = \frac{1}{K}\sum_{k=1}^K(\tilde{x}^{(k)} – x^{(k)})^2$$

MAE is calculated by the following equation.

$$MAE = \frac{1}{K}\sum_{k=1}^K|\tilde{x}^{(k)} – x^{(k)}|$$

textit{Meta-parameters}. $$\lambda, \mu, L, delta, text{\epoch}_{\max}$$ need to be calibrated. We conduct a three-dimensional grid search for $$lambda, mu$$ and $$L$$ while setting $$delta$$ and $$\text{\epoch}_{\max}$$ to computational tractable sizes by human intuition and by trial-and-error. The following grid-search windows are chosen for parameters,

$$lambda in {2^{-i}}_{i=5}^{11}, \quad \mu \in {2^{-i}}_{i=3}^{11}, \quad L \in {1,2,3}$$

The grid was initially smaller, but expanded until optimal border cases were uncovered. In addition to these values we used the following fixed meta-parameters,

$$delta = 2^{-15}, \quad \text{\epoch}_{\max} = 2^{10}$$

The above settings let the models converge until sufficiently small change in MSE is repeatedly observed under the predetermined epoch number.

textit{Cross Validation}. In order to get fair measurements we performed 5-fold cross-validation for all measurements. That is, we generate five sets of performance metrics, i.e., MSE, and average over the folds. This allows us to use more data, without overfitting. The training data was randomized then split evenly among 20% partitions yielding $$u1, u2, u3, u4$$ and $$u5$$ training and test data subsets. Training was done using, for example, $$u1.base$$ and the resultant model is used to test $$u1.test$$. This is repeated for each set of training and test data. Each fold was used exactly once for training (per meta-parameter triplet). The remaining four folds were used to train the actual model that did the prediction for this fold of test data.

### Results and Analysis

#### Trivial Mean Models

The performance of the mean models is reported in the table below. As can be seen, the performance increases as the models get more “clever” with the best one being the bias from mean model. The following table shows MSE of Mean Models.

Model Average MSE on Test Data
User Mean 1.0895
Item Mean 1.0498
Bi-Mean 0.9662
Bias From Mean 0.9264

#### Combination Model

We conducted a broad grid search for the three main meta-parameters of the combined model, $$\lambda, \mu and L$$. Model length $$L=2$$ performs unequivocally better than models based on $$L=1$$ and $$L=3$$ which have MSE of 0.9034 and 0.9033 respectively. From the MSE results of models based on $$L=2$$, which are shown in the table below, we see that the best performing meta-parameters are $$\lambda=2^{-10}$$ and $$\mu=2^{-10}$$. Furthermore, we can also see that the regularization strength has negligible sensitivity among $$\mu, \lambda \in {2^{-8}, 2^{-9}, 2^{-10}}$$ x $${2^{-8}, 2^{-9}, 2^{-10}}$$. The following table shows a subset of the combination model’s MSE grid search.

L=2 $$mu=2^{-8}$$ $$mu=2^{-9}$$ $$mu=2^{-10}$$ $$mu=2^{-11}$$
$$lambda=2^{-8}$$ 0.906 0.925 0.922 0.946
$$lambda=2^{-9}$$ 0.891 0.886 0.886 0.9077
$$lambda=2^{-10}$$ 0.885 0.885 0.884 0.8961
$$lambda=2^{-11}$$ 0.899 0.888 0.886 0.9072

The table below compares MSE and MAE of all the models (mean and combination models). From the results, it can be seen that the combination model performs significantly better than the mean models.

Model Average MSE on Test Data Average MAE on Test Data
User Mean 1.0895 0.8362
Item Mean 1.0498 0.8174
Bi-Mean 0.9662 0.7951
Bias From Mean 0.9264 0.7590
Combination Model 0.8841 0.6899

#### Computational Complexity

For our implementation of the combination model using SGD, the space complexity is $$O(K)$$ and run-time complexity of the training phase is $$O(nK)$$ where $$n$$ is the number of epochs consumed and $$K$$ is the number of training examples. The complexity of the test phase is constant time per example. The number of training examples is the dominant contributor to the complexity as number of epochs tend to stabilise at low 50s and do not increase with increase in dataset size. In the figure below, we see the learning progress over time, i.e, training MSE versus epochs. Of the 560 epochs used on the final model we see a significant decrease in MSE for the first 50 epochs. For $$L=2$$ and $$K=80,000$$ each epoch took about 70 ms. This bodes well for performance on a larger dataset, however, it is unclear if the full dataset $$K=10^8$$ of Netflix is manageable, with regard to memory, on a computer desktop. A parallel and distributed implementation of SGD would handle larger datasets with considerable ease as a single SGD updates can be divided into smaller updates based on smaller subsets with each update parallelized. Such an implementation can lead to better a performing model with high degree of scalability.

MSE vs Epochs for Combination Model

### Conclusion

Collaborative filtering with high accuracy and scalability are important both to the businesses to create user satisfaction and also to the user to have a richer experience. It is becoming imperative for recommender systems to produce high quality of predictions and at the same time be seamlessly scalable. With the inundation of “Big Data”, collaborative filtering algorithms would be under stress without efficient algorithms. In this article, we implemented collaborative filtering algorithms, ranging from trivial to higher-accuracy combination models. Our additive, multiplicative and combination models employ SGD technique and have a linear run-time and space complexity and can scale well to millions of data records provided convergence occurs in few epochs. Applying parallel and distributed strategies for storage and computation, we believe that the collaborative filtering system becomes all the more scalable and responsive.

## Master’s Thesis: A Novel Algorithmic Trading Framework Applying Evolution and Machine Learning for Portfolio Optimization

Master’s Thesis, Spring 2012

Authors: André Christoffer Andersen and Stian Mikelsen
Supervisors: Professor Alexei A. Gaivoronski and Helge Langseth

### Abstract

The goal of this thesis is to implement an automated trading system able to outperform the benchmark uniform buy-and-hold strategy. Performance is measured in term of multiple risk and return measures. A comprehensive signal processing framework is built to facilitate rapid development and testing of multiple candidate trading systems. A subset of the most promising trading systems, 14 in total, are presented in this thesis.

The trading systems are not only designed to function as decision support systems for human trader, but also to function as autonomous traders in their own right. They are built to be applicable in real life, any academic value is a welcomed side effect. Many of the trading systems are heavily influenced by machine learning methods, with a special focus on time series prediction. Support vector machines (SVM), artificial neural networks (ANN), and regression are the primary methods applied for low level machine learning. Evolutionary algorithms are applied as high level meta-parameter tuning. In addition to machine learning methods, classical portfolio optimization methods using modern portfolio theory are applied, some in combination with machine learning methods.

The trading systems are tested by means of simulation on data from two stock markets – the OBX index on Oslo stock exchange and Dow Jones Industrial Average (Dow). Results indicate that the Dow index is highly efficient. No trading system was able to outperform the benchmark when using transaction costs. Results on the OBX index indicate some inefficiencies with potential for exploitation. Many trading systems manage to generate higher return than the benchmark. Some achieve excess return with the same level of risk. The efficient portfolio trading system yields highest return given the same level of risk as the buy-and-hold portfolio. In general, results tend to support the efficient market hypothesis. Given the same risk level, experiments from this thesis shows that efficient portfolios can outperform uniform ones in inefficient markets as these are not efficient enough in and of themselves as in the case of OBX. No strategy was found that can substantially outperform an efficient portfolio.

Support Vector Machines are identified to be the best time series predictors among the machine learning techniques used. Evolution is found to be a powerful tool in identifying and training trading systems. Future work will focus on evolution as an integral part of the trading systems.

In financial markets there are many different types of traders, both human and algorithmic. They are often categorized by their trading time-frame. Alternatively it is possible to categorize based on what kind of strategies they apply.

In general there is one main distinctions to make. There is a difference between technical and fundamental trading. Fundamental trading includes factors outside the price of the security; typically a company’s accounts is used. Technical trading however is only based on the actual price development and its qualities. Technical traders base their strategy on the assumption that historical price data can in some way predict future prices. The degrees to which analysis one employs affect the time-frame, since fundamental analysis is inherently much more static than technical. Price update occur several times per second, while the company results are only published once a quarter. Fundamental data is also more qualitative and thus harder to analyze quickly, e.g., news. Analysis of technical data is purely quantitative, and can thus be readily handled by computers. Generally, the shorter time-frame, the more automated the traders are.

The longest time-frame is often called position trading. It generally consists of all positions held longer than a few months. This is a very reputable strategy, recommended by Warren Buffet himself, but with a limited possibility of return. Since one has plenty of time to do a thorough analysis of the asset, algorithmic methods have less advantage here, but can be used as supplementary guidance. This category also includes more active investors, who are active in a different sense than earlier described. They take an active part in management of the company they’re invested in, like private equity companies.

Positions held over days or weeks include strategies like swing trading and more short-term fundamental analysis strategies. This is generally the shortest time-frame one finds purely fundamental analysis traders. Changes in price on a lower scale than days are often considered noise to these traders. There is some technical analysis done over several days, but it is limited. This is because the primary goal of pure technical analysts is to speculate on price swings. A shorter time frame will give more than enough volatility to do this, due to the fractal nature of the stock market. A fractal is a mathematical concept where roughness exist at whatever resolution you look at. The graph of price a different time periods will tend to have roughly the same kind of volatility.

Intraday trading is trading confined to a single day, most day-traders sell their positions before close. This is mainly speculation and mostly done by technical analysis, though one might supplement with fundamental techniques. Day trading can be a job for human investors, but require much time and is quite close to a zero sum game due to the lack of the long term equity premium. Intraday trading is becoming the realm of machines as they can search through much more data. Most machines are however trading on more or less instant strategies where the aim is to hold an actual position as short as possible; these are called High-frequency traders.

High-frequency trading (HFT) is a poorly defined term. In this paper it is used in its semantic meaning of all trading done at a high frequency instead of meddling with all the contradictory definitions in the literature. With this meaning of the term, HFT will be a broad concept that includes all trading where the trading speed is used as a competitive advantage. In 2010 these systems trade on milli- and microseconds. All HFT is done by algorithms, as the speeds required for this trading type are beyond human capabilities.

Low-latency trading is a subclass of HFT, and can be thought of as a specific strategy of HFT. These are a naive set of simple algorithms which sole advantage is their ultra-low latency, direct market access. An example of a low-latency strategy is to survey the news. The instance good news hits the market, a low-latency trader will buy from slower sellers which have not had time to increase their bid price, and then sell to a new higher price. Most of these tactics revolve around quickly finding and exploiting arbitrage possibilities. There is tough competition in having the fastest algorithms and the lowest latency. This activity is very capital intensive due to the hardware requirements and it needs a certain initial capital. There are few amateurs in this area and most actors are large banks, funds or even specialized algorithmic trading firms.

## Signal Emitter Positioning Using Multilateration

In this post we solve the problem of spatial location of a signal emitter in Euclidian $$\mathbb{R}^3$$ space by means of time difference of arrival (TDOA) at multiple omnidirectional sensors.

## Introduction

Multilateration is a technique that uses multiple omnidirectional sensors in order to isolate the unknown position of a signal emitter in two- or three-dimensional Euclidian space. Two examples of sensors and signals could be microphones listening for sharp noises, or radio receivers listening for radio signals. In any case, these sensors are located at unique known positions where they listen for what is called a signal event. Such events are timestamped based on a synchronized or centralized clock common to all sensors. The signal from an emitter is registered by all sensors only once as the signal wave expands spherically in all directions with constant propagation speed. The time difference between when two sensors register the signal event is called the time difference of arrival (TDOA). Based on the TDOA and the location of each registration, i.e., sensor positions, we can deduce the location of the signal emitter.

## Model definitions

Let sensors $$i=1,\ldots,k$$ at positions $$\vec{p}_{i}$$ be sensors that output a timestamp $$t_{i}$$ for when they register a signal wave with propagation speed $$v$$ from a signal emitter at unknown position $$\vec{x}$$, signal start time $$t$$ and distance $$v\left(t_{i}-t\right)=\|\vec{x}-\vec{p}_{i}\|$$ away. If we define sensor $$c$$ to be the first sensor to register a signal event, then the signal wave has traveled for $$(t_{c}-t)$$ time units and a distance of $$v(t_{c}-t)=\|\vec{x}-\vec{p}_{c}\|$$ before reaching the first sensor. Furthermore, the TDOA between each sensor’s timestamp and the initial sensor’s timestamp is $$(t_{i}-t_{c})$$ yielding the the distance $$v\Delta t_{i}=v(t_{i}-t_{c})$$ traveled by the signal wave since initially being registered and finally reaching the $$i$$th sensor. From the definitions above we have the following important relationships.

\|\vec{x}-\vec{p}_{i}\|=v\left(t_{i}-t\right)\label{eq:main}

and the equivalent formulations

\begin{split}\|\vec{x}-\vec{p}_{i}\|=v(t_{i}-t_{c})+\|\vec{x}-\vec{p}_{c}\|\end{split}
\label{eq:mainSplit}

\|\vec{x}-\vec{p}_{i}\|=v(t_{i}-t_{c})+v(t_{c}-t)\label{eq:mainTime}

The following figure shows the various geometric distance relationships of the multilateration problem.

Distance relationships of the multilateration problem.

## Geometric Interpretation

In general the positions $$\vec{p}_{i},\vec{p}_{c}$$ and TDOA $$t_{i}-t_{c}$$ of a pair of sensors limits the signal emitter’s position to lay on one sheet of a circular two-sheeted hyperboloid with foci at $$\vec{p}_{i}$$ and $$\vec{p}_{c}$$. Given that $$c$$ is the first of the two sensors to register a signal event, then only the sheet at focus $$\vec{p}_{c}$$ is materialized. The associated directrix plane can be found by a known position vector and normal vector. Since the directrix is orthogonal to the vector pointing from sensor $$i$$ to sensor $$c$$, we know that $$p_{c}-p_{i}$$ itself or its unit vector
$\frac{\vec{p}_{c}-\vec{p}_{i}}{\|\vec{p}_{c}-\vec{p}_{i}\|}$
can serve as a normal vector. Next we need a position vector with endpoint on the directrix plane. One such vector can be defined by the point at which the directrix intersects the line segment between $$\vec{p}_{i}$$ and $$\vec{p}_{c}$$. This point lays a length $$v(t_{i}-t_{c})$$ away from $$p_{i}$$ towards $$\vec{p}_{c}$$, that is, it lays at the position vector
$\vec{p}_{i}+v(t_{i}-t_{c})\frac{\vec{p}_{c}-\vec{p}_{i}}{\|\vec{p}_{c}-\vec{p}_{i}\|}$

The following figure illustrates how TDOA can be used to constrain the multilateration problem solution to one sheet of a two-sheeted circular hyperboloid in $$\mathbb{R}^{3}$$.

Given sensor pair i and c the feasible region of emittor position x is constrained to one sheet of a two-sheeted circular hyperboloid.

With enough of these constraints, i.e., having enough sensors, we can isolate the signal emitter position to a single point as in the following figure in $$\mathbb{R}^{2}$$ space.

Given pairings between tre sensors i=1,2,3 with sensor c=4 we can generate three hyperboles which all intersect at emittor possition x.

## Reduction to Linear Form

The most efficient way to solve the multilateration problem is to reduce the problem to linear form. To determining the location of the signal emitter we can use three unique pairs $$i,j$$ of sensors in addtion to an initial discovery sensor $$c$$. This reduces the problem to a linear system of three equations, and requires at least five sensor.

\begin{alignedat}{2}\|\vec{x}-\vec{p}_{i}\| & = & v(t_{i}-t_{c})+\|\vec{x}-\vec{p}_{c}\|\\ \|\vec{x}-\vec{p}_{i}\|^{2} & = & \left(v(t_{i}-t_{c})+\|\vec{x}-\vec{p}_{c}\|\right)^{2}\\ \|\vec{x}-\vec{p}_{i}\|^{2} & = & v^{2}(t_{i}-t_{c})^{2}+2v(t_{i}-t_{c})\|\vec{x}-\vec{p}_{c}\|+\|\vec{x}-\vec{p}_{c}\|^{2} \end{alignedat}

Introducing another sensor $$j$$ we can eliminate the $$\|x-p_{c}\|$$ term.

{2}-2\|\vec{x}-\vec{p}_{c}\|\\ = v(t_{i}-t_{c})+\frac{\|\vec{x}-\vec{p}_{c}\|^{2}-\begin{split}\|\vec{x}-\vec{p}_{i}\|^{2}\end{split}}{v(t_{i}-t_{c})}
\label{eq:tempRes1}

-2\|\vec{x}-\vec{p}_{c}\|\\ = v(t_{j}-t_{c})+\frac{\|\vec{x}-\vec{p}_{c}\|^{2}-\begin{split}\|\vec{x}-\vec{p}_{j}\|^{2}\end{split}}{v(t_{j}-t_{c})}
\label{eq:tempRes2}

v(t_{j}-t_{c})+\frac{\|\vec{x}-\vec{p}_{c}\|^{2}-\begin{split}\|\vec{x}-\vec{p}_{j}\|^{2}\end{split}}{v(t_{j}-t_{c})}\\ = v(t_{i}-t_{c})+\frac{\|\vec{x}-\vec{p}_{c}\|^{2}-\begin{split}\|\vec{x}-\vec{p}_{i}\|^{2}\end{split}}{v(t_{i}-t_{c})}
\label{eq:tempRes3}

Now, expanding the definitions of the squared distances from each of the sensors and the emitter we get the following.

\begin{alignedat}{2}\begin{split}\|\vec{x}-\vec{p}_{i}\|^{2}\end{split}
& = & \left(\vec{x}-\vec{p}_{i}\right)^{T}\left(\vec{x}-\vec{p}_{i}\right)=\vec{x}^{T}\vec{x}-2\vec{p}_{i}^{T}x+\vec{p}_{i}^{T}\vec{p}_{i}\\
\|\vec{x}-\vec{p}_{j}\|^{2} & = & \vec{x}^{T}\vec{x}-2\vec{p}_{j}^{T}\vec{x}+\vec{p}_{j}^{T}\vec{p}_{j}\\
\|\vec{x}-\vec{p}_{c}\|^{2} & = & \vec{x}^{T}\vec{x}-2\vec{p}_{c}^{T}\vec{x}+\vec{p}_{c}^{T}\vec{p}_{c}
\end{alignedat}
\label{eq:xpDefRe}

\begin{alignedat}{2}v(t_{j}-t_{c})+\frac{2\left(\vec{p}_{j}^{T}-\vec{p}_{c}^{T}\right)\vec{x}+\vec{p}_{c}^{T}\vec{p}_{c}-\vec{p}_{j}^{T}\vec{p}_{j}}{v(t_{j}-t_{c})} \\ = v(t_{i}-t_{c})+\frac{2\left(\vec{p}_{i}^{T}-\vec{p}_{c}^{T}\right)\vec{x}+\vec{p}_{c}^{T}\vec{p}_{c}-\vec{p}_{i}^{T}\vec{p}_{i}}{v(t_{i}-t_{c})}\end{alignedat}
\label{eq:systemRaw}

Notice that the quadratic terms have fallen out and we are left with a very solvable linear system of equations

$\vec{a}_{ijc}^{T}\vec{x}=b_{ijc}$

where

\begin{alignedat}{2}\vec{a}_{ijc} & = & 2\left(v(t_{j}-t_{c})(\vec{p}_{i}-\vec{p}_{c})-v(t_{i}-t_{c})(\vec{p}_{j}-\vec{p}_{c})\right)\in\mathbb{R}^{3}\\ b_{ijc} & = & v(t_{i}-t_{c})\left(v^{2}(t_{j}-t_{c})^{2}-\vec{p}_{j}^{T}\vec{p}_{j}\right)\\ & + & \left(v(t_{i}-t_{c})-v(t_{j}-t_{c})\right)\vec{p}_{c}^{T}\vec{p}_{c}\\ & + & v(t_{j}-t_{c})\left(\vec{p}_{i}^{T}\vec{p}_{i}-v^{2}(t_{i}-t_{c})^{2}\right)\in\mathbb{R} \end{alignedat}

This can be expressed as just

$\textbf{A}x=b$
where each pair $$i,j$$ creates the matrix $$\textbf{A}$$ and vector $$b$$ defined by $$b_{ijc}\in b\in\mathbb{R}^{3}$$ and $$\vec{a}_{ijc}\in \textbf{A}\in\mathbb{R}^{3\times3}$$ such that $$\vec{a}_{ijc}^{T}$$ are the rows of $$\textbf{A}$$. This can be solved using LP or some other more or less direct method. We will be simply computing the iverse of $$\textbf{A}$$ such that
$\hat{x}=\textbf{A}^{-1}b=\textbf{A}\backslash b$

## Application

In order to aid intuition it might be helpful to translate the problem in to a real world interpretation. The problem analogy is that of submarine localization using passive acoustic sensors.

The submarine, i.e., the signal emitter, is placed somewhere in a six cubic kilometer body of water. The acoustic sensors are placed randomly in a one cubic kilometer body of water. Both bodies of water has the coordinate system origin in its center.

The following MATLAB code embodies this scenario: multilateration.m.