Although unlikely to gain a competitive advantage in standard form, the fact that the Naive Bayes classification algorithm is relatively simple and intuitive make it a suitable candidate for an initial post on the application of Machine Learning to algorithmic trading. Further, the underlying principles of this algorithm are powerful and with suitable extension and modification has potential to form the basis of a profitable trading system.
As an initial example we develop an algorithm to predict future returns. Let Y be a random variable representing the cumulative return to some horizon into the future, denoted $\tau$, and defined as
$$ Y_{T0 + \tau} = ln \left( S_{T0 + \tau} - S_{T0} \right) $$
where $S$ denotes the price of the asset we are attempting to predict, $ln$ denotes the natural logarithm, and $T0$ indicates the current period. Thus, Y is the variable we want to predict. The variable Y is standardized by subtracting the mean and dividing by some measure of dispersion.
Let $X$ < $ X_1, X_2,... , X_k $ > be a vector of real valued stochastic signals used to predict Y, which have also been standardized. Bayes' Theorem allows us to relate Y and X and is defined as
$$ P(Y=y | X = x) = \frac { P(Y=y)P(X = x | Y = y) } { P (X | x) } $$
where $x$ <$x_1,x_2$> is an observed signal vector, $X=x$ is shorthand for $X_1 = x_1 \wedge X_2 = x_2 \wedge ...\wedge X_k = x_k $, and $y$ is an observed value of Y. Since the denominator $ P (X | x) $ is invariant across values of Y, the formula can be simplified to
$$ P(Y=y | X = x) \propto P(Y=y)P(X = x | Y = y) $$
However, there is an issue with this formulation. Given that Y and X are real valued variables it may be difficult to assign a probability to any single value of a signal. To address this problem we apply discretization. We define C as a categorical variable which takes the value $c=1$ when $ Y < -2*\sigma$, a value of $c=3$ when $ Y > 2*\sigma$, and a value $c=2$ otherwise. We apply the same discretization procedure to each signal $X_i$ to form a corresponding categorical variable $X_i^*$. Thus, states 1 and 3 indicate 2 sigma events in the left and right tails, respectively. The new formula reads
$$ P(C=c | X^* = x^*) = P(C=c)P(X^* = x^* | C = c) $$
Assuming two predictor signals, the expression $P(C=3 | X_1^*=1 \wedge X_2^*=3 )$ reads "the probability of Y experiencing a positive 2 sigma event conditional on a negative 2 sigma event in signal $X_1$ and a positive 2 sigma event in $X_2$". Thus, the probabilistic expression maps to a simple english sentence, which I prefer over more cryptic and black box algorithms.
The probabilities are estimated from a training set of examples generated using historical data. However, there is one further issue. We can't guarantee that an out-of-sample value for the input vector $x^*$ exists in our training set, which means that we may not be able to estimate $P(X^*=x^*| C = c)$. The so called Naive Bayes' assumption addresses this issue by assuming that the signals $X_1,X_2,...,X_k$ are conditionally independent of each other for a particular class c. This assumption allows us to express $P(X^*=x^*| C = c)$ in the following form
$$ P(X^*=x^*| C = c) = P( \wedge X_i^* = x_i^* | C=c) = \prod p(X_i^*=x_i^* | C=c) $$
This means that $P(C=c | X^* = x^*)$ can now be expressed as
$$ P(C=c | X^* = x^*) \propto P(C=c) \prod P(X_i^*=x_i^* | C=c) $$
"Hold on!", I hear you say. "In practice the signals may be highly correlated." Well, not if the signals are specifically constructed to be orthogonal, which can be achieved using Principal Component Analysis (PCA). I will address this issue in my next post!
No comments:
Post a Comment