Learning about Kalman filters

This is my weekend project about Kalman filters, which I read about while looking into SLAM. SLAM is a family of algorithms used to figure out where a robot and the things around it are. You can do SLAM with Kalman filters, though in busy realtime environments you might prefer FastSLAM 2 or some other more scalable method. I chose to focus on Kalman filters as a starting point since they show up as components of those more advanced SLAM methods, but also in classic applications like radar, underscoring their ubiquity and generality. That's a good reason to invest some time in building an intuition for the math behind them, and a good excuse to make some fun visual demos/animations.

Some disclaimers:

Up next:

  1. Visual demo environment

  2. Kalman filter with nonlinear estimation (radar demo)

  3. Kalman filters and SLAM

  4. Wrap up

Visual demo environment

Throughout this post I've included simple 2D animations to illustrate key ideas. In the demos, there's a "mouse" agent that can move and rotate on a bounded <canvas>. The point

(x,y)=(0,0)(x, y) = (0, 0)
is the top-left corner of the canvas and orientation is expressed in radians, where θ=π\theta = \pi means the mouse points upward.

Kalman filter with nonlinear estimation (radar demo)

State estimation

The Kalman filter is a loop that can quickly crush uncertainty about the state of a changing system, at least in theory. In order to use it, you need to write down two prediction functions:

  1. State prediction function:

    f(x,u):XXf(x, u): X \rightarrow X

    xx is your state vector (e.g., position of an object), uu is a control vector that your estimator knows about (say, velocity), and xXx \in X.

    This function lets us ask: given some previous state and control input, (xk1,u)(x_{k-1}, u), what would I predict the next state xkx_k^- to be? We don't try to predict uu though, so you can't stash things in there that you'd need to measure.

  2. Measurement prediction function:

    h(x):XMh(x): X \rightarrow M

    The way to read this is: I've already predicted the state (xx) and I'm about to measure it for real, so what should my measurement be if xx matched reality exactly?

    The slickest feature of hh is that the vector space for measurements, MM, isn't required to be the same as the state vector space XX. That means it's OK to make measurements that don't live in state space: all you need is a way to relate the two. Later that'll let us define a change-of-basis matrix to "pull" measurements into state space, which is very handy for updating our knowledge of state uncertainty (covariances).

The advantages of having these specific predictors are:

Kalman filters are theoretically optimal at reducing uncertainty when prediction functions are linear. I'm interested in nonlinear motion, so we'll look at an extended Kalman filter that trades away theoretical optimality for a little extra generality.

Aside: A downside of having two possibly-nonlinear prediction functions is that they can be complicated or expensive to compute. Sometimes a simpler alternative like the alpha-beta filter can be a better fit.

Writing down the prediction functions

Let's make the Kalman filter concrete by writing down the setup for a simple radar demo. We'll treat our mouse agent like a plane and have it move around in a circle in the sky. Then we'll point a radar gun at the mouse to estimate its position. Due to air turbulence, there's random jitter in both the mouse's true motion and in the radar sensor measurements.

The mouse's state includes position and orientation:

x=[pxpyθ] x = \begin{bmatrix} p_x \\ p_y \\ \theta \end{bmatrix}

For smooth circular motion, we'll fix a constant angular speed (ω\omega) (in rad/s) and use v=rωv=r\omega to get linear velocity:

f(xk;r,ω,Δtk)=[px,k+(rω)Δtkcosθkpy,k+(rω)Δtksinθkθk+ωΔtk] f(x_k; r, \omega, \Delta t_k) = \begin{bmatrix} p_{x,k} + (r\omega)\Delta t_k \cos\theta_k \\ p_{y,k} + (r\omega)\Delta t_k \sin\theta_k \\ \theta_k + \omega\Delta t_k \end{bmatrix}

When making a sensor reading, we'll get an angle and distance from the stationary radar gun, so given the true position xx of the mouse we'd expect the following measurement:

h(x)=[dα]=[(xgpx)2+(ygpy)2atan2(ygpy,xgpx)] h(x) = \begin{bmatrix} d \\ \alpha \end{bmatrix} = \begin{bmatrix} \sqrt{(x_{g} - p_x)^2 + (y_{g} - p_y)^2} \\ \atantwo(y_{g} - p_y, x_{g} - p_x) \end{bmatrix}

Wiring it all up, the system looks like this. Try cranking the motion and radar noise up/down to see how the system reacts:

0.10
1.0

Expressing and updating uncertainty

In this section I'm aiming to develop intuition for both how and why the Kalman filter evolves state estimates and its own internal uncertainty. I value this because I like knowing where I might hit limitations or encounter tradeoffs. To that end I'm going to focus on the details that matter the most for a clean conceptual understanding, including the most interesting (to me) derivations, while avoiding exhaustively deriving everything.

To express our uncertainty about the state of the system, we need to make a distinction between the "true" state of the system, xx, and the current state estimate x^\hat x. We assume the true state evolves with noise wN(0,Q)w \sim \mathcal{N}(0, Q):

xk=f(xk1)+wk1 x_{k} = f(x_{k-1}) + w_{k-1}

But naturally, we don't add any noise to our predictions:

x^k=f(x^k1) \hat{x}_k = f(\hat{x}_{k-1})

The state uncertainty PP is the covariance of the difference between the "true" and estimated vectors (roughly, how aligned fluctuations in different components of the state estimate are with each other):

Pk=Cov(δxk)=Cov(xkx^k) P_k = \Cov(\delta x_k) = \Cov(x_k - \hat{x}_k)

We don't have a way to actually know the true state xx, so in order to update our uncertainty after making a fresh state prediction x^k\hat{x}_k, we need an approximation. We can do this by taking a first-order Taylor series expansion of f(x)f(x) at the new state estimate (here, FF is a Jacobian):

around x^k:f(x)f(x^k)+Fk(xx^k) \texttt{around } \hat{x}_k: f(x) \approx f(\hat{x}_k) + F_k(x - \hat{x}_k)

Technically, the Taylor series expansion of ff can have higher-order derivatives (Hessian, etc.) that we're simply ignoring. Why? Well, those terms are inconvenient for doing closed-form math and a first-order approximation can be good enough. In real-world usage I imagine you'd need to keep tabs on this approximation error to make sure it doesn't get out of hand.

Thanks to the first-order approximation, there are no awkward power-of-xx terms, so we can plug the approximation back in to δxk\delta x_k and express the new state error in terms of the old one. This sets us up nicely to update our state uncertainty PkP_k:

δxk=xkx^k=f(xk1)+wk1f(x^k1)f(x^k1)+Fk1(xk1x^k1)+wk1f(x^k1)=Fk1(xk1x^k1)+wk1=Fk1δxk1+wk1 \begin{aligned} \delta x_k &= x_k - \hat{x}_k = f(x_{k-1}) + w_{k-1} - f(\hat{x}_{k-1}) \\ &\approx f(\hat{x}_{k-1}) + F_{k-1}(x_{k-1} - \hat{x}_{k-1}) + w_{k-1} - f(\hat{x}_{k-1}) \\ &= F_{k-1}(x_{k-1} - \hat{x}_{k-1}) + w_{k-1} \\ &= F_{k-1}\delta x_{k-1} + w_{k-1} \end{aligned}

Now we make the reasonable assumption that E[wk]=0\E[w_k] = 0: we can always guarantee this by adding the true noise-mean to the prediction model, so no big deal. By linearity of expectation, we can solve for the new uncertainty PkP_k:

Pk=Cov(δxk)=E[δxkδxkT]E[(Fk1δxk1+wk1)(Fk1δxk1+wk1)T]=E[(Fk1δxk1+wk1)(δxk1TFk1T+wk1T)]=E[Fk1δxk1δxk1TFk1T]+E[wk1wk1T]=Fk1E[δxk1δxk1T]Fk1T+E[wk1wk1T]=Fk1Pk1Fk1T+Q \begin{aligned} P_k &= \Cov(\delta x_k) \\ &= \E[\delta x_k \delta x_k^T] \\ &\approx \E[(F_{k-1}\delta x_{k-1} + w_{k-1})(F_{k-1}\delta x_{k-1} + w_{k-1})^T] \\ &= \E[(F_{k-1}\delta x_{k-1} + w_{k-1})(\delta x_{k-1}^TF_{k-1}^T + w_{k-1}^T)] \\ &= \E[F_{k-1}\delta x_{k-1} \delta x_{k-1}^TF_{k-1}^T] + \E[w_{k-1}w_{k-1}^T] \\ &= F_{k-1}\E[\delta x_{k-1} \delta x_{k-1}^T]F_{k-1}^T + \E[w_{k-1}w_{k-1}^T] \\ &= F_{k-1}P_{k-1}F_{k-1}^T + Q \\ \end{aligned}

At this point we've fully covered a priori state prediction for time step kk:

x^k=f(x^k1)Pk=Fk1Pk1Fk1T+Q \begin{aligned} \hat{x}_k^- &= f(\hat{x}_{k-1}) \\ P_k^- &= F_{k-1}P_{k-1}F_{k-1}^T + Q \end{aligned}

If we take a measurement, zkz_k, we make an a posteriori correction that takes the measurement into account. The empirical measurement is modeled as the measurement we'd expect of the true state plus a noise term rN(0,R)r \sim \mathcal{N}(0, R). The correction tool we use is called a residual, y^k\hat{y}_k:

zk=h(xk)+rky^k=zkh(x^k) z_k = h(x_k) + r_k \\ \hat{y}_k = z_k - h(\hat{x}_k^-)

If we add the residual y^k\hat{y}_k to the a priori state estimate x^k\hat{x}_k^- without any scaling, it would be like saying: don't trust my state predictions at all, just clamp to the empirical zkz_k. On the other hand, if we ignored the residual, it'd say the opposite: don't trust the sensors at all, just clamp to the predicted x^k\hat{x}_k.

Ideally, we'd have a Kalman scaling factor for y^k\hat{y}_k that's higher when we when we have high state uncertainty, and lower when we have high measurement uncertainty (the self-balancing see-saw):

Kk=state uncertaintymeasurement uncertaintyxk+=xk+Kky^k \begin{aligned} K_k &= \frac{\textit{state uncertainty}}{\textit{measurement uncertainty}} \\ x_k^+ &= x_k^- + K_k\hat{y}_k \end{aligned}

To define the real Kalman scaling factor, we need a way to express measurement uncertainty, SkS_k, so let's tackle that first. We'll use the exact same tactic used to approximate state uncertainty. Linearize hh with a first-order Taylor series expansion and take the covariance of the residual:

y^k=h(xk)+rkh(x^k)h(x^k)+Hkδxk+rkh(x^k)=Hkδxk+rkSk=Cov(y^k)=HkPkHkT+R \begin{aligned} \hat{y}_k &= h(x_k) + r_k - h(\hat{x}_k^-) \\ &\approx h(\hat{x}_k^-) + H_k\delta x_k + r_k - h(\hat{x}_k^-) \\ &= H_k\delta x_k + r_k \\ S_k &= \Cov(\hat{y}_k) = H_k P_k^- H_k^T + R \end{aligned}

At this point I'm explicitly prioritizing an intuitive picture over a rigorous formal derivation. It's possible to get the same result by writing down a function to estimate the minimum mean-squared error for a linear Kalman filter, taking the gradient, and solving for the critical point. For me, it was more helpful to think about KkK_k like a fraction. If the "numerator" is state uncertainty, that's just PkP_k^-. If the "denominator" is measurement uncertainty, that's just an inverted SkS_k. To pull Sk1S_k^{-1} from measurement space into state space, we use the change-of-basis Jacobian HkH_k (note Hk:XMH_k: X \rightarrow M and HkT:MXH_k^T: M \rightarrow X):

Kk=PkHkTSk1 K_k = P_k^- H_k^T S_k^{-1}

The fraction metaphor highlights an edge case: what happens if you divide by zero, i.e. SkS_k isn't invertible? KkK_k could blow up and your program may crash. I chose to address this by adding a tiny positive value to the diagonal of RR to crank det(Sk)>0\det(S_k) > 0.

Measurement changes the state estimate xk+x_k^+, but it should change the posterior state uncertainty too. To do that, we scale down uncertainty in components of x^\hat{x} where measurement instills higher confidence, and scale up uncertainty about components where the new measurement muddies the waters (here, the HkH_k factor is again needed to "pull" PkP_k^- into measurement space where KkK_k operates):

Pk+=(IKkHk)Pk P_k^+ = (I - K_kH_k)P_k^-

I think of the Pk+P_k^+ update as "scale my uncertainty by 1 minus the scaling factor I needed for my measurement correction".

Another way to think about the posterior uncertainty update is that the new Pk+=Cov(δxk+)P_k^+ = \Cov(\delta x_k^+), where:

δxk+=(IKkHk)δxk+rkCov(δxk+)=(IKkHk)Pk(IKkHk)T+KkRKkT \begin{aligned} \delta x_k^+ &= (I - K_kH_k)\delta x_k^- + r_k \\ \Cov(\delta x_k^+) &= (I - K_kH_k)P^-_k(I - K_kH_k)^T + K_kRK_k^T \end{aligned}

This is called the "Joseph form", which is used sometimes if you have a custom Kalman scaling factor KkK_k instead of the one that's optimal for linear filters. If you use the classic KkK_k, both versions of Pk+P_k^+ are in theory equivalent.

Radar demo

We've covered how to set up and update a Kalman filter. To make a demo we need the Jacobians for ff and hh:

Fk=fxk=[10(rω)Δtksinθk01  (rω)Δtkcosθk001]Hk=hxk=[pxxgdpyygd0ygpyd2pxxgd20] \begin{aligned} F_k &= \frac{\partial f}{\partial x_k} = \begin{bmatrix} 1 & 0 & -(r\omega)\Delta t_k \sin\theta_k \\ 0 & 1 & \ \ (r\omega)\Delta t_k \cos\theta_k \\ 0 & 0 & 1 \end{bmatrix} \\ H_k &= \frac{\partial h}{\partial x_k} = \begin{bmatrix} \frac{p_x - x_g}{d} & \frac{p_y - y_g}{d} & 0 \\ \frac{y_g - p_y}{d^2} & \frac{p_x - x_g}{d^2} & 0 \end{bmatrix} \end{aligned}

Here dd is just shorthand for (xgpx)2+(ygpy)2\sqrt{(x_{g} - p_x)^2 + (y_{g} - p_y)^2}.

We also need to define the initial system state. Let's position x0=x^0x_0 = \hat{x}_0 at the center of the canvas as before. Let's also set P0P_0 to 0 everywhere except for the diagnoal entries corresponding to the position uncertainty, which we'll set to 64. As the system evolves below, you should see the position estimate evolve. Since this is a simulation we have the luxury of being able to calculate the true position error δxk\delta x_k, which you'll see below. Try changing the noise parameters to see how the uncertainty spikes and settles:

0.10
1.0

A cool feature here is that once the estimated uncertainty reaches a low level, you can ratchet up the radar noise really high and uncertainty tends to remain fixed. This is the see-saw in action, up-weighting the state prediction when the measurement becomes unreliable.

Kalman filters and SLAM

A robot can maintain Kalman filters for all the objects in its environment and itself. One way I can see this working is:

This hand-waves away a fair bit of complexity related to sensor fusion and object registration. There are a few issues with using a Kalman filter for each object I want to unpack first.

When we measure an object, we might get a (d,α)(d, \alpha) readout relative to the current state estimate of the robot's position x^k\hat{x}_k. But that position is uncertain and our level of uncertainty about it evolves. In the next time step k+1k+1, we may have a new x^k+1\hat{x}_{k+1} and Pk+1P_{k+1} for our robot... because it moved. When this happens, you need to update the EKF state for each object to reflect the change in state uncertainty. If you have an extended Kalman filter but the true motion of your system is highly nonlinear, then all of those linear approximation errors could compound and poison state estimates for all objects.

While it seems plausible that you could find a neat, closed-form way to make all these updates, it also seems fairly computationally expensive. Quoting from this 2006 survey:

The observation update step requires that all landmarks and the joint covariance matrix be updated every time an observation is made. Naively, this means computation grows quadratically with the number of landmarks.

In addition to the computational cost, you're restricted to a unimodal Gaussian estimate of the robot's state. In some contexts that's perfectly fine. But we might not always want that. There's a famous "kidnapped robot problem" where a Roomba gets dropped into a different room that looks identical to the one it was in. In that case, we want a probability spike for the Roomba being in either one of the two rooms, instead of a diffuse blob that could mean it's anywhere.

Particle filters are mentioned as one of the tools used to address these problem. I'd be interested in covering that next.

Here are some more open-ended problems I'd like to pull the thread on:

Wrap up

The big idea about Kalman filters is that they fuse predictions and measurements with quick (in some cases optimal) reduction in uncertainty. Their structure provides resiliency to noisy measurements and bogus state predictions, letting us describe an evolving system coherently.

Having written and debugged an extended Kalman filter, this feels like a good stopping point. We took the time to build deep intuition for how and why it works, and we saw where things can go wrong:

This gives a decent feel for when Kalman filters are (or aren't) a good fit for a problem. That's a good foundation for chasing more advanced SLAM methods that build on the same ideas around managing uncertainty in tractable ways.