Learning about Kalman filters

This is my weekend project about Kalman filters, which I read about while looking into SLAM. SLAM is a family of algorithms used to figure out where a robot and the things around it are. You can do SLAM with Kalman filters, though in busy realtime environments you might prefer FastSLAM 2 or some other more scalable method. I chose to focus on Kalman filters as a starting point since they show up as components of those more advanced SLAM methods, but also in classic applications like radar, underscoring their ubiquity and generality. That's a good reason to invest some time in building an intuition for the math behind them, and a good excuse to make some fun visual demos/animations.

Some disclaimers:

My background is in compilers and CPU architecture. If I come across as having an outsiderish point of view on this topic, it's because I do! I may use odd terminology or just plain get things wrong. Let me know about any bugs you spot.
These notes are based on my Googling and Wikipedia diving: I'm not working through a specific paper or textbook.

Up next:

Visual demo environment
Kalman filter with nonlinear estimation (radar demo)
Kalman filters and SLAM
Wrap up

Visual demo environment

Throughout this post I've included simple 2D animations to illustrate key ideas. In the demos, there's a "mouse" agent that can move and rotate on a bounded <canvas>. The point

(x, y) = (0, 0)

is the top-left corner of the canvas and orientation is expressed in radians, where

\theta = \pi

means the mouse points upward.

Kalman filter with nonlinear estimation (radar demo)

State estimation

The Kalman filter is a loop that can quickly crush uncertainty about the state of a changing system, at least in theory. In order to use it, you need to write down two prediction functions:

State prediction function:

$f(x, u): X \rightarrow X$
$x$ is your state vector (e.g., position of an object), $u$ is a control vector that your estimator knows about (say, velocity), and $x \in X$ .

This function lets us ask: given some previous state and control input, $(x_{k-1}, u)$ , what would I predict the next state $x_k^-$ to be? We don't try to predict $u$ though, so you can't stash things in there that you'd need to measure.
Measurement prediction function:

$h(x): X \rightarrow M$
The way to read this is: I've already predicted the state ( $x$ ) and I'm about to measure it for real, so what should my measurement be if $x$ matched reality exactly?

The slickest feature of $h$ is that the vector space for measurements, $M$ , isn't required to be the same as the state vector space $X$ . That means it's OK to make measurements that don't live in state space: all you need is a way to relate the two. Later that'll let us define a change-of-basis matrix to "pull" measurements into state space, which is very handy for updating our knowledge of state uncertainty (covariances).

The advantages of having these specific predictors are:

If measurements are temporarily unavailable or excessively noisy, $f$ lets you update state based on first-principles (e.g., physics) instead of freezing state in place.
If $f$ is temporarily inaccurate and we have relatively higher certainty in empirical measurements, the machinery around $h$ lets us prioritize those, and vice versa. It's like a self-balancing see-saw.

Kalman filters are theoretically optimal at reducing uncertainty when prediction functions are linear. I'm interested in nonlinear motion, so we'll look at an extended Kalman filter that trades away theoretical optimality for a little extra generality.

Aside: A downside of having two possibly-nonlinear prediction functions is that they can be complicated or expensive to compute. Sometimes a simpler alternative like the alpha-beta filter can be a better fit.

Writing down the prediction functions

Let's make the Kalman filter concrete by writing down the setup for a simple radar demo. We'll treat our mouse agent like a plane and have it move around in a circle in the sky. Then we'll point a radar gun at the mouse to estimate its position. Due to air turbulence, there's random jitter in both the mouse's true motion and in the radar sensor measurements.

The mouse's state includes position and orientation:

x = \begin{bmatrix} p_x \\ p_y \\ \theta \end{bmatrix}

For smooth circular motion, we'll fix a constant angular speed ( $\omega$ ) (in rad/s) and use $v=r\omega$ to get linear velocity:

f(x_k; r, \omega, \Delta t_k) = \begin{bmatrix} p_{x,k} + (r\omega)\Delta t_k \cos\theta_k \\ p_{y,k} + (r\omega)\Delta t_k \sin\theta_k \\ \theta_k + \omega\Delta t_k \end{bmatrix}

When making a sensor reading, we'll get an angle and distance from the stationary radar gun, so given the true position $x$ of the mouse we'd expect the following measurement:

h(x) = \begin{bmatrix} d \\ \alpha \end{bmatrix} = \begin{bmatrix} \sqrt{(x_{g} - p_x)^2 + (y_{g} - p_y)^2} \\ \atantwo(y_{g} - p_y, x_{g} - p_x) \end{bmatrix}

Wiring it all up, the system looks like this. Try cranking the motion and radar noise up/down to see how the system reacts:

Radar Noise

0.10

Mouse Noise

1.0

Expressing and updating uncertainty

In this section I'm aiming to develop intuition for both how and why the Kalman filter evolves state estimates and its own internal uncertainty. I value this because I like knowing where I might hit limitations or encounter tradeoffs. To that end I'm going to focus on the details that matter the most for a clean conceptual understanding, including the most interesting (to me) derivations, while avoiding exhaustively deriving everything.

To express our uncertainty about the state of the system, we need to make a distinction between the "true" state of the system, $x$ , and the current state estimate $\hat x$ . We assume the true state evolves with noise $w \sim \mathcal{N}(0, Q)$ :

x_{k} = f(x_{k-1}) + w_{k-1}

But naturally, we don't add any noise to our predictions:

\hat{x}_k = f(\hat{x}_{k-1})

The state uncertainty $P$ is the covariance of the difference between the "true" and estimated vectors (roughly, how aligned fluctuations in different components of the state estimate are with each other):

P_k = \Cov(\delta x_k) = \Cov(x_k - \hat{x}_k)

We don't have a way to actually know the true state $x$ , so in order to update our uncertainty after making a fresh state prediction $\hat{x}_k$ , we need an approximation. We can do this by taking a first-order Taylor series expansion of $f(x)$ at the new state estimate (here, $F$ is a Jacobian):

\texttt{around } \hat{x}_k: f(x) \approx f(\hat{x}_k) + F_k(x - \hat{x}_k)

Technically, the Taylor series expansion of $f$ can have higher-order derivatives (Hessian, etc.) that we're simply ignoring. Why? Well, those terms are inconvenient for doing closed-form math and a first-order approximation can be good enough. In real-world usage I imagine you'd need to keep tabs on this approximation error to make sure it doesn't get out of hand.

Thanks to the first-order approximation, there are no awkward power-of- $x$ terms, so we can plug the approximation back in to $\delta x_k$ and express the new state error in terms of the old one. This sets us up nicely to update our state uncertainty $P_k$ :

\begin{aligned} \delta x_k &= x_k - \hat{x}_k = f(x_{k-1}) + w_{k-1} - f(\hat{x}_{k-1}) \\ &\approx f(\hat{x}_{k-1}) + F_{k-1}(x_{k-1} - \hat{x}_{k-1}) + w_{k-1} - f(\hat{x}_{k-1}) \\ &= F_{k-1}(x_{k-1} - \hat{x}_{k-1}) + w_{k-1} \\ &= F_{k-1}\delta x_{k-1} + w_{k-1} \end{aligned}

Now we make the reasonable assumption that $\E[w_k] = 0$ : we can always guarantee this by adding the true noise-mean to the prediction model, so no big deal. By linearity of expectation, we can solve for the new uncertainty $P_k$ :

\begin{aligned} P_k &= \Cov(\delta x_k) \\ &= \E[\delta x_k \delta x_k^T] \\ &\approx \E[(F_{k-1}\delta x_{k-1} + w_{k-1})(F_{k-1}\delta x_{k-1} + w_{k-1})^T] \\ &= \E[(F_{k-1}\delta x_{k-1} + w_{k-1})(\delta x_{k-1}^TF_{k-1}^T + w_{k-1}^T)] \\ &= \E[F_{k-1}\delta x_{k-1} \delta x_{k-1}^TF_{k-1}^T] + \E[w_{k-1}w_{k-1}^T] \\ &= F_{k-1}\E[\delta x_{k-1} \delta x_{k-1}^T]F_{k-1}^T + \E[w_{k-1}w_{k-1}^T] \\ &= F_{k-1}P_{k-1}F_{k-1}^T + Q \\ \end{aligned}

At this point we've fully covered a priori state prediction for time step $k$ :

\begin{aligned} \hat{x}_k^- &= f(\hat{x}_{k-1}) \\ P_k^- &= F_{k-1}P_{k-1}F_{k-1}^T + Q \end{aligned}

If we take a measurement, $z_k$ , we make an a posteriori correction that takes the measurement into account. The empirical measurement is modeled as the measurement we'd expect of the true state plus a noise term $r \sim \mathcal{N}(0, R)$ . The correction tool we use is called a residual, $\hat{y}_k$ :

z_k = h(x_k) + r_k \\ \hat{y}_k = z_k - h(\hat{x}_k^-)

If we add the residual $\hat{y}_k$ to the a priori state estimate $\hat{x}_k^-$ without any scaling, it would be like saying: don't trust my state predictions at all, just clamp to the empirical $z_k$ . On the other hand, if we ignored the residual, it'd say the opposite: don't trust the sensors at all, just clamp to the predicted $\hat{x}_k$ .

Ideally, we'd have a Kalman scaling factor for $\hat{y}_k$ that's higher when we when we have high state uncertainty, and lower when we have high measurement uncertainty (the self-balancing see-saw):

\begin{aligned} K_k &= \frac{\textit{state uncertainty}}{\textit{measurement uncertainty}} \\ x_k^+ &= x_k^- + K_k\hat{y}_k \end{aligned}

To define the real Kalman scaling factor, we need a way to express measurement uncertainty, $S_k$ , so let's tackle that first. We'll use the exact same tactic used to approximate state uncertainty. Linearize $h$ with a first-order Taylor series expansion and take the covariance of the residual:

\begin{aligned} \hat{y}_k &= h(x_k) + r_k - h(\hat{x}_k^-) \\ &\approx h(\hat{x}_k^-) + H_k\delta x_k + r_k - h(\hat{x}_k^-) \\ &= H_k\delta x_k + r_k \\ S_k &= \Cov(\hat{y}_k) = H_k P_k^- H_k^T + R \end{aligned}

At this point I'm explicitly prioritizing an intuitive picture over a rigorous formal derivation. It's possible to get the same result by writing down a function to estimate the minimum mean-squared error for a linear Kalman filter, taking the gradient, and solving for the critical point. For me, it was more helpful to think about $K_k$ like a fraction. If the "numerator" is state uncertainty, that's just $P_k^-$ . If the "denominator" is measurement uncertainty, that's just an inverted $S_k$ . To pull $S_k^{-1}$ from measurement space into state space, we use the change-of-basis Jacobian $H_k$ (note $H_k: X \rightarrow M$ and $H_k^T: M \rightarrow X$ ):

K_k = P_k^- H_k^T S_k^{-1}

The fraction metaphor highlights an edge case: what happens if you divide by zero, i.e. $S_k$ isn't invertible? $K_k$ could blow up and your program may crash. I chose to address this by adding a tiny positive value to the diagonal of $R$ to crank $\det(S_k) > 0$ .

Measurement changes the state estimate $x_k^+$ , but it should change the posterior state uncertainty too. To do that, we scale down uncertainty in components of $\hat{x}$ where measurement instills higher confidence, and scale up uncertainty about components where the new measurement muddies the waters (here, the $H_k$ factor is again needed to "pull" $P_k^-$ into measurement space where $K_k$ operates):

P_k^+ = (I - K_kH_k)P_k^-

I think of the $P_k^+$ update as "scale my uncertainty by 1 minus the scaling factor I needed for my measurement correction".

Another way to think about the posterior uncertainty update is that the new $P_k^+ = \Cov(\delta x_k^+)$ , where:

\begin{aligned} \delta x_k^+ &= (I - K_kH_k)\delta x_k^- + r_k \\ \Cov(\delta x_k^+) &= (I - K_kH_k)P^-_k(I - K_kH_k)^T + K_kRK_k^T \end{aligned}

This is called the "Joseph form", which is used sometimes if you have a custom Kalman scaling factor $K_k$ instead of the one that's optimal for linear filters. If you use the classic $K_k$ , both versions of $P_k^+$ are in theory equivalent.

Radar demo

We've covered how to set up and update a Kalman filter. To make a demo we need the Jacobians for $f$ and $h$ :

\begin{aligned} F_k &= \frac{\partial f}{\partial x_k} = \begin{bmatrix} 1 & 0 & -(r\omega)\Delta t_k \sin\theta_k \\ 0 & 1 & \ \ (r\omega)\Delta t_k \cos\theta_k \\ 0 & 0 & 1 \end{bmatrix} \\ H_k &= \frac{\partial h}{\partial x_k} = \begin{bmatrix} \frac{p_x - x_g}{d} & \frac{p_y - y_g}{d} & 0 \\ \frac{y_g - p_y}{d^2} & \frac{p_x - x_g}{d^2} & 0 \end{bmatrix} \end{aligned}

Here $d$ is just shorthand for $\sqrt{(x_{g} - p_x)^2 + (y_{g} - p_y)^2}$ .

We also need to define the initial system state. Let's position $x_0 = \hat{x}_0$ at the center of the canvas as before. Let's also set $P_0$ to 0 everywhere except for the diagnoal entries corresponding to the position uncertainty, which we'll set to 64. As the system evolves below, you should see the position estimate evolve. Since this is a simulation we have the luxury of being able to calculate the true position error $\delta x_k$ , which you'll see below. Try changing the noise parameters to see how the uncertainty spikes and settles:

Radar Noise

0.10

Mouse Noise

1.0

A cool feature here is that once the estimated uncertainty reaches a low level, you can ratchet up the radar noise really high and uncertainty tends to remain fixed. This is the see-saw in action, up-weighting the state prediction when the measurement becomes unreliable.

Kalman filters and SLAM

A robot can maintain Kalman filters for all the objects in its environment and itself. One way I can see this working is:

Combine IMU and less-frequent GPS readings to establish the robot's position and orientation.
Use camera, radar, and/or lidar sensors to observe the environment. Use mesh-building and segmentation to identify objects.
Register a Kalman filter for each object in the environment.

This hand-waves away a fair bit of complexity related to sensor fusion and object registration. There are a few issues with using a Kalman filter for each object I want to unpack first.

When we measure an object, we might get a $(d, \alpha)$ readout relative to the current state estimate of the robot's position $\hat{x}_k$ . But that position is uncertain and our level of uncertainty about it evolves. In the next time step $k+1$ , we may have a new $\hat{x}_{k+1}$ and $P_{k+1}$ for our robot... because it moved. When this happens, you need to update the EKF state for each object to reflect the change in state uncertainty. If you have an extended Kalman filter but the true motion of your system is highly nonlinear, then all of those linear approximation errors could compound and poison state estimates for all objects.

While it seems plausible that you could find a neat, closed-form way to make all these updates, it also seems fairly computationally expensive. Quoting from this 2006 survey:

The observation update step requires that all landmarks and the joint covariance matrix be updated every time an observation is made. Naively, this means computation grows quadratically with the number of landmarks.

In addition to the computational cost, you're restricted to a unimodal Gaussian estimate of the robot's state. In some contexts that's perfectly fine. But we might not always want that. There's a famous "kidnapped robot problem" where a Roomba gets dropped into a different room that looks identical to the one it was in. In that case, we want a probability spike for the Roomba being in either one of the two rooms, instead of a diffuse blob that could mean it's anywhere.

Particle filters are mentioned as one of the tools used to address these problem. I'd be interested in covering that next.

Here are some more open-ended problems I'd like to pull the thread on:

Say you're driving behind Car A. You and Car A ahead of you approach an intersection. You have the right of way, but you see another car perpendicular to your path, Car B, accelerate into the intersection. Car A momentarily accelerates as well and occludes your line of sight to Car B. If you maintain a Kalman filter for Car B, how do you update it while your line of sight is occluded? Do you expect Car B to yield and slow down, or is its driver likely to swerve 90°, parallel to your path, at speed? The answer affects whether you maintain your current speed or decelerate to reduce the risk of catastrophic collision.

Within the Kalman filter framework, you might say you make a policy choice about how to handle this situation by encoding it into your state prediction model $f(x)$ . But how? Maybe you hardcode a detector for this scenario and plan for the worst-case, or split the EKF into two to model the different outcomes probabilistically and use the fork which minimizes $P(collision)$ , or implement $f(x)$ using a neural net.
Edge cases pop up out of the woodwork when doing path planning. Clasically, you can plan paths using A* search. Layering on real-world obstacle avoidance makes this significantly more involved than the basic A* implementation I wrote in school. For one, your cost function has to model the fact that you can't rotate the ideal amount at some positions because the rotation would induce collision. And if you get close to bumping into an object, you need to make a Kalman filter update that very quickly drives down uncertainty about that object's position. It can feel like we're hardcoding a lot of details. Is there an alternative?
Since we mentioned algorithmic approaches to reducing computational cost, I wonder how much mileage we can get out of systems engineering optimizations. Chiefly, throwing a big GPU at the problem. We already saw that Kalman filter updates are fairly heavy on linear algebra. Are there enough divergent branches and data dependencies to make GPU offload impractical?

Wrap up

The big idea about Kalman filters is that they fuse predictions and measurements with quick (in some cases optimal) reduction in uncertainty. Their structure provides resiliency to noisy measurements and bogus state predictions, letting us describe an evolving system coherently.

Having written and debugged an extended Kalman filter, this feels like a good stopping point. We took the time to build deep intuition for how and why it works, and we saw where things can go wrong:

Excessive measurement noise
Linear approximation error (possibly compounding)
Sudden loss of invertibility
Unimodal assumption breaking down

This gives a decent feel for when Kalman filters are (or aren't) a good fit for a problem. That's a good foundation for chasing more advanced SLAM methods that build on the same ideas around managing uncertainty in tractable ways.