Christoph Rüegg

Math.NET, distributed computing and how an electrical engineer sees the world of complex software

Linear Regression With Math.NET Numerics

Likely the most requested feature for Math.NET Numerics is support for some form of regression, or fitting data to a curve. I’ll show in this article how you can easily compute regressions manually using Math.NET, until we support it out of the box. We already have broad interpolation support, but interpolation is about fitting some curve exactly through a given set of data points and therefore an entirely different problem.

For a regression there are usually much more data points available than curve parameters, so we want to find the parameters that produce the lowest errors on the provided data points, according to some error metric.

Least Squares Linear Regression

If the curve is linear in its parameters, then we’re speaking of linear regression. The problem becomes much simpler and we can leverage the rich linear algebra toolset to find the best parameters, especially if we want to minimize the square of the errors (least squares metric).

In the general case such a curve would be in the form of a linear combination of arbitrary but known functions , scaled by the parameters . Note that none of the functions depends on any of the parameters.

If we have data points , then we can write the whole problem as an overdefined system of equations:

Or in matrix notation:

This is a standard least squares problem and can easily be solved using Math.NET Numerics’s linear algebra classes and the QR decomposition. In literature you’ll usually find algorithms explicitly computing some form of matrix inversion. While symbolically correct, using the QR decomposition instead is numerically more robust. This is a solved problem, after all.

var p = X.QR().Solve(y);

Some matrices of this form have well known names, for example the Vandermonde-Matrix for fitting to a polynomial.

Example: Fitting to a Line

A line can be parametrized by the height at and its slope :

This maps to the general case with parameters as follows:

And therefore the equation system

The complete code when using Math.NET Numerics would look like this:

// data points
var xdata = new double[] { 10, 20, 30 };
var ydata = new double[] { 15, 20, 25 };

// build matrices
var X = DenseMatrix.CreateFromColumns(
  new[] {new DenseVector(xdata.Length, 1), new DenseVector(xdata)});
var y = new DenseVector(ydata);

// solve
var p = X.QR().Solve(y);
var a = p[0];
var b = p[1];

Example: Fitting to an arbitrary linear function

The functions do not have to be linear in at all to work with linear regression, as long as the resulting function remains linear in the parameters . In fact, we can use arbitrary functions, as long as they are defined at all our data points . For example, let’s compute the regression to the following complicated function including the Digamma function , sometimes also known as Psi function:

The resulting equation system in Matrix form:

The complete code with Math.NET Numerics, but this time with F#:

// define our target functions
let f1 x = Math.Sqrt(Math.Exp(x))
let f2 x = SpecialFunctions.DiGamma(x*x)

// create data samples, with chosen parameters and with gaussian noise added
let fy (noise:IContinuousDistribution) x = 2.5*f1(x) - 4.0*f2(x) + noise.Sample()
let xdata = [ 1.0 .. 1.0 .. 10.0 ]
let ydata = xdata |> (fy (Normal.WithMeanVariance(0.0,2.0)))

// build matrix form
let X =
        xdata |> f1 |> vector
        xdata |> f2 |> vector
    |] |> DenseMatrix.CreateFromColumns
let y = vector ydata

// solve
let p = X.QR().Solve(y)
let (a,b) = (p.[0], p.[1])

Note that we use the Math.NET Numerics F# package here (e.g. for the vector function).

Example: Fitting to a Sine

Just like the digamma function we can also target a sine curve. However, to make it more interesting, we’re also looking for phase shift and frequency parameters:

Unfortunately the function now depends on parameters and which is not allowed in linear regression. Indeed, fitting to a fequency in a linear way is not trivial if possible at all, but for a fixed we can leverage the following trigonometric identity:

and therefore

However, note that because of the non-linear transformation on the and parameters, the result will no longer be strictly the least square error solution. While our result would be good enough for some scenarios, we’d either need to compensate or switch to non-linear regression if we need the actual least square error parameters.

The complete code in C# with Math.NET Numerics would look like this:

// data points: we compute y perfectly but then add strong random noise to it
var rnd = new Random(1);
var omega = 1.0d
var xdata = new double[] { -1, 0, 0.1, 0.2, 0.3, 0.4, 0.65, 1.0, 1.2, 2.1, 4.5, 5.0, 6.0 };
var ydata = xdata
  .Select(x => 5 + 2 * Math.Sin(omega*x + 0.2) + 2*(rnd.NextDouble()-0.5)).ToArray();

// build matrices
var X = Matrix.CreateFromColumns(new[] {
    new DenseVector(xdata.Length, 1),
    new DenseVector(xdata.Select(t => Math.Sin(omega*t)).ToArray()),
    new DenseVector(xdata.Select(t => Math.Cos(omega*t)).ToArray())});
var y = new DenseVector(ydata);

// solve
var p = X.QR().Solve(y);
var a = p[0];
var b = SpecialFunctions.Hypotenuse(p[1], p[2]);
var c = Math.Atan2(p[2], p[1]);

The following graph visualizes the resulting regressions. The curve we computed the values from, before adding the strong noise, is shown in black. The red dots show the actual data points with only small noise, the blue dots the points with much stronger noise added. The red and blue curves then show the actual computed regressions for each.