Christoph Rüegg

Loading native DLLs in F# Interactive

2013-10-05T18:00:00+02:00

F# Interactive (FSI) is a very convenient environment to execute pieces of F# code on the fly. You can even reference managed assemblies using the #r and #I preprocessor directives. However, if one of the referenced assemblies tries to use a native DLL using p/invoke you might end up with a DllNotFoundException even if the native DLL is in the same folder as the managed assembly, and if the folder has been included with the #I directive. Note that it is not possible to reference native DLLs explicitly in .Net.

The reason is that finding and loading such DLLs in .Net works the same way as all native applications in Windows and follows the standard search order. When launched from within VisualStudio, the working directory of the F# Interactive process is the path where it is installed, in my case C:\Program Files (x86)\Microsoft SDKs\F#\3.0\Framework\v4.0. Naturally it has no chance to find a DLL in your script folder and fails.

There are multiple ways how you can tell Windows where to look for the DLL:

Any technique described here is only needed when using F# Interactive or run F# scripts, but not in normal compiled applications (where you'd just tell the IDE or build tool to copy the native DLLs to the output path)

Change the Working Directory

The simplest way is to set the process' working directory to the directory where the DLL is located. If it is in the same place as your script file, you can do this with a one-liner:

1:	`System.Environment.CurrentDirectory <- __SOURCE_DIRECTORY__`

Set Path Environment Variable

Beside of the working directory, Windows also considers the Path environment variables to look for the DLL. Environment variables can be defined globally or on your user account, but also locally to a running process.

You can append the path to the process-local Path variable like this:

1: 
2: 
3: 
4:

open System

Environment.SetEnvironmentVariable("Path",
    Environment.GetEnvironmentVariable("Path") + ";" + __SOURCE_DIRECTORY__)

Tell Windows: SetDllDirectory

We can also tell Windows directly where to look by calling the SetDllDirectory function:

1: 
2: 
3: 
4: 
5: 
6: 
7:

open System.Runtime.InteropServices

module Kernel =
    [<DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)>]
    extern bool SetDllDirectory(string lpPathName);

Kernel.SetDllDirectory(__SOURCE_DIRECTORY__)

However, note that every time this method is called it overrides what was set in the last call. So this can work well for a while - until someone else within your process starts calling it as well.

Load Explicitly: LoadLibrary

If a library module with the same name is already loaded into memory, it will be used directly without even starting to look for it in the file system. We can leverage this by explicitly loading a library.

For example, for the MKL extensions for Math.NET Numerics we could write:

1: 
2: 
3: 
4: 
5: 
6:

module Kernel =
    [<DllImport("Kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)>]
    extern IntPtr LoadLibrary(string lpFileName);

Kernel.LoadLibrary(Path.Combine(__SOURCE_DIRECTORY__, "libiomp5md"))
Kernel.LoadLibrary(Path.Combine(__SOURCE_DIRECTORY__, "MathNet.Numerics.MKL.dll"))

If you were writing this in a long running process where the DLL is used in a well defined section only, then you'd better unload the library once no longer needed with the symmetric FreeLibrary routine. But for quick experiments in F# Interactive it's probably fine.

Example: Enable the MKL native provider in Math.NET Numerics

Math.NET Numerics provides a few code samples as a NuGet package. Ignoring for now that we really should add much more interesting, complete and applied examples (ideas and contributions are welcome!), these examples also do not currently leverage the Intel MKL native provider for faster linear algebra, so let's change at least that. First get a copy of them:

Create a new F# Console Application project

Open the NuGet Package Manager Console and run

1: 
2: 
3:

Install-Package MathNet.Numerics.FSharp -Version 2.6.0
Install-Package MathNet.Numerics.FSharp.Sample -Version 2.6.0
Install-Package MathNet.Numerics.MKL.Win-x86 -Version 1.3.0

Open the Matrices.fsx file in Samples/MathNet.Numerics.FSharp
Fix the MathNet.Numerics reference line to v2.6.1, or whatever package NuGet installed

Now we would like to enable MKL and verify that it works. The native DLLs have already been copied to the project root directory by the NuGet package in step 2.

Add the following lines right after the original 3 open-lines at the beginning of the file:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10:

open System
open System.IO
open MathNet.Numerics
open MathNet.Numerics.Algorithms.LinearAlgebra.Mkl

Control.LinearAlgebraProvider <- MklLinearAlgebraProvider()

let m = matrix [[1.; 2.]; [3.; 4.]]
let v = vector [4.;5.]
m.LU().Solve(v)

If you try to execute this line by line, the last line calling LU and Solve will try to use MKL and fail with a DllNotFoundException as expected. Let's try to use the environment variables approach by adding the following lines after the open-lines. Our script is two directories down from the project root directory, so we have to fix the path accordingly:

1: 
2:

Environment.SetEnvironmentVariable("Path",
    Environment.GetEnvironmentVariable("Path") + ";" + Path.Combine(__SOURCE_DIRECTORY__,@@"..\..\"))

And suddenly it works.

If it does not and you get a BadImageFormatException, you may have switched F# Interactive to run in 64-bit mode. In this case you should install the MKL.Win-x64 package instead of the x86 one.

PS: We're trying to simplify this in the upcoming v3 release so there is only one package for both platforms and it automatically loads the right one. Maybe we can also do something about this whole 'telling windows where to find the DLLs' thing while we're at it...

namespace System

type Environment =
  static member CommandLine : string
  static member CurrentDirectory : string with get, set
  static member Exit : exitCode:int -> unit
  static member ExitCode : int with get, set
  static member ExpandEnvironmentVariables : name:string -> string
  static member FailFast : message:string -> unit + 1 overload
  static member GetCommandLineArgs : unit -> string[]
  static member GetEnvironmentVariable : variable:string -> string + 1 overload
  static member GetEnvironmentVariables : unit -> IDictionary + 1 overload
  static member GetFolderPath : folder:SpecialFolder -> string + 1 overload
  ...
  nested type SpecialFolder
  nested type SpecialFolderOption

Full name: System.Environment

property System.Environment.CurrentDirectory: string

Environment.SetEnvironmentVariable(variable: string, value: string) : unit
Environment.SetEnvironmentVariable(variable: string, value: string, target: EnvironmentVariableTarget) : unit

Environment.GetEnvironmentVariable(variable: string) : string
Environment.GetEnvironmentVariable(variable: string, target: EnvironmentVariableTarget) : string

namespace System.Runtime

namespace System.Runtime.InteropServices

Multiple items
type DllImportAttribute =
  inherit Attribute
  new : dllName:string -> DllImportAttribute
  val EntryPoint : string
  val CharSet : CharSet
  val SetLastError : bool
  val ExactSpelling : bool
  val PreserveSig : bool
  val CallingConvention : CallingConvention
  val BestFitMapping : bool
  val ThrowOnUnmappableChar : bool
  member Value : string

Full name: System.Runtime.InteropServices.DllImportAttribute

--------------------
DllImportAttribute(dllName: string) : unit

type CharSet =
  | None = 1
  | Ansi = 2
  | Unicode = 3
  | Auto = 4

Full name: System.Runtime.InteropServices.CharSet

field CharSet.Auto = 4

type bool = Boolean

Full name: Microsoft.FSharp.Core.bool

val SetDllDirectory : lpPathName:string -> bool

Full name: loadingnativedllsinfsharpinteractivecontent.Kernel.SetDllDirectory

Multiple items
val string : value:'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

--------------------
type string = String

Full name: Microsoft.FSharp.Core.string

val lpPathName : string

module Kernel

from loadingnativedllsinfsharpinteractivecontent

Multiple items
type IntPtr =
  struct
    new : value:int -> nativeint + 2 overloads
    member Equals : obj:obj -> bool
    member GetHashCode : unit -> int
    member ToInt32 : unit -> int
    member ToInt64 : unit -> int64
    member ToPointer : unit -> unit
    member ToString : unit -> string + 1 overload
    static val Zero : nativeint
    static member Add : pointer:nativeint * offset:int -> nativeint
    static member Size : int
    ...
  end

Full name: System.IntPtr

--------------------
IntPtr()
IntPtr(value: int) : unit
IntPtr(value: int64) : unit
IntPtr(value: nativeptr<unit>) : unit

namespace System.IO

namespace System.Numerics

namespace Microsoft.FSharp.Control

val m : obj

Full name: loadingnativedllsinfsharpinteractivecontent.m

val v : obj

Full name: loadingnativedllsinfsharpinteractivecontent.v

type Path =
  static val DirectorySeparatorChar : char
  static val AltDirectorySeparatorChar : char
  static val VolumeSeparatorChar : char
  static val InvalidPathChars : char[]
  static val PathSeparator : char
  static member ChangeExtension : path:string * extension:string -> string
  static member Combine : [<ParamArray>] paths:string[] -> string + 3 overloads
  static member GetDirectoryName : path:string -> string
  static member GetExtension : path:string -> string
  static member GetFileName : path:string -> string
  ...

Full name: System.IO.Path

Path.Combine([<ParamArray>] paths: string []) : string
Path.Combine(path1: string, path2: string) : string
Path.Combine(path1: string, path2: string, path3: string) : string
Path.Combine(path1: string, path2: string, path3: string, path4: string) : string

Towards Math.NET Numerics Version 3

2013-09-15T10:52:00+02:00

Math.NET Numerics is well on its way towards the next major release, v3.0. A first preview alpha has already been pushed to the NuGet gallery, even though there's still a lot to do. If you'd like to understand a bit better where we currently are, where we're heading to, and why, then read on.

Why a new major release?

We apply the principles of semantic versioning, meaning that we are not supposed to break any parts of the public surface of the library, which is almost everything in our case, during minor releases (with the 3-part version format major.minor.patch). This makes sure you can easily upgrade within minor releases without second thoughts or breaking any of your code.

Nevertheless, sometimes there really is a good reason to change the design, because it is way to complicated to use, inconsistent, leads to bad performance or was just not very well thought out. Or we simply learned how to do it in a much better way. You may have noticed that some members have been declared as obsolete over the last couple minor releases, with suggestions how to do it instead, even though the old implementation was kept intact. Over the time all that old code became a pain to maintain, and using the library was much more complicated than needed. So I decided it's time to finally fix most of these issues and clean up.

We do move some cheese around in this release. Your code will break in a few occasions. But in all cases a fix should be easy if not trivial. Also, once there we will again be bound by semantic versioning to keep the library stable over all future minor releases and thus likely for years to come. Also, we may keep providing patches for the old v2 branch if needed for a while. Nevertheless, I strongly recommend to upgrade to v3 once available.

Feedback is welcome

A first preview (v3.0.0-alpha1) has already been published to NuGet and I plan to do at least two more preview releases before we reach the first v3.0 release. Please do have a look at it and give feedback - now is a unique possibility for breaking changes.

Overview on what has been done so far

Namespace simplifications.
More functional design where appropriate. Make sure everything works fine and feels native in both C# and F#.
Use common short names if well known instead of very long full names (trigonometry).
Linear Algebra: Using the generic types is the recommended way now; make sure it works well. The IO classes for matrix/vector serialization become separate packages. Major refactoring of the iterative solvers. Filled some missing pieces, various simplifications, lots of other changes.
Distributions: Major cleanup. Direct exposure of distributions functions (pdf, cdf, etc). Parameter Estimation.
New distance functions

Overview on what is planned to do

Iterative solvers need more work. I'd also like to design them such that they can be iterated manually, in a simple way.
Integral transformations (FFT etc) need major refactoring. Backed by native provider if possible.
Consider to bring back filtering (FIR, IIR, moving average, etc.)
The current QR-decomposition-based curve fitting is inefficient for large data sets, but fixing it is actually not very complicated.
Investigate and fix an inconsistency in the Precision class.
Drop redundant null-checks

Details on what's new in version 3 so far

Dropping .Algorithms Namespaces

Did you ever have to open 10 different Math.NET Numerics namespaces to get all you need? This should get somewhat better in v3, as the static facades like Integrate, Interpolate, Fit or FindRoots for simple cases have been moved directly to the root namespace MathNet.Numerics and all the algorithms namespaces (for advanced uses) of the form MathNet.Numerics.X.Algorithms are now simply MathNet.Numerics.X.

Interpolation

In addition to the simplified namespaces, the last Differentiate overload that returns all the interpolated value and the first and second derivative at some point x has been simplified: instead of two out-parameters in an unexpected order it now returns a tuple with reasonable ordering.

Integration

The design of the double-exponential transformation was rather weird. It has been simplified to a static class and is much simpler to use explicitly.

Probability Distributions

Although it was always possible to assign a custom random source (RNG) to a distribution for random number sampling, it was somewhat complicated and required two steps. Now all distribution constructors have an overload accepting a custom random source directly at construction, in a single step.

A few distributions now support maximum-likelihood parameter estimation and most distributions implement an inverse cumulative distribution function. Distribution functions like PDF, CDF and InvCDF are now exposed directly as static functions.

The inline documentation and parameter naming has been improved significantly. ChiSquare became ChiSquared, and the IDistribution interface became IUnivariateDistribution. Simpler more composeable random sampling in F# with new Sample module.

New Distance functions

Standard routines for evaluating the Euclidean, Manhattan and Chebychev distances between arrays or vectors, also for the common Sum of Absolute Difference (SAD), Mean-Absolute Error (MAE), Sum of Squared Difference (SSD) and Mean-Squared Error (MSE) metrics. Hamming distance. Leveraging providers where appropriate.

Less null checks and ArgumentNullExceptions

Likely as a side effect from my exposure to functional programming over the last year, I no longer follow the arguments why in C# every routine must explicitly check all arguments for null. I've already dropped a few of these checks, but there are still more than 2000 places where Math.NET Numerics throws an ArgumentNullException. Most of these will likely be gone. There is one case where it does make sense to keep them though: when a routine accepts an argument but does not use it immediately (and therefore does not cause an immediate NullReferenceException), a null reference sneaking in could be hard to debug, so we'll keep the check. But such cases are quite rare given the nature of the library.

IO Library

The IO library that used to be distributed as part of the core package is now a set of separate NuGet packages, e.g. MathNet.Numerics.Data.Text, and lives in a separate repository.

Favoring generic linear algebra types

Since the generic namespace was required all the time anyway and the recommended happy path is now to always use the generic types, everything from the .Generic namespace has been moved one namespace up. From now on you usually only need to open two namespaces when working with linear algebra, even if factorizations are needed. For example, when using the double type, you'd open MathNet.Numerics.LinearAlgebra and MathNet.Numerics.LinearAlgebra.Double.

Since typing is stronger in F#, all the init/create functions in the F# module now directly return generic types so you don't have to upcast manually all the time. Most routines have been generalized to work on generic types.

For cases where you want to implement generic algorithms but also need to create new dense or sparse matrices or vectors a new generic builder has been added. This should rarely be needed in user code though.

Missing scalar-matrix routines

A few missing scalar-matrix routines like adding or subtracting a scalar to a matrix or dividing a scalar by a matrix have been added, backed by providers where possible. There's now also a modulus routine.

Point-wise infix operators where supported (F#)

We've added point-wise .*, ./ and .% operators to matrices and vectors in the core library. This is not supported in all .Net languages yet, but works fine in F# even though without currying support. Of course in the other languages you can continue to use the normal methods as before.

Factorization and Iterative Solvers

Previously matrix factorization was only accessible by extension methods or explicit creation, which did not work very well when using generic types. The generic matrix type now provides methods to create them directly. As such, the actual implementations have been internalized as there is no longer any need for direct access.

The QR factorization is now thin by default, and factorizations no longer clone their results for no practical reason.

The iterative solver design has been significantly simplified and is now generic and shared where possible and accepts generic types everywhere. The namespaces are now much more flat as the very detailed structure did not add any value but meant you had to open a dozen namespaces.

Misc linear algebra improvements

Vectors now have a ConjugateDotProduct routine in addition to DotProduct.
Vectors now explicitly provide proper L1, L2 and infinity norms
Matrices/Vectors now have consistent enumerators, with a variant that skips zeros (useful if sparse).
Matrix/Vector creation routines have been simplified and usually no longer require explicit dimensions. New variants to create diagonal matrices, or such where all fields have the same value.
Matrices/Vectors expose whether storage is dense with a new IsDense property.
Providers have been moved to a Providers namespace and are fully generic again.

Misc

More robust complex Asin/Acos for large real numbers.
Trig functions: common short names instead of very long names.
Complex: common short names for Exp, Ln, Log10, Log.
Statistics: new single-pass MeanVariance method (as used often together).

What's New in Math.NET Numerics 2.6

2013-07-26T00:00:00+02:00

Math.NET Numerics v2.6, released in July 2013, is focused on filling some gaps around the very basic numerical problems of fitting a curve to data and finding solutions of nonlinear equations. As usual you'll find a full listing of all changes in the release notes. However, I'd like to take the chance to highlight some important changes, show some code samples and explain the reasoning behind the changes.

A lot of high quality code contributions made this release possible. Just like last release, I've tried to attribute them directly in the release notes. Thanks again!

Please let me know if these "What's New" articles are useful in this format and whether I should continue to put the together for future releases. See also what's new in the previous version 2.5.

Linear Curve Fitting

Fitting a linear-parametric curve to a set of samples such that the squared errors are minimal has always been possible with the linear algebra toolkit, but it was somewhat complicated to do and required understanding of the algorithm. See Linear Regression with Math.NET Numerics for an introduction and some examples.

Note: if you need to have the curve go exactly through all your data points, use our Interpolation routines instead.

We now finally provide a shortcut with a few common functions to fit to data, but also a method to fit a linear combination of arbitrary functions. For fitting a simple line it uses an efficient direct algorithm:

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8:

var x = new [] { 1.0, 2.0, 3.0, 4.0, 5.0, 6.0 };
var y = new [] { 4.986, 2.347, 2.061, -2.995, 2.352, -5.782 }

C#: var p = Fit.Line(x, y);
    var offset = p[0];  // = 7.01013
    var slope = p[1];   // = -2.08551

F#: let offset, slope = Fit.line x y

Otherwise it usually applies an ordinary least squares regression to find the best parameters using a thin QR decomposition (leveraging a native provider like Intel MKL if enabled). This also works with arbitrary functions, like sine and cosine:

1: 
2: 
3: 
4:

F#: let p = (x, y) ||> Fit.linear [(fun _ -> 1.0); (Math.Sin); (Math.Cos)]
C#: var p = Fit.LinearCombination(x, y, z => 1.0, Math.Sin, Math.Cos);

// p = [ -0.287, 4.02, -1.46 ], hence f: x -> -0.287 + 4.02*sin(x) - 1.46*cos(x)

The intention is to add more special cases for common curves like the logistic function in the future. Like the line they may have more appropriate direct implementations. For now there is one other special case, for fitting to a polynomial. It returns the best parameters, in ascending order (coefficient for power k has index k) compatible to the Evaluate.Polynomial routine:

1: 
2: 
3:

C#: var coeff = Fit.Polynomial(x, y, 2); // order 2
    Evaluate.Polynomial(1.2, coeff); // ...
F#: let coeff = Fit.polynomial 2 x y

In practice your x values are not always just real numbers. Maybe you need multi-dimensional fitting where the x values are actually arrays, or even full data structures. For such cases we provide a version that is generic in x and where you can provide a list of functions that accept such x directly without the need to convert to an intermediate double vector first:

1: 
2: 
3:

C#: var p = Fit.LinearMultiDim(xarrays, y, f1, f2, f3, ...);
    var p = Fit.LinearGeneric(xstructs, y, f1, f2, f3, ...);
F#: let p = Fit.linear [f1; f2; f3; ...] xgeneric y

Often after evaluating the best fitting linear parameters you'd actually want to evaluate the function with those parameters. For this scenario we provide a shortcut as well: For each of these methods there is also a version with a "Func" suffix ("F" in F#) which, instead of the parameters, returns the composed function:

1: 
2: 
3: 
4: 
5:

F#: let f = Fit.lineF x y
    [1.0..0.1..2.0] |> List.map f

C#: var f = Fit.LinearCombinationFunc(x, y, z => z*z, Math.Sin, SpecialFunctions.Gamma);
    Enumerable.Range(0,11).Select(x => f(x/10.0))

Root Finding

We now provide basic root finding algorithms. A root of a function x -> f(x) is a solution of the equation f(x)=0. Root-finding algorithms can thus help finding numerical real solutions of arbitrary equations, provided f is reasonably well-behaved and we already have an idea about an interval [a,b] where we expect a root. As usual, there is a facade class FindRoots for simple scenarios:

The routines usually expect a lower and upper boundary as parameters, and then optionally the accuracy we try to achieve and the maximum number of iterations.

1: 
2: 
3: 
4: 
5: 
6:

C#: FindRoots.OfFunction(x => x*x - 4, -5, 5) // -2.00000000046908
C#: FindRoots.OfFunction(x => x*x - 4, -5, 5, accuracy: 1e-14) // -2 (exact)
C#: FindRoots.OfFunctionDerivative(x => x*x - 4, x => 2*x, -5, 5) // -2 (exact)

F#: FindRoots.ofFunction -5.0 5.0 (fun x -> x*x - 4.0)
F#: FindRoots.ofFunctionDerivative -5.0 5.0 (fun x -> x*x - 4.0) (fun x -> 2.0*x)

A NonConvergenceException is thrown if no root can be found by the algorithm.

In practice you'd often want to use a specific well-known algorithm. You'll find them in the RootFinding namespace. Each of these algorithm provides a FindRoot method with similar arguments as those above. However, the algorithms may sometimes fail to find a root or the function may not actually have a root within the provided interval. Failing to find a root is thus not exactly exceptional. That's why the algorithms also provide an exception-free TryFindRoot code path with the common Try-pattern as in TryParse.

Bisection

A simple and robust yet rather slow algorithm, implemented in the Bisection class.

Example: Find the real roots of the cubic polynomial 2x^3 + 4x^2 - 50x + 6:

1: 
2: 
3: 
4: 
5: 
6:

Func<double, double> f = x => Evaluate.Polynomial(x, 6, -50, 4, 2);
Bisection.FindRoot(f, -6.5, -5.5, 1e-8, 100);  // -6.14665621970684
Bisection.FindRoot(f, -0.5, 0.5, 1e-8, 100);   // 0.121247371972135
Bisection.FindRoot(f, 3.5, 4.5, 1e-8, 100);    // 4.02540884774855

F#: f |> FindRoots.bisection 100 1e-8 3.5 4.5  // Some(4.0254..)

Note that the F# function returns a float option. Instead of throwing an exception it will simply return None if it fails.

Brent's Method

We use Brent's method as default algorithm, implemented in the Brent class. Brent's method is faster than bisection, but falls back to something close to bisection if the faster approaches (essentially the secant method and inverse quadratic interpolation) fails and is therefore almost as reliable.

The same example as above, but using Brent's method:

1: 
2: 
3: 
4: 
5: 
6:

Func<double, double> f = x => Evaluate.Polynomial(x, 6, -50, 4, 2);
Brent.FindRoot(f, -6.5, -5.5, 1e-8, 100);  // -6.14665621970684
Brent.FindRoot(f, -0.5, 0.5, 1e-8, 100);   // 0.121247371972135
Brent.FindRoot(f, 3.5, 4.5, 1e-8, 100);    // 4.02540884774855

F#: f |> FindRoots.brent 100 1e-8 3.5 4.5  // Some(4.0254..)

Note that there are better algorithms for finding all roots of a polynomial. We plan to add specific polynomial root finding algorithms later on.

Newton-Raphson

The Newton-Raphson method leverages the function's first derivative to converge much faster, but can also fail completely. The pure Newton-Raphson algorithm is implemented in the NewtonRaphson class. However, we also provide a modified algorithm that tries to recover (instead of just failing) when overshooting, converging too slowly or even when loosing bracketing in the presence of a pole. This modified algorithms is available in the RobustNewtonRaphson class.

Example: Assume we want to find solutions of x+1/(x-2) == -2, hence x -> f(x) = 1/(x-2)+x+2 with a pole at x==2:

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9:

Func<double, double> f = x => 1/(x - 2) + x + 2;
Func<double, double> df = x => -1/(x*x - 4*x + 4) + 1;
RobustNewtonRaphson.FindRoot(f, df, -2, -1, 1e-14, 100, 20); // -1.73205080756888
RobustNewtonRaphson.FindRoot(f, df, 1, 1.99, 1e-14, 100, 20); // 1.73205080756888
RobustNewtonRaphson.FindRoot(f, df, -1.5, 1.99, 1e-14, 100, 20); // 1.73205080756888
RobustNewtonRaphson.FindRoot(f, df, 1, 6, 1e-14, 100, 20); // 1.73205080756888

F#: FindRoots.newtonRaphsonRobust 100 20 1e-14 1.0 6.0 f df
F#: (f, df) ||> FindRoots.newtonRaphsonRobust 100 20 1e-14 1.0 6.0

Broyden's Method

The quasi-newton method by Broyden, implemented in the Broyden class, may help you to find roots in multi-dimensional problems.

Linear Algebra

As usual there have been quite a few improvements around linear algebra, see the release notes for the complete list. If you've enabled our Intel MKL native linear algebra provider, then eigenvalue decompositions should be much faster now. Matrices now also support the new F# 3.1 array slicing syntax.

Note that we're phasing out the MathNet.Numerics.IO library and namespace and plan to drop it entirely in v3. We've already replaced it with two new separate NuGet packages and obsoleted all members of the old library. The new approach with separate libraries makes it possible to introduce specific dependencies e.g. to read and write Excel files, without forcing these dependencies on all of Math.NET Numerics. We recommend to switch over to the new packages as soon as possible.

Statistics

We've had a Pearson correlation coefficient routine for a while, but no Covariance routine. In addition to a new Spearman ranked correlation routine, this release finally also adds sample and population Covariance functions for arrays and IEnumerables.

1:	`ArrayStatistics.Covariance(new[] {1.2, 1.3, 2.4}, new[] {2.2, 2.3, -4.5})`

Multiple items
module List

from Microsoft.FSharp.Collections

--------------------
type List<'T> =
  | ( [] )
  | ( :: ) of Head: 'T * Tail: 'T list
  interface IEnumerable
  interface IEnumerable<'T>
  member GetSlice : startIndex:int option * endIndex:int option -> 'T list
  member Head : 'T
  member IsEmpty : bool
  member Item : index:int -> 'T with get
  member Length : int
  member Tail : 'T list
  static member Cons : head:'T * tail:'T list -> 'T list
  static member Empty : 'T list

Full name: Microsoft.FSharp.Collections.List<_>

val map : mapping:('T -> 'U) -> list:'T list -> 'U list

Full name: Microsoft.FSharp.Collections.List.map

Multiple items
val double : value:'T -> double (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.double

--------------------
type double = System.Double

Full name: Microsoft.FSharp.Core.double

Test your C# or F# Library on Mono with Vagrant

2013-06-07T21:33:00+02:00

Most .Net libraries should also work on Linux or OS X thanks to Mono. But, do they really? How do you verify and test that, without installing Mono on your Windows development box, or setting up a separate Linux box? Or how to compile and test some local files without installing any .net or mono dev tools at all?

Vagrant comes in handy here. Vagrant may not be very well known among .Net developers on Windows yet, so let the Vagrant team introduce it in their own words: "Vagrant is a tool for building complete development environments. With an easy-to-use workflow and focus on automation, Vagrant lowers development environment setup time, increases development/production parity, and makes the "works on my machine" excuse a relic of the past." In my words, Vagrant lets you define a standardized development or test environment for your project that works exactly the same everywhere, no matter what OS you're on or how you've set it up. It uses virtual machines in the background, but you neither see nor care about them much.

Using Mono with Vagrant

The Math.NET Numerics project claims that it supports Mono, but I admit I verify that claim sporadically only. Up to now, that is. I've just enabled Vagrant on the project, so everyone can test it on Mono without any effort. This is what you do, assuming you have Vagrant installed and you have a local checkout of the repository: open git bash at the root of the checkout and run

1:	`$ vagrant up`

This will download the box the first time it is used (~540 MB), create a virtual machine in VirtualBox and then start and provision it. After the first time this is quite fast, usually less than a minute.

Then we can enter the environment. We see that both mono and fsharp are available and up to date:

1:	`$ vagrant ssh`

The trick is that the local directory is automatically available within the environment in the /vagrant path. We can compile our project using mono and xbuild right away:

1: 
2: 
3: 
4: 
5:

~$ xbuild /vagrant/MathNet.Numerics.sln
# or just some projects:
~$ xbuild /vagrant/src/Numerics/Numerics.csproj
~$ xbuild /vagrant/src/FSharp/FSharp.fsproj
~$ xbuild /vagrant/src/UnitTests/UnitTests.csproj

And then run all the unit tests using NUnit:

1: 
2:

~$ nunit-console /vagrant/out/tests/Net40/MathNet.Numerics.UnitTests.dll
~$ nunit-console /vagrant/out/tests/Net40/MathNet.Numerics.FSharp.UnitTests.dll

And indeed, 37 of 11906 tests are failing - apparently there are some differences around number formatting. Seems like we've got some work to do. In order to fix it, I open Visual Studio on Windows as usual, compile it there (or in the Vagrant environment using xbuild), and run the tests again both in Visual Studio using .Net and in the environment using Mono. No manual file transfer or copying is needed.

If you're done working for now, leave the environment with exit and either suspend (vagrant suspend) or shutdown (vagrant halt) the virtual machine. In either case, you can bring it up again later with vagrant up, or remove it completely with vagrant destroy.

Installing Vagrant

I mentioned before that you need to have Vagrant installed for this to work:

Download and install Vagrant from here
Download and install VirtualBox from here
Not sure this is still needed, but to be on the safe side add the VirtualBox folder to your system PATH environment variable. In my case the folder is "C:\Program Files\Oracle\VirtualBox".

If you need more help there, have a look at Vagrant's getting started guide.

Enabling Vagrant in your own project

There are really only two things I did to enable this in Math.NET Numerics:

Add ".vagrant" to the .gitignore file (or the equivalent if you don't use git).
Add a Vagrantfile text file to the repository

The Vagrantfile can be generated automatically with the vagrant init command, but it may even be easier to just start from the file I currently use:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11:

Vagrant.configure("2") do |config|

  config.vm.box = "wheezy64-mono3.0.10-fsharp3.0.27"
  config.vm.box_url = "https://dl.dropboxusercontent.com/s/uelesklqouaw1gl/wheezy64-mono3.0.10-fsharp3.0.27-virtualbox.box"

  config.vm.provider :virtualbox do |vb|
  vb.gui = false
  vb.customize ["modifyvm", :id, "--memory", "1024"]
  vb.customize ["modifyvm", :id, "--cpus", "2"]
  end
end

The file declares the base box to use and where it can be downloaded from if needed. This specific box by default uses only 384 MB RAM and 1 CPU, but we override this in our project to give it some more resources. You could also declare more folders to be kept in sync or network ports to be forwarded in this file.

This file is also the right place to specify additional provisioning if needed. How this works is described in the Vagrant docs. Note however that this specific box does not have ruby installed and is not prepared to be provisioned with tools like puppet or chef. Simple provisioning by shell script should work though, and good enough to install some more packages or change some settings if needed.

About the Base Box

Ideally we could just use an existing official Debian base box and provision mono and fsharp using Vagrant's provisioning mechanisms. Hopefully we'll get there in the end, e.g. by giving the Debian package maintainers a hand. Unfortunately for now the Debian packages are out of date (but they actually exist which is good news) so we need to compile both mono and fsharp from sources. Compiling them takes some time on a VM and seems inappropriate to me to do at provisioning time. So I've created a new VM from scratch where I can compile the tools locally and extract cleaned-up Vagrant base boxes in a relatively straight forward way, whenever a new version is released.

Current Specs:

OS: Debian 7 "Wheezy", 3.2.0-4-amd64 Linux Kernel (3.2.41-2+deb7u2)
VM: VirtualBox, Guest Tools for v4.2.12 (installed from VB instead of using the Debian packages)
Size: 542 MB when packed
Defaults to 384 MB Ram, 1 CPU. Dynamically expanding disk, max 40 GB.
Mono 3.0.10 and F# 3.0.27 compiled locally from tagged sources, installed to /usr
NUnit 2.6.1 official binary added manually to /usr/local, hence effectively hiding nunit-console from Mono in /usr/bin but not Mono's nunit-console2 or nunit-console4. Note that Mono comes with NUnit 2.4.8.

I've made the box available on Dropbox for now until we find a better place. To use it either copy the Vagrantfile above, or add the box manually using something like:

1:	`$ vagrant box add wheezy64monofs https://dl.dropboxusercontent.com/s/uelesklqouaw1gl/wheezy64-mono3.0.10-fsharp3.0.27-virtualbox.box`

Known Issues

The F# tools seem to clear the shell at start which is a bit irritating. Might be related to git bash, but if I remember right this did not happen back when I used the Debian-provided VirtualBox Guest Tools. Considering going back to them.
I did not manage to fully turn off USB in VirtualBox; apparently it is needed by some HID device (maybe the mouse?).

I'm no expert on Vagrant or even on using Mono and F# on Linux myself, so any feedback and suggestions for improvements are very welcome. Let me know if this works for you or whether we need to go further.

Of course there are also ways to get F# and/or mono straight to your box without the need for Vagrant, see the F# Software Foundation website for instructions on how to get started on your platform.

What's New in Math.NET Numerics 2.5

2013-04-14T00:00:00+02:00

Math.NET Numerics v2.5, released in April 2013, is focused on statistics and linear algebra. As usual you'll find a full listing of all changes in the release notes. However, I'd like to take the chance to highlight some important changes, show some code samples and explain the reasoning behind the changes.

Please let me know if these "What's New" articles are useful in this format and whether I should continue to put the together for future releases.

Statistics

Order Statistics & Quantiles

Previously our order statistics and quantile functions were quite limited. With this release we finally have almost complete quantile support:

OrderStatistic
Median
LowerQuartile
UpperQuartile
InterquartileRange
FiveNumberSummary
Percentile
Quantile
QuantileCustom

All of them are implemented on top of the quantile function. We always default to approximately median-unbiased quantiles, usually denoted as type R-8, which do not assume samples to be normally distributed. If you need compatibility with another implementation, you can use QuantileCustom which accepts either a QuantileDefinition enum (we support all 9 R-types, SAS 1-5, Excel, Nist, Hydrology, etc.) or a 4-parameter definition as in Mathematica.

For the empirical inverse cummulative distribution, which is essentially an R1-type quantile, you can use the new Statistics.InverseCDF function.

More efficient ways to compute statistics

Previously there were two ways to estimate some statistics from a sample set: The Statistics class provided static extension methods to evaluate a single statistic from an enumerable, and DescriptiveStatistics to compute a whole set of standard statistics at once. This was unsatisfactory since it was not very efficient: the DescriptiveStatistics way actually required more than one pass internally (mostly because of the median) and it was not leveraging the fact that the sample set may already be sorted.

To fix the first issue, we've marked DescriptiveStatistics.Median as obsolete and will remove it in v3. Until then, the median computation is delayed until requested the first time. In normal cases where Median is not used it now only requires a single pass.

The second issue we attacked by introducing three new classes to compute a single statistic directly from the best fitting sample data format:

ArrayStatistics operates on arrays which are not assumed to be sorted.
SortedArrayStatistics operates on arrays which must be sorted in ascending order.
StreamingStatistics operates on a stream in a single pass, without keeping the full data in memory at any time. Can thus be used to stream over data larger than system memory.

ArrayStatistics implements Minimum, Maximum, Mean, Variance, StandardDeviation, PopulationVariance and PopulationStandardDeviation. In addition it implements all the order statistics/quantile functions mentioned above, but in an inplace way that reorders the data array (partial sorting) and because of that is marked with an Inplace-suffix to indicate the side effect. These inplace functions get slightly faster when calling them repeatedly, but will always be slower than the sorted array statistics. Nevertheless, since the sorting itself is quite expensive, all in all the (non-sorted) array statistics are still faster in practice if only few calls are needed.

Example: We want to compute the IQR of {3,1,2,4}.

1: 
2:

var data = new double[] { 3.0, 1.0, 2.0, 4.0 };
ArrayStatistics.InterquartileRangeInplace(data); // iqr = 2.16666666666667

This is equivalent to executing IQR(c(3,1,2,4), type=8) in R, with quantile definition type R-8.

SortedArrayStatistics expects data to be sorted in ascending order and implements Minimum, Maximum, and all the order statistics/quantile functions mentioned above. It leverages the ordering for very fast (constant time) order statistics. There's also no need to reorder the data, so other than ArrayStatistics, this class never modifies the provided array and has no side effect. It does not re-implement any operations that cannot leverage the ordering, like Mean or Variance, so use the implementation from ArrayStatistics instead if needed.

1: 
2:

var data = new[] { 1.0, 2.0, 3.0, 4.0 };
var iqr = SortedArrayStatistics.InterquartileRange(data); // iqr = 2.16666666666667

StreamingStatistics estimates statistics in a single pass without memorization and implements Minimum, Maximum, Mean, Variance, StandardDeviation, PopulationVariance and PopulationStandardDeviation. It does not implement any order statistics, since they require sorting and are thus not computable in a single pass without keeping the data in memory. No function of this class has any side effects on the data.

The Statistics class has been updated to leverage these new implementations internally, and implements all of the statistics mentioned above as extension methods on enumerables. No function of this class has any side effects on the data.

1:	`var iqr = new[] { 3.0, 1.0, 2.0, 4.0 }.InterquartileRange(); // iqr = 2.16666666666667`

Note that this is generally slower than ArrayStatistics because it requires to copy the array to make sure there are no side effects, and much slower than SortedArrayStatistics which would have constant time (assuming we sorted it manually first).

Repeated Evaluation and Precomputed Functions

Most of the quantile functions accept a tau-parameter. Often you need to evaluate that function with not one but a whole range of values for that parameter, say for plotting. In such scenarios it is advantageous to first sort the data and then use the SortedArrayStatistics functions with constant time complexity. For convenience we also provide alternative implementations with the Func-suffix in the Statistics class that do exactly that: instead of accepting a tau-parameter themselves, they return a function that accepts tau:

1: 
2: 
3:

var icdf = new[] { 3.0, 1.0, 2.0, 4.0 }.InverseCDFFunc();
var a = icdf(0.3); // 2
var b = new[] { 0.0, 0.1, 0.5, 0.9, 1.0 }.Select(icdf).ToArray(); // 1,1,2,4,4

Linear Algebra

There have been quite a few bug fixes and performance improvements around linear algebra. See the release notes for details.

Matrix and Vector String Formatting

Previously the ToString method used to render the whole matrix to a string. When working with very large data sets this can be an expensive operation both on CPU and memory usage. It also makes it a pain to work with interactive consoles or REPL environments like F# Interactive that write the ToString of the resulting object to the console output.

Starting from v2.5, ToString methods no longer render the whole structure to a string for large data. Instead, ToString now only renders an excerpt of the data, together with a line about dimension, type and in case of sparse data a sparseness indicator. The intention is to give a good idea about the data in a visually useful way:

1:	`DenseMatrix.CreateRandom(60,80,Normal.WithMeanVariance(2.0,1.0)).ToString();`

generates the following multi-line string:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10:

DenseMatrix 60x80-Double
 1.68665      1.24322      3.36594      2.07444      4.13008 ...      3.01076
 1.10888       2.8856      2.31662      3.94124      3.56711 ...    -0.216804
0.843804      2.67243        1.097      2.34063     0.875953 ...        1.808
 3.87044      2.69509      2.79642    -0.354365      2.45302 ...      2.79665
 2.05722      3.39823      2.56256      1.88849      1.75259 ...       3.8987
  1.9874      2.97047      1.40584      1.97734      2.37733 ...        2.875
 2.06503      1.15681      3.85957      2.84836      1.25326 ...    -0.108938
     ...          ...          ...          ...          ... ...          ...
 4.39245      1.32734      2.83637      1.78257      2.44356 ...      2.58935

The output is not perfect yet, as we'd ideally align the decimal point and automatically choose the right width for each column. Hopefully we can fix that in a future version. How much data is shown by default can be adjusted in the Control class:

1: 
2:

Control.MaxToStringColumns = 6;
Control.MaxToStringRows = 8;

Or you can use an override or alternative:

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8:

Matrix.ToString(maxRows, maxColumns, formatProvider)

// Just the top info line
Matrix.ToTypeString()

// Just the matrix data without the top info line
Matrix.ToMatrixString(maxRows, maxColumns, formatProvider)
Matrix.ToMatrixString(maxRows, maxColumns, padding, format, formatProvider)

Note that ToString has never been intended to serialize a matrix to a string in order to parse it back later. Please use one of our data libraries instead, e.g. the MathNet.Numerics.Data.Text package.

Creating a Matrix or Vector

Constructing a matrix or vector has become more consistent: Except for obsolete members which will be removed in v3, all constructors now directly use the provided data structure by reference, without any copying. This means that there are only constructors left that accept the actual inner data structure format. Usually you'd use the new static functions instead, which always either create a copy or construct the inner data structure directly from the provided data, without keeping a reference to it.

Some examples below. The C# way usually works in F# just the same as well, but we provide more idiomatic alternatives for most of them:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32:

// Directly from an array in the internal column-major format (no copying)
C#: new DenseMatrix(2, 3, new[] { 1.0, 2.0, 10.0, 20.0, 100.0, 300.0 })
F#: DenseMatrix.raw 2 3 [| 1.0; 2.0; 10.0; 20.0; 100.0; 300.0 |]

// All-zero 3x4 (3 rows, 4 columns) matrix
C#: new DenseMatrix(3, 4)
F#: DenseMatrix(3, 4)
F#: DenseMatrix.zeroCreate 3 4

// 3x4, all cells the same fixed value
C#: DenseMatrix.Create(3, 4, (r, c) => 20.5)  // note: better way is planned
F#: DenseMatrix.create 3 4 20.5

// 3x4, random
C#: DenseMatrix.CreateRandom(3, 4, Normal.WithMeanVariance(2.0, 0.5));
F#: DenseMatrix.randomCreate 3 4 (Normal.WithMeanVariance(2.0, 0.5))

// 3x4, using an initializer function
C#: DenseMatrix.Create(3, 4, (r, c) => r/100.0 + j)
F#: DenseMatrix.init 3 4 (fun r c -> float r/100.0 + float c)

// From enumerables of enumerables (as rows or columns)
C#: DenseMatrix.OfColumns(2,3, new[] { new[] { 1.0, 4.0 }, new[] { 2.0, 5.0 }, new[] { 3.0, 6.0 } })
F#: DenseMatrix.ofRows 2 3 [{ 1.0 .. 3.0 }; { 4.0 .. 6.0 }]

// From F# lists of lists
F#: DenseMatrix.ofRowsList 2 3 [[1.0 .. 3.0]; [4.0 .. 6.0]]
F#: matrix [[1.0; 2.0; 3.0]
            [4.0; 5.0; 6.0]]

// By calling a function to construct each row (or analog for columns)
F#: DenseMatrix.initRow 10 10 (fun r -> vector [float r .. float (r+9)])

All of these work the same way also for sparse matrices, and similarly for vectors. Useful for sparse data is another way that accepts a list or sequence of indexed row column value tuples, where all other cells are assumed to be zero:

1: 
2: 
3:

F#: SparseMatrix.ofListi 200 100 [(4,3,20.0); (18,9,3.0); (2,1,99.9)]
F#: DenseMatrix.ofSeqi 10 10 (seq {for i in 0..9 -> (i,i,float(i*i))})
C#: SparseMatrix.OfIndexed(10,10,Enumerable.Range(0,10).Select(i => Tuple.Create(i,i,(double)(i*i))))

Inplace Map

The F# modules have supported (outplace) combinators like map for quite some time. We've now implemented inplace map directly in the storage classes so it can operate efficiently also on sparse data, and is accessible from C# code without using the F# extensions. The F# map-related functions have been updated to leverage these routines:

1: 
2: 
3: 
4:

matrix |> Matrix.map (fun x -> 2.0*x)
matrix |> Matrix.mapnz (fun x -> 2.0*x) // non-zero: may skip zero values, useful if sparse
matrix |> Matrix.mapi (fun i j x -> x + float i - float j) // indexed with row, column
matrix |> Matrix.mapinz (fun i j x -> x + float i - float j) // indexed, non-zero

Or the equivalent inplace versions which return unit:

1: 
2: 
3: 
4:

matrix |> Matrix.mapInPlace (fun x -> 2.0*x)
matrix |> Matrix.mapnzInPlace (fun x -> 2.0*x)
matrix |> Matrix.mapiInPlace (fun i j x -> x + float i - float j)
matrix |> Matrix.mapinzInPlace (fun i j x -> x + float i - float j)

In C#:

1: 
2:

matrix.MapInplace(x => 2*x);
matrix.MapIndexedInplace((i,j,x) => x+i-j, forceMapZeros:true);

F# Slice Setters

Speaking about F#, we've supported the slice getter syntax for a while as a nice way to get a sub-matrix. For example, to get the bottom right 2x2 sub-matrix of m we can do:

1: 
2: 
3:

let m = DenseMatrix.init 3 4 (fun i j -> float (10 * i + j))
m |> printfn "%A"
m.[1..2,2..3] |> printfn "%A"

The same syntax now also works for setters, e.g. to overwrite the very same bottom right corner we can write:

1: 
2: 
3:

m.[1..2,2..3] <- matrix [[0.1; 0.2]
                         [0.3; 0.4]]
printfn "%A" m

The 3 printfn statements generate the following output:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11:

DenseMatrix 3x4-Double
           0            1            2            3
          10           11           12           13
          20           21           22           23
DenseMatrix 2x2-Double
          12           13
          22           23
DenseMatrix 3x4-Double
           0            1            2            3
          10           11          0.1          0.2
          20           21          0.3          0.4

namespace Microsoft.FSharp.Control

Multiple items
val float : value:'T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

--------------------
type float = System.Double

Full name: Microsoft.FSharp.Core.float

--------------------
type float<'Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

Multiple items
val seq : sequence:seq<'T> -> seq<'T>

Full name: Microsoft.FSharp.Core.Operators.seq

--------------------
type seq<'T> = System.Collections.Generic.IEnumerable<'T>

Full name: Microsoft.FSharp.Collections.seq<_>

val printfn : format:Printf.TextWriterFormat<'T> -> 'T

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.printfn

Math.NET Numerics with Native Linear Algebra

2013-02-03T19:56:00+01:00

Linear algebra is one of those areas where performance can be essential, but also one where native optimizations can make a huge difference. That's why in Math.NET Numerics we implemented linear algebra on top of a provider abstraction where providers can be exchanged.

Out of the box Math.NET Numerics only includes a fully managed provider which is supported on all platforms, but unfortunately is also rather slow. This doesn't matter much for most problems, but if you're working with very large dense matrices it can be a deal breaker. That's why we've added some helper projects you can use to compile your own native provider, but that is still quite involved and requires some experience around C or C++. Not any more, kudos to @@marcuscuda!

Since Math.NET Numerics v2.4 we begin to distribute native providers as NuGet packages, starting with one based on Intel MKL. Enabling native algorithms becomes almost as simple as adding a NuGet package to your project.

Enabling Native Linear Algebra

Since the native providers are optimized for and work on a specific platform only, you'll have to decide exactly what platform your project should target (x64/64-bit, or x86/32-bit), configure your project or compiler accordingly and then choose the matching native package. We might be able to simplify this in the future by distributing both versions together and automatically selecting the right one at runtime, but for now that's how it is.

1: 
2:

PM> Install-Package MathNet.Numerics.MKL.Win-x64
  PM> Install-Package MathNet.Numerics.MKL.Win-x86

We're also preparing analog packages for Linux.

I mentioned it is almost as simple as adding a NuGet package. Almost, because you still need to make sure the libraries get copied to your bin directory (e.g. by choosing "Copy always" in VS) and to actually enable them using the Control class:

1:	`Control.LinearAlgebraProvider = new MklLinearAlgebraProvider();`

Without this line the default provider would have been used instead. Usually the default provider is the fully managed one mentioned above. However, you can choose the new MKL provider as default from the outside by setting the MathNetNumericsLAProvider environment variable to MKL (Note: does not work on the portable build). This is especially useful for unit testing against multiple providers, but also e.g. to upgrade to a faster provider in cases where you cannot actually modify the source code. If needed you can prevent the environment variable to have any effect in your application by explicitly configuring a provider yourself, e.g. the managed provider:

1:	`Control.LinearAlgebraProvider = new ManagedLinearAlgebraProvider();`

Speed Up

You can expect a very significant speed up with the MKL provider. For example, computing the product of two 1000x1000 dense double matrices I've just measured the following median times on my 4-core x64 Windows desktop, with no code change between them (except configuring the Control class):

Managed provider (parallelization enabled): 0.9s
MKL native provider: 0.085s

This might not be a very thorough analysis but shows that even in this simple case the native provider is 10 times faster than our managed provider. Admittedly our managed provider might have some room for improvement (we actually have a promising alternative implementation waiting to be fully tested and integrated), but even with the best generic & portable managed algorithm we're unlikely to ever beat a native provider.

Alternative Native Providers - Call for Help

We try to make alternative providers available the same way in the future, hopefully also including some more open/free ones. Unfortunately, for most of them even just getting them to compile for both x86 and x64 on Windows can be very tricky. We've worked on ACML, Goto (now OpenBLAS) and ATLAS in the past but ran into problems with each of them on Windows. Hence, please let us know if you have any experience you could share with us. Thanks.

Git Howto: Mirror a GitHub repo without pull refs

2013-01-26T15:44:00+01:00

GitHub recently started publishing all pull request as special git refs. This is awesome, since it makes it trivial to checkout out and work with them from your local repository, without having to add the submitter's repo as a remote all the time. It is also nicely done in that it does not affect normal clones in any way - unless you actually want to fetch them.

However, there is one case where it may have an undesired side effect: mirrors. For example, I routinely mirror the Math.NET Numerics mainline repository to a couple other places, including Codeplex, Gitorious and Google Code. I want a mirror to exactly mirror the source repository, adding all new branches and tags automatically, but also remove those that have been deleted in the source. Git has excellent support for such exact mirroring. Unfortunately this mirroring mechanism includes all the pull refs as well, which may not be what you want. In Math.NET Numerics, some pull request actually base on an old (long removed) branch that included some corrupt objects. So in this case, including them in the mirror not only doubles the repository size, it also causes a corrupt git file system.

Luckily there is an easy way to skip them in the mirror, but to do that we must understand how git refs actually work:

Git Refs

In essence, a git ref is just a reference to a specific git commit. Refs can represent local branches and tags, but also remote branches. To keep things organized, they're structured hierarchically. You can find them in two places:

As separate file for each ref in the .git/refs directory
In the .git/packed-refs file

In a normal local repository you'll typically end up with the following structure:

1: 
2: 
3: 
4:

refs/heads/{branchname} - all your local branches
refs/remotes/{remotename}/{branchname}` - all your fetched remote branches
refs/tags/{tagname} - all tags
refs/stash - your stash, if you use it

However, if you create a local mirror of a GitHub repo, i.e.

1:	`$ git clone --mirror git://github.com/mathnet/mathnet-numerics.git`

Then you'll end up with exactly the same bare structure as the remote itself, but this time including GitHub's pull requests:

1: 
2: 
3:

refs/heads/{branchname} - all your remote branches
refs/tags/{tagname} - all your remote tags
refs/pull/{id}/head|merge - all your remote GitHub pull requests

How exactly a remote's refs are mapped down to your local refs and why there is a difference between the structure of a normal clone and a bare mirror is defined in the refspec that is automatically added to your repo config.

In a normal clone, the fetch refspec typically looks like this:

1:	`fetch = +refs/heads/:refs/remotes/origin/`

It essentially says that all remote refs within refs/heads should map to local refs in refs/remotes/origin. On the other hand, a mirror includes all refs, so its refspec looks like the following:

1:	`fetch = +refs/:refs/`

Excluding Pull Refs

As far as I know there is no simple way to exclude some refs in a subdirectory of a refspec, but you can add multiple fetch refspecs to get the same effect. Simply replace the catch-all refspec above with two more specific specs to just include all heads and tags, but not the pulls, and all the remote pull refs will no longer make it into your bare mirror:

1: 
2:

fetch = +refs/heads/*:refs/heads/*
fetch = +refs/tags/*:refs/tags/*

Full Config Example

For completeness, I've attached the full config (see git config -e) I use myself for mirroring Math.NET Numerics below.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18:

[core]
    repositoryformatversion = 0
    filemode = false
    bare = true
    symlinks = false
    ignorecase = true
    hideDotFiles = dotGitOnly
[remote "mathnet"]
    url = git@github.com:mathnet/mathnet-numerics.git
    fetch = +refs/heads/*:refs/heads/*
    fetch = +refs/tags/*:refs/tags/*
    mirror = true
[remote "mirrors"]
    url = https://git01.codeplex.com/mathnetnumerics
    url = https://code.google.com/p/mathnet-numerics/
    url = git@gitorious.org:mathnet-numerics/mainline.git
    mirror = true
    skipDefaultUpdate = true

To update the mirror, I then run the following commands:

1: 
2:

$ git remote update
$ git push mirrors

Linear Regression with Math.NET Numerics

2012-09-09T20:14:00+02:00

Likely the most requested feature for Math.NET Numerics is support for some form of regression, or fitting data to a curve. I'll show in this article how you can easily compute regressions manually using Math.NET, until we support it out of the box. We already have broad interpolation support, but interpolation is about fitting some curve exactly through a given set of data points and therefore an entirely different problem.

For a regression there are usually much more data points available than curve parameters, so we want to find the parameters that produce the lowest errors on the provided data points, according to some error metric.

Least Squares Linear Regression

If the curve is linear in its parameters, then we're speaking of linear regression. The problem becomes much simpler and we can leverage the rich linear algebra toolset to find the best parameters, especially if we want to minimize the square of the errors (least squares metric).

In the general case such a curve would be in the form of a linear combination of $N$ arbitrary but known functions $f_i(x)$, scaled by the parameters $p_i$. Note that none of the functions $f_i$ depends on any of the $p_i$ parameters.

\[y : x \mapsto p_1 f_1(x) + p_2 f_2(x) + \cdots + p_N f_N(x)\]

If we have $M$ data points $(x_j,y_j)$, then we can write the whole problem as an overdefined system of $M$ equations:

\[\begin{eqnarray} y_1 &=& p_1 f_1(x_1) + p_2 f_2(x_1) + \cdots + p_N f_N(x_1) \\ y_2 &=& p_1 f_1(x_2) + p_2 f_2(x_2) + \cdots + p_N f_N(x_2) \\ &\vdots& \\ y_M &=& p_1 f_1(x_M) + p_2 f_2(x_M) + \cdots + p_N f_N(x_M) \end{eqnarray}\]

Or in matrix notation:

\[\begin{eqnarray} \mathbf y &=& \mathbf X \mathbf p \\ \begin{bmatrix}y_1\\y_2\\ \vdots \\y_M\end{bmatrix} &=& \begin{bmatrix}f_1(x_1) & f_2(x_1) & \cdots & f_N(x_1)\\f_1(x_2) & f_2(x_2) & \cdots & f_N(x_2)\\ \vdots & \vdots & \ddots & \vdots\\f_1(x_M) & f_2(x_M) & \cdots & f_N(x_M)\end{bmatrix} \begin{bmatrix}p_1\\p_2\\ \vdots \\p_N\end{bmatrix} \end{eqnarray}\]

This is a standard least squares problem and can easily be solved using Math.NET Numerics's linear algebra classes and the QR decomposition. In literature you'll usually find algorithms explicitly computing some form of matrix inversion. While symbolically correct, using the QR decomposition instead is numerically more robust. This is a solved problem, after all.

1:	`var p = X.QR().Solve(y);`

Some $\mathbf{X}$ matrices of this form have well known names, for example the Vandermonde-Matrix for fitting to a polynomial.

Example: Fitting to a Line

A line can be parametrized by the height $a$ at $x=0$ and its slope $b$:

\[y : x \mapsto a + b x\]

This maps to the general case with $N=2$ parameters as follows:

\[p_1 = a, f_1 : x \mapsto 1 \\ p_2 = b, f_2 : x \mapsto x\]

And therefore the equation system

\[\begin{bmatrix}y_1\\y_2\\ \vdots \\y_M\end{bmatrix} = \begin{bmatrix}1 & x_1\\1 & x_2\\ \vdots & \vdots\\1 & x_M\end{bmatrix} \begin{bmatrix}a\\b\end{bmatrix}\]

The complete code when using Math.NET Numerics would look like this:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13:

// data points
var xdata = new double[] { 10, 20, 30 };
var ydata = new double[] { 15, 20, 25 };

// build matrices
var X = DenseMatrix.CreateFromColumns(
  new[] {new DenseVector(xdata.Length, 1), new DenseVector(xdata)});
var y = new DenseVector(ydata);

// solve
var p = X.QR().Solve(y);
var a = p[0];
var b = p[1];

Example: Fitting to an arbitrary linear function

The functions $f_i(x)$ do not have to be linear in $x$ at all to work with linear regression, as long as the resulting function $y(x)$ remains linear in the parameters $p_i$. In fact, we can use arbitrary functions, as long as they are defined at all our data points $x_j$. For example, let's compute the regression to the following complicated function including the Digamma function $\psi(x)$, sometimes also known as Psi function:

\[y : x \mapsto a \sqrt{\exp x} + b \psi(x^2)\]

The resulting equation system in Matrix form:

\[\begin{bmatrix}y_1\\y_2\\ \vdots \\y_M\end{bmatrix} = \begin{bmatrix}\sqrt{\exp{x_1}} & \psi(x_1^2)\\\sqrt{\exp{x_2}} & \psi(x_2^2)\\ \vdots & \vdots\\\sqrt{\exp{x_M}} & \psi(x_M^2)\end{bmatrix} \begin{bmatrix}a\\b\end{bmatrix}\]

The complete code with Math.NET Numerics, but this time with F#:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20:

// define our target functions
let f1 x = Math.Sqrt(Math.Exp(x))
let f2 x = SpecialFunctions.DiGamma(x*x)

// create data samples, with chosen parameters and with gaussian noise added
let fy (noise:IContinuousDistribution) x = 2.5*f1(x) - 4.0*f2(x) + noise.Sample()
let xdata = [ 1.0 .. 1.0 .. 10.0 ]
let ydata = xdata |> List.map (fy (Normal.WithMeanVariance(0.0,2.0)))

// build matrix form
let X =
    [|
        xdata |> List.map f1 |> vector
        xdata |> List.map f2 |> vector
    |] |> DenseMatrix.CreateFromColumns
let y = vector ydata

// solve
let p = X.QR().Solve(y)
let (a,b) = (p.[0], p.[1])

Note that we use the Math.NET Numerics F# package here (e.g. for the vector function).

Example: Fitting to a Sine

Just like the digamma function we can also target a sine curve. However, to make it more interesting, we're also looking for phase shift and frequency parameters:

\[y : x \mapsto a + b \sin(c + \omega x)\]

Unfortunately the function $f_2 : x \mapsto \sin(c + \omega x)$ now depends on parameters $c$ and $\omega$ which is not allowed in linear regression. Indeed, fitting to a frequency $\omega$ in a linear way is not trivial if possible at all, but for a fixed $\omega$ we can leverage the following trigonometric identity:

\[\begin{eqnarray} a+b\sin(c + \omega x) &=& a+u\sin{\omega x}+v\cos{\omega x} \\ b &=& \sqrt{u^2+v^2} \\ c &=& \operatorname{atan2}(v,u) \end{eqnarray}\]

and therefore

\[\begin{bmatrix}y_1\\y_2\\ \vdots \\y_M\end{bmatrix} = \begin{bmatrix}1 & \sin \omega x_1 & \cos \omega x_1\\1 & \sin \omega x_2 & \cos \omega x_2\\ \vdots & \vdots & \vdots\\1 & \sin \omega x_M & \cos \omega x_M\end{bmatrix} \begin{bmatrix}a\\u\\v\end{bmatrix}\]

However, note that because of the non-linear transformation on the $b$ and $c$ parameters, the result will no longer be strictly the least square error solution. While our result would be good enough for some scenarios, we'd either need to compensate or switch to non-linear regression if we need the actual least square error parameters.

The complete code in C# with Math.NET Numerics would look like this:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19:

// data points: we compute y perfectly but then add strong random noise to it
var rnd = new Random(1);
var omega = 1.0d
var xdata = new double[] { -1, 0, 0.1, 0.2, 0.3, 0.4, 0.65, 1.0, 1.2, 2.1, 4.5, 5.0, 6.0 };
var ydata = xdata
  .Select(x => 5 + 2 * Math.Sin(omega*x + 0.2) + 2*(rnd.NextDouble()-0.5)).ToArray();

// build matrices
var X = Matrix.CreateFromColumns(new[] {
    new DenseVector(xdata.Length, 1),
    new DenseVector(xdata.Select(t => Math.Sin(omega*t)).ToArray()),
    new DenseVector(xdata.Select(t => Math.Cos(omega*t)).ToArray())});
var y = new DenseVector(ydata);

// solve
var p = X.QR().Solve(y);
var a = p[0];
var b = SpecialFunctions.Hypotenuse(p[1], p[2]);
var c = Math.Atan2(p[2], p[1]);

The following graph visualizes the resulting regressions. The curve we computed the $y$ values from, before adding the strong noise, is shown in black. The red dots show the actual data points with only small noise, the blue dots the points with much stronger noise added. The red and blue curves then show the actual computed regressions for each.

val f1 : x:'a -> 'b

Full name: linearregressionmathnetnumericscontent.f1

val x : 'a

val f2 : x:'a -> 'b

Full name: linearregressionmathnetnumericscontent.f2

val fy : noise:'a -> x:'b -> float

Full name: linearregressionmathnetnumericscontent.fy

val noise : 'a

val x : 'b

val xdata : float list

Full name: linearregressionmathnetnumericscontent.xdata

val ydata : float list

Full name: linearregressionmathnetnumericscontent.ydata

val map : mapping:('T -> 'U) -> list:'T list -> 'U list

Full name: Microsoft.FSharp.Collections.List.map

val X : obj

Full name: linearregressionmathnetnumericscontent.X

val y : obj

Full name: linearregressionmathnetnumericscontent.y

val p : obj

Full name: linearregressionmathnetnumericscontent.p

val a : obj

Full name: linearregressionmathnetnumericscontent.a

val b : obj

Full name: linearregressionmathnetnumericscontent.b

Lokad.Cloud Architecture Refresh

2011-08-04T13:29:00+02:00

In a recent post about new deployment and versioning approaches in Lokad.Cloud I mentioned that I'm also heavily refactoring the old cloud service framework and runtime. That refactoring was long due but also required to support these new approaches effectively.

In essence, developing Cloud Services still works as before. There is a framework library (Lokad.Cloud.Services.Framework) that provides base classes for a small set of service types that you can derive from. The following figure shows the dependencies of all involved components:

Previously the framework also contained the complete runtime with AppDomain isolation and all. This is no longer the case (since we want to use the new deployment approach). Instead, the framework now comes with service runners, which are lightweight classes that take already created service instances plus their settings and can be used to run cloud services directly in the current thread, without any isolation. This comes handy for easier debugging and testing. In simple scenarios it might even be good enough for production or integration into another system.

However, in production scenarios you often do want proper isolation, more robustness and some deployment story. That's where the new AppHost comes in.

Introducing the Lokad.Cloud AppHost

The new deployment approach is currently implemented in a prototype, Lokad.Cloud AppHost. I'll introduce the AppHost in more details in a later post. Important for now is that it comes with two assemblies, AppHost and AppHost.Framework. AppHost.Framework is essentially a set of contracts, while AppHost implements the actual runtime environment. Both are quite small and simple. The typical architecture anticipated in the prototype is as follows:

Your Context
Represents the whole environment where the AppHost is executed to the AppHost itself. This is why AppHost has no dependencies at all (except SharpZipLib, but that will likely be dropped soon). Thanks to this abstraction, AppHost is completely neutral to where and how it is executed. The context also provides a deployment reader and thus decides where and how application deployments are stored.
Your Worker Process
This would be the process where the whole application is executing, e.g. a Windows Azure WorkerRole, an Windows Service or even some CLI application. The worker builds the "host context", creates an AppHost Host instances using said context and then starts and stops the host on demand.
Your Entry Point
The entry point of your application that is hosted using the AppHost. The entry point class type is chosen in the deployment itself, and automatically created in one or more runtime cells (again as specified in the deployment), isolated by AppDomain and in its own thread.

Note that this figure does not mention Lokad.Cloud Services, Storage or Provisioning at all. Indeed, AppHost could be used to host all kind of applications (e.g. even some business application based on Lokad.CQRS).

Hosting Lokad.Cloud Services in AppHost

That's all nice and well, but the primary scenario is to run Lokad.Cloud Services. One of the design targets of Cloud Services have always been simple usage, achieved in parts by tight integration of our storage and provisioning toolkits into the services framework (opinionated on infrastructure). Luckily this gives us the opportunity to fully provide complete AppHost Context and EntryPoint implementations. The complete services solution now looks like this:

Note that the services framework no longer depends on Provisioning, and does not depend on any AppHost infrastructure at all. Neither AppHost nor Provisioning thus leak into your cloud services implementations. The separation between AppContext and AppEntryPoint also reflects that they run in different places: AppContext is used directly in the host process, while AppEntryPoints run in the isolated runtime cell AppDomains. This becomes clear when we visualize the complete solution:

This looks quite complicated and like a lot of infrastructure just to support that little yellow box on the top right. But this is somewhat misleading, as all the components are very focused and most of them small and independent.

A closer look at what is actually deployed on the worker (e.g. your Azure WorkerRole) reveals that there is really nothing more than the AppContext opinionating the AppHost towards Lokad.Cloud Provisioning and Storage and then connecting this context with the AppHost and run it in the worker process:

Similarly, the actual (versioned) application deployments need to contain only the assemblies shown in the following figure. Obviously there are your cloud services, but also the entry point and the service framework:

All these parts are thus movable and "replaceable" per deployment. This brings up some nice opportunities, as you can patch and replace any of these assemblies in specific deployments without worrying about compatibility with the worker process (this used to be an issue in the past). You can change the scheduling, add new cloud service types or even replace the framework and entry point completely with your own code. Technically there's also no need to keep it separated into three assemblies, but the isolated EntryPoint helps keeping some dependencies like AppHost out of your cloud services.

Overengineered?

I claim it is not. If you do want all of these:

Storage:
Robust storage (especially important for remote cloud storage) that is very easy to use
Provisioning:
Automatically scale your worker instances (cloud scenario) based on demand
Deployments:
Easily switch between deployments, fast, versioned including settings, in Git style.
Runtime:
Robust multi-cell cloud application hosting, self-healing to some degree.
Cloud Services:
Compute agents that are easy to implement.

then you do need all these components. You can either have them all in one huge monolithical assembly and your logic depending on all of them, or you can isolate them logically, keep them simple and focused and avoid unnecessary dependencies, as suggested in the presented architecture.

Cleaning up after migrating from Hg to Git

2011-07-30T11:35:00+02:00

There is a lot of guidance out there on how to migrate from Mercurial to Git, but they often leave you with a repository in a bad state. Even more so if it originally was a subversion repository, then migrated to Mercurial and now finally to Git.

The Lokad.Cloud repository was such a case. The committers and authors in the commit history were a complete mess, but that's not that much of an issue in practice. Worse is the fact that most text files were stored with CLRF line endings instead of LF internally. Git supports platform-native checkouts (CRLF on Windows, LF on Linux) quite nicely, but it only works well if text files are normalized to LF internally when committed. I strongly recommend doing that, as it will save you from a lot of trouble later on. Luckily it is also the default behavior for new repositories.

Migration: Fast-Export to Git

This is the usual procedure that properly converts branches and tags to the git equivalents:

1: 
2: 
3: 
4: 
5:

git clone git://repo.or.cz/fast-export.git
mkdir git_repo && cd git_repo
git init
/path/to/hg-fast-export.sh -r /path/to/mercurial_repo
git checkout HEAD

Normalize the whole history to LF line-endings

This step is only needed if all or some of the commits have been using non-LF line endings internally. If the repo once was in Subversion on Windows this most certainly is the case, but not necessarily on pure mercurial repositories. You can find out whether this is an issue, if you remove your git index and then reset. If a lot of files are now listed as modified, you better fix it as described here, if not you can skip this step.

1: 
2:

rm .git/index
git reset

I recommend to do this step in Linux as it didn't work well for me on Windows.

First we need to turn off any automated git end-of-line handling. Unfortunately this is controlled in multiple places (for historical reasons). First there is the core.autocrlf config we need to turn off:

1:	`git config core.autocrlf false`

Then we need to get rid of all the .gitattributes files in your repository in case they specify any automatic eol handling. This is not necessary in most of the cases, yet the repository I was dealing with used to be a hybrid git/mercurial repo some time ago and thus did already have a gitattributes file. If there is one, delete it and commit. Afterwards your current working directory should be clean, since git no longer wants to fix your line endings on any touched text files.

But to make sure the .gitattributes file in previous commits don't mess with us, we need to drop it in all commits (single line):

1: 
2:

git filter-branch --prune-empty --index-filter
   'git rm --cached --ignore-unmatch .gitattributes' -- --all

After that we finally can go converting all the text files to LF line endings, with another history rewrite (single line):

1: 
2:

git filter-branch -f --prune-empty --tree-filter
   'git ls-files -z | xargs -0 dos2unix --skipbin' -- --all

What this does is for every commit, for all files that are not binary, convert them to LF endings using dos2unix. In my case there are some paths with spaces in them (don't ask..), so I switched over to NULL-character separation using the -z and -0 options.

To ensure the normalization is enforced in future commits (especially from people forking your repository and then send you pull requests), create a new .gitattributes files containing at least something like * text=auto. The config option core.autocrlf however is not only local but also depreciated. You can remove it completely using

1:	`git config --unset core.autocrlf`

Clean up committers and authors

You can get a quick overview on how badly the authors are off using

1:	`git shortlog -se`

Luckily, fixing them is not that difficult, with yet another history rewrite:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12:

git filter-branch -f --env-filter '
if [ "$GIT_COMMITTER_NAME" = "bad user name" ]
then
export GIT_COMMITTER_NAME="correct user name"
export GIT_COMMITTER_EMAIL="correct email address"
fi
if [ "$GIT_AUTHOR_NAME" = "bad user name" ]
then
export GIT_AUTHOR_NAME="correct user name"
export GIT_AUTHOR_EMAIL="correct email address"
fi
' -- --all

Housekeeping

After all these rewrites it would be a good time to do some git maintenance, i.e.

1:	`git fsck --full`

to check and verify your repository, drop no longer used blobs with

1:	`git prune`

and then clean up and optimize your local repository using

1:	`git gc --aggressive`

Lokad.Cloud Application Deployment and Versioning Refresh

2011-07-15T17:10:00+02:00

At the very beginning of the Lokad.Cloud project we decided to not rely on the Windows Azure management tools to deploy new versions of our application. Instead we implemented a dynamic worker role - initially deployed once using the Windows Azure tools - that provides a runtime environment that can load and unload applications on demand, without even recycling the azure virtual machine. The applications are isolated in a separate AppDomain so we can unload them safely, plus for sandboxing.

Disclaimer: I'm a major contributor to the Lokad.Cloud opensource project. Lokad.Cloud is a framework for distributed computing in Windows Azure, plus a set of independent toolkits like Lokad.Cloud.Storage for simpler and more reliable cloud storage access and Lokad.Cloud.Provisioning for dynamic worker auto-scaling. We use Lokad.Cloud at Lokad to deal with our massive and rapidly changing computation demands.

This approach worked out nicely for us, both for apps that don't change for months and apps where we often have multiple redeployments per hour. Yet over the last year we gained a lot of experience and discovered some dark spots in the current stable release. Most of them were "good enough" back then, but are starting to get in the way:

Non-atomic deployments
The deployment mainly consisted of an assembly blob (zip file containing all the assemblies), a config blob (IoC configuration to add application-specific registrations and config like additional connection strings) plus service settings distributed over multiple blobs. Since workers automatically discover new deployments themselves, deployments are not atomic. Some worker could see new assemblies but old and possibly incompatible IoC configuration, causing undefined behavior until it catches up.
Lots of blobs to poll for changes
By design, all workers are completely self-contained and self-healing. A worker is never "contacted" in any way from outside, except via cloud storage (queues, blobs, tables). Since the deployments are non-atomic and in particular service settings spread over multiple blobs, they all have to be polled for change repeatedly on every worker. Polling for changed etags is not expensive but causes latency and can sum up if there are lots of services and worker instances. Even worse, some service settings also contained state (for simplicity) and thus change quite often.
Replacing an application is easy, getting back not so
It is very easy to redeploy, but there's no way to get back to the previous state unless you have a backup ready or can rebuild it from sources. Upgrading becomes much safer if it is easy to get back, reducing the burden of shorter deployment cycles.
Growing demand for stronger runtime
E.g. to support multiple runtime "cores" or "cells" on each worker with independent scheduling and customizable assignment/affinity.

Hence I started refactoring the Lokad.Cloud service framework recently, including reworking the handling of app deployments (not released yet):

Concentrating service settings to single blob

Cloud services have been refactored so that they no longer have to manage their own settings. Instead all settings are now stored in a single blob. Settings include parameters like whether a service is disabled or the trigger interval for scheduled services. Settings generally change rarely (e.g. manually through the management console) so the new settings blob still changes only rarely and conflicts are no issue (can be handled with optimistic concurrency). This brings the number of blobs to poll for drastically down to three, reducing a lot of unnecessary storage I/O and thus latency.

Separating deployments from currently active deployment

Previously there were just three blobs (assemblies, config and settings) for the currently active deployment. Changes were almost immediately applied on all workers.

From now on we can have multiple deployments exist in parallel. The "currently active deployment" is given by a pointer to the chosen deployment. Deployments are now read-only. If we want to change anything in a deployment (e.g. change some settings) we essentially create a new deployment and update the active deployment pointer to point to that new deployment instead. Let's call that new pointer HEAD.

Only one blob to poll

Since deployments are read-only, the only blob we have to poll is HEAD. Since no other polling is needed, we can easily poll more often to get much more reactive workers. If we poll once every 15 seconds, we get transaction costs of around US$0.17 per month per active worker plus maybe US$ 0.10 for bandwidth (note that most of the time the HTTP packets will only have headers, no body payload). This is negligible compared to the worker instance cost.

Content-based storage for deployments

I've introduced content-based storage in a previous blog post. The general idea is to identify data by its hash (often SHA-1 or SHA256) resulting in automatic deduplication and verifiable referential consistency. It is an ideal concept for versioning, that's why it is also broadly used in the popular git distributed version control system. It is also an ideal approach for managing our deployments. Like this:

Think of assemblies, config and settings as files, deployments as commits (pointing to one assemblies, config and settings blob each, shared if equal), and HEAD as head just like in git. All the arrows include the full hash of the target (as part of their name, shortened in the diagram).

The Index is just a redundant list of all deployments for easier management so we don't have to iterate through all available deployments all the time. In a similar way a History blob could be interesting to track the last few deployments and when they have been deployed. Note that HEAD and Index (and History) are the only mutable blobs, all the others are readonly, although they can be garbage collected.

The arrows between deployments will likely be dropped, they don't seem provide any value in practice.

Prepare deployment, then activate atomically

Both the creation of a deployment (based on assemblies, config and settings) and actually activating it (by changing HEAD to point to it) are now atomic operations. They still can happen at different times though, so you can prepare one or more deployments but activate them much later, if at all. The applications can be completely unrelated, so you can use this mechanism to quickly switch between different applications.

Note that it still takes a while until all workers have detected the change, so there will be a phase where different workers (on different VMs and servers) may have different applications running. There are ways to deal with this if it is an issue, see below.

Get back to the previous version

... is as trivial as looking up the previous version in the History or Index and change HEAD to point to it.

Changing service settings

If you change some settings in the web console, for example disable a service or change a trigger interval, a new settings blob will be created, then a new deployment referring to the new settings, and in the end HEAD will be changed to point to the new deployment. If you decide to change it back, the new settings will already exist (with the same hash), so in effect only the HEAD blob will be changed back to point to the previous deployment (plus the History updated if available). Note that you won't see much of that in practice as the management classes will handle it automatically.

Handling changes in the runtime

If the runtime detects a changed HEAD it will immediately load the deployment blob. Since it knows its current deployment and since the blobs are named after their content hash and readonly, it can simply compare the names to detect what blob has been changed. If either assemblies or config has changed, the runtime will have to restart all the processes, but if only the settings changed then it's usually enough to just adapt the scheduling appropriately. Settings changes therefore still have a rather small impact in practice, despite switching to a completely different deployment in the storage.

Forcing a deployment form an application

Sometimes you need to ensure that a message is processed by a specific deployment. For example, we sometimes deploy a new version and then want to do some computation on exactly that version. To achieve that we could either wait, or include the deployment hash in the message and make the application force the runtime to load exactly that deployment if it isn't matching already. For that and similar purposes I suggest to provide some way for services to send commands to the local runtime, like commands to enforce loading of the head or some specific deployment as soon as possible.

Multi-Head Scenario

This is unrelated to deployments, but the new runtime will also support multiple processes ("cells") in parallel, isolated in separate AppDomains and threads and with independent scheduling. Services settings contain a new cell affinity parameter to control in what cells a service should be executed. This can be useful e.g. to create a cell for low latency queue services or to avoid blocking when some services can have long processing times but there are only few worker instances available.

Now technically it would be possible to load different applications or versions in different cells at the same time, with separate HEADs for each cell. This would bring the service interleaving approach to a new level. Not sure how useful it would be in practice though (plus it would require some work to decide which app can choose the number of worker instances), so I won't follow that idea any further for now.

Too complicated?

This all seems very complicated just to do deployments. How could we simplify it but still satisfy our requirements?

Single blob only:
Store everything (assemblies, config, settings) in a single blob and give the currently active blob a special name (like "current" or still "HEAD"). Way simpler. Disadvantage: the whole blob can get large and has to be touched by every single settings change. Unlikely an issue in most deployments though.
Drop the hashing:
SHA is built in, so this is not really a big simplification. Disadvantage: we'd loose deduplication
Toggle instead of versioning:
Just provide two versions of each blob, which can be switched on demand (similar to Azure staging vs. production deployments). The staging blobs would be edited and when done switched somehow in an atomic way (e.g. HEAD blob pointing to version again). This may simplify management remarkably.
Outsource the versioning:
Use a proofed version control system instead, like git. Technically even subversion would work (I've tested subversion on Azure in the past, worked fine. For git we could even use one of the native git libraries). Disadvantage: checking a remote repository for changes is more expensive than a simple azure blob storage ETag check (git beats svn here). In the worst case we could work around that by introducing a HEAD blob in azure storage again, containing the current head revision/hash. We would then update that blob after every commit and poll it from the workers at much higher frequency. Advantage: Much more robust versioning, we could drop the zip files (simply version the assemblies directly), and we'd get push deployment for free.

Personally I like that last alternative using git the most.

Feedback

What do you think? Too complicated? Overengineered? Schould I use a real git repository instead? Let me know. Thanks!

(Migrated Comments)

Rinat Abdullin, July 23, 2011

Once again, that's a fine post, just like the previous one on hashing. I loved rereading it.

Just a few thoughts.

How hard would be to use Lokad.Cloud cell management with AppDomains without using actual services and message dispatch?
I think that given versioning sandbox/production toggle is just an overkill (it duplicates the logic). swaps are just a way to shift focus from one version to another, while keeping the ability to roll-back.
Hashing and separated blobs, as I believe, are a must for a simple implementation. They allow to keep stuff simple and decoupled. Besides, the complexity could be reduced by the tooling.
Git (versus self-implemented versioning) is just a way to deliver changes in my opinion (besides, versioned settings). So it should not be that different for blob or git storage (we just poll for changes and pull the version specified by head). I'm wondering how well would private github repo work here...

In case of blob, head is stored in blob storage (human-editable JSON), pointing to the deployment blobs In case of full git, head is in git In mixed scenario, head is stored in blob storage, pointing to the git version/url

Sorry for pushing that much of my rambling here :)

Content-Based Storage in the Cloud

2010-07-21T12:19:00+02:00

One derivative of the NoSQL movement that rediscovers non-relational storage approaches lately is a content-based value store. Such a store is similar to a Key-Value store but uses a cryptographic hash of the value as key.

An SHA-1 hash of the value is good enough to identify it

The SHA-1 hash function is unique, meaning that for every value there's exactly one key that can be computed using SHA-1, hence value implies key. We can always compute the unique key of a value.

The probability of an SHA-1 hash collision is extremely low. The most cited numbers show that you'd need 10²⁴ values in order to cause a 50% chance of a collision. Even with a whopping 10¹⁸ distinct values the likelihood of at least one collision is already down at 10^-9. Hence, a key refers to a single value with extremely high probability. While the SHA1 function is not strictly injective, it is approximatively injective enough for almost all practical applications.

Note that this is different from common non-distributed Hash Tables where a very short hash function is used to directly jump fast to an inner data structure (bucket) containing all items sharing the same hash. The motivation for hashing in such hash tables is to be able to directly compute the position where an item is stored, avoiding long linear or binary searches.

Verifiable Consistency

A nice side effect of using SHA1 is that given the key, the value retrieved from the storage can be verified (detect data corruption or tampering) simply by recomputing its SHA-1 hash and comparing it to the key. You can even digitally sign a key and by that implicitly sign its value and all those referred by it.

Keys are uniformly distributed

When using the common hex string format, the 160bit long keys are always 40 characters long and look something like this:

1:	`d921970aadf03b3cf0e71becdaab3147ba71cdef`

We can safely treat them as if their characters were distributed uniformly (0-9, a-f). This brings some advantages especially when used in a distributed or cloud-like scenario, as simple prefix ranges (like 0-3, 4-7, 8-b, c-f) can be used for partitioning and distributed processing.

On the other hand this means that you can't use other indexed keys or ordering out of the box without further logic or storage on top of it.

The value of a key is fixed and can't be changed

The value associated with a key can never change. If you store an updated value, you'll get a new key for it and update the reference to this new key. This has severe consequences on where this storage scheme can be used efficiently. For example, a typical relational data model with cyclic relations wouldn't fit at all to such a content-based data store.

However, in practice in a cloud-like application this is often not that much of an issue. Even more so as soon as you realize that the existence of read-only stale yet still consistent data is not an issue either (see CQRS).

Again this fits very well with distributed and cloud computing, as it becomes trivial to aggressively cache values locally. If a value is found in the local cache it is guaranteed to be up to date (since values can't change), so you don't even have to check for timestamps or whether it has been changed remotely. Since in Azure the instances come with a lot of local storage, a simple MRU cache for a few GB can save you a lot of downloads and roundtrips if you use only a relatively small number of instances or have managed to create some weak affinity between jobs and Azure instances.

Example: Large Queue Messages

Azure Queues have content size limitations, that's why Lokad.Cloud implements logic to let messages transparently overflow to blob storage. To do that it needs a way to store a value in a blob that it can retrieve later by some identifier. This identifier is then packed to the actual message. There's no need to access it in any other way, so it's a perfect candidate for a content-based value store.

In my experience, in real life the probability that a message is processed on the same worker that originally put it there is often high or at least not negligible. In all these cases, a cached content-based value store would save you from having to download these blobs completely, but still work correctly otherwise.

Implicit Value-Deduplication

Since the same value leads to the same key, trying to store the same value twice means you get the same storage location and the value gets stored only once. The second trial can even be aborted early by provoking a precondition violation, or skipped completely if it is already in the local cache (depending on the deletion plan).

Example: Daily Backup Snapshots

I recently wrote a small service that periodically takes full snapshots of all tables and blobs of a set of Azure storage accounts to a separate account, keeps the last N snapshots each and removes the rest. Often only a small subset of blobs or table entities actually change in a day. Had I used content-based storage, I could have saved a lot of storage (and thus cost) by deduplication without having to implement complicated incremental or differential backups. Taking a snapshot would likely also have taken less time thanks to some saved uploads.

Trivial Distribution and Replication

Other than any classical relational databases and key value stores, replication and distribution of data in such a content-based value store is trivial since there can't be any conflicts. This is why the caching mentioned above works so well. Replication simply means to copy the values of all missing keys over to the target. A consequence of this is that for some scenarios there's no technical need for a single master database. A peer can synchronize with any other peer, resulting in full peer to peer support. Distributed hash tables (DHT) as used by most file sharing solutions including BitTorrent work similarly and turn out to be very efficient.

History Consistency and Versioning

Since values can't change, they remain consistent with each other even when they become stale. That's why this approach is used by most of the popular distributed version control systems like Git and Mercurial as well.

The Git object model is nicely described in the git community book (the following two images are taken from there). In essence, all objects are stored just as described here. In addition to data blobs (i.e. source code files) there are also tree objects representing a folder simply by listing all the SHA-1 keys of its child elements, again stored by its hash:

If a file changes in git in a new revision, it will get a new hash. The folder/tree containing it will update that hash in its list, and in turn will itself get a new hash. Both the old a new version are therefore still available completely and consistently simply by referring to the hash of the respective version of the tree.

Historical consistency can be useful for all kind of applications. Note that this approach persists snapshots of values and content, not how they are changed. This is thus a dual counterpart to concepts like event sourcing where only the actions causing changes of the values are persisted but not the actual values.

Append-Only Storage or Value Scavenging

Unless you need an append-only storage, you need to be careful about deleting values in such a system. Since there is implicit deduplication, you can't just delete what you've just inserted since the same value could also be used in other places. There are several approaches how you can attack this, depending on your scenario:

Garbage Collection: If there is a hierarchy where all values are referenced by another value, you can follow the tree from time to time and then remove all values you haven't seen. This is used by all the distributed version control systems. Be careful about race conditions though.
Reference Tracking: Use metadata to list all keys or items referring a value. If you remove the last reference, remove it. This can be combined with garbage collection. You can also use reference counters, but they are difficult to handle correctly in an unreliable world like a cloud environment where instantaneous VM shutdowns without prior notice are to be expected.
Time-Based: You "touch" a value (update a timestamp in the metadata) whenever it is used, and from time to time remove all items that haven't been used for a while. Note that that causes a lot of round trips (although they could be performed asynchronously in the background).
Limited Lifetime: Sometimes its good enough to just define that a value can safely be removed after a day or a month.

How to create 2048bit Certificate CSRs for Dell's iDRAC6

2010-07-20T11:21:00+02:00

In case you happen to manage a recent DELL server with a dedicated iDRAC remote management card and you'd like to secure it by using your own certificate, you'll have to request a certificate based on a CSR request created directly in the iDRAC web interface.

Unfortunately these CSRs have only 1024 bit keys, which get refused by some public certificate authorities like StartCom (for security reasons they require at least 2048 bits). You can't choose the bit length in the iDRAC web interface, but luckily there is another way to make it generate 2048 or 4096 bit long keys for the CSR using racadm from Dell's System Management Tools:

View the current configuration (all on 1 line):

1: 
2:

racadm.exe -r [iDRAC IP] -u [user] -p [password]
getconfig -g cfgRacSecurity

Change the key length to 2048 bits (all on 1 line):

1: 
2:

racadm.exe -r [iDRAC IP] -u [user] -p [password]
config -g cfgRacSecurity -o cfgRacSecCsrKeySize 2048

(Migrated Comments)

Dan Orum, September 7, 2010

If you are using the Express version of the iDRAC card, you can't use the racadm.exe utility with an IP address remotely. Instead, you need to run the utility on the local server without specifying the -r parameter.

Christoph Ruegg, September 11, 2010

Indeed, thanks for the hint!

Git HowTo: revert a commit already pushed to a remote repository

2010-05-05T23:22:00+02:00

So you've just pushed your local branch to a remote branch, but then realized that one of the commits should not be there, or that there was some unacceptable typo in it. No problem, you can fix it. But you should do it rather fast before anyone fetches the bad commits, or you won't be very popular with them for a while ;)

First two alternatives that will keep the history intact:

Alternative: Correct the mistake in a new commit

Simply remove or fix the bad file in a new commit and push it to the remote repository. This is the most natural way to fix an error, always safe and totally non-destructive, and how you should do it 99% of the time. The bad commit remains there and accessible, but this is usually not a big deal, unless the file contains sensitive information.

Alternative: Revert the full commit

Sometimes you may want to undo a whole commit with all changes. Instead of going through all the changes manually, you can simply tell git to revert a commit, which does not even have to be the last one. Reverting a commit means to create a new commit that undoes all changes that were made in the bad commit. Just like above, the bad commit remains there, but it no longer affects the the current master and any future commits on top of it.

1:	`$ git revert dd61ab32`

About History Rewriting

People generally avoid history rewiriting, for a good reason: it will fundamentally diverge your repository from anyone who cloned or forked it. People cannot just pull your rewritten history as usual. If they have local changes, they have to do some work to get in sync again; work which requires a bit more knowledge on how Git works to do it properly.

However, sometimes you do want to rewrite the history. Be it because of leaked sensitive information, to get rid of some very large files that should not have been there in the first place, or just because you want a clean history (I certainly do).

I usually also do a lot of very heavy history rewriting when converting some repository from Subversion or Mercurial over to Git, be it to enforce internal LF line endings, fixing committer names and email addresses or to completely delete some large folders from all revisions. I recently also had to rewrite a large git repository to get rid of some corruption in an early commit that started causing more and more problems.

Yes, you should avoid rewriting history which already passed into other forks if possible, but the world does not end if you do nevertheless. For example you can still cherry-pick commits between the histories, e.g. to fetch some pull requests on top of the old history.

In opensource projects, always contact the repository maintainer first before doing any history rewriting. There are maintainers that do not allow any rewriting in general and block any non-fastforward pushes. Others prefer doing such rewritings themselves.

Case 1: Delete the last commit

Deleting the last commit is the easiest case. Let's say we have a remote mathnet with branch master that currently points to commit dd61ab32. We want to remove the top commit. Translated to git terminology, we want to force the master branch of the mathnet remote repository to the parent of dd61ab32:

1:	`$ git push mathnet +dd61ab32^:master`

Where git interprets x^ as the parent of x and + as a forced non-fastforward push. If you have the master branch checked out locally, you can also do it in two simpler steps: First reset the branch to the parent of the current commit, then force-push it to the remote.

1: 
2:

$ git reset HEAD^ --hard
$ git push mathnet -f

Case 2: Delete the second last commit

Let's say the bad commit dd61ab32 is not the top commit, but a slightly older one, e.g. the second last one. We want to remove it, but keep all commits that followed it. In other words, we want to rewrite the history and force the result back to mathnet/master. The easiest way to rewrite history is to do an interactive rebase down to the parent of the offending commit:

1:	`$ git rebase -i dd61ab32^`

This will open an editor and show a list of all commits since the commit we want to get rid of:

1: 
2: 
3:

pick dd61ab32
pick dsadhj278
...

Simply remove the line with the offending commit, likely that will be the first line (vi: delete current line = dd). Save and close the editor (vi: press :wq and return). Resolve any conflicts if there are any, and your local branch should be fixed. Force it to the remote and you're done:

1:	`$ git push mathnet -f`

Case 3: Fix a typo in one of the commits

This works almost exactly the same way as case 2, but instead of removing the line with the bad commit, simply replace its pick with edit and save/exit. Rebase will then stop at that commit, put the changes into the index and then let you change it as you like. Commit the change and continue the rebase (git will tell you how to keep the commit message and author if you want). Then push the changes as described above. The same way you can even split commits into smaller ones, or merge commits together.

Lost in Math.NET Codenames?

2010-04-26T20:27:00+02:00

Math.NET Numerics? Iridium? dnAnalytics? Yttrium? Huh? ...sounds familiar?

It looks like some of you got lost in all the Math.NET subprojects and codenames. Math.NET evolved over time, with projects splitting into separate new projects, the introduction of codenames and new projects replacing older ones with a slightly different focus and approach. Unfortunately this lead to a mess (sorry for that!), so I'm trying to throw light on it by the following small chart, depicting the Math.NET Project history:

It all started with MathLib which was a very verbose object oriented computer algebra approach, including all kind of numeric routines to back the symbolics, including basic linear algebra. At the same time dnAnalytics was founded independently and unrelated to Math.NET, focusing entirely on numerics and statistics, leveraging highly optimized native libraries for better performance.

Soon it became obvious that it would make sense to refactor out the numerical aspects of MathLib to a separate project and to develop it independently, so Numerics was born, as well as several other non-numeric subprojects. Numerics became Iridium, and in 2009 Iridium and dnAnalytics finally decided to join forces and work together on the new Math.NET Numerics project, replacing both Iridium and dnAnalytics and entirely unrelated to the early Numerics 0.1-0.4 back in 2004.

Mostly thanks to Marcus and Jurgen, Math.NET Numerics is very well alive and active. Check out our source code repository and forums.

Connect from Azure to an SQL Server Named Instance

2009-12-23T17:52:00+01:00

In some situations you can't or don't want to move all your data completely to the cloud. Be it to connect to your existing infrastructure, a company policy, to remain multi-tenant or simply when migrating slowly step by step. Common to these cases is often the requirement to synchronize with or connect from Azure to some local or offsite SQL Server database. For synchronization you may want to try the Microsoft Sync Framework. This post is about the other option: connecting to an external named SQL Server instance.

Connecting to Named SQL Server Instances

In addition to its own storage options like SQL Azure and Azure Table Storage, Azure also allows you to connect to external SQL Servers over TCP/IP. However, there's a pitfall right now when using named SQL Server instances:

System.Data.SqlClient.SqlException:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections.
(provider: SQL Network Interfaces, error: 26 - Error Locating Server/Instance Specified)
source

Provided your connection string is correct, this is likely to be an issue with how SQL server finds your named instance.

Instance Resolution using SQL Server Browser

Since SQL Server 2005, the SQL Server Browser service is responsible for enumerating available instances on a machine, and to resolve instance names to the actual named pipe or TCP port (for SQL Server 2000 it was the SQL Server Resolution Protocol).

In order to resolve the TCP port of a named instance, the client sends an UDP datagram to port 1434, to which the server browser replies with another datagram listing the instance endpoint to which the client then connects to. Thanks to this mechanism it is no longer required to have the server listen on the standard SQL server TCP port 1433, so it can fully support multiple (named) instances. In fact, the default for named instances is to use a dynamic random TCP port.

Azure vs. SQL Server Browser

When connecting from Azure this resolution mechanism fails, simply because the UDP datagrams never reach their target (this may change in the future). So there's no way the client can find the actual probably random TCP port to connect to, and will throw the SqlException cited above.

Solution

To work around this issue, you can configure your named instance to listen on a static TCP port instead of randomly selecting a new dynamic one on every restart (related kb). You can then specify this static port directly in the connection string in your Azure worker role:

1: 
2:

Data Source={domain/ip},{port};Network Library=DBMSSOCN;
Initial Catalog={dbname};User ID={user};Password={pw}

Note that in this case there's no need to specify the name of the instance in the connection string. The network library parameter tells the client to use TCP/IP instead of e.g. Named Pipes.

namespace Microsoft.FSharp.Data

Azure: Cloud Service Models

2009-12-15T13:26:00+01:00

Since I joined Lokad this September I finally had the chance to dive into cloud computing. We chose Windows Azure as platform for our very computation intensive business, and built a neutral opensource framework on top of it: Lokad.Cloud.

Cloud Services

Lokad.Cloud is described as a .net object-to-cloud persistence mapper, but it's actually much more. This post shall concentrate on one aspect only: Its notion of Cloud Services as horizontally scalable workers.

In essence, cloud services are managed and executed as follows:

The Lokad.Cloud management infrastructure (for now essentially a web role) allows you to upload one or more assemblies containing a set of cloud services and optionally some configuration file.
Every Azure worker role instance loads all these services in an isolated AppDomain.
Each Azure worker then executes these services one at a time according to some scheduling algorithm and execution policy.

We provide specialized base classes to simplify implementing services processing items from a shared queue or for services which are to be called in regular intervals.

We treat all azure workers as equal and therefore execute every cloud service on each Azure worker from time to time. In other words, we map all cloud services to all Azure workers, forming a complete bipartite graph between cloud services and Azure workers as shown in the following figure.

This is a fundamental concept that yields a very simple design with a potential for ideal horizontal scaling, and is even resilient to failing azure workers as long as at least one worker remains intact.

Cloud Service Models and Deployments

The only object that is aware of this mapping is the service scheduler. Yet, from the management and diagnostics perspective it would be interesting to represent the cloud services as first class objects. I'm therefore introducing the notion of Cloud Service Models for Lokad.Cloud (not part of the current release, open whether it ever will be).

In Azure, web and worker roles are explicitly defined and configured in two xml files. Since the latest update of the Azure tools for Microsoft VisualStudio, they are referred to as Azure Service Model. Using the Azure management website one can upload an assembly plus the two xml files to create a unique Azure deployment. A deployment can be stopped or running, either in production or in staging mode.

The same concepts can also be applied to Cloud Services, on a slightly higher level of abstraction and orthogonal to the Azure terms.

A Cloud Service Model is a unique entity, associated with a set of assemblies, the cloud services defined in them and their configuration (if applicable). Using the Lokad.Cloud management tools an administrator can upload such a model and create a unique Cloud Service Deployment. A deployment can be stopped or running, and of course be removed when no longer needed. A failing or malfunctioning deployment can be diagnosed and dealt with directly in the management UI.

Note that the currently implemented option to upload a zip file containing assemblies and optional configuration is already very close to such a models, but is missing identity and other metadata.

In each Azure worker, our scheduler will load the current service model, load the services and schedule them accordingly. From time to time the scheduler will check whether the deployed service model has changed, and update if necessary.

Technically this design would also allow to run multiple different deployments in parallel, e.g. by breaking the complete bipartite graph between Cloud Services and Azure workers into a non-complete bipartite one where Azure workers are assigned to a single Cloud Service Deployment:

Or by sharing the Azure workers by Cloud Service Deployments in a way or another (e.g. in parallel, or round robin):

Remember however that some of these scenarios violate the fundamental concept mentioned above. Hence, as usual, there's a tradeoff between flexibility and robustness.

Update

It seems there's a better way to differentiate between cloud service models and deployments:

Model: An identity, a set of (named) cloud services, their assemblies and optionally some configuration.
Deployment: An identity, a set of models and their mapping to (Azure) worker nodes.

I.e. only one deployment can run at at time, but there's an option to support configuring multiple models in a deployment. Also, there's a trivial empty deployment where no models are loaded at all.

Hence, the labels in the figures above should read "Cloud Service Model A" instead of "Cloud Service Deployment A", etc.

dnAnalytics + Iridium = Math.NET Numerics

2009-08-03T10:44:00+02:00

You may have wondered why the Math.NET Iridium development has stopped abruptly almost two months ago. Luckily this is not entirely true, in the last few weeks the .Net numerics library has progressed well - but at a different place:

Math.NET Iridium is being merged with dnAnalytics, resulting in a new project named Math.NET Numerics

What does that mean for existing Math.NET Iridium users?

Higher development momentum and larger user community (as a direct result of merging two projects).
Better algorithm and code quality by picking the best of each project and simply by having new highly skilled developers on board.
New opensource license model: MIT/X11. This is a very open license similar to the so called New BSD License. This model is much less restricting than the previous LGPL and is (to my knowledge) source-compatible to a wide range of licenses including all GPL-based licenses and the Microsoft opensource licenses, too.
Some API changes. This is unavoidable since we try to integrate the best of both dnAnalytics and Iridium. At the same time this is a good chance to throw out some old designs that have shown to be improvable and replace them with better approaches. However, we try hard to keep migration as smooth as possible.
In addition to the completely self-contained managed implementation, we'll profit from the dnAnalytics experience with parallelized and native optimizations (MKL, ACMS, CUDA etc) and will therefore provide optional wrappers around native libraries which provide significantly better performance when working with large data sets.
Again thanks to the dnAnalytics experience, you can expect better F# support, even though the library is still written in C#.
Although Iridium did support sparse linear algebra for a very short time, we had to remove it due to several issue. You can expect Math.NET Numerics to finally support sparse linear algebra in a clean way.

You'll find the new Math.NET Numerics discussion board and tracker at CodePlex and the current sources at Github (subversion mirror at google). The full portal website and wikis etc. will be available in a few weeks. Feel free to post your ideas, feedback or even fork the repository at github to contribute code to the project (note that we will completely reorganize the project structure until mid August).

We'll let you know here and on Twitter as soon as we reach a first milestone and have an api preview ready.

(Migrated Comments)

Joannes Vermorel, August 3, 2009

Congratulations! Sparse linear algebra is really a nice move (I am sorry I had not been able to push it forward at the time).

Alexey Zakharov, October 23, 2009

Good news! C# really needs such library in stable version.

Online API Reference

2009-04-17T18:08:00+02:00

We now finally provide an online api reference in an rdoc-like style, generated by docu (actually by my github fork of it). Note that docu is new and still under heavy development, so the quality is likely to improve over the next months (e.g. right now the class summaries are missing).

http://api.mathdotnet.com/

It is simple, but (other than the older NDoc & Sandcastle generated sites) loads very fast.

Iridium Statistics Accumulator: Better numerical stability

2009-01-07T21:12:00+01:00

The algorithm on how the Mean, Variance and Sigma are incrementally computed in the statisics accumulator (MathNet.Numerics.Statistics.Accumulator) has been improved last week in Iridium revision 503 to provide better numeric stability when dealing with samples with a very large mean but only a small variance.

For example, the variance of normally distributed samples with mean 10^e+9 but a variance of only 1 can now be accurately estimated. The previous implementation has been very unstable in that case.

The new algorithm continues to support removing samples from the accumulator (and updates the estimates accordingly).