National Physical Laboratory

Automatic Differentiation

Minimising a Goodness-of-Fit
Figure 1: Finding estimates of parameters from data
involves minimising a goodness-of-fit measure.

A wide variety of mathematical and scientific problems involve solving systems of nonlinear equations, or finding the maximum or minimum of a nonlinear function.

In data fitting, for example,  the estimates of model parameters are found by minimising a goodness-of-fit measure (Figure 1). Finding the solution to these problems usually depends on being able to linearise the functions involved, and to achieve this we have to calculate their derivatives.

Theory

If f(x) is a function of a variable x, its derivative f′ is also a function of x and represents the slope of f at x. For a function of a number of variables f(x1, ..., xn), the partial derivative of f with respect to xj represents the slope of f in the direction of the jth axis. Using basic calculus rules, e.g.

(f + g)′ = f′(x) +g′(x),

the derivatives of complex functions can be expressed in terms of the derivatives of their simpler, component functions.

However, even for only moderately complicated functions, the hand calculation of derivatives can lead to pages of tedious algebra with an increasing likelihood of a mistake entering into the computation. If the function evaluation is encoded in a software component, it is natural to ask if it is possible for the software to compute the derivatives automatically.

Until recently, the standard method of evaluating derivatives numerically was to use finite differences. The main and simple idea of the method is to approximate the derivative by evaluating f at two nearby points (Figure 2). Typical of finite-difference formulae are the forward-difference formula

f′(x) ≈ (f(x + h) - f(x))/h

and the central-difference formula

f′(x) ≈ (f(x + h) - f(x - h))/2h.

Here h is a "step" selected in an appropriate way. These formulae generally only provide approximate estimates. If h is chosen to be too small, cancellation errors are likely. If h is too large, then the approximation error will be large.

The Complex Step Method

Because of the accuracy difficulties associated with finite differences, attention is now being paid to the complex step method, which is similar to finite differences, but uses complex arithmetic. We recall that a complex number z is of the form z = x + iy where x and y are real and i = √-1. All the arithmetic operations for real numbers also apply to complex numbers. Most real-valued functions f(x) occurring in science also have complex-valued counterparts, and can be written in the form f(z) = Re f(z) + iIm f(z) where the real-valued functions Re f and Im f are known as the real and imaginary parts. Taking z = x + ih, where x is real and h is real and small, the derivative is approximated by f′(x) ≈ Im f(x + ih)/h. Unlike the use of a finite difference formula, h can be chosen to be very small without the risk of cancellation errors. The complex step is very straightforward to implement in languages that support complex arithmetic.

Automatic Differentiation

The term automatic differentiation (AD) generally applies to techniques that produce, from a function evaluation software component, a computational scheme, also implemented in software, that calculates the derivatives. The execution of any program, no matter how complex, is built up from a sequence of elementary arithmetic operations (e.g. add, multiply) or elementary function evaluations (e.g. sin, exp). This implies that in any program, a function can be split into atomic operations that involve no more than two variables at a time. In automatic differentiation, the rules of calculus are applied to these atomic operations and are combined appropriately according to the algorithmic specification of the function.

Implementing Automatic Differentiation In Software

The AD approach can be implemented in two ways. In forward automatic differentiation (FAD), the derivative information is accumulated in the same order as for the function evaluation. In reverse automatic differentiation (RAD), the algorithm first performs a forward sweep for the function evaluation, storing, at the same time, the information required for the derivative computations. Then, in a second sweep, the algorithm steps in the reverse direction to that of the function generation and uses the information saved from the forward sweep to calculate the gradient. For many problems, RAD can be considerably more efficient than FAD.

In terms of software engineering, AD can be implemented using operator overloading and source-to-source transformation. In the operator-overloading approach (for languages such as Fortran 90 that support it), the basic arithmetic operations and intrinsic functions are re-assigned to calculate the corresponding derivatives in addition to the function values so that, at the end of the computation, the derivative information is also available. In source-to-source translation, the source code for the function evaluation is analysed and extended to produce source code to evaluate the derivatives. Generally, the implementation of source code transformation requires much more effort than the operator overloading approach.

Comparing Approaches

Figure 2 illustrates the error in computing the derivative f′(x) = -(1/x2)cos(1/x) of the function f(x) = sin(1/x) for the CS, FAD and RAD methods.

Error in Computing the Derivative
Figure 2: Error in computing the derivative of f(x) = sin(1/x) for CS, FAD and RAD. The band at zero indicates that the computed derivatives agree with the evaluated analytical derivative to machine precision.

Further Reading

This tutorial first appeared in Counting on IT Issue 14.

Last Updated: 25 Mar 2010
Created: 5 Jun 2007