Statistical Analysis with Python
You will not need to know much at all about the theory of statistics to complete this tutorial. If you desire, you could peruse:
The function linear_least_squares performs what is sometimes called "multiple linear regression", which is presented in this large-print, 15 page PDF document: Introduction to Applied Statistics: Multiple Linear Regression.
But you may want to skip both of the above, and just read the first couple of sections in the Linear Regression entry in the Wikipedia.
IMO, in statistical studies 90% of the labor is in obtaining the data (e.g., building and maintaining a mesonet), 9% in archiving and formatting the data (e.g., the reanalysis data) and 1% in passing the data through statistical analysis software and interpreting the output. Here we will do part of the 9%: formatting the data using Python. And the 1%: using a Python function to find regression coefficients from one data set, making a prediction with those coefficients, and comparing the predictions with observations.
Both numpy and linearAlgebra (which is part of Numeric) have a linear regression function:
In [1]: import numpy
In [2]: help(numpy.linalg.lstsq)
lstsq(a, b, rcond=-1)
returns x,resids,rank,s
where x minimizes 2-norm(|b - Ax|)
resids is the sum square residuals
rank is the rank of A
s is the rank of the singular values of A in descending order
If b is a matrix then x is also a matrix with corresponding columns.
If the rank of A is less than the number of columns of A or greater than
the number of rows, then residuals will be returned as an empty array
otherwise resids = sum((b-dot(A,x)**2).
Singular values less than s[0]*rcond are treated as zero.(Note notational inconsistency: is a the same as A? Hey, but how much did you pay for it?) We will adopt another notational convention rather than A (a) and b. Instead of A, we will call the matrix with rows consisting of m records of simultaneous observations of n input variables to be X. The column vector of m observations of an output variable, which are hoped to be significantly dependent on the input variables, will be called y. The n coefficients we seek will be the vector (a 1-d array in Python) c. Having found the coefficients, from an analysis of "training data", we hope to either make predictions or otherwise test the skill of the model with an entirely different set of observations, with so-called "verification data". The vector of predictions is denoted yp, which is desired to be close to y in the verification data. The function linear_least_squares chooses c that minimizes the difference between y and yp in the training data.
Here is our forecast model, with notation appropriate for a single record of observations, using n=3 as an example:
ypi= c1 Xi1 + c2 Xi2 + c3 Xi3 + c4
Actually, we seek n+1 coefficients, because of the use of the last term in the above.
The task is Statistical Forecast of Norman Temperature. The task is easily completed by modifying the scripts for the example nowcast project.