Statistical Analysis with Python

pix
Go back to home page

This page provides a simple demonstration of fitting an output variable to multiple input variables using the linear_least_squares function in the Python LinearAlgebra module.

This page is obsolete.

The latest page is at itmetr.net.

last modified: 05:04 PM MDT, Sun 15 Jun 2008

You will not need to know much at all about the theory of statistics to complete this tutorial. If you desire, you could peruse:

IMO, in statistical studies 90% of the labor is in obtaining the data (e.g., building and maintaining a mesonet), 9% in archiving and formatting the data (e.g., the reanalysis data) and 1% in passing the data through statistical analysis software and interpreting the output. Here we will do part of the 9%: formatting the data using Python. And the 1%: using a Python function to find regression coefficients from one data set, making a prediction with those coefficients, and comparing the predictions with observations.

Here is the online documentation for the "workhorse" of the statistical analysis, available from a Python interpreter:

>>> import LinearAlgebra
>>> help(LinearAlgebra.linear_least_squares)
linear_least_squares(a, b, rcond=1e-10)
    solveLinearLeastSquares(a,b) returns x,resids,rank,s
    where x minimizes 2-norm(|b - Ax|)
          resids is the sum square residuals
          rank is the rank of A
          s is an rank of the singual values of A in desending order

    If b is a matrix then x is also a matrix with corresponding columns.
    If the rank of A is less than the number of columns of A or greater than
    the numer of rows, then residuals will be returned as an empty array
    otherwise resids = sum((b-dot(A,x)**2).
    Singular values less than s[0]*rcond are treated as zero.
(Note the spelling errors and notational inconsistencies. Hey, but how much did you pay for it?) We will adopt another notational convention rather than A (a) and b. Instead of A, we will call the matrix with rows consisting of m records of simultaneous observations of n input variables to be X. The column vector of m observations of an output variable, which are hoped to be significantly dependent on the input variables, will be called y. The n coefficients we seek will be the vector (a 1-d array in Python) c. Having found the coefficients, from an analysis of "training data", we hope to either make predictions or otherwise test the skill of the model with an entirely different set of observations, with so-called "verification data". The vector of predictions is denoted yp, which is desired to be close to y in the verification data. The function linear_least_squares chooses c that minimizes the difference between y and yp in the training data.

Here is our forecast model, with notation appropriate for a single record of observations, using n=3 as an example:

ypi= c1 Xi1 +  c2 Xi2  +  c3 Xi3 + c4
Actually, we seek n+1 coefficients, because of the use of the last term in the above.
The task for Fall 2006 is Statistical Forecast of Norman Temperature . The task is easily completed by modifying the scripts for the example nowcast project.

this is an obsolete site
go to new site
go to obsolete home page

pix
Move to top of page