Introduction to MATLAB for Functional Data Analysis

Author: Giles Hooker

Last Modified: 24/01/07

This file represents Lecture 2 in the BTRY 694: Functional Data Analysis. It is intended to provide an introduction to Matlab at the same time as illustrating some data-exploration techniques for Functional Data Analysis.

To begin with, note that Matlab has a text editing window (this one) in which you can write commands and programs. As will all programming languages, programs can be commented -- Matlab will ignore anything after a '%' on the line. You should always comment liberally; for your own sake as well as those who read your code.

Matlab will also publish commented files to HTML; use '%%' at a line to create a heading. Any comments following a heading will also make use of LaTeX comments, so that I can write

y = f(x) + \epsilon

and have it come out nicely in HTML! See Help (press F1) on Cell Mode and Publishing.

Two useful keys to know are F9, which will enter highlighted commands into the MATLAB window, and F5 which will run all commands in the file.

Contents

Loading, Generating and Saving Data

% For help on any function, you can look in the official help (press F1) or
% you can type 'help <function name>' at the prompt. For example

help load

% First Load a data set.

load 'handwriting.mat'

% MATLAB stores data in a binary .mat format. You can force it to load a
% tab delimited file by the option

load -ASCII 'handwritingx.dat'

% Alternatively, you can use a user interface to make sure you are getting
% the file correct:

% uiload

% Load enters variables into MATLAB memory, to view your workspace type

who

% or for more details

whos

% If you have generated data and want to save it

save 'handwriting.mat'

% If you want to only save specific variables

save 'handwritingx.mat' fdarrayx

% To save as a text file

save -ASCII 'fdarrayx' handwritingx

% Some tips for generating data:

% To create a sequence

x = 1:10  % Integers between 0 and 10

% Note that most commands result in MATLAB outputting the result of that
% command. This can be supressed by:

x = 1:10;

% A colon on the end of the line never hurts, we can always look at a
% variable by typing

x

% Sequences can be incremented in non-unit intervals

x = 0:2.5:10
x = linspace(1,10,4)

% Matlab treats variables as arrays and indexes these as follows

x(1)

x(3:4)

% I can also build data myself

x = [1 2; 3 4]

% Note here that the ';' indicates a new row in the array. To index
% multidimensional arrays, I can take

x(1,2)

% I can also take individual rows or columns.

x(:,1)
x(1,:)

% This notation can be extended to multi-dimensional arrays.
% I can also look at the size of x

size(x)
numel(x)

% And if I want to get rid of some entries

x(:,1) = [];

% I can generate random data

x = rand(10,2)    % uniform
x = randn(10,3)   % normal

% The function 'random' will do many more.

help random

% A couple of other useful functions

A = ones(3,3)

B = zeros(3,3)

C = eye(3,3)

% Matlab also allows multi-dimensional arrays

D = zeros(2,2,2)

% A standard convention:

E = ones([2 2 2])

% When the final index in an array is a singleton, Matlab will reduce the
% dimension:

D(:,:,1)

% However, this does not work for other dimensions

D(:,2,:)

% To reduce this to a two-dimensional array use

squeeze(D(:,2,:))

% See also permute as a generalization of traspose.

% Matlab allows vectors of integers as indices:

f = 1:10;
f(1:2:9)
 LOAD Load workspace variables from disk.
    LOAD FILENAME retrieves all variables from a file given a full pathname
    or a MATLABPATH relative partial pathname (see PARTIALPATH).  If        
    FILENAME has no extension LOAD looks for FILENAME.mat and, if found,
    LOAD treats the file as a binary "MAT-file".  If FILENAME.mat is not
    found, or if FILENAME has an extension other than .mat it is treated as
    an ASCII file.
 
    LOAD, by itself, uses the binary "MAT-file" named 'matlab.mat'.  It is
    an error if 'matlab.mat' is not found.
 
    LOAD FILENAME X loads only X.
    LOAD FILENAME X Y Z ... loads just the specified variables.  The
    wildcard '*' loads variables that match a pattern (MAT-file only).
 
    LOAD FILENAME -REGEXP PAT1 PAT2 can be used to load all variables
    matching the specified patterns using regular expressions. For more
    information on using regular expressions, type "doc regexp" at the
    command prompt.
 
    LOAD -ASCII FILENAME or LOAD -MAT FILENAME forces LOAD to treat the
    file as either an ASCII file or a MAT-file regardless of file
    extension.  With -ASCII, LOAD will error if the file is not numeric
    text.  With -MAT, LOAD will error if the file is not a MAT-file
    generated by SAVE -MAT.
 
    If FILENAME is a MAT-file, requested variables from FILENAME are
    created in the workspace. If FILENAME is not a MAT-file, a double
    precision array is created with name based on FILENAME.  Leading
    underscores or digits in FILENAME are replaced with X.  Other non-alpha
    chars in FILENAME are replaced with underscores.
 
    S = LOAD(...) returns the contents of FILENAME in variable S.  If
    FILENAME is a MAT-file, S is a struct containing fields matching the
    variables retrieved.  If FILENAME is an ASCII file, S is a double
    precision array.
 
    Use the functional form of LOAD, such as LOAD('filename'), when the
    filename is stored in a string, when an output argument is requested,
    or if FILENAME contains spaces.
 
    Examples for pattern matching:
        load fname a*                % Load variables starting with "a"
        load fname -regexp ^b\d{3}$  % Load variables starting with "b" and
                                     %    followed by 3 digits
        load fname -regexp \d        % Load variables containing any digits
 
    See also SAVE, WHOS, UILOAD, SPCONVERT, PARTIALPATH, IOFUN, FILEFORMATS.

    Overloaded functions or methods (ones with the same name in other directories)
       help COM/load.m
       help ccsdebug/load.m

    Reference page in Help browser
       doc load


Your variables are:

A             d2xmean       fda_x_cor     fid           x             ydata         
B             d2ydata       fda_xy_cor    handwritingx  xbundle       ydata_s       
C             d2ymean       fda_y_cor     i             xcov          ymean         
D             dx            fdarray       j             xdata         ymean_s       
E             dxdata        fdarrayx      n             xdata_s       yvar          
G             dxmean        fdarrayy      prod          xmean         
H             dy            fdatime       prodcell      xmean2        
N             dydata        fdatime_d     sub_samp      xmean_s       
ans           dymean        fdatime_d2    vx            xvar          
d2xdata       f             fdatime_s     vy            ycov          

  Name               Size                           Bytes  Class

  A                500x500                        2000000  double array
  B                500x500                        2000000  double array
  C                500x500                        2000000  double array
  D                500x500                        2000000  double array
  E                  2x2x2                             64  double array
  G                  3x3                               72  double array
  H                  5x2                               80  double array
  N                  1x1                                8  double array
  ans                1x3                               24  double array
  d2xdata         1399x20                          223840  double array
  d2xmean         1399x1                            11192  double array
  d2ydata         1399x20                          223840  double array
  d2ymean         1399x1                            11192  double array
  dx               100x1                              800  double array
  dxdata          1400x20                          224000  double array
  dxmean          1400x1                            11200  double array
  dy               100x1                              800  double array
  dydata          1400x20                          224000  double array
  dymean          1400x1                            11200  double array
  f                  1x10                              80  double array
  fda_x_cor        100x100                          80000  double array
  fda_xy_cor       100x100                          80000  double array
  fda_y_cor        100x100                          80000  double array
  fdarray         1401x20x2                        448320  double array
  fdarrayx        1401x20                          224160  double array
  fdarrayy        1401x20                          224160  double array
  fdatime         1401x1                            11208  double array
  fdatime_d       1400x1                            11200  double array
  fdatime_d2      1399x1                            11192  double array
  fdatime_s        100x1                              800  double array
  fid                1x1                                8  double array
  handwritingx    1401x20                          224160  double array
  i                  1x1                                8  double array
  j                  1x1                                8  double array
  n                  1x1                                8  double array
  prod               1x1                              640  struct array
  prodcell           2x2                          4000128  cell array
  sub_samp           1x100                            800  double array
  vx               100x100                          80000  double array
  vy               100x100                          80000  double array
  x                 10x3                              240  double array
  xbundle            1x1                           482396  struct array
  xcov             100x100                          80000  double array
  xdata           1401x20                          224160  double array
  xdata_s          100x20                           16000  double array
  xmean           1401x1                            11208  double array
  xmean2          1401x1                            11208  double array
  xmean_s          100x1                              800  double array
  xvar            1401x1                            11208  double array
  ycov             100x100                          80000  double array
  ydata           1401x20                          224160  double array
  ydata_s          100x20                           16000  double array
  ymean           1401x1                            11208  double array
  ymean_s          100x1                              800  double array
  yvar            1401x1                            11208  double array

Grand total is 1959830 elements using 15679788 bytes


x =

     1     2     3     4     5     6     7     8     9    10


x =

     1     2     3     4     5     6     7     8     9    10


x =

         0    2.5000    5.0000    7.5000   10.0000


x =

     1     4     7    10


ans =

     1


ans =

     7    10


x =

     1     2
     3     4


ans =

     2


ans =

     1
     3


ans =

     1     2


ans =

     2     2


ans =

     4


x =

    0.2088    0.1735
    0.9047    0.8445
    0.4223    0.5731
    0.0281    0.1457
    0.4724    0.0029
    0.7427    0.5766
    0.8527    0.9781
    0.4275    0.7549
    0.6733    0.6361
    0.4305    0.6449


x =

    0.4694    1.0184   -0.4650
   -0.9036   -1.5804    0.3710
    0.0359   -0.0787    0.7283
   -0.6275   -0.6817    2.1122
    0.5354   -1.0246   -1.3573
    0.5529   -1.2344   -1.0226
   -0.2037    0.2888    1.0378
   -2.0543   -0.4293   -0.3898
    0.1326    0.0558   -1.3813
    1.5929   -0.3679    0.3155

 RANDOM Generate random arrays from a specified distribution.
    The appropriate syntax depends on the number of parameters in the
    distribution you are using:
 
    R = RANDOM(NAME,A) returns an array of random numbers from the named
    distribution that requires a single parameter array A.
    R = RANDOM(NAME,A,B) returns an array of random numbers from the named
    distribution that requires two parameter arrays A and B.
    R = RANDOM(NAME,A,B,C) returns an array of random numbers from the named
    distribution that requires three parameter arrays A, B, and C.
 
    The size of R is the common size of A, B, and C if all are arrays.  If
    any parameters are scalars, the size of R is the size of the other
    parameter(s).
 
    R = RANDOM(NAME,A,M,N,...) or R = RANDOM(NAME,A,[M,N,...]) returns an
    M-by-N-by-... array of random numbers.
    R = RANDOM(NAME,A,B,M,N,...) or R = RANDOM(NAME,A,B,[M,N,...]) returns
    an M-by-N-by-... array of random numbers.
    R = RANDOM(NAME,A,B,C,M,N,...) or R = RANDOM(NAME,A,B,C,[M,N,...])
    returns an M-by-N-by-... array of random numbers.
 
    NAME can be one of:
 
       'beta'  or 'Beta',
       'bino'  or 'Binomial',
       'chi2'  or 'Chisquare',
       'exp'   or 'Exponential',
       'ev'    or 'Extreme Value',
       'f'     or 'F',
       'gam'   or 'Gamma',
       'gev'   or 'Generalized Extreme Value',
       'gp'    or 'Generalized Pareto',
       'geo'   or 'Geometric',
       'hyge'  or 'Hypergeometric',
       'logn'  or 'Lognormal',
       'nbin'  or 'Negative Binomial',
       'ncf'   or 'Noncentral F',
       'nct'   or 'Noncentral t',
       'ncx2'  or 'Noncentral Chi-square',
       'norm'  or 'Normal',
       'poiss' or 'Poisson',
       'rayl'  or 'Rayleigh',
       't'     or 'T',
       'unif'  or 'Uniform',
       'unid'  or 'Discrete Uniform',
       'wbl'   or 'Weibull'.
 
    Partial matches are allowed and case is ignored.
 
    RANDOM calls many specialized routines that do the calculations.
 
    See also CDF, ICDF, MLE, PDF.

    Reference page in Help browser
       doc random


A =

     1     1     1
     1     1     1
     1     1     1


B =

     0     0     0
     0     0     0
     0     0     0


C =

     1     0     0
     0     1     0
     0     0     1


D(:,:,1) =

     0     0
     0     0


D(:,:,2) =

     0     0
     0     0


E(:,:,1) =

     1     1
     1     1


E(:,:,2) =

     1     1
     1     1


ans =

     0     0
     0     0


ans(:,:,1) =

     0
     0


ans(:,:,2) =

     0
     0


ans =

     0     0
     0     0


ans =

     1     3     5     7     9

Operations and Arithmetic

% Matlab performs addition and subtraction component-wise on arrays.

G = A + B + C

% It also performs scalar multiplication and division this way

G/pi

% But '*', '/', '^' call matrix operations

G*C

% Component operations can be called with '.*' etc

G.^2

% Reshaping

H = reshape(1:10,5,2)

% Transpose

H'
G =

     2     1     1
     1     2     1
     1     1     2


ans =

    0.6366    0.3183    0.3183
    0.3183    0.6366    0.3183
    0.3183    0.3183    0.6366


ans =

     2     1     1
     1     2     1
     1     1     2


ans =

     4     1     1
     1     4     1
     1     1     4


H =

     1     6
     2     7
     3     8
     4     9
     5    10


ans =

     1     2     3     4     5
     6     7     8     9    10

Exploring Functional Data

size(fdarray)

% Separate the dimensions

xdata = fdarray(:,:,1);
ydata = fdarray(:,:,2);

% Time:

fdatime   = linspace(0, 2.3, 1401)';

% plot a line

plot(fdatime,xdata(:,1))

% take a closer look

figure(2)
plot(fdatime(1:100),xdata(1:100,1),'.')


% what about all the replications?

figure(3)
plot(fdatime(1:100),xdata(1:100,:))

% How to I plot both sets through time?

figure(4)
subplot(2,1,1)
plot(fdatime,xdata)
subplot(2,1,2)
plot(fdatime,ydata)

% Alternatively:

figure(5)
plot(xdata,ydata)

% I can add labels

xlabel('x position','fontsize',20)
ylabel('y position','fontsize',20)
title('Handwriting Data','fontsize',20)

% Now lets look at some pointwise statistics:

xmean = mean(xdata);     % Mean

size(xmean)

% Ooops!

xmean = mean(xdata,2);
ymean = mean(ydata,2);

figure(6)
subplot(2,1,1)
plot(fdatime,xmean,'LineWidth',2)
hold on
plot(fdatime,ymean,'r','LineWidth',2)
hold off

% OK, now lets look at variance

xvar = var(xdata,[],2);
yvar = var(ydata,[],2);

subplot(2,1,2)
plot(fdatime,xvar,'c--','LineWidth',2);
hold on
plot(fdatime,yvar,'g--','LineWidth',2);
hold off


% I wonder if they have any relationship

figure(8)
subplot(1,1,1)
plot(xmean,xvar)
hold on
plot(ymean,yvar,'r')
hold off

% OK, what about more sophisticated analyses?

sub_samp = 1:8:800;    % Reduce for the sake of computing time and visualization

xdata_s = xdata(sub_samp,:);
ydata_s = ydata(sub_samp,:);

fdatime_s = fdatime(sub_samp,:);

fda_x_cor = corr(xdata_s');
fda_y_cor = corr(ydata_s');
fda_xy_cor = corr(xdata_s',ydata_s');


% Now lets look at the correlation process:

figure(9)
surf(fdatime_s,fdatime_s,fda_x_cor)

% Alternatively, a contour plot is sometimes more helpful:

figure(10)
contourf(fdatime_s,fdatime_s,fda_x_cor)

% I'll make sense of this by looking at the x-process as well

figure(11)
subplot(2,1,1)
contourf(fdatime_s,fdatime_s,fda_x_cor)
subplot(2,1,2)
plot(fdatime_s,xdata_s)

% I need to register the second plot to have the same axis

axis([fdatime_s(1) fdatime_s(100) -0.04 0.01])

% Now do the same with y

figure(12)
subplot(2,1,1)
contourf(fdatime_s,fdatime_s,fda_y_cor)
subplot(2,1,2)
plot(fdatime_s,ydata_s)
axis([fdatime_s(1) fdatime_s(100) -0.04 0.04])

% Now lets look at cross correlation

figure(13)
subplot(2,2,1)
plot(xdata_s,fdatime_s)
axis([-0.04 0.02 fdatime_s(1) fdatime_s(100)])

subplot(2,2,4)
plot(fdatime_s,ydata_s)
axis([fdatime_s(1) fdatime_s(100) -0.04 0.04])

subplot(2,2,2)
contourf(fdatime_s,fdatime_s,fda_xy_cor);

subplot(2,2,3)
plot(xdata_s,ydata_s)

% I like that graph so much that I think I'll save it

print -dpng 'handwriting_crosscorr'


% A bit of multivariate analysis will formalize what's going on here.

% First I want to look at covariance

xcov = cov(xdata_s');
ycov = cov(ydata_s');

% Eigen-analysis of correlation matrices

% x first

[vx,dx] = eig(xcov);

dx = diag(dx);

plot(dx)

dx(91:100)/sum(dx)

xmean_s = xmean(sub_samp);

figure(14)
subplot(1,1,1)
plot(fdatime_s,xmean(sub_samp))
hold on
plot(fdatime_s,xmean_s+sqrt(dx(100))*vx(:,100),'r')
plot(fdatime_s,xmean_s-sqrt(dx(100))*vx(:,100),'r--')

plot(fdatime_s,xmean_s+sqrt(dx(99))*vx(:,99),'c')
plot(fdatime_s,xmean_s-sqrt(dx(99))*vx(:,99),'c--')
hold off

% And I can do the same thing with y

[vy,dy] = eig(ycov);

dy = diag(dy);

figure(15)
plot(dy)

dy(91:100)/sum(dy)

ymean_s = ymean(sub_samp);

figure(16)
plot(fdatime_s,ymean(sub_samp))
hold on
plot(fdatime_s,ymean_s+sqrt(dy(100))*vy(:,100),'r')
plot(fdatime_s,ymean_s-sqrt(dy(100))*vy(:,100),'r--')

plot(fdatime_s,ymean_s+sqrt(dy(99))*vy(:,99),'c')
plot(fdatime_s,ymean_s-sqrt(dy(99))*vy(:,99),'c--')
hold off

% And, of course, I can put them together

figure(17)
plot(xmean_s,ymean_s)
hold on
plot(xmean_s+sqrt(dx(100))*vx(:,100),ymean_s+sqrt(dy(100))*vy(:,100),'r')
plot(xmean_s+sqrt(dx(100))*vx(:,100),ymean_s-sqrt(dy(100))*vy(:,100),'r--')
hold off

% Now lets look at some derivatives.

% First consider a difference

dxdata = diff(xdata,1)/0.0016;
dydata = diff(ydata,1)/0.0016;

fdatime_d = fdatime(1:1400);

% What does this look like?

figure(18)
plot(fdatime_d,dydata)

% Wow; is it really that noisy?

figure(19)
plot(fdatime(1:100),dydata(1:100,1))

% Alright, perhaps we can smooth this out by averaging:

dxmean = mean(dxdata,2);
dymean = mean(dydata,2);

figure(20)
plot(fdatime_d,dxmean)

% Now, looking back at that variance:

figure(21)
plot(dxmean,xvar(1:1400),'.')
hold on
plot(dymean,yvar(1:1400),'r.')

% What about acceleration?

d2xdata = diff(dxdata,1)/0.0016;
d2ydata = diff(dydata,1)/0.0016;

fdatime_d2 = fdatime(1:1399);

figure(22)
plot(fdatime_d2,d2ydata)

% Aaaaugh!

figure(23)
plot(fdatime(1:100),d2ydata(1:100,1))

% So the lesson is that differencing can be disasterous numerically; and
% even worse statistically!

d2xmean = mean(d2xdata,2);
d2ymean = mean(d2ydata,2);

figure(24)
plot(fdatime_d2,d2ymean);
ans =

        1401          20           2


ans =

     1    20


ans =

    0.0054
    0.0122
    0.0149
    0.0172
    0.0345
    0.0421
    0.0961
    0.1400
    0.2612
    0.3611


ans =

    0.0077
    0.0119
    0.0126
    0.0165
    0.0323
    0.0416
    0.0642
    0.1277
    0.2481
    0.4263

More Matlab -- loops

Matlab provides the standard loop options.

% I could, for example obtain the pointwise mean in the following way:

[N,n] = size(xdata)

xmean2 = zeros(N,1);

for i = 1:N
   j = 1;
   while j <= n
    xmean2(i) = j/(j+1)*xmean2(i) + 1/(j+1)*xdata(i,j);
    j = j + 1;
   end
end

% see also if, switch statements.
N =

        1401


n =

    20

More Matlab -- functions

A = rand(500);
B = rand(500);

C = mat_prod(A,B);

% This takes a while -- reasons to beware of loops; Matlab is very good at
% vector operations.

D = A*B;

Other Data Structures

% Structs and Cells are like lists in R.

% Structs contain named fields, for example:

prod.A = randn(2,2);
prod.B = randn(2,3);
prod.fn = @mat_prod;
prod.P = prod.fn(prod.A,prod.B);

prod

% These are frequently used to reduce the number of intputs to a function

xbundle.xdata = xdata;
xbundle.xmean = xmean;
xbundle.dxdata = dxdata;
xbundle.dxmean = dxmean;
xbundle.time = fdatime;

figure(25)
plot(xbundle.xdata);

% They can also be used to define other classes of objects. the FDA package
% will make extensive (often implicit) use of these.


% Cell arrays allow arbitrary objects to be stored in indexed arrays. The
% main piece of notation to be aware of here is that '( )' now refers to
% the cell location, whereas '{ }' refers to cell contents.

prodcell = cell(2,2);

prodcell{1,1} = A;
prodcell(1,2) = {B};
prodcell(2,:) = {prod.P};

prodcell

prodcell(2,:) = {@mat_prod,prod.P};

prodcell

% These can be usefull for storing data with unequal lengths or collections
% of other objects that don't fit easily into arrays.
prod = 

     A: [2x2 double]
     B: [2x3 double]
    fn: @mat_prod
     P: [2x3 double]


prodcell = 

    [500x500 double]    [500x500 double]
    [  2x3   double]    [  2x3   double]


prodcell = 

    [500x500 double]    [500x500 double]
    @mat_prod           [  2x3   double]