Introduction to MATLAB for Functional Data Analysis
Author: Giles Hooker
Last Modified: 24/01/07
This file represents Lecture 2 in the BTRY 694: Functional Data Analysis. It is intended to provide an introduction to Matlab at the same time as illustrating some data-exploration techniques for Functional Data Analysis.
To begin with, note that Matlab has a text editing window (this one) in which you can write commands and programs. As will all programming languages, programs can be commented -- Matlab will ignore anything after a '%' on the line. You should always comment liberally; for your own sake as well as those who read your code.
Matlab will also publish commented files to HTML; use '%%' at a line to create a heading. Any comments following a heading will also make use of LaTeX comments, so that I can write
y = f(x) + \epsilon
and have it come out nicely in HTML! See Help (press F1) on Cell Mode and Publishing.
Two useful keys to know are F9, which will enter highlighted commands into the MATLAB window, and F5 which will run all commands in the file.
Contents
Loading, Generating and Saving Data
% For help on any function, you can look in the official help (press F1) or % you can type 'help <function name>' at the prompt. For example help load % First Load a data set. load 'handwriting.mat' % MATLAB stores data in a binary .mat format. You can force it to load a % tab delimited file by the option load -ASCII 'handwritingx.dat' % Alternatively, you can use a user interface to make sure you are getting % the file correct: % uiload % Load enters variables into MATLAB memory, to view your workspace type who % or for more details whos % If you have generated data and want to save it save 'handwriting.mat' % If you want to only save specific variables save 'handwritingx.mat' fdarrayx % To save as a text file save -ASCII 'fdarrayx' handwritingx % Some tips for generating data: % To create a sequence x = 1:10 % Integers between 0 and 10 % Note that most commands result in MATLAB outputting the result of that % command. This can be supressed by: x = 1:10; % A colon on the end of the line never hurts, we can always look at a % variable by typing x % Sequences can be incremented in non-unit intervals x = 0:2.5:10 x = linspace(1,10,4) % Matlab treats variables as arrays and indexes these as follows x(1) x(3:4) % I can also build data myself x = [1 2; 3 4] % Note here that the ';' indicates a new row in the array. To index % multidimensional arrays, I can take x(1,2) % I can also take individual rows or columns. x(:,1) x(1,:) % This notation can be extended to multi-dimensional arrays. % I can also look at the size of x size(x) numel(x) % And if I want to get rid of some entries x(:,1) = []; % I can generate random data x = rand(10,2) % uniform x = randn(10,3) % normal % The function 'random' will do many more. help random % A couple of other useful functions A = ones(3,3) B = zeros(3,3) C = eye(3,3) % Matlab also allows multi-dimensional arrays D = zeros(2,2,2) % A standard convention: E = ones([2 2 2]) % When the final index in an array is a singleton, Matlab will reduce the % dimension: D(:,:,1) % However, this does not work for other dimensions D(:,2,:) % To reduce this to a two-dimensional array use squeeze(D(:,2,:)) % See also permute as a generalization of traspose. % Matlab allows vectors of integers as indices: f = 1:10; f(1:2:9)
LOAD Load workspace variables from disk.
LOAD FILENAME retrieves all variables from a file given a full pathname
or a MATLABPATH relative partial pathname (see PARTIALPATH). If
FILENAME has no extension LOAD looks for FILENAME.mat and, if found,
LOAD treats the file as a binary "MAT-file". If FILENAME.mat is not
found, or if FILENAME has an extension other than .mat it is treated as
an ASCII file.
LOAD, by itself, uses the binary "MAT-file" named 'matlab.mat'. It is
an error if 'matlab.mat' is not found.
LOAD FILENAME X loads only X.
LOAD FILENAME X Y Z ... loads just the specified variables. The
wildcard '*' loads variables that match a pattern (MAT-file only).
LOAD FILENAME -REGEXP PAT1 PAT2 can be used to load all variables
matching the specified patterns using regular expressions. For more
information on using regular expressions, type "doc regexp" at the
command prompt.
LOAD -ASCII FILENAME or LOAD -MAT FILENAME forces LOAD to treat the
file as either an ASCII file or a MAT-file regardless of file
extension. With -ASCII, LOAD will error if the file is not numeric
text. With -MAT, LOAD will error if the file is not a MAT-file
generated by SAVE -MAT.
If FILENAME is a MAT-file, requested variables from FILENAME are
created in the workspace. If FILENAME is not a MAT-file, a double
precision array is created with name based on FILENAME. Leading
underscores or digits in FILENAME are replaced with X. Other non-alpha
chars in FILENAME are replaced with underscores.
S = LOAD(...) returns the contents of FILENAME in variable S. If
FILENAME is a MAT-file, S is a struct containing fields matching the
variables retrieved. If FILENAME is an ASCII file, S is a double
precision array.
Use the functional form of LOAD, such as LOAD('filename'), when the
filename is stored in a string, when an output argument is requested,
or if FILENAME contains spaces.
Examples for pattern matching:
load fname a* % Load variables starting with "a"
load fname -regexp ^b\d{3}$ % Load variables starting with "b" and
% followed by 3 digits
load fname -regexp \d % Load variables containing any digits
See also SAVE, WHOS, UILOAD, SPCONVERT, PARTIALPATH, IOFUN, FILEFORMATS.
Overloaded functions or methods (ones with the same name in other directories)
help COM/load.m
help ccsdebug/load.m
Reference page in Help browser
doc load
Your variables are:
A d2xmean fda_x_cor fid x ydata
B d2ydata fda_xy_cor handwritingx xbundle ydata_s
C d2ymean fda_y_cor i xcov ymean
D dx fdarray j xdata ymean_s
E dxdata fdarrayx n xdata_s yvar
G dxmean fdarrayy prod xmean
H dy fdatime prodcell xmean2
N dydata fdatime_d sub_samp xmean_s
ans dymean fdatime_d2 vx xvar
d2xdata f fdatime_s vy ycov
Name Size Bytes Class
A 500x500 2000000 double array
B 500x500 2000000 double array
C 500x500 2000000 double array
D 500x500 2000000 double array
E 2x2x2 64 double array
G 3x3 72 double array
H 5x2 80 double array
N 1x1 8 double array
ans 1x3 24 double array
d2xdata 1399x20 223840 double array
d2xmean 1399x1 11192 double array
d2ydata 1399x20 223840 double array
d2ymean 1399x1 11192 double array
dx 100x1 800 double array
dxdata 1400x20 224000 double array
dxmean 1400x1 11200 double array
dy 100x1 800 double array
dydata 1400x20 224000 double array
dymean 1400x1 11200 double array
f 1x10 80 double array
fda_x_cor 100x100 80000 double array
fda_xy_cor 100x100 80000 double array
fda_y_cor 100x100 80000 double array
fdarray 1401x20x2 448320 double array
fdarrayx 1401x20 224160 double array
fdarrayy 1401x20 224160 double array
fdatime 1401x1 11208 double array
fdatime_d 1400x1 11200 double array
fdatime_d2 1399x1 11192 double array
fdatime_s 100x1 800 double array
fid 1x1 8 double array
handwritingx 1401x20 224160 double array
i 1x1 8 double array
j 1x1 8 double array
n 1x1 8 double array
prod 1x1 640 struct array
prodcell 2x2 4000128 cell array
sub_samp 1x100 800 double array
vx 100x100 80000 double array
vy 100x100 80000 double array
x 10x3 240 double array
xbundle 1x1 482396 struct array
xcov 100x100 80000 double array
xdata 1401x20 224160 double array
xdata_s 100x20 16000 double array
xmean 1401x1 11208 double array
xmean2 1401x1 11208 double array
xmean_s 100x1 800 double array
xvar 1401x1 11208 double array
ycov 100x100 80000 double array
ydata 1401x20 224160 double array
ydata_s 100x20 16000 double array
ymean 1401x1 11208 double array
ymean_s 100x1 800 double array
yvar 1401x1 11208 double array
Grand total is 1959830 elements using 15679788 bytes
x =
1 2 3 4 5 6 7 8 9 10
x =
1 2 3 4 5 6 7 8 9 10
x =
0 2.5000 5.0000 7.5000 10.0000
x =
1 4 7 10
ans =
1
ans =
7 10
x =
1 2
3 4
ans =
2
ans =
1
3
ans =
1 2
ans =
2 2
ans =
4
x =
0.2088 0.1735
0.9047 0.8445
0.4223 0.5731
0.0281 0.1457
0.4724 0.0029
0.7427 0.5766
0.8527 0.9781
0.4275 0.7549
0.6733 0.6361
0.4305 0.6449
x =
0.4694 1.0184 -0.4650
-0.9036 -1.5804 0.3710
0.0359 -0.0787 0.7283
-0.6275 -0.6817 2.1122
0.5354 -1.0246 -1.3573
0.5529 -1.2344 -1.0226
-0.2037 0.2888 1.0378
-2.0543 -0.4293 -0.3898
0.1326 0.0558 -1.3813
1.5929 -0.3679 0.3155
RANDOM Generate random arrays from a specified distribution.
The appropriate syntax depends on the number of parameters in the
distribution you are using:
R = RANDOM(NAME,A) returns an array of random numbers from the named
distribution that requires a single parameter array A.
R = RANDOM(NAME,A,B) returns an array of random numbers from the named
distribution that requires two parameter arrays A and B.
R = RANDOM(NAME,A,B,C) returns an array of random numbers from the named
distribution that requires three parameter arrays A, B, and C.
The size of R is the common size of A, B, and C if all are arrays. If
any parameters are scalars, the size of R is the size of the other
parameter(s).
R = RANDOM(NAME,A,M,N,...) or R = RANDOM(NAME,A,[M,N,...]) returns an
M-by-N-by-... array of random numbers.
R = RANDOM(NAME,A,B,M,N,...) or R = RANDOM(NAME,A,B,[M,N,...]) returns
an M-by-N-by-... array of random numbers.
R = RANDOM(NAME,A,B,C,M,N,...) or R = RANDOM(NAME,A,B,C,[M,N,...])
returns an M-by-N-by-... array of random numbers.
NAME can be one of:
'beta' or 'Beta',
'bino' or 'Binomial',
'chi2' or 'Chisquare',
'exp' or 'Exponential',
'ev' or 'Extreme Value',
'f' or 'F',
'gam' or 'Gamma',
'gev' or 'Generalized Extreme Value',
'gp' or 'Generalized Pareto',
'geo' or 'Geometric',
'hyge' or 'Hypergeometric',
'logn' or 'Lognormal',
'nbin' or 'Negative Binomial',
'ncf' or 'Noncentral F',
'nct' or 'Noncentral t',
'ncx2' or 'Noncentral Chi-square',
'norm' or 'Normal',
'poiss' or 'Poisson',
'rayl' or 'Rayleigh',
't' or 'T',
'unif' or 'Uniform',
'unid' or 'Discrete Uniform',
'wbl' or 'Weibull'.
Partial matches are allowed and case is ignored.
RANDOM calls many specialized routines that do the calculations.
See also CDF, ICDF, MLE, PDF.
Reference page in Help browser
doc random
A =
1 1 1
1 1 1
1 1 1
B =
0 0 0
0 0 0
0 0 0
C =
1 0 0
0 1 0
0 0 1
D(:,:,1) =
0 0
0 0
D(:,:,2) =
0 0
0 0
E(:,:,1) =
1 1
1 1
E(:,:,2) =
1 1
1 1
ans =
0 0
0 0
ans(:,:,1) =
0
0
ans(:,:,2) =
0
0
ans =
0 0
0 0
ans =
1 3 5 7 9
Operations and Arithmetic
% Matlab performs addition and subtraction component-wise on arrays. G = A + B + C % It also performs scalar multiplication and division this way G/pi % But '*', '/', '^' call matrix operations G*C % Component operations can be called with '.*' etc G.^2 % Reshaping H = reshape(1:10,5,2) % Transpose H'
G =
2 1 1
1 2 1
1 1 2
ans =
0.6366 0.3183 0.3183
0.3183 0.6366 0.3183
0.3183 0.3183 0.6366
ans =
2 1 1
1 2 1
1 1 2
ans =
4 1 1
1 4 1
1 1 4
H =
1 6
2 7
3 8
4 9
5 10
ans =
1 2 3 4 5
6 7 8 9 10
Exploring Functional Data
size(fdarray) % Separate the dimensions xdata = fdarray(:,:,1); ydata = fdarray(:,:,2); % Time: fdatime = linspace(0, 2.3, 1401)'; % plot a line plot(fdatime,xdata(:,1)) % take a closer look figure(2) plot(fdatime(1:100),xdata(1:100,1),'.') % what about all the replications? figure(3) plot(fdatime(1:100),xdata(1:100,:)) % How to I plot both sets through time? figure(4) subplot(2,1,1) plot(fdatime,xdata) subplot(2,1,2) plot(fdatime,ydata) % Alternatively: figure(5) plot(xdata,ydata) % I can add labels xlabel('x position','fontsize',20) ylabel('y position','fontsize',20) title('Handwriting Data','fontsize',20) % Now lets look at some pointwise statistics: xmean = mean(xdata); % Mean size(xmean) % Ooops! xmean = mean(xdata,2); ymean = mean(ydata,2); figure(6) subplot(2,1,1) plot(fdatime,xmean,'LineWidth',2) hold on plot(fdatime,ymean,'r','LineWidth',2) hold off % OK, now lets look at variance xvar = var(xdata,[],2); yvar = var(ydata,[],2); subplot(2,1,2) plot(fdatime,xvar,'c--','LineWidth',2); hold on plot(fdatime,yvar,'g--','LineWidth',2); hold off % I wonder if they have any relationship figure(8) subplot(1,1,1) plot(xmean,xvar) hold on plot(ymean,yvar,'r') hold off % OK, what about more sophisticated analyses? sub_samp = 1:8:800; % Reduce for the sake of computing time and visualization xdata_s = xdata(sub_samp,:); ydata_s = ydata(sub_samp,:); fdatime_s = fdatime(sub_samp,:); fda_x_cor = corr(xdata_s'); fda_y_cor = corr(ydata_s'); fda_xy_cor = corr(xdata_s',ydata_s'); % Now lets look at the correlation process: figure(9) surf(fdatime_s,fdatime_s,fda_x_cor) % Alternatively, a contour plot is sometimes more helpful: figure(10) contourf(fdatime_s,fdatime_s,fda_x_cor) % I'll make sense of this by looking at the x-process as well figure(11) subplot(2,1,1) contourf(fdatime_s,fdatime_s,fda_x_cor) subplot(2,1,2) plot(fdatime_s,xdata_s) % I need to register the second plot to have the same axis axis([fdatime_s(1) fdatime_s(100) -0.04 0.01]) % Now do the same with y figure(12) subplot(2,1,1) contourf(fdatime_s,fdatime_s,fda_y_cor) subplot(2,1,2) plot(fdatime_s,ydata_s) axis([fdatime_s(1) fdatime_s(100) -0.04 0.04]) % Now lets look at cross correlation figure(13) subplot(2,2,1) plot(xdata_s,fdatime_s) axis([-0.04 0.02 fdatime_s(1) fdatime_s(100)]) subplot(2,2,4) plot(fdatime_s,ydata_s) axis([fdatime_s(1) fdatime_s(100) -0.04 0.04]) subplot(2,2,2) contourf(fdatime_s,fdatime_s,fda_xy_cor); subplot(2,2,3) plot(xdata_s,ydata_s) % I like that graph so much that I think I'll save it print -dpng 'handwriting_crosscorr' % A bit of multivariate analysis will formalize what's going on here. % First I want to look at covariance xcov = cov(xdata_s'); ycov = cov(ydata_s'); % Eigen-analysis of correlation matrices % x first [vx,dx] = eig(xcov); dx = diag(dx); plot(dx) dx(91:100)/sum(dx) xmean_s = xmean(sub_samp); figure(14) subplot(1,1,1) plot(fdatime_s,xmean(sub_samp)) hold on plot(fdatime_s,xmean_s+sqrt(dx(100))*vx(:,100),'r') plot(fdatime_s,xmean_s-sqrt(dx(100))*vx(:,100),'r--') plot(fdatime_s,xmean_s+sqrt(dx(99))*vx(:,99),'c') plot(fdatime_s,xmean_s-sqrt(dx(99))*vx(:,99),'c--') hold off % And I can do the same thing with y [vy,dy] = eig(ycov); dy = diag(dy); figure(15) plot(dy) dy(91:100)/sum(dy) ymean_s = ymean(sub_samp); figure(16) plot(fdatime_s,ymean(sub_samp)) hold on plot(fdatime_s,ymean_s+sqrt(dy(100))*vy(:,100),'r') plot(fdatime_s,ymean_s-sqrt(dy(100))*vy(:,100),'r--') plot(fdatime_s,ymean_s+sqrt(dy(99))*vy(:,99),'c') plot(fdatime_s,ymean_s-sqrt(dy(99))*vy(:,99),'c--') hold off % And, of course, I can put them together figure(17) plot(xmean_s,ymean_s) hold on plot(xmean_s+sqrt(dx(100))*vx(:,100),ymean_s+sqrt(dy(100))*vy(:,100),'r') plot(xmean_s+sqrt(dx(100))*vx(:,100),ymean_s-sqrt(dy(100))*vy(:,100),'r--') hold off % Now lets look at some derivatives. % First consider a difference dxdata = diff(xdata,1)/0.0016; dydata = diff(ydata,1)/0.0016; fdatime_d = fdatime(1:1400); % What does this look like? figure(18) plot(fdatime_d,dydata) % Wow; is it really that noisy? figure(19) plot(fdatime(1:100),dydata(1:100,1)) % Alright, perhaps we can smooth this out by averaging: dxmean = mean(dxdata,2); dymean = mean(dydata,2); figure(20) plot(fdatime_d,dxmean) % Now, looking back at that variance: figure(21) plot(dxmean,xvar(1:1400),'.') hold on plot(dymean,yvar(1:1400),'r.') % What about acceleration? d2xdata = diff(dxdata,1)/0.0016; d2ydata = diff(dydata,1)/0.0016; fdatime_d2 = fdatime(1:1399); figure(22) plot(fdatime_d2,d2ydata) % Aaaaugh! figure(23) plot(fdatime(1:100),d2ydata(1:100,1)) % So the lesson is that differencing can be disasterous numerically; and % even worse statistically! d2xmean = mean(d2xdata,2); d2ymean = mean(d2ydata,2); figure(24) plot(fdatime_d2,d2ymean);
ans =
1401 20 2
ans =
1 20
ans =
0.0054
0.0122
0.0149
0.0172
0.0345
0.0421
0.0961
0.1400
0.2612
0.3611
ans =
0.0077
0.0119
0.0126
0.0165
0.0323
0.0416
0.0642
0.1277
0.2481
0.4263
More Matlab -- loops
Matlab provides the standard loop options.
% I could, for example obtain the pointwise mean in the following way: [N,n] = size(xdata) xmean2 = zeros(N,1); for i = 1:N j = 1; while j <= n xmean2(i) = j/(j+1)*xmean2(i) + 1/(j+1)*xdata(i,j); j = j + 1; end end % see also if, switch statements.
N =
1401
n =
20
More Matlab -- functions
A = rand(500); B = rand(500); C = mat_prod(A,B); % This takes a while -- reasons to beware of loops; Matlab is very good at % vector operations. D = A*B;
Other Data Structures
% Structs and Cells are like lists in R. % Structs contain named fields, for example: prod.A = randn(2,2); prod.B = randn(2,3); prod.fn = @mat_prod; prod.P = prod.fn(prod.A,prod.B); prod % These are frequently used to reduce the number of intputs to a function xbundle.xdata = xdata; xbundle.xmean = xmean; xbundle.dxdata = dxdata; xbundle.dxmean = dxmean; xbundle.time = fdatime; figure(25) plot(xbundle.xdata); % They can also be used to define other classes of objects. the FDA package % will make extensive (often implicit) use of these. % Cell arrays allow arbitrary objects to be stored in indexed arrays. The % main piece of notation to be aware of here is that '( )' now refers to % the cell location, whereas '{ }' refers to cell contents. prodcell = cell(2,2); prodcell{1,1} = A; prodcell(1,2) = {B}; prodcell(2,:) = {prod.P}; prodcell prodcell(2,:) = {@mat_prod,prod.P}; prodcell % These can be usefull for storing data with unequal lengths or collections % of other objects that don't fit easily into arrays.
prod =
A: [2x2 double]
B: [2x3 double]
fn: @mat_prod
P: [2x3 double]
prodcell =
[500x500 double] [500x500 double]
[ 2x3 double] [ 2x3 double]
prodcell =
[500x500 double] [500x500 double]
@mat_prod [ 2x3 double]