Tuesday, Thursday 11:40 - 12:55, Warren 145
Office Hours: 1186 Comstock, Wednesday 1 - 3.
A syllabus is now available.
Slides from lecture 1.
09/03/07 Homework 1 is now out and will be due September 18.
09/10/07 The Class Project/Competition has now started. Data is listed in the Data section below. The initial best MSE on the test set is 0.9403.
09/17/07 My office hours have moved one hour earlier, Wednesday 1 - 3. If you cannot attend these hours, please feel free to send me an e-mail to arrange an appointment.
09/20/07 Homework 2 is now out and will be due October 4.
10/11/07 Homework 3 is now out and will be due October 30.
30/10/07 Homework 4 is now out and will be due November 29.
New Netflix performance - 0.6273.
Boston Housing Data and documentation.
HP Spam Data; there is a training set, test set and a documentation file.
California Housing Data and documentation.
train_ratings_all.dat The ratings that the users in the training data set gave to each of the 99 movies.
train_dates_all.dat The date at which each of the ratings above were made.
train_ratings_nomiss.dat The training-set user ratings for the first 14 movies -- ie, where there are no missing values.
train_dates_nomiss.dat The corresponding dates for train_dates_nomiss.dat.
train_y_date.dat The dates at which the training set users rated "Miss Congeniality".
train_y_rating.dat The ratings that the users in the training set gave to "Miss Congeniality".
test_ratings_all.dat The ratings that the users in the test data set gave to each of the 99 movies.
test_dates_all.dat The date at which each of the ratings above were made.
test_ratings_nomiss.dat The test-set user ratings for the first 14 movies -- ie, where there are no missing values.
test_dates_nomiss.dat The corresponding dates for test_dates_nomiss.dat.
test_y_date.dat The dates at which the testing set users rated "Miss Congeniality".
movie_titles.txt
Names and release dates for the 99 movies, given in the same
order as the columns in the data above.
R-project website provides software and documentation for R.
Rintro a very basic "getting you started in R" tutorial. The data used
in this tutorial is the Boston Housing Data.
Fox,
Introduction to Statistical Computing in R: an online introduction to the basics of R.
Venables and Ripley, 2004, Modern Applied Statistics with S-plus, 4th Edition, Springer. Provides a good reference for the R and S-plus languages.
Friedman,
Data Mining and Statistics: What's the Connection?, Jerry Friedman on why Data Mining is Statistics.
Breiman, Statistical Modeling: The Two Cultures. Leo Breiman on why Statistics should be Data Mining.
This is a very provocative article, with some very interesting discussion.
Hand, Classifier Technology and the Illusion of Progress,
David Hand claiming that data mining has not really payed off. Again a provocative article with some interesting
discussion.
Data Set Selection is a marvelous satirical
commentary on the practise of machine learning research. Awarded a special prize at NIPS 2003. The
Journal of Machine Learning Gossip continues the tradition.
Introduction to Data Mining, a very quick overview put out by twocrows.
Vapnik, 2000, The Nature of Statistical Learning Theory, Springer. On the opposite end -- an extremely technical account of particular aspects of machine learning.
Lee, 2004, Bayesian Nonparametrics via Neural Networks, SIAM. A statistical perspective on one of the most commonly used tools.
Burges, 1998, A Tutorial on Support Vector Machines for
Pattern Recognition. Kluwer. A nicely accessible introduction to SVMs from a CS point of view.
Also see texts used in the classes below.
A very cute presentation on the use of image data. Thanks to Haim for finding this!
The Netflix Prize is a current opportunity for the highly inventive.
Readings and Resources
R Statistical Software
Perspectives on Statistics and Machine Learning
On Machine Learning/Data Mining
Some Cute Applications
Other Classes at Cornell
A computer science list of courses (somewhat out of date).
CS 478 -- Machine Learning
CS 578 -- Empirical Machine Learning
MATH 774 -- Topics in Statistical Learning Theory
ORIE 474 -- Statistical Data Mining; a masters level course.
ORIE 674 -- this class is a trial replacement.
Note that CS also has several seminar series in machine learning and its applications.