From Statistics for Engineering



Video timing


00:00  to  08:46   Announcements and recap of last week

08:46  to  12:08   Overview of today's class

12:08  to  21:05   Covariance

21:05  to  28:05   Correlation and formula for covariance

28:05  to  31:50   Nonparametric modelling

31:50  to  41:35   Definitions and why we minimize the sum of squares

41:35  to  62:13   Finding b_{0} and b_{1}: grid search; analytically; example

62:13  to  86:12   Model analysis: breakdown of variance

86:12  to  111:04   Model analysis: standard error, R^{2} and their interpretation

111:04  to  132:20   Confidence intervals for b_{0} and b_{1}

132:20  to  136:40   Interpreting software output



Video timing


00:00  to  06:45   Announcements and midterm results

06:45  to  11:15   Recap of last week's class on least squares

11:15  to  18:05   Prediction error for a new yprediction

18:05  to  54:24   Analysis of a linear model. Checking assumptions:
 normally distributed errors
 nonconstant error variance
 lack of independence in the data
 nonlinearity in the assumed model
 summary of various checks to perform

54:24  to  85:00   Multiple linear regression (MLR)

85:00  to  91:10   Interpreting MLR model coefficients

91:10  to  107:40   Including and interpreting integer variables

107:40  to  134:45   Detecting outliers, discrepancy, leverage, and influential observations

134:45  to  141:00   Dealing with testing data



Video timing


00:00  to  40:00   Review of questions from assignment 5


[edit] Course notes
 (PDF) Course notes
 Please print pages from Chapter 4.
 The full PDF is provided so that hyperlinks for crosssections will work as expected.
[edit] Projector overheads
[edit] Audio recordings of 2011 classes
Date
 Material covered
 Audio file

07 February 2011
 Covariance, correlation, nonparameteric modelling
 Class 14

09 February 2011
 Notation, calculating model parameters
 Class 15

10 February 2011
 Analysis of variance, \(R^2\) and \(S_E\)
 Class 16

16 February 2011
 Assumptions required to derived confidence intervals
 Class 17

17 February 2011
 Calculating prediction intervals for \(y\), assessing assumptions
 Class 18

28 February 2011
 Assessing assumptions and modifying data to better meet assumptions
 Class 19

02 March 2011
 Multiple linear regression
 Class 20

03 March 2011
 Integer variables; outliers: discrepancy, leverage and influence
 Class 21

Thanks to the various students responsible for recording and making these files available
[edit] R code for this section
 Code used in class to evaluate a leastsquares model:
bio < read.csv('http://openmv.net/file/bioreactoryields.csv')
summary(bio)
# Plot data
plot(bio)
# For convenience:
y < bio$yield
x < bio$temperature
# Build linear model: interpret coeff, CI and SE
model < lm(y ~ x)
summary(model)
confint(model)
# Visualize predictions
plot(x, y)
abline(model)
# Residuals normal?
library(car)
qqPlot(model)
# Structure in residuals?
plot(resid(model))
abline(h=0, col="blue")
# Residuals in time order?
# You might have to rearrange the row order,
# the plot using the same code as previous plot
# Look at the autocorrelation to check for lack of
# independence (can be inaccurate for small datasets)
acf(resid(model))
# Predictions vs residuals
plot(predict(model), resid(model))
abline(h=0, col="blue")
# xdata vs residuals
plot(x, resid(model))
abline(h=0, col="blue")
identify(x, resid(model))
# Predictionsvsy
plot(y, predict(model))
abline(a=0, b=1, col="blue")
identify(y, predict(model))
 R code used to test for influence, discrepancy and leverage
# Create some data
N = 25
set.seed(41)
x < rnorm(N, sd=4, mean=10)
y < x*4  6 + log(x) + rnorm(N, sd=3)
# Discrepant point (model A)
x[N+1] = 11.4
y[N+1] = 72.6
# Influential point (model B)
x[N+1] = 25
y[N+1] = 42
# High leverage point (model C)
x[N+1] = 25
y[N+1] = 92.6
# Build the model 3 times: once for A, B and C
model < lm(y~x)
summary(model)
plot(x, y)
abline(model, col="blue")
identify(x, y)
# Leverage: hatvalues
plot(hatvalues(model), ylim=c(0,1))
N < length(x)
avg.hat < 2/N
abline(h=2*avg.hat, col="darkgreen")
abline(h=3*avg.hat, col="red")
identify(hatvalues(model))
# Discrepancy: rstudent
plot(rstudent(model))
abline(h=c(2, 2), col="red")
# Influence: Cook's D
plot(cooks.distance(model))
K < length(model$coefficients)
cutoff < 4/(NK)
abline(h=cutoff, col="red")
identify(cooks.distance(model))
build < seq(1,N)
remove < c(26)
model.update < lm(model, subset=build[remove])
# Improved?
library(car)
influenceIndexPlot(model.update)