This website is out of date. The new site is at http://learnche.mcmaster.ca/4C3

Least squares modelling (2011)

From Statistics for Engineering

Jump to: navigation, search
Video material (part 1)
Download video: Link (plays in Google Chrome) [1.2Gb]

Video timing

00:00 to 08:46 Announcements and recap of last week
08:46 to 12:08 Overview of today's class
12:08 to 21:05 Covariance
21:05 to 28:05 Correlation and formula for covariance
28:05 to 31:50 Nonparametric modelling
31:50 to 41:35 Definitions and why we minimize the sum of squares
41:35 to 62:13 Finding b0 and b1: grid search; analytically; example
62:13 to 86:12 Model analysis: breakdown of variance
86:12 to 111:04 Model analysis: standard error, R2 and their interpretation
111:04 to 132:20 Confidence intervals for b0 and b1
132:20 to 136:40 Interpreting software output

Video material(part 2)
Download video: Link (plays in Google Chrome) [1.1 Gb]

Video timing

00:00 to 06:45 Announcements and midterm results
06:45 to 11:15 Recap of last week's class on least squares
11:15 to 18:05 Prediction error for a new y-prediction
18:05 to 54:24 Analysis of a linear model. Checking assumptions:
  • normally distributed errors
  • non-constant error variance
  • lack of independence in the data
  • non-linearity in the assumed model
  • summary of various checks to perform
54:24 to 85:00 Multiple linear regression (MLR)
85:00 to 91:10 Interpreting MLR model coefficients
91:10 to 107:40 Including and interpreting integer variables
107:40 to 134:45 Detecting outliers, discrepancy, leverage, and influential observations
134:45 to 141:00 Dealing with testing data

Video material (part 3)
Download video: Link (plays in Google Chrome) [380 Mb]

Video timing

00:00 to 40:00 Review of questions from assignment 5

Contents

[edit] Course notes

[edit] Projector overheads

Class date: 7 February: covered slides 1 to 19
9 February: covered slides 20 to 30
10 February: covered slides 31 to 41
16 February: covered slides 42 to 59
17 February: covered slides 60 to 73
28 February: covered slides 67 to 84
2 March: covered slides 85 to 98
3 March: covered slides 98 to 112
Enrichment topics: not covered
I want my notes with:  

  pages per physical page

Use page frames?

[edit] Audio recordings of 2011 classes

Date Material covered Audio file
07 February 2011 Covariance, correlation, non-parameteric modelling Class 14
09 February 2011 Notation, calculating model parameters Class 15
10 February 2011 Analysis of variance, \(R^2\) and \(S_E\) Class 16
16 February 2011 Assumptions required to derived confidence intervals Class 17
17 February 2011 Calculating prediction intervals for \(y\), assessing assumptions Class 18
28 February 2011 Assessing assumptions and modifying data to better meet assumptions Class 19
02 March 2011 Multiple linear regression Class 20
03 March 2011 Integer variables; outliers: discrepancy, leverage and influence Class 21

Thanks to the various students responsible for recording and making these files available

[edit] R code for this section

bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv')
summary(bio)
 
# Plot data
plot(bio)
 
# For convenience:
y <- bio$yield
x <- bio$temperature
 
# Build linear model: interpret coeff, CI and SE
model <- lm(y ~ x)
summary(model)
confint(model)
 
# Visualize predictions
plot(x, y)
abline(model)
 
# Residuals normal?
library(car)
qqPlot(model)
 
# Structure in residuals?
plot(resid(model))
abline(h=0, col="blue")
 
# Residuals in time order?
# You might have to rearrange the row order,
# the plot using the same code as previous plot
 
# Look at the autocorrelation to check for lack of
# independence (can be inaccurate for small datasets)
acf(resid(model))
 
# Predictions vs residuals
plot(predict(model), resid(model))
abline(h=0, col="blue")
 
# x-data vs residuals
plot(x, resid(model))
abline(h=0, col="blue")
identify(x, resid(model))
 
# Predictions-vs-y
plot(y, predict(model))
abline(a=0, b=1, col="blue")
identify(y, predict(model))
# Create some data
N = 25
set.seed(41)
x <- rnorm(N, sd=4, mean=10)
y <- x*4 - 6 + log(x) + rnorm(N, sd=3)
 
# Discrepant point (model A)
x[N+1] = 11.4
y[N+1] = 72.6
 
# Influential point (model B)
x[N+1] = 25
y[N+1] = 42
 
# High leverage point (model C)
x[N+1] = 25
y[N+1] = 92.6
 
# Build the model 3 times: once for A, B and C
 
model <- lm(y~x)
summary(model)
 
plot(x, y)
abline(model, col="blue")
identify(x, y)
 
# Leverage: hatvalues
plot(hatvalues(model), ylim=c(0,1))
N <- length(x)
avg.hat <- 2/N
abline(h=2*avg.hat, col="darkgreen")
abline(h=3*avg.hat, col="red")
identify(hatvalues(model))
 
# Discrepancy: rstudent
plot(rstudent(model))
abline(h=c(-2, 2), col="red")
 
# Influence: Cook's D
plot(cooks.distance(model))
K <- length(model$coefficients)
cutoff <- 4/(N-K)
abline(h=cutoff, col="red")
identify(cooks.distance(model))
 
build <- seq(1,N)
remove <- -c(26)
model.update <- lm(model, subset=build[remove])
 
# Improved?
library(car)
influenceIndexPlot(model.update)
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox