This website is out of date. The new site is at http://learnche.mcmaster.ca/4C3

Univariate data analysis (2011)

From Statistics for Engineering

Jump to: navigation, search
Video material (part 1)
Download video: Link (plays in Google Chrome) [1.1Gb]

Video timing

00:00 to 08:55 Quiz and announcements
08:55 to 26:19 Variability: why it costs us
26:19 to 37:52 Histograms and frequency distributions
37:52 to 56:55 Nomenclature: mean, median, MAD, etc
56:55 to 60:40 Binomial and uniform distributions
60:40 to 75:05 Central limit theorem and independence
75:05 to 97:45 Normal distribution and exercises
97:45 to 110:30 Testing for normality
110:30to 114:10 Software topics; wrapping up

Video material(part 2)
Download video: Link (plays in Google Chrome) [1.2Gb]

Video timing

00:00 to 05:00 Announcements
05:00 to 13:30 Recap and moving onto the Central Limit Theorem (CLT)
13:30 to 28:21 Using the CLT to estimate error of the mean
28:21 to 37:08 t-distribution and in-class example
37:08 to 55:10 Confidence intervals: calculation and their interpretation
55:10 to 82:26 Differences and similarity: using only reference data
82:26 to 116:38 Differences and similarity: using only experimental data, why we require independence
116:38 to 126:42 Paired tests

Contents

[edit] Course notes

[edit] Projector overheads

Class date: 10 January: slides 1 to 20 were covered
12 January: slides 20 to 35 were covered
13 January: slides 35 to 45 were covered
17 January: slides 46 to 56 were covered
19 January: slides 57 to 70 were covered
20 January: slides 71 to 84 were covered
I want my notes with:  

  pages per physical page

Use page frames?

[edit] Audio recordings of 2011 classes

Date Material covered (approximate: may differ somewhat) Audio file
10 January 2011 About variability, histograms and frequency distributions No audio available
12 January 2011 Samples, population, robust methods, central limit theorem, independence Class 4
13 January 2011 The normal distribution; testing for normality with the q-q plot Class 5
17 January 2011 The \(t\)-distribution and confidence interval for the mean with given variance Class 6
19 January 2011 Confidence interval with unknown variance; tests for differences/similarity with a reference set Class 7
20 January 2011 Tests for differences with without a reference set Class 8
24 January 2011 Continued with tests with without a reference set; paired tests Class 9

Thanks to the various students responsible for recording and making these files available

[edit] Code used in class

Code used to illustrate how the q-q plot is constructed:

N <- 10
 
# What are the quantiles from the theoretical normal distribution?
index <- seq(1, N)
P <- (index - 0.5) / N
theoretical.quantity <- qnorm(P)
 
# Our sampled data:
yields <- c(86.2, 85.7, 71.9, 95.3, 77.1, 71.4, 68.9, 78.9, 86.9, 78.4)
mean.yield <- mean(yields)       # 80.0
sd.yield <- sd(yields)           # 8.35
 
# What are the quantiles for the sampled data?
yields.z <- (yields - mean.yield)/sd.yield
yields.z
 
yields.z.sorted <- sort(yields.z)
 
# Compare the values in text:
yields.z.sorted 
theoretical.quantity  
 
# Compare them graphically:
plot(theoretical.quantity, yields.z.sorted, asp=1)
abline(a=0, b=1)
 
# Built-in R function to do all the above for you:
qqnorm(yields)
qqline(yields)
 
# A better function: see http://connectmv.com/tutorials/r-tutorial/extending-r-with-packages/
library(car)
qqPlot(yields)

Code used to illustrate the central limit theorem's reduction in variance:

# Show the 3 plots side by side
layout(matrix(c(1,2,3), 1, 3))
 
# Sample the population:
N <- 100
x <- rnorm(N, mean=80, sd=5)
mean(x)
sd(x)
 
# Plot the raw data
x.range <- range(x)
plot(x, ylim=x.range, main='Raw data')
 
# Subgroups of 2
subsize <- 2
x.2 <- numeric(N/subsize)
for (i in 1:(N/subsize))
{
    x.2[i] <- mean(x[((i-1)*subsize+1):(i*subsize)])
}
plot(x.2, ylim=x.range, main='Subgroups of 2')
 
# Subgroups of 4
subsize <- 4
x.4 <- numeric(N/subsize)
for (i in 1:(N/subsize))
{
    x.4[i] <- mean(x[((i-1)*subsize+1):(i*subsize)])
}
plot(x.4, ylim=x.range, main='Subgroups of 4')

Code used to illustrate unpaired and paired tests:

#d.data <- c(11,18,16,20,12,8,26,12,17,14)
#m.data <- c(25,27,30,33,16,28,27,12,32,16)
 
d.data <- c(11,26,18,16,20,12,8,26,12,17,14)
m.data <- c(25,3,27,30,33,16,28,27,12,32,16)
 
d.n <- length(d.data)
m.n <- length(m.data)
d.mean <- mean(d.data)
m.mean <- mean(m.data)
d.sd <- sd(d.data)
m.sd <- sd(m.data)
 
# Unpaired difference
# -------------------
DOF <- m.n - 1 + d.n - 1
var.pooled <- ((m.n-1)*(m.sd)^2 + (d.n-1)*(d.sd)^2) / DOF
 
sample.diff <- m.mean - d.mean
denom.sd <-  sqrt(var.pooled * (1/m.n + 1/d.n))
z <- sample.diff / denom.sd
 
pt(z, df=DOF)
ct <- qt(0.975, df=DOF)
CI.LB <- sample.diff - ct * denom.sd
CI.UB <- sample.diff + ct * denom.sd
c(CI.LB, CI.UB)
 
 
# Paired difference
# -------------------
diffs <- m.data - d.data
 
diffs.mean = mean(diffs)
diffs.sd = sd(diffs)
c(diffs.mean, diffs.sd)
diffs.N = length(diffs)
t.crit = qt(0.975, df=diffs.N-1)
t.crit
LB = diffs.mean  - t.crit * diffs.sd / sqrt(diffs.N)
UB = diffs.mean  + t.crit * diffs.sd / sqrt(diffs.N)
c(LB, UB)
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox