This website is out of date. The new site is at http://learnche.mcmaster.ca/4C3

# Univariate data analysis (2011)

Video material (part 1)

Video timing

 00:00 to 08:55 Quiz and announcements 08:55 to 26:19 Variability: why it costs us 26:19 to 37:52 Histograms and frequency distributions 37:52 to 56:55 Nomenclature: mean, median, MAD, etc 56:55 to 60:40 Binomial and uniform distributions 60:40 to 75:05 Central limit theorem and independence 75:05 to 97:45 Normal distribution and exercises 97:45 to 110:30 Testing for normality 110:30 to 114:10 Software topics; wrapping up

Video material(part 2)

Video timing

 00:00 to 05:00 Announcements 05:00 to 13:30 Recap and moving onto the Central Limit Theorem (CLT) 13:30 to 28:21 Using the CLT to estimate error of the mean 28:21 to 37:08 t-distribution and in-class example 37:08 to 55:10 Confidence intervals: calculation and their interpretation 55:10 to 82:26 Differences and similarity: using only reference data 82:26 to 116:38 Differences and similarity: using only experimental data, why we require independence 116:38 to 126:42 Paired tests

##  Course notes

• (PDF) Course notes
• Please print pages from Chapter 2.
• The full PDF is provided so that hyperlinks for cross-sections will work as expected.

 Class date: 10 January: slides 1 to 20 were covered12 January: slides 20 to 35 were covered13 January: slides 35 to 45 were covered17 January: slides 46 to 56 were covered19 January: slides 57 to 70 were covered20 January: slides 71 to 84 were covered I want my notes with: 1x1 (landscape) 2x1 (portrait) 3x1 (portrait) 3x1 (but with space for notes) 2x2 (landscape) 3x2 (portrait) pages per physical page Use page frames?

##  Audio recordings of 2011 classes

Date Material covered (approximate: may differ somewhat) Audio file
10 January 2011 About variability, histograms and frequency distributions No audio available
12 January 2011 Samples, population, robust methods, central limit theorem, independence Class 4
13 January 2011 The normal distribution; testing for normality with the q-q plot Class 5
17 January 2011 The $$t$$-distribution and confidence interval for the mean with given variance Class 6
19 January 2011 Confidence interval with unknown variance; tests for differences/similarity with a reference set Class 7
20 January 2011 Tests for differences with without a reference set Class 8
24 January 2011 Continued with tests with without a reference set; paired tests Class 9

Thanks to the various students responsible for recording and making these files available

##  Code used in class

Code used to illustrate how the q-q plot is constructed:

N <- 10

# What are the quantiles from the theoretical normal distribution?
index <- seq(1, N)
P <- (index - 0.5) / N
theoretical.quantity <- qnorm(P)

# Our sampled data:
yields <- c(86.2, 85.7, 71.9, 95.3, 77.1, 71.4, 68.9, 78.9, 86.9, 78.4)
mean.yield <- mean(yields)       # 80.0
sd.yield <- sd(yields)           # 8.35

# What are the quantiles for the sampled data?
yields.z <- (yields - mean.yield)/sd.yield
yields.z

yields.z.sorted <- sort(yields.z)

# Compare the values in text:
yields.z.sorted
theoretical.quantity

# Compare them graphically:
plot(theoretical.quantity, yields.z.sorted, asp=1)
abline(a=0, b=1)

# Built-in R function to do all the above for you:
qqnorm(yields)
qqline(yields)

# A better function: see http://connectmv.com/tutorials/r-tutorial/extending-r-with-packages/
library(car)
qqPlot(yields)

Code used to illustrate the central limit theorem's reduction in variance:

# Show the 3 plots side by side
layout(matrix(c(1,2,3), 1, 3))

# Sample the population:
N <- 100
x <- rnorm(N, mean=80, sd=5)
mean(x)
sd(x)

# Plot the raw data
x.range <- range(x)
plot(x, ylim=x.range, main='Raw data')

# Subgroups of 2
subsize <- 2
x.2 <- numeric(N/subsize)
for (i in 1:(N/subsize))
{
x.2[i] <- mean(x[((i-1)*subsize+1):(i*subsize)])
}
plot(x.2, ylim=x.range, main='Subgroups of 2')

# Subgroups of 4
subsize <- 4
x.4 <- numeric(N/subsize)
for (i in 1:(N/subsize))
{
x.4[i] <- mean(x[((i-1)*subsize+1):(i*subsize)])
}
plot(x.4, ylim=x.range, main='Subgroups of 4')

Code used to illustrate unpaired and paired tests:

#d.data <- c(11,18,16,20,12,8,26,12,17,14)
#m.data <- c(25,27,30,33,16,28,27,12,32,16)

d.data <- c(11,26,18,16,20,12,8,26,12,17,14)
m.data <- c(25,3,27,30,33,16,28,27,12,32,16)

d.n <- length(d.data)
m.n <- length(m.data)
d.mean <- mean(d.data)
m.mean <- mean(m.data)
d.sd <- sd(d.data)
m.sd <- sd(m.data)

# Unpaired difference
# -------------------
DOF <- m.n - 1 + d.n - 1
var.pooled <- ((m.n-1)*(m.sd)^2 + (d.n-1)*(d.sd)^2) / DOF

sample.diff <- m.mean - d.mean
denom.sd <-  sqrt(var.pooled * (1/m.n + 1/d.n))
z <- sample.diff / denom.sd

pt(z, df=DOF)
ct <- qt(0.975, df=DOF)
CI.LB <- sample.diff - ct * denom.sd
CI.UB <- sample.diff + ct * denom.sd
c(CI.LB, CI.UB)

# Paired difference
# -------------------
diffs <- m.data - d.data

diffs.mean = mean(diffs)
diffs.sd = sd(diffs)
c(diffs.mean, diffs.sd)
diffs.N = length(diffs)
t.crit = qt(0.975, df=diffs.N-1)
t.crit
LB = diffs.mean  - t.crit * diffs.sd / sqrt(diffs.N)
UB = diffs.mean  + t.crit * diffs.sd / sqrt(diffs.N)
c(LB, UB)