This website is out of date. The new site is at http://learnche.org/4C3

# Design and analysis of experiments (2012)

### From Statistics for Engineering

Class date(s):
| 8, 15, 22, 29 March 2012 | ||||

| |||||

| |||||

| |||||

| |||||

| |||||

| |||||

| |||||

| |||||

| |||||

| |||||

## Contents |

## [edit] Materials

- Audio: 08 March and 15 March and 22 March and 29 March
- Video from:
- Course notes (print chapter 5)
- Slides:

## [edit] Software code (8 March 2012)

You can use R to find the DoE model:

T <- c(-1, +1, -1, +1) # centered and scaled temperature S <- c(-1, -1, +1, +1) # centered and scaled substrate concentration y <- c(69, 60, 64, 53) # conversion is the response, y mod <- lm(y ~ T + S + T*S) summary(mod) Call: lm(formula = y ~ T + S + T * S) Residuals: ALL 4 residuals are 0: no residual degrees of freedom! Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 61.5 NA NA NA T -5.0 NA NA NA S -3.0 NA NA NA T:S -0.5 NA NA NA Residual standard error: NaN on 0 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: NaN F-statistic: NaN on 3 and 0 DF, p-value: NA

You could also use MATLAB if you prefer:

% Set up the X matrix: n = 4; temperature = [-1, +1, -1, +1]; substrate = [-1, -1, +1, +1]; X = [ ones(n, 1), temperature', substrate', (temperature .* substrate)']; % or you can type it in directly (but this is error prone) X = [+1 -1 -1 +1; ... +1 +1 -1 -1; ... +1 -1 +1 -1; ... +1 +1 +1 +1]; % Conversion is your y-variable y = [69, 60, 64, 53]'; % Regression coefficients = b = [Intercept, b_T, b_S, b_{TS}] b = inv(X'*X)*X'*y

## [edit] 3-factor example (8 March 2012)

Could not pull and update the repository; please report this problem to site administrator.

The data are from a plastics molding factory which must treat its waste before discharge. The \(y\) variable represents the average amount of pollutant discharged [lb per day], while the 3 factors that were varied were:

- \(C\): the chemical compound added [A or B]
- \(T\): the treatment temperature [72°F or 100°F]
- \(S\): the stirring speed [200 rpm or 400 rpm]
- \(y\): the amount of pollutant discharged [lb per day]

Experiment Order \(C\) \(T\) [°F] \(S\) [rpm] \(y\) [lb] 1 5 A 72 200 5 2 6 B 72 200 30 3 1 A 100 200 6 4 4 B 100 200 33 5 2 A 72 400 4 6 7 B 72 400 3 7 3 A 100 400 5 8 8 B 100 400 4

We showed the cube plot for this system in the class on 10 March. From the cube plot we could already see the main factors, and even the **CS** interaction was noticeable.

**C effect**: There are 4 estimates of \(C = \displaystyle \frac{(+25) + (+27) + (-1) + (-1)}{4} = \frac{50}{4} = \bf{12.5}\)**T effect**: There are 4 estimates of \(T = \displaystyle \frac{(+1) + (+3) + (+1) + (+1)}{4} = \frac{6}{4} = \bf{1.5}\)**S effect**: There are 4 estimates of \(S = \displaystyle \frac{(-27) + (-1) + (-29) + (-1)}{4} = \frac{-58}{4} = \bf{-14.5}\)**CT interaction**: There are 2 estimates of \(CT\). Recall that interactions are calculated as the half difference going from high to low. Consider the change in \(C\) when- \(T_\text{high}\) (at \(S\) high) = 4 - 5 = -1
- \(T_\text{low}\) (at \(S\) high) = 3 - 4 = -1
- First estimate = [(-1) - (-1)]/2 = 0
- \(T_\text{high}\) (at \(S\) low) = 33 - 6 = +27
- \(T_\text{low}\) (at \(S\) low) = 30 - 5 = +25
- Second estimate = [(+27) - (+25)]/2 = +1
- Average
**CT**interaction = (0 + 1)/2 =**0.5** - You can interchange \(C\) and \(T\) and still get the same result.

**CS interaction**: There are 2 estimates of \(CS\). Consider the change in \(C\) when- \(S_\text{high}\) (at \(T\) high) = 4 - 5 = -1
- \(S_\text{low}\) (at \(T\) high) = 33 - 6 = +27
- First estimate = [(-1) - (+27)]/2 = -14
- \(S_\text{high}\) (at \(T\) low) = 3 - 4 = -1
- \(S_\text{low}\) (at \(T\) low) = 30 - 5 = +25
- Second estimate = [(-1) - (+25)]/2 = -13
- Average
**CS**interaction = (-13 - 14)/2 =**-13.5** - You can interchange \(C\) and \(S\) and still get the same result.

**ST interaction**: There are 2 estimates of \(ST\): (-1 + 0)/2 =**-0.5**, calculate in the same way as above.**CTS interaction**: There is only a single estimate of \(CTS\):- \(CT\) effect at high \(S\) = 0
- \(CT\) effect at low \(S\) = +1
- \(CTS\) interaction = [(0) - (+1)] / 2 =
**-0.5** - You can calculate this also by considering the \(CS\) effect at the two levels of \(T\)
- Or, you can calculate this by considering the \(ST\) effect at the two levels of \(C\).
- All 3 approaches give the same result.

Next, use computer software to verify that

The \(\mathbf{X}\) matrix and \(\mathbf{y}\) vector used to calculate the least squares model:

# Create the design matrix in a quick way in R C <- T <- S <- c(-1, +1) design <- expand.grid(C=C, T=T, S=S) design C T S 1 -1 -1 -1 2 1 -1 -1 3 -1 1 -1 4 1 1 -1 5 -1 -1 1 6 1 -1 1 7 -1 1 1 8 1 1 1 C <- design$C T <- design$T S <- design$S y <- c(5, 30, 6, 33, 4, 3, 5, 4) # Full factorial model (error prone) mod.full <- lm(y ~ C + T + S + C*T + C*S + S*T + C*T*S) # or the much more powerful notation will expand all terms up to the 3rd order interactions mod.full <- lm( y ~ (C + T + S)^3 ) summary(mod.full) Call: lm(formula = y ~ C + T + S + C * T + C * S + S * T + C * T * S) Residuals: ALL 8 residuals are 0: no residual degrees of freedom! Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.25 NA NA NA C 6.25 NA NA NA T 0.75 NA NA NA S -7.25 NA NA NA C:T 0.25 NA NA NA C:S -6.75 NA NA NA T:S -0.25 NA NA NA C:T:S -0.25 NA NA NA # Guide to estimating significant coefficients, ignoring the intercept (1st coefficient) coeff.full <- coef(mod.full)[2:length(coef(mod.full))] # Pareto plot of the absolute coefficients library(lattice) barchart(sort(abs(coeff.full)), xlab="Magnitude of effect", groups=coeff.full>0, col=c("lightblue", "orange")) # Another alternative: qqPlot identifying the 4 largest coefficients) library(car) qqPlot(coeff.full, id.n=length(coeff.full), envelope=FALSE) # Build a partial model with only the significant effects mod.partial <- lm(y ~ C + S + C*S) summary(mod.partial) # notice the standard errors are all the same; why? # and check confidence intervals confint(mod.partial) # Try to add back coefficients that were possibly significant and recheck confidence intervals # The temperature, T, was the next highest coefficient: mod.revised <- update(mod.partial, . ~ . + T) summary(mod.revised) # we see it is now significant at the 90 to 95% level

## [edit] Example from class (15 March 2012)

Could not pull and update the repository; please report this problem to site administrator.

Your group is developing a new product, but have been struggling to get the product's stability, measured in days, to the level required. You are aiming for a stability value of 50 days or more. Four factors have been considered:

**A**= monomer concentration: 30% or 50%**B**= acid concentration: low or high**C**= catalyst level: 2% or 3%**D**= temperature: 393K or 423K

These eight experiments have been run so far:

Experiment | Order | A | B | C | D | Stability |
---|---|---|---|---|---|---|

1 | 5 | \(-\) | \(-\) | \(-\) | \(-\) | 40 |

2 | 6 | \(+\) | \(-\) | \(-\) | \(+\) | 27 |

3 | 1 | \(-\) | \(+\) | \(-\) | \(+\) | 35 |

4 | 4 | \(+\) | \(+\) | \(-\) | \(-\) | 21 |

5 | 2 | \(-\) | \(-\) | \(+\) | \(+\) | 39 |

6 | 7 | \(+\) | \(-\) | \(+\) | \(-\) | 27 |

7 | 3 | \(-\) | \(+\) | \(+\) | \(-\) | 27 |

8 | 8 | \(+\) | \(+\) | \(+\) | \(+\) | 20 |

- How was the experimented generated?
- What is the defining relationship?
- What will be aliased with
**A**; with**D**and with**BC**? - Describe the aliasing structure (resolution)?
- What is the model's intercept; main effect for
**A**; and for the**AD**interaction? - If the least squares model is: \(y = 29.5 -5.75x_A -3.75 x_B -1.25 x_C + 0.75 x_D + 0.50 x_A x_B + 1.0 x_A x_C - 1.0 x_A x_D\) what is the predicted stability when operating at:
- monomer concentration of 25%
- low acid concentration
- 1.5% catalyst level
- a temperature of 408 K

## [edit] Other class examples (15 March 2012)

A <- B <- C<- D <- c(-1, 1) d <- expand.grid(A=A, B=B, C=C, D=D) y <- c(45,71,48,65,68,60,80,65,43,100,45,104,75,86,70,96) A <- d$A B <- d$B C <- d$C D <- d$D mod.full <- lm(y ~ (A + B + C + D)^4 ) coeff.full <- coef(mod.full)[2:length(coef(mod.full))] # Pareto plot of the absolute coefficients library(lattice) barchart(sort(abs(coeff.full)), xlab="Magnitude of effect", groups=coeff.full>0, col=c("lightblue", "orange")) # Another alternative: qqPlot identifying the 4 largest coefficients) library(car) qqPlot(coeff.full, id.n=length(coeff.full), envelope=FALSE) # Build a partial model with only the significant effects mod.partial <- lm(y ~ A + C + D + A*C + A*D) summary(mod.partial) # and check confidence intervals confint(mod.partial) # Try to add back coefficients that were possibly significant and recheck confidence intervals # The temperature, T, was the next highest coefficient: mod.revised <- update(mod.partial, . ~ . + B + A*B*D) summary(mod.revised) confint(mod.revised) #--------------- A <- B <- C<- D <- c(-1, 1) g <- expand.grid(A=A, B=B, C=C) y = c(40,27, 35, 21, 39, 27, 27, 20) A <- g$A B <- g$B C <- g$C D <- A*B*C D # [1] -1 1 1 -1 1 -1 -1 1 mod.doe <- lm(y ~ A +B+ C+ D+A*B+A*C+A*D) summary(mod.doe)

## [edit] General contour plots (22 March 2012)

Try running your own set of response surface optimizations for these processes:

- A simple optimum: \(y = 83 +9.4x_A +7.1x_B -6 x_A x_B - 7.4 x_A^2 -3.7 x_B^2\)
- A stationary ridge: \(y = 83 + 10.0x_A + 5.6x_B - 7.6 x_A x_B - 6.9 x_A^2 - 2.0 x_B^2\)
- A rising ridge: \(y = 83 + 8.8x_A + 8.2x_B - 7.6 x_A x_B - 7.0 x_A^2 - 2.4 x_B^2\)
- A saddle point: \(y = 83 + 11.1x_A + 4.1x_B - 9.4 x_A x_B - 6.5 x_A^2 - 0.4 x_B^2\)

Binary variables should be treated exactly like continuous variables in all respects during the optimization. The only difference is that it makes sense to visualize and use them at the two levels.

bound <- 4.2 N <- 150 # resolution of surface (higher values give smoother plots) # The lower and upper bounds, in coded units, over which the surface is visualized A <- seq(-bound, bound ,length=N) B <- seq(-bound, bound, length=N) grd <- expand.grid(A=A, B=B) # A simple optimum: grd$y <- 83 + 9.4*grd$A + 7.1*grd$B - 6*grd$A*grd$B - 7.4*grd$A**2 - 3.7*grd$B**2 # A stationary ridge: grd$y = 83 + 10.0*grd$A + 5.6*grd$B - 7.6*grd$A*grd$B - 6.9*grd$A**2 - 2.0*grd$B**2 # A rising ridge: grd$y = 83 + 8.8*grd$A + 8.2*grd$B - 7.6*grd$A*grd$B - 7.0*grd$A**2 - 2.4*grd$B**2 # A saddle point: grd$y = 83 + 11.1*grd$A + 4.1*grd$B + 9.4*grd$A*grd$B - 6.5*grd$A**2 - 0.4*grd$B**2 library(lattice) contourplot(y ~ A * B, data = grd, cuts = 30, region = TRUE, col.regions = terrain.colors, xlab = "A", ylab = "B", main = "Predicted response to factors A and B, in coded units" )

## [edit] Contour plots for response surface methods (22 March 2012)

# Initial set of 4 runs T <- c(-1, +1, -1, +1, 0) S <- c(-1, -1, +1, +1, 0) y <- c(193, 310, 468, 571, 407) # After first step T <- c(-1, +1, -1, +1, 0, 1) S <- c(-1, -1, +1, +1, 0, 2.43) y <- c(193, 310, 468, 571, 407, 669) # Second factorial around point 6 as baseline, with points 8, 9, 10, 11 T <- c(-1, +1, -1, +1, 0) S <- c(-1, -1, +1, +1, 0) y <- c(694, 725, 620, 642, 688) # The rest of the code is common to all factorial models # ------------------------------------------------------ mod <- lm(y ~ T + S + T*S) summary(mod) N <- 50 # resolution of surface (higher values give smoother plots) # The lower and upper bounds, in coded units, over which we want # to visualize the surface bound <- 4 T_plot <- seq(-bound, bound ,length=N) S_plot <- seq(-bound, bound, length=N) grd <- expand.grid(T=T_plot, S=S_plot) # Predict directly from least squares model grd$y <- predict(mod, grd) library(lattice) contourplot(y ~ T * S, data = grd, cuts = 10, region = TRUE, col.regions = terrain.colors, xlab = "Temperature", ylab = "Substrate", main = "Predicted response to factors T and S, in coded units" ) trellis.focus("panel", 1, 1, highlight=FALSE) lpoints(T, S, pch="O", col="red", cex=2) trellis.unfocus() # Add point 13 and 14 (you don't need to do all the CCD experiments) T <- c(-1, +1, -1, +1, 0, 0, +sqrt(2)) S <- c(-1, -1, +1, +1, 0, -sqrt(2), 0) y <- c(694, 725, 620, 642, 688, 720, 699) # Add all CCD points: then add 15, 16 T <- c(-1, +1, -1, +1, 0, 0, +sqrt(2), 0, -sqrt(2)) S <- c(-1, -1, +1, +1, 0, -sqrt(2), 0, sqrt(2), 0) y <- c(694, 725, 620, 642, 688, 720, 699, 610, 663) # Add quadratic terms to the model mod.quad <- lm(y ~ T + S + T*S + I(S^2) + I(T^2)) summary(mod.quad) # Predict directly from least squares model grd$y <- predict(mod.quad, grd) library(lattice) contourplot(y ~ T * S, data = grd, cuts = 20, region = TRUE, col.regions = terrain.colors, xlab = "Temperature", ylab = "Substrate", main = "Predicted response to factors T and S, in coded units" ) trellis.focus("panel", 1, 1, highlight=FALSE) lpoints(T, S, pch="O", col="red", cex=2) trellis.unfocus()