Assignment 6 - 2011

 

Assignment objectives

=
========

.. rubric:: Assignment objectives


 * To become more comfortable using R to fit, interpret and manipulate least squares models.
 * The questions in this assignment are typical of the exploratory/learning type questions that will be in the take-home midterm.

Question 1 [1.5]

=
====

Use the mature `cheddar cheese data set `_ for this question.


 * 1) .	Choose any :math:`x`-variable, either ``Acetic`` acid concentration (already log-transformed), ``H2S`` concentration (already log-transformed), or ``Lactic`` acid concentration (in original units) and use this to predict the ``Taste`` variable in the data set.  The ``Taste`` is a subjective measurement, presumably measured by a panel of tasters.

Prove that you get the same linear model coefficients, :math:`R^2`, :math:`S_E` and confidence intervals whether or not you first mean center the :math:`x` and :math:`y` variables.
 * 1) .	What is the level of correlation between each of the :math:`x`-variables. Also show a scatterplot matrix to learn what this level of correlation looks like visually.

*	Report your correlations as a :math:`3 \times 3` matrix, where there should be 1.0's on the diagonal, and values between :math:`-1` and :math:`+1` on the off-diagonals.


 * 1) .	Build a linear regression that uses all three :math:`x`-variables to predict :math:`y`.

-	Report the slope coefficient and confidence interval for each :math:`x`-variable -	Report the model's standard error. Has it decreased from the model in part 1? -	Report the model's :math:`R^2` value. Has it decreased?

Question 2 [2.5]

=
===

In this question we will revisit the `bioreactor yield `_ data set and fit a linear model with all :math:`x`-variables to predict the yield.


 * 1) .	Provide the interpretation for each coefficient in the model, and also comment on its confidence interval when interpreting it.


 * 1) .	Compare the 3 slope coefficient values you just calculated, to those from the last assignment:

-	:math:`\hat{y} = 102.5 - 0.69T`, where :math:`T` is tank temperature -	:math:`\hat{y} = -20.3 + 0.016S`, where :math:`S` is impeller speed -	:math:`\hat{y} = 54.9 - 16.7B`, where :math:`B` is 1 if baffles are present and :math:`B=0` with no baffles Explain why your coefficients do not match.
 * 1) .	Are the residuals from the multiple linear regression model normally distributed?


 * 1) .	In this part we are investigating the variance-covariance matrices used to calculate the linear model.

-	First center the :math:`x`-variables and the :math:`y`-variable that you used in the model. *Note*: feel free to use MATLAB, or any other tool to answer this question. If you are using R, then you will benefit from `this page in the R tutorial `_. Also, read the help for the ``model.matrix(...)`` function to get the :math:`\mathbf{X}`-matrix. Then read the help for the ``sweep(...)`` function, or more simply use the ``scale(...)`` function to do the mean-centering.

-	Show your calculated :math:`\mathbf{X}^T\mathbf{X}` and :math:`\mathbf{X}^T\mathbf{y}` variance-covariance matrices from the centered data.

-	Explain why the interpretation of covariances in :math:`\mathbf{X}^T\mathbf{y}` match the results from the full MLR model you calculated in part 1 of this question. -	Calculate :math:`\mathbf{b} =\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}` and show that it agrees with the estimates that R calculated (even though R fits an intercept term, while your :math:`\mathbf{b}` does not).
 * 1) .	What would be the predicted yield for an experiment run without baffles, at 4000 rpm impeller speed, run at a reactor temperature of 90 °C?

Question 3 [3]

=
=

In this question we will use the `LDPE data `_ which is data from a high-fidelity simulation of a low-density polyethylene reactor. LDPE reactors are very long, thin tubes. In this particular case the tube is divided in 2 zones, since the feed enters at the start of the tube, and some point further down the tube (start of the second zone). There is a temperature profile along the tube, with a certain maximum temperature somewhere along the length. The maximum temperature in zone 1, ``Tmax1`` is reached some fraction ``z1`` along the length; similarly in zone 2 with the ``Tmax2`` and ``z2`` variables.

We will build a linear model to predict the ``SCB`` variable, the short chain branching (per 1000 carbon atoms) which is an important quality variable for this product. Note that the last 4 rows of data are known to be from abnormal process operation, when the process started to experience a problem. However, we will pretend we didn't know that when building the model, so keep them in for now.

Use this code to get you started (make sure you understand what it is doing):: LDPE <- read.csv('http://datasets.connectmv.com/file/ldpe.csv') subdata <- data.frame(cbind(LDPE$Tmax1, LDPE$Tmax2, LDPE$z1, LDPE$z2, LDPE$SCB)) colnames(subdata) <- c("Tmax1", "Tmax2", "z1", "z2", "SCB") Using bullet points, describe the nature of relationships between the 5 variables, and particularly the relationship to the :math:`y`-variable.
 * 1) .	Use only the following subset of :math:`x`-variables: ``Tmax1``, ``Tmax2``, ``z1`` and ``z2`` and the :math:`y` variable = ``SCB``. Show the relationship between these 5 variables in a scatter plot matrix.
 * 1) .	Let's start with a linear model between ``z2`` and ``SCB``. We will call this the ``z2`` model.  Let's examine its residuals:

-	Are the residuals normally distributed? -	What is the standard error of this model? -	Are there any time-based trends in the residuals (the rows in the data are already in time-order)? -	Use any other relevant plots of the predicted values, the residuals, the :math:`x`-variable, as described in class, and diagnose the problem with this linear model. -	What can be done to fix the problem? (You don't need to implement the fix yet).

..	Non-constant error variance
 * 1) .	Show a plot of the hat-values (leverage) from the ``z2`` model.

-	Add suitable horizontal cut-off lines to your hat-value plot. -	Identify on your plot the observations that have large leverage on the model -	Remove the high-leverage outliers and refit the model. Call this the ``z2.updated`` model -	Show the updated hat-values and verify whether the problem has mostly gone away *Note*: see the R tutorial on how to rebuild a model by removing points


 * 1) .	Use the ``influenceIndexPlot(...)`` function in the ``car`` library on both the ``z2`` model and the ``z2.updated`` model. Interpret what each plot is showing for the two models.  You may ignore the *Bonferroni p-values*  subplot.