Assignment 2 - 2011 - Solution

 

.. rubric:: Assignment objectives

- A review of basic probability, histograms and sample statistics. - Collect data from multiple sources, consolidate it, and analyze it. - Deal with issues that are prevalent in real data sets. - Improve your skills with R (if you are using R for the course).


 * Notes**:

-	I would normally expect you to spend between 3 and 5 hours outside of class on assignments. This assignment should take about that long. Answer with bullet points, not in full paragraphs. -	**Numbers in bold** next to the question are the grading points. Read more about the `assignment grading system `_. -	600-level students must complete all the question; 400-level students may attempt the 600 level question for extra credit. Also 600-level students must read the paper by PJ Rousseeuw, "`Tutorial to Robust Statistics `_".

Question 1 [1]

=
====

Recall from class that :math:`\mu = \mathcal{E}(x) = \frac{1}{N}\sum{x}` and :math:`\mathcal{V}\left\{x\right\} = \mathcal{E}\left\{ (x - \mu )^2\right\} = \sigma^2 = \frac{1}{N}\sum{(x-\mu)^2}`.

#.	What is the expected value thrown of a fair 12-sided dice? #.	What is the expected variance of a fair 12-sided dice? #.	Simulate 10,000 throws in R, MATLAB, or Python from this dice and see if your answers match those above. Record the average value from the 10,000 throws. #.	Repeat the simulation for the average value of the dice a total of 10 times. Calculate and report the mean and standard deviation of these 10 simulations and *comment* on the results.

Solution

The objective of this question is to recall basic probability rules.


 * 1) . Let :math:`X` represent a discrete random variable for the event of throwing a fair die. Let :math:`x_{i}` for :math:`i=1,\ldots,12` represent the numerical or realized values of the outcome of the random event given by :math:`X`. Now we can define the expected value of :math:`X` as,

.. math:: \mathcal{E}(X)=\sum_{i=1}^{12}x_{i}P(x_{i})

where the probability of obtaining a value of :math:`1,\ldots,12` is :math:`P(x_{i})=1/N=1/12 \;\forall\; i=1,\ldots,12`. So, we have,

.. math:: \mathcal{E}(X)=\frac{1}{N}\sum_{i=1}^{12}x_{i}=\frac{1}{12}\left(1+2+\cdots+12\right)=\bf{6.5}


 * 1) . Continuing the notation from the above question we can derive the expected variance as,

.. math:: \mathcal{V}(X)&=\mathcal{E}\left\{[X-\mathcal{E}(X)]^{2}\right\}\\ &=\mathcal{E}(X^{2})-[\mathcal{E}(X)]^{2} where :math:`\mathcal{E}(X^{2})=\sum_{i}x_{i}^{2}P(x_{i})`. So we can now calculate :math:`\mathcal{V}(X)` as,

.. math:: \mathcal{V}(X)&=\sum_{i=1}^{12}x_{i}^{2}P(x_{i})-\left[\sum_{i=1}^{12}x_{i}P(x_{i})\right]^{2}\\ &=\frac{1}{12}(1^{2}+2^{2}+\cdots+12^{12}) - [6.5]^{2}\approx \bf{11.9167}


 * 1) .	Simulating 10,000 throws corresponds to 10,000 independent and mutually exclusive random events, each with an outcome in the set :math:`\mathcal{S}={1,2,\ldots,12}`. The sample mean and variance from my sample was:

.. math::

\overline{x} &= 6.4925\\ s^2 &= 11.77915

.. twocolumncode:: :code1: ../che4c3/Assignments/Assignment-2/code/q1c.R		   :language1: s		    :header1: R code :code2: ../che4c3/Assignments/Assignment-2/code/q1c.m		   :language2: matlab :header2: MATLAB code


 * 1) .	Repeating the above simulation 10 times (i.e., 10 independent experiments) produces 10 different estimates of :math:`\mu` and :math:`\sigma^2`. Note, everyone's answer should be slightly different, and different each time you run the simulation.

.. twocolumncode:: :code1: ../che4c3/Assignments/Assignment-2/code/q1d.R		   :language1: s		    :header1: R code :code2: ../che4c3/Assignments/Assignment-2/code/q1d.m		   :language2: matlab :header2: MATLAB code

Note that each :math:`\overline{x} \sim \mathcal{N}\left(\mu, \sigma^2/n \right)`, where :math:`n = 10000`. We know what :math:`\sigma^2` is in this case: it is our theoretical value of **11.92**, calculated earlier, and for :math:`n=10000` samples, our :math:`\overline{x} \sim \mathcal{N}\left(6.5, 0.00119167\right)`.

Calculating the average of those 10 means, let's call that :math:`\overline{\overline{x}}`, shows values around 6.5, the theoretical mean.

Calculate the variance of those 10 means shows numbers that are around 0.00119167, as expected.

Question 2 [1.5]

=
====

In the class last week I mentioned an example of independence. I said that if I take the grade for each question in an exam for a student, calculate the grade per question, then the average of those :math:`N` grades will be normally distributed, even if the grades in individual question are not. For example: if there are 10 questions, and your grades for each question was 100%, 95%, 26%, 78%, ... 87%, then your average will be as if it came from a normal distribution.


 * 1) .	This example was faulty: what was wrong with my reasoning?
 * 2) .	600-level students: However, when I look at the average grades from any exam, without fail they are always normally distributed. What's going on here?

Solution -


 * 1) .	Unfortunately, I chose my example in class too hastily, without thinking about the details. The grades for every student are not independent, because that student (as long as they are not receiving external help), will likely do well in all questions, or poorly in all questions, or only well in the section(s) they have studied.  So each student's grade for the individual questions will be related.


 * 1) .	**600-level** students: The central limit theorem tells us that samples from *any distribution with finite variance* (each question in the exam has a different distribution, but has finite variance), that the average of those values (the average grade of each student) will be normally distributed, as long as we took our samples independently (which we did not have for the grades example).

So we are only breaking the independence assumption of the central limit theorem. That means we should take a look at why we've assumed independence between two sampled values. To do this, first let's look at the case when we do have independence, and for simplicity, let's assume every question in the exam also had a normal distribution with the same mean, :math:`\mu` and the same variance, :math:`\sigma^2` (really restrictive, but you will see why in a minute). We know that this case leads to: .. math:: \overline{x}_j \sim \mathcal{N}(\mu, \sigma^2/N) which says the average grade for student :math:`j`, call it :math:`\overline{x}_j`, comes from a normal distribution with that mean :math:`\mu`, and with standard deviation of :math:`\sigma^2/N`, where :math:`N` is the total number of questions. This is the usual formula we have seen in class; but where did this formula come from? Recall that: .. math:: \overline{x}_j = \frac{1}{N}x_{j,1} + \frac{1}{N}x_{j,2} + \ldots + \frac{1}{N}x_{j,N} where each student, :math:`j`, obtained a grade for question, :math:`1, 2, \ldots, n, \ldots N`. Let's call that grade :math:`x_{j,n}`, and recall that we have assumed :math:`x_{j,n} \sim \mathcal{N}(\mu, \sigma^2)`. The mean and standard deviation of :math:`\overline{x}_j`, *crucially assuming independence* between each :math:`x_{j,n}` value, can then be found from: .. math:: \mathcal{E}(\overline{x}_j) &= \mathcal{E}\left(\frac{1}{N}x_{j,1} + \frac{1}{N}x_{j,2} + \ldots + \frac{1}{N}x_{j,N} \right) \\ &= \frac{1}{N}\mathcal{E}(x_{j,1}) + \frac{1}{N}\mathcal{E}(x_{j,2}) + \ldots + \frac{1}{N}\mathcal{E}(x_{j,N}) \\ &= \frac{1}{N}\mu + \frac{1}{N}\mu + \ldots + \frac{1}{N}\mu \\ &= \mu \qquad\text{(this is expected)}\\ \mathcal{V}(\overline{x}_j) &= \mathcal{V}\left(\frac{1}{N}x_{j,1} + \frac{1}{N}x_{j,2} + \ldots + \frac{1}{N}x_{j,N} \right) \\ &= \frac{1}{N^2}\mathcal{V}(x_{j,1}) + \frac{1}{N^2}\mathcal{V}(x_{j,2}) + \ldots + \frac{1}{N^2}\mathcal{V}(x_{j,N}) \qquad\text{(this is why we require independence)}\\\\ &= \frac{N}{N^2}\sigma^2 \\ &= \frac{\sigma^2}{N} This also explains where the :math:`\sigma^2/N` term, used in the :math:`t`-distribution, comes from. The above derivation relies on two properties you should be familiar with (see a good stats textbook, e.g. Box, Hunter and Hunter):

.. math::

\mathcal{V}(x + y) &= \mathcal{V}(x) + \mathcal{V}(y) + 2 \text{Cov}(x, y)\\ \mathcal{V}(x + y) &= \mathcal{V}(x) + \mathcal{V}(y) + 2\mathcal{E}\big[(x - \mathcal{E}[x])(y - \mathcal{E}[y])\big]\\ \mathcal{V}(ax) &= a^2\mathcal{V}(x) \\

and independence implies that :math:`\text{Cov}(x, y) = 0`. So relaxing our assumption of independent :math:`x_{j,n}` values shows that we cannot combine the variances in an easy way, but we do see that the correct variance will be a larger number (if the student grades within an exam are positively correlated - the usual case), or a smaller number (if the grades are negatively correlated within each student's exam). Also, relaxing the assumption that each question has the same variance, we just replace :math:`\sigma^2` with :math:`\sigma^2_n` in the formula for :math:`\mathcal{V}(\overline{x})`. Relaxing the assumption of equal means, :math:`\mu`, for each question requires we use :math:`\mu_j` instead of :math:`\mu`. Note that :math:`\sigma^2_n` and :math:`\mu_n` can come from *any distribution*, not just the normal distribution. But, the central limit theorem tells us average grade for a student, :math:`\overline{x}_j`, will be as if it came from a normal distribution. However, because we do not have independence, and we don't know the individual :math:`\sigma^2_n` and :math:`\mu_n` values for each question, we cannot *estimate the parameters* of that normal distribution. So, to conclude: it is correct that the average grades from the exam for every student will be as if they came from a normal distribution, only we can't calculate (estimate) that distribution's parameters. I always find my course grades to be normally distributed when examining the qq-plot.

Question 3 [1]

=
==

Write a few *bullet point* notes on the purpose of feedback control, and its effect on variability of process quality.

Solution -


 * Purpose is to keep the process close to a desired set point (or mean).


 * Sometimes used to maintain the process variability within a desired tolerance limit (or standard deviation).


 * Lowers the variability of the process outputs (i.e., narrow the distribution) by actually introducing *greater* variability into the process, to counteract external variation in the the process inputs. For example, variation from the raw materials, or ambient conditions, such as seasonal temperature are process inputs.


 * Feedback control allows us to move the process operation closer to targets, without less likelihood of deviation outside these limits. (In the next section on process monitoring we will learn how to track and quantify this).

Question 4 [1.5]

=
====

The ammonia concentration in your wastewater treatment plant is measured every 6 hours. The data for one year are available from the `dataset website `_.


 * 1) .	Use a visualization plot to hypothesize from which distribution the data might come. Which distribution do you think is most likely?
 * 2) .	Confirm your answer using a suitable plot.
 * 3) .	Estimate location and spread statistics assuming the data are from a normal distribution.
 * 4) .	What if I told you that these measured values are not independent. How does it affect your answer?
 * 5) .	What is the probability of having an ammonia concentration greater than 40 mg/L when:

- you may use only the data (do not use *any* estimated statistics) - you use the estimated statistics for the distribution? **Note**: Answer this entire question using computer software to calculate values from the normal distribution. But also make sure you can answer the last part of the question by hand, if given the mean and variance, and using the `table of normal distributions `_. Print out the table and bring it with you to the midterm and final exam. The computer answer should agree with your hand-calculated value.

Solution -


 * 1) . To visualize/hypothesize which distribution the data might come from, use a histogram, a plot of the estimated frequency density, or simply a comparison of a histogram and normal PDF. We show a combined histogram and normal PDF as follows,

.. figure:: ../che4c3/Assignments/Assignment-2/images/Q4-histogram-npdf.png :alt:	code/q4.R		:scale: 60 :width: 500px :align: center


 * 1) . An appropriate distribution appears to be the normal distribution, however the right hand side tail (upper tail), of the plot shown below, is slightly heavier, outside the given limits, than would be found on the normal distribution. This bias may have a small effect on our results - by estimating a standard deviation that is larger than would have been from a true normal distribution.

.. figure:: ../che4c3/Assignments/Assignment-2/images/Q4-qqplot.png :alt:	code/q4.R		:scale: 60 :width: 500px :align: center


 * 1) . Assuming the data are normal, we can calculate the distribution's parameters as :math:`\bar{x} = \hat{\mu} = 36.1` and :math:`s= \hat{\sigma} = 8.52`.


 * 1) . The fact that the *data* are not independent is not an issue.  To calculate estimates of the parameter's distribution (the mean and standard deviation) we do not need to assume independence.  One way to see this: if I randomly reorder the data, I will still get the same value for the mean and standard deviation.  The assumption of independence is required for the central limit theorem, but we have not used that theorem here.


 * 1) . The probability of having an ammonia concentration greater than 40 mg/L:

- when using only the data: 34.4% (see code below) - when using the estimated parameters of the distribution: 32.3% (see code below)

We should use the :math:`t`-distribution to answer the last part, but at this stage we had not yet looked at the :math:`t`-distribution. However, the large number of observations (1440) means the :math:`t`-distribution is no different than the normal distribution. But note that the :math:`t`-distribution requires the assumption that the data are normally distributed, and independent. We are better off *using the raw data* to estimate the probability in this case, without making these restrictive assumptions.

.. twocolumncode:: :code1: ../che4c3/Assignments/Assignment-2/code/q4.R	   :language1: s	    :header1: R code :code2: ../che4c3/Assignments/Assignment-2/code/q4.m	   :language2: matlab :header2: MATLAB code Question 5 [1]

=
==

One of the questions we posed at the start of the course was: `given the yields `_ from a batch bioreactor system for the last 3 years (300 data points; we run a new batch about every 3 to 4 days).


 * 1) .	What sort of distribution do the yield data have?
 * 2) .	A recorded yield value was was less than 60%, what are the chances of that occurring? Express your answer as: *there's a 1 in* :math:`X` *chance* of it occurring.
 * 3) .	Which assumptions do you have to make for the second part of this question?

Solution -


 * 1) .	Assume the 300 data points represent an entire population. Plot a ``qqPlot(...)`` using the ``car`` package (you could also try using a normal probability plot, e.g. the ``probplot`` function from the ``e1071`` package):

.. figure:: ../che4c3/Assignments/Assignment-2/images/Q5-qqplot.png :alt:	code/q5.R		:scale: 60 :width: 500px :align: center

Since the data seem to agree with the above plot, we can conclude they follow a normal distribution.


 * 1) .	We need to find the probability that the yield, :math:`Y`, is less than or equal to 60, stated as :math:`P(Y\le 60)`. If we assume :math:`Y \sim \mathcal{N}(\mu,\sigma^{2})` then we first need to find the :math:`z`-value bound corresponding to 60, and then find the probability of finding values below, or equal to that bound.

.. math::

z_\text{bound} = \frac{y-\mu}{\sigma} = \frac{60-80.353}{6.597} = -3.085

In this data set of 300 numbers there are zero entries below this limit. But using the distribution's fit, we can calculate the probability as ``pnorm(-3.085)``, which is :math:`\approx 0.001`. This is equivalent to saying that there is a *1 in 1000 chance* of achieving a yield less than 60\%.


 * 1) .	We only had to assume the data are normally distributed - we did not need the data to be independent - in order to use the estimated parameters from the distribution to calculate the probability.

.. twocolumncode:: :code1: ../che4c3/Assignments/Assignment-2/code/q5.R	   :language1: s	    :header1: R code :code2: ../che4c3/Assignments/Assignment-2/code/q5.m	   :language2: matlab :header2: MATLAB code

Question 6 [1]

=
==

Use the section on `Historical data `_ from Environment Canada's website and use the ``Customized Search`` option to obtain data for the ``HAMILTON A`` station from 1990 to 2000. Use the settings as ``Year=1990``, and ``Data interval=Monthly`` and request the data for 1990, then click ``Next year`` to go to 1991 and so on.

-	For each year from 1990 to 2000, record the total snowfall and the average of the ``Mean temp`` column over the 12 months (the sums and averages are reported at the bottom of the table). .. Snow:    131.2, 128.0, 130.7, 190.6, 263.4, 138.0, 207.3, 161.5, 78.8, 166.5, 170.9 .. MeanTemp:  8.6,   8.6,   6.9,   7.1,   7.1,   7.7,   6.9,   7.3,  9.8,   8.8,   7.6 -	Plot these 2 variables against time. -	Perform a test to verify whether the 11 total snowfall values are possibly from a normal distribution. -	Now retrieve the long-term averages for these data `from a different section of their website `_ (use the same location, ``HAMILTON A``, and check that the data range is 1961 to 1990). Superimpose the long-term average as a horizontal line on your previous plot.
 * Note**: the purpose of this exercise is more for you to become comfortable with web-based data retrieval, which is common in most companies.

Solution


 * The snow data are :math:`[131.2, 128.0, 130.7, 190.6, 263.4, 138.0, 207.3, 161.5, 78.8, 166.5, 170.9]`.

The mean temperature data are :math:`[8.6,  8.6,   6.9,   7.1,   7.1,   7.7,   6.9,   7.3,  9.8,   8.8,   7.6]`.
 * A plot against time for these two, with the long-term average (1961 to 1990 values) superimposed

.. image:: ../che4c3/Assignments/Assignment-2/images/q1-snowfall-data.png :alt:	../../Assignments/Assignment-2/code/weather-data.R		:scale: 15 :align: center .. image:: ../che4c3/Assignments/Assignment-2/images/q1-temperature-data.png :alt:	../../Assignments/Assignment-2/code/weather-data.R		:scale: 15 :align: center where the 1961 to 1990 average snowfall was 152.4 cm per year, and the average temperature was 7.6 Celsius.


 * The qq-plot for the snowfall data shows that these 11 values could quite possibly come from a normal distribution.

.. image:: ../che4c3/Assignments/Assignment-2/images/q1-qqplot-snow.png :alt:	../../Assignments/Assignment-2/code/weather-data.R		:scale: 47 :width: 750px :align: center