Discussion about math, puzzles, games and fun.   Useful symbols: ÷ × ½ √ ∞ ≠ ≤ ≥ ≈ ⇒ ± ∈ Δ θ ∴ ∑ ∫ • π ƒ -¹ ² ³ °

You are not logged in.

|
Options

bob bundy
2013-04-25 05:08:57

Variance formulas.

Sorry, this has turned out to be quite a long post.  Hope you manage to stay awake to the end.

The definition formula is

That's the one you used.

It is possible to do some algebra on that to get an alternative version that gives the same result

This formula is handy if you are having to calculate the variance on paper as you don't have to work out the mean before you start doing the sum of squares total, so it saves time.  It is also used by calculators that have statistical functions as it only requires three memories to do the calculations: one for the running count, n; one for the running total of x; and one for the running total of x squared.  When you have entered all the data (which the calculator doesn't have to remember) there are built in functions to calculate the mean and variance.

Now if you have a set of data and just want to get the variance, use either of the above.

But what happens if you just take a sample of values, calculate the sample mean and variance as usual; but now want to estimate the 'population' statistics?

eg.  You have sampled the weight of bags of sugar coming off a production line, and now want to say what the mean and variance are for the whole production.  Can you use the sample statistics?

The answer is YES and NO.

What do I mean by that?  Well, imagine you keep taking samples and computing the mean and variance for each sample.  The sample means will be symmetrically clustered around the population mean (this can be proved by something called 'expectation algebra' but it is a complicated proof so I'd rather not go into it.)  So taking any particular sample mean won't give you the true population mean but it is said to be an unbiased estimator for it.  By which I mean, there's no bias in taking one value; it may be too high; it may be too low; but these are equally likely.  So you may use the sample mean as an estimate for the population mean.

However, the same is not true for the sample variance.  If you repeat the sampling many times and compute the variance each time, you again get a set of results that are symmetrically clustered about a fixed value; but that value is not the population variance.  The value you get will tend to be too low.  I like to think of it like this: you've only taken a few samples from the population so there's less of a spread in the results than if you took the whole population.  This leads to a bias if you take the sample variance as the population variance.  But the bias is by a predictable amount!

Expectation algebra shows that the mean of the sample variances (let's call it s^2) is given by this formula:

As you can see this leads to a variance that is too low, but only by a tiny amount when n is very large.  So if you take a big sample you could use the sample variance for the population variance and it probably wouldn't matter; but if n is small, it would because you'd be using a variance that is too small.

But you can easily unbias it.  If you multiply the sample variance by

you unbias it by just the right factor.  So you could calculate the sample variance, and then adjust it by this multiplier.  But, as the last step in calculating a variance is to divide by n, you can save some steps.

This formula is often called (incorrectly) the sample variance formula.  It isn't.  It is the formula for estimating the population variance from a set of sample data.  Calculators and math packages will probably have it labelled as s^2  but, hopefully,  you can see this is not quite correct.

So, back to your original post.  You said

Bobbym works at Pizza Hut. He wants to calculate the standard deviation of his weekly earnings.
Here is how much Bobbym earned this week :

Now it is debatable that what you meant was " he wants to use this sample to calculate the  standard deviation for all of his earnings" in which case you would divide the sum of the squared deviations from the mean by 6.  But it isn't what you said.  In any case, the estimator formula is only valid if you take a random sample across all his earnings.   Taking values from just one week isn't random because sales may have been poor at that time of the year leading to poor earnings.  Or maybe this was an early week in his employment when he was keen and hard working, before he became cynical and disgruntled and ended up getting the sack.  There is lots of potential for introducing bias if you just take 7 days, one after the other.

My Conclusion: you were right to divide by 7.

But, recommendation: Don't round off early in the calculation; maintain all the figures until the end and then round off.  You were lucky to get 74 after all that rounding and one calculation error.

Bob

bob bundy
2013-04-24 19:58:37

OK, but it will have to be later.  I'm part way through setting up an arch in my garden and only came in for a coffee break.  I'll have a go this evening for you.  (My time now is about 11am, BST.)

Bob

mathaholic
2013-04-24 19:56:36

Yup.

bob bundy wrote:

I've been waiting for a response from you.

Wolfram says this is an area that is often confused.  I'm not confused.  So I'm happy to have a go at explainig this if you want.

Bob

bobbym wrote:

Okay, we will divide by 7.

bob bundy
2013-04-24 19:47:29

hi julianthemath

jtm wrote:

I've been waiting for a response from you.

Wolfram says this is an area that is often confused.  I'm not confused.  So I'm happy to have a go at explainig this if you want.

Bob

mathaholic
2013-04-24 18:37:34

bobbym
2013-04-23 05:55:55

Hi Bob;

Okay, we will divide by 7.

bob bundy
2013-04-23 04:37:13

hi bobbym,

Arhh, I see what you mean (pun not intended).  He makes lots of approximations and adds in 169 twice.  He was lucky to get 74 at the end.

But that doesn't change my opinion on what to divide by.

At

http://www.mathsisfun.com/data/standard-deviation.html

we are encouraged to think there are two formulas for variance.

At

http://www.mathsisfun.com/data/standard … mulas.html

the reason for this is explained more fully.

http://en.wikipedia.org/wiki/Variance

As I understand it, there is only one formula for calculating variance and that involves dividing by n.  (There is a simplification that is useful if you are doing it without the help of an electronic device that stores all the values.)

I think the formula with (n-1) is only for the purpose of getting an unbiased estimate of a population variance if all you know is the sample variance.  This arises because, with a sample (which by its nature has less values than the whole population) there is less variation in the values resulting in a variance that is expected to be lower than the true population variance by a factor (n-1)/n

To remove this bias, if you want to estimate the population variance, you take the sample variance, s^2, and multiply it by an unbiassing factor of n/(n-1).  Since the last step in calculating s^2 would have been to divide by n, you might as well save yourself the effort and cancel out the ns altogether and just divide by (n-1).

Most statistical packages will have both options (divide by n and divide by n-1) and the user has to know which one to use when.

If julianthemath wanted to estimate the population variance he would have to start with a random sample.  Choosing seven days in order is hardly random.  I conclude he wanted the variance of just those values => divide by 7.

Bob

bobbym
2013-04-23 02:35:24

Hi All;

Yes, we could discuss what sd is appropriate until they rehire me but what I was after was this:

9025+5929+9+11236+12321+169+25 = 38714

14954+11245+12490+194 = 38883

The little fellow seems to have found a way to pair off 7 numbers into 4 distinct pairs? Now to mention that in post #1276 seems picayune but in post #2 it was okay.

bob bundy
2013-04-22 22:45:09

hi julianthemath and bobbym

See below for my calculations.  There are 7 values so once the sum of the squared deviations has been determined this should be divided by 7.

So I agree with julianthemath.

The value of 80 is sum/6 which is used to determine an unbiassed estimate of a population sd for a sample size n.

julianthemath wrote:

He wants to calculate the standard deviation of his weekly earnings.

Maybe he should have said ".......of his earnings for one week", but, from what bobbym has told us,  he didn't work there any longer than that because he got fired!

So,  in this case, we know the whole population,         =>       dividing by 7 is appropriate.

http://www.mathsisfun.com/data/standard … mulas.html

Bob

Mrwhy
2013-04-22 20:55:34

Do you realise that if one of those daily numbers was a misprint then its error is exaggerated by taking its square.
Mean Absolute deviation is better

Indeed in experimental science (all data!) mean of the cube root of the deviation gives a more reliable answer as it gives more weight to the numbers whose deviation is smallest (more carefully measured?)

bobbym
2013-04-22 20:47:33

Hi;

Without the senseless rounding I am getting

80.32582458486246

mathaholic
2013-04-22 14:39:42

Just go ahead. Use Mathematica to round off the standard deviation.

bobbym
2013-04-22 11:35:36

If you want I can do the calculation by hand.

mathaholic
2013-04-22 11:20:15

Oh. Mathematica again?

bobbym
2013-04-22 11:11:24

I used a program to get it and rounded it.