WEBVTT Kind: captions; language: en-us NOTE Treffsikkerhet: 87% (H?Y) 00:00:00.000 --> 00:00:07.600 In this video, we will talk about indices of dispersion. Indices of dispersion are quantities that 00:00:07.600 --> 00:00:10.550 tell us how variable our data are. NOTE Treffsikkerhet: 89% (H?Y) 00:00:10.550 --> 00:00:18.400 To illustrate them, we will use the previous example of the five people who measure the height of 00:00:18.400 --> 00:00:24.299 one person using a measuring tape and producing five estimates for it. NOTE Treffsikkerhet: 91% (H?Y) 00:00:24.299 --> 00:00:32.100 So these are the values we've seen before, as five measurements of one person's height. The 00:00:32.100 --> 00:00:39.500 question is, how can we express the variability we see? The fact that these aren't the same. Can we 00:00:39.500 --> 00:00:47.500 have a quantitative assessment of how different they are? Which can be useful in answering two 00:00:47.500 --> 00:00:51.500 different questions. One question is NOTE Treffsikkerhet: 91% (H?Y) 00:00:51.500 --> 00:01:00.000 If we were to ask more people to measure the same person's height, where could we expect the new 00:01:00.000 --> 00:01:07.150 individual measurements to lie? So what kinds of additional values can we expect? NOTE Treffsikkerhet: 74% (MEDIUM) 00:01:07.150 --> 00:01:14.800 That other question is. If we were to ask a completely new set of five people to measure this 00:01:14.800 --> 00:01:21.600 person's height and calculated the mean for that height from the new values. NOTE Treffsikkerhet: 91% (H?Y) 00:01:21.600 --> 00:01:29.400 How far would you expect that to be from the current mean, from our current estimate, based on the 00:01:29.400 --> 00:01:32.900 first sample of five people measuring it? NOTE Treffsikkerhet: 91% (H?Y) 00:01:32.900 --> 00:01:41.600 So the first question has to do with individual values and the second question has to do with 00:01:41.600 --> 00:01:49.400 indices of central tendency from new samples, new sets of values. And these questions are very 00:01:49.400 --> 00:01:58.600 important because they tell us how different answers we can get to our questions of interest. As we 00:01:58.600 --> 00:02:02.300 will see in more detail later in the course. NOTE Treffsikkerhet: 91% (H?Y) 00:02:02.800 --> 00:02:12.750 So let us remind ourselves of this frequency display where we have each value, obtained associated 00:02:12.750 --> 00:02:15.900 with the number of times it was reported. NOTE Treffsikkerhet: 91% (H?Y) 00:02:15.900 --> 00:02:27.400 So, the value 162 was reported twice. And that's why this line is up to two. The valley of 162,5 00:02:27.400 --> 00:02:33.700 was reported once, and that's why we have the value here and so on. NOTE Treffsikkerhet: 89% (H?Y) 00:02:35.000 --> 00:02:44.700 As you may recall, these data have a mode of 162 centimeters, because this was the value most often 00:02:44.700 --> 00:02:53.800 reported. The median is 162 because there is an equal number of measurements below and above this 00:02:53.800 --> 00:03:04.700 value. And the mean value of these measurements is 162.3 with one decimal digit with some 00:03:04.700 --> 00:03:06.300 rounding. NOTE Treffsikkerhet: 91% (H?Y) 00:03:06.300 --> 00:03:15.350 So the first thing we can do about these values is to simply note how far they are in the extreme. 00:03:15.350 --> 00:03:23.800 So how far the smallest value is from the largest and that is called the range. NOTE Treffsikkerhet: 82% (H?Y) 00:03:23.800 --> 00:03:31.200 The range of these values is simply the difference between the largest and the smallest. And in this 00:03:31.200 --> 00:03:39.200 case we have a range of 1.8 centimeters. This is how far apart two measurements can be 00:03:39.200 --> 00:03:42.600 at the extreme from this sample. NOTE Treffsikkerhet: 89% (H?Y) 00:03:43.400 --> 00:03:51.500 The second quantity we will consider is called an interquartile range and I will demonstrate what 00:03:51.500 --> 00:03:59.000 this means and how it is derived. The first thing we have to do as we did for the median is to 00:03:59.000 --> 00:04:05.000 short, our measurements in order from the smallest to the largest. NOTE Treffsikkerhet: 85% (H?Y) 00:04:05.200 --> 00:04:17.200 Once we have our values in order, then we can assume that they form a continuum from 0% to a 100% 00:04:17.200 --> 00:04:26.500 So the beginning is at 0 and the last one is at a 100%. These 00:04:26.500 --> 00:04:34.800 are the endpoints and they define the range of our measurements. Then as we already know that 50% 00:04:34.800 --> 00:04:35.900 point, which is NOTE Treffsikkerhet: 91% (H?Y) 00:04:35.900 --> 00:04:40.250 The measurement in the middle is called the media. NOTE Treffsikkerhet: 85% (H?Y) 00:04:40.250 --> 00:04:51.500 Then we can also divide each half into half. And in this way, obtain the first quarter or first 00:04:51.500 --> 00:05:00.000 quartile, as it is called, the second quarter, the third quarter and the fourth quarter. Essentially. 00:05:00.000 --> 00:05:10.000 we are dividing our measurements into four intervals, such that each one has one quarter of the 00:05:10.000 --> 00:05:10.950 available values. NOTE Treffsikkerhet: 90% (H?Y) 00:05:13.200 --> 00:05:24.600 Therefore between this and this point, we have half our values. This is our center, 50% NOTE Treffsikkerhet: 79% (H?Y) 00:05:24.900 --> 00:05:36.500 The interquartile range is simply the range that is spent by this central 50%. So, if you throw out 00:05:36.500 --> 00:05:45.000 the 1/4, smallest measurements, and the 1/4 largest measurements and you only retain the central 00:05:45.000 --> 00:05:53.000 half, the range of that half is the interquartile range. And in this case, it is equal to 0.5 00:05:53.000 --> 00:05:54.900 centimeter. NOTE Treffsikkerhet: 89% (H?Y) 00:05:56.300 --> 00:06:05.800 Now, I did this with only five values in order to have a lot of space to illustrate, but in usual 00:06:05.800 --> 00:06:13.000 practice, we don't have five values. We have many more. And in those cases, it makes more sense to 00:06:13.000 --> 00:06:20.400 speak of 1/2 or 1/4 of the measurements. So, for illustrations, let us consider we have a whole 00:06:20.400 --> 00:06:22.299 bunch of measurements. NOTE Treffsikkerhet: 87% (H?Y) 00:06:22.299 --> 00:06:30.400 Would you can then sort, such that the first one is the smallest and the last one is the largest. NOTE Treffsikkerhet: 77% (H?Y) 00:06:30.400 --> 00:06:34.150 And these define the range. NOTE Treffsikkerhet: 91% (H?Y) 00:06:34.150 --> 00:06:43.200 The difference between the smallest and largest defines the range. We can divide these into two 00:06:43.200 --> 00:06:51.500 halves so that the half smallest values are here and the half largest are here. And in the middle is 00:06:51.500 --> 00:06:58.150 the median. In this case. The largest value of this half is the same as the smallest of this half. 00:06:58.150 --> 00:07:00.700 And this is our median. NOTE Treffsikkerhet: 85% (H?Y) 00:07:00.900 --> 00:07:09.600 And then we can divide each half into two halves so that we have four quarters or four quartiles. 00:07:09.600 --> 00:07:19.500 This is the 1/4 of our data, the all the smallest values, another fourth of our data, another fourth 00:07:19.500 --> 00:07:25.400 of our data, and the final fourth of our data, the largest value. Remember these are all 00:07:25.400 --> 00:07:29.700 sorted in order from smallest to largest. NOTE Treffsikkerhet: 84% (H?Y) 00:07:29.700 --> 00:07:32.150 and therefore, NOTE Treffsikkerhet: 91% (H?Y) 00:07:32.150 --> 00:07:40.300 this is called the first quartile, which has 25% of our data. This is the second quartile with 00:07:40.300 --> 00:07:49.900 another 25%, third quartile another 25%, fourth quartile another 25%. Therefore, if we join 00:07:49.900 --> 00:07:58.800 the two halves, that's the middle half. We join two quartiles. We have the middle half of the data 00:07:58.800 --> 00:08:01.700 and this plus this, is NOTE Treffsikkerhet: 88% (H?Y) 00:08:01.700 --> 00:08:10.200 another half of the data. These are the more extreme data, the smallest and the largest values. NOTE Treffsikkerhet: 90% (H?Y) 00:08:11.800 --> 00:08:20.000 So the range within the central half is the interquartile range. NOTE Treffsikkerhet: 89% (H?Y) 00:08:20.000 --> 00:08:27.150 And in this case it is it happens to be 1.5. NOTE Treffsikkerhet: 82% (H?Y) 00:08:27.150 --> 00:08:36.000 So the point is that we divide our data in four parts, such that each part has an equal number of 00:08:36.000 --> 00:08:37.299 measurements. NOTE Treffsikkerhet: 91% (H?Y) 00:08:37.299 --> 00:08:46.600 And then we define three points. The first quartile is the end point of the first quarter. The 00:08:46.600 --> 00:08:54.000 second quartile is the median. The third quartile is the end of the third quarter, and then the end 00:08:54.000 --> 00:08:58.400 of the last, the, the fourth quarter is the maximum. NOTE Treffsikkerhet: 90% (H?Y) 00:08:59.300 --> 00:09:10.300 There is however, a different way to think about variability in the data. And this is again, based 00:09:10.300 --> 00:09:19.600 on deviations. Deviations, if you recall, are the distances of measurements from the mean. So, if we 00:09:19.600 --> 00:09:29.100 go back to our graphical display of our five measurements, we can also add the actual mean, and it's 00:09:29.100 --> 00:09:30.500 distances from NOTE Treffsikkerhet: 87% (H?Y) 00:09:30.500 --> 00:09:38.400 the specific measurements and as you may recall, two of them, two of the measurements are above the 00:09:38.400 --> 00:09:45.100 mean. And so there are distances are positive three are below the mean, and their distances are 00:09:45.100 --> 00:09:50.500 negative. We've gone through that when we were talking about the mean, these are the actual 00:09:50.500 --> 00:09:53.599 distances and this is our mean NOTE Treffsikkerhet: 91% (H?Y) 00:09:53.599 --> 00:10:03.650 So the idea here is if there is a lot of variability in the data. These distances will be large. NOTE Treffsikkerhet: 91% (H?Y) 00:10:03.650 --> 00:10:07.400 So the deviations will be big numbers. NOTE Treffsikkerhet: 91% (H?Y) 00:10:07.400 --> 00:10:15.600 If all the data are close to one another so there is little variability or little dispersion. Then 00:10:15.600 --> 00:10:24.200 these deviations will be small numbers. So the idea is to try and quantify the variability by using 00:10:24.200 --> 00:10:27.700 the information from the deviations. NOTE Treffsikkerhet: 91% (H?Y) 00:10:28.100 --> 00:10:34.150 The simplest approach would be to use the mean deviation. NOTE Treffsikkerhet: 75% (MEDIUM) 00:10:34.150 --> 00:10:42.000 And the mean deviation is the sum of all the deviations divided by how many they are. NOTE Treffsikkerhet: 74% (MEDIUM) 00:10:42.000 --> 00:10:49.800 And if you recall our discussion of the mean, you will already realize that this is not going to 00:10:49.800 --> 00:10:59.000 work because the sum of the deviations is defined to be 0. That's how the mean was derived. So if 00:10:59.000 --> 00:11:06.300 you just add up all those deviations you get 0 so this idea is not going to work. Obviously the 00:11:06.300 --> 00:11:12.000 problem is that we have some positive deviations and some negative deviations. NOTE Treffsikkerhet: 91% (H?Y) 00:11:12.000 --> 00:11:15.200 So the easy way to deal with that NOTE Treffsikkerhet: 91% (H?Y) 00:11:15.200 --> 00:11:24.350 is to just throw out the, the minus signs and end up with what we call mean absolute deviation. 00:11:24.350 --> 00:11:33.300 Absolute relates to the notion of an absolute value, which is this symbol around the deviation. And 00:11:33.300 --> 00:11:40.000 this symbol, the absolute value means use the value without design, which means if there is a minus 00:11:40.000 --> 00:11:45.500 sign, throw it out. If there's a plus sign, just keep the number as it is. NOTE Treffsikkerhet: 91% (H?Y) 00:11:45.500 --> 00:11:52.600 So, by using the absolute value of the deviation, this means we only have positive numbers to add 00:11:52.600 --> 00:12:01.349 up. So this is more like what we had in mind when we were thinking about the mean deviation. This is 00:12:01.349 --> 00:12:09.300 the mean of the distances from the mean, all of them considered as positive numbers. NOTE Treffsikkerhet: 91% (H?Y) 00:12:10.100 --> 00:12:18.100 Another trick that mathematicians like to use when they want to get rid of negative numbers is to 00:12:18.100 --> 00:12:24.700 multiply the numbers by themselves. And in statistics, this turns out to be very important because 00:12:24.700 --> 00:12:32.700 it leads us to a very useful connections between observations and probabilities that we will discuss 00:12:32.700 --> 00:12:34.700 later in the course. NOTE Treffsikkerhet: 84% (H?Y) 00:12:34.700 --> 00:12:42.000 As you probably remember, if you multiply a positive number by another positive number, the product 00:12:42.000 --> 00:12:49.200 is always positive. But also if you multiply a negative number by another negative number minus x 00:12:49.200 --> 00:12:56.550 minus gives plus. So it's again a positive number. So every time you multiply a number by itself, 00:12:56.550 --> 00:13:03.599 the result is going to be positive and the multiplication of a number by itself is called a square 00:13:03.599 --> 00:13:04.200 and NOTE Treffsikkerhet: 78% (H?Y) 00:13:04.200 --> 00:13:06.900 we use the symbol two NOTE Treffsikkerhet: 82% (H?Y) 00:13:07.100 --> 00:13:14.800 as a superscript to mean that this number the distance is multiplied by itself. NOTE Treffsikkerhet: 91% (H?Y) 00:13:14.800 --> 00:13:24.600 So using this trick, the sum of the squared deviations divided by how many deviations we have is 00:13:24.600 --> 00:13:32.400 called the variance and the variance is an extremely useful quantity in statistics. Again, what 00:13:32.400 --> 00:13:39.800 this means is we have each distance from the mean, each deviation, multiplied by itself. And these 00:13:39.800 --> 00:13:43.349 squared deviations are then all added up. NOTE Treffsikkerhet: 70% (MEDIUM) 00:13:43.349 --> 00:13:50.800 In the sum, the result of this deviation is divided by the number of observations, of by the number of 00:13:50.800 --> 00:13:52.900 how many of these things we have. NOTE Treffsikkerhet: 89% (H?Y) 00:13:52.900 --> 00:14:02.500 And this is the variance. One difficulty with the variance, serving as an index of variability that 00:14:02.500 --> 00:14:10.400 we can use for expectation, is exactly because of the square. So in this example, all the 00:14:10.400 --> 00:14:13.099 observations were centimeters. NOTE Treffsikkerhet: 83% (H?Y) 00:14:13.099 --> 00:14:22.500 And what are interested in finding out is how many centimeters we can expect new measurements for 00:14:22.500 --> 00:14:29.099 new sample means to differ from the current ones. NOTE Treffsikkerhet: 91% (H?Y) 00:14:29.099 --> 00:14:38.800 But if you multiply a deviation by itself, that would be a centimeters times centimeters, which is 00:14:38.800 --> 00:14:47.800 centimeters square. Now, square centimeters is a unit of area. It's not a unit of length or height. NOTE Treffsikkerhet: 91% (H?Y) 00:14:47.800 --> 00:14:55.900 And it sounds very strange to express variability in height measurements by using square 00:14:55.900 --> 00:15:03.000 centimeters. It actually doesn't make any sense when you try to think of it intuitively. So the 00:15:03.000 --> 00:15:12.700 solution to this problem is actually very easy and it is into kind of taking back this square. And 00:15:12.700 --> 00:15:17.350 how can you take back a square is by using the square root. NOTE Treffsikkerhet: 91% (H?Y) 00:15:17.350 --> 00:15:26.400 So if you apply the square root on this number, then this kind of undoes the square. It certainly 00:15:26.400 --> 00:15:33.800 undoes it on the units. So instead of having square centimeters, you have centimeters again. NOTE Treffsikkerhet: 91% (H?Y) 00:15:33.800 --> 00:15:37.800 And this is called the standard deviation. NOTE Treffsikkerhet: 91% (H?Y) 00:15:39.000 --> 00:15:41.800 So, this is NOTE Treffsikkerhet: 89% (H?Y) 00:15:41.800 --> 00:15:54.900 the square root of the mean squared distance from the mean, that's what this means. And we're going 00:15:54.900 --> 00:16:03.300 to now look at this into more detail. Go into it, step by step. Make sure we understand it. It looks 00:16:03.300 --> 00:16:09.500 a bit scary. Maybe, because it looks like a lot of math, but it doesn't have anything were 00:16:09.500 --> 00:16:12.150 unfamiliar with. So just going NOTE Treffsikkerhet: 86% (H?Y) 00:16:12.150 --> 00:16:16.250 to break it down into parts and see how it works. NOTE Treffsikkerhet: 91% (H?Y) 00:16:16.250 --> 00:16:19.200 So, the standard deviation. NOTE Treffsikkerhet: 79% (H?Y) 00:16:19.400 --> 00:16:26.600 To calculate the standard deviation. We start with the values of our observations. So if we have a 00:16:26.600 --> 00:16:30.650 variable x, we start with our xi, NOTE Treffsikkerhet: 78% (H?Y) 00:16:30.650 --> 00:16:37.800 And in this case, i goes from 1 to 5. Meaning we have five measurements. And these are the actual 00:16:37.800 --> 00:16:45.800 values. These are the actual measurements. The first step is to calculate the deviation. That is the 00:16:45.800 --> 00:16:53.800 distance from the mean. So from each of these measurements, which subtract the mean. NOTE Treffsikkerhet: 89% (H?Y) 00:16:54.100 --> 00:17:04.800 And these are the differences. And then for each one of these differences, we multiply this by 00:17:04.800 --> 00:17:15.900 itself. This is what this squre means. So 1.16 times 1.16. Is this number minus 0.34. 00:17:15.900 --> 00:17:24.150 times - 0.34. Is this number. Minus 0.64 x minus NOTE Treffsikkerhet: 91% (H?Y) 00:17:24.150 --> 00:17:33.200 0.64 is this number and so on. So these are the squared deviations or the squared distances from the 00:17:33.200 --> 00:17:37.550 mean, and all we have to do is just add them up. NOTE Treffsikkerhet: 86% (H?Y) 00:17:37.550 --> 00:17:42.449 So if we just some these five numbers. NOTE Treffsikkerhet: 74% (MEDIUM) 00:17:42.449 --> 00:17:52.050 We obtain this value. This is just their sum, and then we divide by how many they are there. Five. 00:17:52.050 --> 00:18:00.900 So divide this number by 5 and we get this number. So this is the mean squared deviation. NOTE Treffsikkerhet: 90% (H?Y) 00:18:01.800 --> 00:18:06.800 And then, if we apply this square root. NOTE Treffsikkerhet: 91% (H?Y) 00:18:06.800 --> 00:18:14.300T This number, just enter this number into your calculator and request the square root and you get 00:18:14.300 --> 00:18:18.500 this number and this is the standard deviation. NOTE Treffsikkerhet: 91% (H?Y) 00:18:18.500 --> 00:18:29.100 In fact, we use this symbol to mean the population standard deviation. As we used the Greek letter 00:18:29.100 --> 00:18:38.300 ¦Ì to indicate the population mean ,we use the Greek letter Sigma the lowercase Sigma to indicate 00:18:38.300 --> 00:18:43.900 the population standard deviation. However, now we only have a sample NOTE Treffsikkerhet: 82% (H?Y) 00:18:43.900 --> 00:18:53.400 so the correct thing to calculate is not exactly that. It's almost like that. And the difference is 00:18:53.400 --> 00:19:01.700 we have to subtract 1 from how many measures we have. So instead of dividing with the number of 00:19:01.700 --> 00:19:11.000 observations, we divide by one less than that. And then this result, this number is the sample 00:19:11.000 --> 00:19:13.750 standard deviation, which is NOTE Treffsikkerhet: 68% (MEDIUM) 00:19:13.750 --> 00:19:22.800 symbolized with the Latin letter S and this is really what we've been aiming out. This is the standard 00:19:22.800 --> 00:19:31.700 deviation that we can use as an index of variability or an index of this version for our set of 00:19:31.700 --> 00:19:33.500 measurements. NOTE Treffsikkerhet: 84% (H?Y) 00:19:36.200 --> 00:19:45.600 So this formula, which does look a bit scary is now a lot less scary because you can understand each 00:19:45.600 --> 00:19:53.000 step, and the fact that it actually doesn't hide anything complicated. It just shows in a shorthand 00:19:53.000 --> 00:19:59.550 form that our index of variability is the square root of NOTE Treffsikkerhet: 79% (H?Y) 00:19:59.550 --> 00:20:09.300 the sum of the squared distances of observations from the mean divided by the number of 00:20:09.300 --> 00:20:12.000 observationsone minus 1. NOTE Treffsikkerhet: 91% (H?Y) 00:20:12.000 --> 00:20:19.200 And you'll probably never have to calculate this by hand outside of a statistics course, because 00:20:19.200 --> 00:20:22.100 that's what we have computers for. NOTE Treffsikkerhet: 82% (H?Y) 00:20:22.100 --> 00:20:30.200 Let us now look at some examples to understand how the different indices of dispersion behave 00:20:30.200 --> 00:20:35.350 when we actually have differences in the dispersion of data. NOTE Treffsikkerhet: 86% (H?Y) 00:20:35.350 --> 00:20:45.000 These are the data that we already have seen several times. These are actual measurements. aAtual 00:20:45.000 --> 00:20:54.300 five observations for one person's height. And I have created some variations on this. For this 00:20:54.300 --> 00:20:57.800 illustration. So one variation is NOTE Treffsikkerhet: 91% (H?Y) 00:20:57.800 --> 00:21:05.700 just to have some values that are closely spaced together, closely more closely than the actual 00:21:05.700 --> 00:21:14.100 observations. So, the dispersion here is obviously smaller than here. They're all closure. So, it 00:21:14.100 --> 00:21:22.200 looks like if these were five people measuring somebody's height. They were more in agreement about 00:21:22.200 --> 00:21:28.850 this person's height than these five people. And here is the opposite situation. NOTE Treffsikkerhet: 74% (MEDIUM) 00:21:28.850 --> 00:21:35.500 If these were five people measuring one person's height, they're not really agreeing very much with 00:21:35.500 --> 00:21:44.600 one another. So here, this is a set with higher dispersion and here we have a hypothetical 00:21:44.600 --> 00:21:53.450 situation in which four of the measures are exactly the same as our actual data, but the fifth one 00:21:53.450 --> 00:21:58.350 is taken to the extreme representing a possible NOTE Treffsikkerhet: 91% (H?Y) 00:21:58.350 --> 00:22:01.450 mistake, like a slip of the tape. NOTE Treffsikkerhet: 86% (H?Y) 00:22:01.450 --> 00:22:07.650 What are the indices of central tendency for these data sets? NOTE Treffsikkerhet: 91% (H?Y) 00:22:07.650 --> 00:22:17.800 Well, the mode is the same because I have kept the two values at 162 in each case. The median is 00:22:17.800 --> 00:22:27.050 also the same because I've always kept one measurement less than 162 in each of these hypothetical 00:22:27.050 --> 00:22:37.600 manipulations. The mean is about the same, but not exactly the same. It's 162.3 in the original 00:22:37.600 --> 00:22:38.100 data. NOTE Treffsikkerhet: 73% (MEDIUM) 00:22:38.100 --> 00:22:45.150 Point 2 in these data point 5, here and point 6, so the mean is slightly affected. NOTE Treffsikkerhet: 85% (H?Y) 00:22:45.150 --> 00:22:52.750 So the interesting question now is what happens to the dispersion to our indices of dispersion. NOTE Treffsikkerhet: 83% (H?Y) 00:22:52.750 --> 00:22:59.949 The first and simplest approach is to calculate the ranges. As you can see these green lines here. 00:22:59.949 --> 00:23:05.949 So the range of the original data was one point eight centimeters. NOTE Treffsikkerhet: 91% (H?Y) 00:23:05.949 --> 00:23:16.700 The range of the second set. The more condensed set is one. The range of the expanded set is 3.8 and 00:23:16.700 --> 00:23:25.900 the range of the set with one outlier is 3.3. So the range simply expresses the difference between 00:23:25.900 --> 00:23:31.550 the smallest and the largest value. What about the interquartile range? NOTE Treffsikkerhet: 82% (H?Y) 00:23:31.550 --> 00:23:38.900 The interquartile range is not affected by where the smallest and the largest values are because 00:23:38.900 --> 00:23:45.000 it's not affected by the smallest, one quarter of the measures and the largest 1/4 of the measures. 00:23:45.000 --> 00:23:51.900 This means that the interquartile range is much less sensitive to these manipulations. Of course, if 00:23:51.900 --> 00:23:58.800 the manipulations truly affect the nature of the entire data set, we do see it in the interquartile 00:23:58.800 --> 00:23:59.600 range. NOTE Treffsikkerhet: 90% (H?Y) 00:23:59.600 --> 00:24:07.000 So here where the data set is truly more condensed. We have a smaller interquartile range than the 00:24:07.000 --> 00:24:16.000 original data and here we have everything more expanded. We have a larger interquartile range in the 00:24:16.000 --> 00:24:23.500 last case where we just have a mistake in value, an outlier, interquartile range is not affected. NOTE Treffsikkerhet: 71% (MEDIUM) 00:24:25.400 --> 00:24:33.100 Let us move on to the deviation based indices. These are the deviations. So the distances from 00:24:33.100 --> 00:24:44.150 the mean and these are the actual numbers. The mean absolute deviation is 0.53 in the original data, 00:24:44.150 --> 00:24:52.699 0.33 in the condensed data, 0.99 in the expanded data and NOTE Treffsikkerhet: 91% (H?Y) 00:24:52.699 --> 00:24:56.550 0.94 in the data with an outlier. NOTE Treffsikkerhet: 81% (H?Y) 00:24:56.550 --> 00:24:59.100 So, you see that NOTE Treffsikkerhet: 81% (H?Y) 00:24:59.100 --> 00:25:07.750 the mean absolute deviation is double in this set than and that, and it's 3/5 in this than and that 00:25:07.750 --> 00:25:13.800 basically the difference between these two is that this triples. NOTE Treffsikkerhet: 88% (H?Y) 00:25:14.100 --> 00:25:24.000 When you look at the standard deviation, the standard deviation of the original is 0.71, the 00:25:24.000 --> 00:25:33.300 condensed 0.42. The expanded one 1.41, and the one with the outlier 1.35. NOTE Treffsikkerhet: 71% (MEDIUM) 00:25:33.300 --> 00:25:39.200 So again, we see that this is about twice as much as that. NOTE Treffsikkerhet: 80% (H?Y) 00:25:41.500 --> 00:25:52.500 And this is more than three times this one. The reason that expansion costs more than is gained 00:25:52.500 --> 00:26:04.050 by condensation, is that the standard deviation is made to be more influenced by larger deviations. NOTE Treffsikkerhet: 91% (H?Y) 00:26:04.050 --> 00:26:12.199 In the mean absolute deviation all the numbers just enter as the actual distance is they are. NOTE Treffsikkerhet: 85% (H?Y) 00:26:12.199 --> 00:26:21.800 But in the standard deviation, the numbers enter the formula after their multiplies themselves, this is very 00:26:21.800 --> 00:26:30.600 important. It means that a small deviation will be multiplied by a small number,that's itself. A 00:26:30.600 --> 00:26:35.500 large deviation will be multiplied by a large number, itself. NOTE Treffsikkerhet: 74% (MEDIUM) 00:26:35.500 --> 00:26:44.400 So the large deviations will go into the some having been multiplied by larger numbers than the small 00:26:44.400 --> 00:26:51.400 deviations. So they will end up influencing the standard deviation more than they would influence 00:26:51.400 --> 00:26:59.300 the mean absolute deviation. This means that the standard deviation is more sensitive to values far 00:26:59.300 --> 00:27:05.250 from the mean, which is actually a good thing for when you're interested in guessing NOTE Treffsikkerhet: 71% (MEDIUM) 00:27:05.250 --> 00:27:08.800 We were new values might actually lie. NOTE Treffsikkerhet: 91% (H?Y) 00:27:08.900 --> 00:27:18.700 So to sum up our indices of dispersion. The first and simplest one was the range which only gives us 00:27:18.700 --> 00:27:26.800 an indication of possible values we can expect. Because it only depends on two observations, the 00:27:26.800 --> 00:27:34.150 smallest and the largest, and because the smallest and the largest are those that will be most likely 00:27:34.150 --> 00:27:38.150 to be the result of something problematic the range is actually NOTE Treffsikkerhet: 67% (MEDIUM) 00:27:38.150 --> 00:27:44.300 sensitive to outliers, which means sensitive to any kind of problems with our measurement. NOTE Treffsikkerhet: 77% (H?Y) 00:27:44.300 --> 00:27:52.400 So although it can be very informative about where possible values could range. It's not really very 00:27:52.400 --> 00:28:00.100 useful when we want to guess something more precisely, where we can really expect new values or new 00:28:00.100 --> 00:28:03.200 sets to be observed. NOTE Treffsikkerhet: 87% (H?Y) 00:28:03.500 --> 00:28:12.300 The second one, the interquartile range gives us a good indication of the central data range, of are 00:28:12.300 --> 00:28:14.900 half of our data. NOTE Treffsikkerhet: 91% (H?Y) 00:28:16.100 --> 00:28:21.650 This means that it doesn't tell us anything about the other half. NOTE Treffsikkerhet: 91% (H?Y) 00:28:21.650 --> 00:28:28.700 This could be a bad thing if you really want to know the whole range of your data, but it is a good 00:28:28.700 --> 00:28:36.600 thing if you don't want your index to be affected by potential problems. So you can use the 00:28:36.600 --> 00:28:46.600 interquartile range to guess about a new data point that it will be within this range half of the 00:28:46.600 --> 00:28:47.699 time. NOTE Treffsikkerhet: 91% (H?Y) 00:28:47.699 --> 00:28:55.500 If you have a good estimate for the range of half of your data that is robust and not sensitive to 00:28:55.500 --> 00:29:02.800 what happens with extreme data. Then this can be a pretty good guess about new values and you can 00:29:02.800 --> 00:29:09.100 expect that half of any additional data will lie within this range. NOTE Treffsikkerhet: 90% (H?Y) 00:29:09.100 --> 00:29:17.000 The mean absolute deviation is not in fact very often used in practice, because it doesn't really 00:29:17.000 --> 00:29:25.900 have any obvious advantages over the other options. What is frequently used when certain assumptions 00:29:25.900 --> 00:29:33.000 are met is the standard deviation. The standard deviation as we said gives weight to large 00:29:33.000 --> 00:29:34.450 deviations. NOTE Treffsikkerhet: 88% (H?Y) 00:29:34.450 --> 00:29:43.400 and can be very useful for predicting where new values will be. But like the mean it is sensitive on 00:29:43.400 --> 00:29:50.800 certain assumptions about our data, about which we will say more later in the course.