WEBVTT Kind: captions; language: en-us NOTE Treffsikkerhet: 88% (H?Y) 00:00:00.000 --> 00:00:07.600 In this video we will introduce the normal distribution based on what we've learned from our coin 00:00:07.600 --> 00:00:09.500 tossing experiments. NOTE Treffsikkerhet: 90% (H?Y) 00:00:09.500 --> 00:00:19.600 Let us recall the probabilities for two coin tosses. The probabilities for getting heads 0 times. So 00:00:19.600 --> 00:00:31.000 two tails once. So one of each or twice, twice heads, and these were 1 divided by 4, because there's 00:00:31.000 --> 00:00:40.050 four possible patterns. 25% to get two heads, 25% to get two tails, and 50% to get NOTE Treffsikkerhet: 78% (H?Y) 00:00:40.050 --> 00:00:49.000 one of each. Because there is two ways out of 4 to get once heads and once tails and as you will 00:00:49.000 --> 00:00:57.300 recall the total of these probabilities must be 100 percent or one. NOTE Treffsikkerhet: 91% (H?Y) 00:00:57.300 --> 00:01:02.000 Because these are all the possible outcomes. NOTE Treffsikkerhet: 81% (H?Y) 00:01:03.099 --> 00:01:10.400 When we think about four tosses, these are the probabilities of getting the different number of 00:01:10.400 --> 00:01:17.700 heads indicated from zero times heads, All tails in four tosses, to 4 times heads, going through 00:01:17.700 --> 00:01:26.000 once or twice. So twice each or three times heads. And the actual numbers are these. These are the 00:01:26.000 --> 00:01:33.750 probabilities here in percent. And again, of course, they must sum up to 1 or NOTE Treffsikkerhet: 84% (H?Y) 00:01:33.750 --> 00:01:40.100 a hundred percent because these are all the possible patterns in one of these must happen. NOTE Treffsikkerhet: 85% (H?Y) 00:01:40.400 --> 00:01:50.700 For ten tosses. This is what the expected frequencies look like. What about lots more tosses? NOTE Treffsikkerhet: 91% (H?Y) 00:01:51.200 --> 00:02:02.200 Let's see what happens with 100 tosses. So these are the expected frequencies for getting from 30 00:02:02.200 --> 00:02:10.300 through 70 times heads. We're not plotting the probabilities for getting fewer or more because 00:02:10.300 --> 00:02:18.900 there are very very low. So, if you toss a coin a hundred times, then you are expected to get NOTE Treffsikkerhet: 91% (H?Y) 00:02:18.900 --> 00:02:23.700 at least about 30 times heads. NOTE Treffsikkerhet: 90% (H?Y) 00:02:23.700 --> 00:02:28.149 And at least about 30 times tails. NOTE Treffsikkerhet: 86% (H?Y) 00:02:28.149 --> 00:02:36.700 Larger ratios are very, very unlikely. What about a thousand tosses. What are the expected 00:02:36.700 --> 00:02:38.800 frequencies for that? NOTE Treffsikkerhet: 91% (H?Y) 00:02:38.800 --> 00:02:47.250 Okay, we can see a nice shape being formed here by all of these bars and we see that the 00:02:47.250 --> 00:02:51.550 probabilities here range from about NOTE Treffsikkerhet: 91% (H?Y) 00:02:51.550 --> 00:02:53.900 440 NOTE Treffsikkerhet: 88% (H?Y) 00:02:53.900 --> 00:03:03.550 To 560. Of course, you can calculate them for fewer or more times heads, but they're exceedingly low 00:03:03.550 --> 00:03:05.600 and can be ignored. NOTE Treffsikkerhet: 91% (H?Y) 00:03:05.600 --> 00:03:15.250 So, most of the outcomes will contain at least 450 of each. NOTE Treffsikkerhet: 84% (H?Y) 00:03:15.250 --> 00:03:32.750 What about 10,000 coin tosses? Okay. So now we are likely to get between 4850 and 5150. NOTE Treffsikkerhet: 89% (H?Y) 00:03:32.750 --> 00:03:42.600 And you see that all of these little bars that correspond to individual counts are so close that 00:03:42.600 --> 00:03:48.200 they have merged into one another. So NOTE Treffsikkerhet: 91% (H?Y) 00:03:48.400 --> 00:03:59.500 they have formed a smooth curve. If we imagine joining the tops of all these bars. NOTE Treffsikkerhet: 91% (H?Y) 00:03:59.500 --> 00:04:09.000 In each case. In the case of ten tosses. We see that this is kind of a jagged mountain because you 00:04:09.000 --> 00:04:16.250 can see each individual bar. It's a bit smoother with a hundred tosses. If you do the same thing, 00:04:16.250 --> 00:04:19.800 it's pretty smooth with a thousand. NOTE Treffsikkerhet: 90% (H?Y) 00:04:19.800 --> 00:04:29.000 And there are no visible corners of any sort by 10,000 and indeed the bars are merging into each 00:04:29.000 --> 00:04:29.900 other. NOTE Treffsikkerhet: 81% (H?Y) 00:04:29.900 --> 00:04:42.300 We imagine starting from ten thousand horses. Imagine now that we go up to tossing forever. NOTE Treffsikkerhet: 84% (H?Y) 00:04:42.300 --> 00:04:51.500 So this means that the number of tosses goes to infinite, uncountable ,and the little bars in here, 00:04:51.500 --> 00:05:00.300 becomes so thin that they actually have no width but there's so many of them. They completely merge 00:05:00.300 --> 00:05:06.400 into one another, into one big area, one big blob. NOTE Treffsikkerhet: 91% (H?Y) 00:05:06.400 --> 00:05:16.500 The shape of the curve joining the tops of all these infinite individual bars for uncountable, 00:05:16.500 --> 00:05:20.900 number of tosses is called the normal distribution. NOTE Treffsikkerhet: 91% (H?Y) 00:05:20.900 --> 00:05:23.900 And this is what it looks like. NOTE Treffsikkerhet: 91% (H?Y) 00:05:24.800 --> 00:05:33.200 The reason that we're introducing this, as an extension of the number of tosses is because we want 00:05:33.200 --> 00:05:39.900 to retain a very important lesson, from tossing coins. And that lesson is that NOTE Treffsikkerhet: 88% (H?Y) 00:05:39.900 --> 00:05:50.300 the area under this curve is equal to 1 because that's the total probability. So we don't have 00:05:50.300 --> 00:05:58.000 individual bars here to add their heights because there's an infinite number of them and each one is 00:05:58.000 --> 00:05:59.800 infinitely thin. NOTE Treffsikkerhet: 91% (H?Y) 00:05:59.800 --> 00:06:07.600 So, you can't work with individual bars, but what you can do, once they are infinite, is you 00:06:07.600 --> 00:06:15.200 stop thinking about bars and adding them, and think about areas. So this total area under here is the 00:06:15.200 --> 00:06:22.500 total probability of everything that can happen, which has to be 100% or one. NOTE Treffsikkerhet: 90% (H?Y) 00:06:22.500 --> 00:06:30.600 So this shaded area here under this curve has an area of one. NOTE Treffsikkerhet: 91% (H?Y) 00:06:31.300 --> 00:06:39.300 And obviously this curve is symmetric. So whatever happens this way. Exactly the same thing 00:06:39.300 --> 00:06:46.950 happens this way. Okay. So this means if I cut it in exactly half. So this is the top of the curve. 00:06:46.950 --> 00:06:53.900 If I cut it in half and if the whole thing is equal to 1, then the left half will have to be equal 00:06:53.900 --> 00:06:55.650 to 1/2. NOTE Treffsikkerhet: 88% (H?Y) 00:06:55.650 --> 00:07:08.600 So 50% probability, this shaded area here is equal to 0.5 or 50% and of course, this is also 50% 00:07:08.600 --> 00:07:11.000 because it's the other half. NOTE Treffsikkerhet: 91% (H?Y) 00:07:11.900 --> 00:07:22.000 Okay. So what about different points in the curve. The important thing about the normal distribution 00:07:22.000 --> 00:07:28.300 is that we can calculate the probability for any given point. NOTE Treffsikkerhet: 85% (H?Y) 00:07:28.300 --> 00:07:33.400 So, at the point that I have indicated with one. NOTE Treffsikkerhet: 85% (H?Y) 00:07:33.400 --> 00:07:47.100 What is this area? Well, it is about 84 percent. So the area from far left all the way up to 1 on 00:07:47.100 --> 00:07:56.300 the normal distribution is equal to 84 percent. And because the whole thing is 1, this means that 00:07:56.300 --> 00:07:57.900 what's left NOTE Treffsikkerhet: 77% (H?Y) 00:07:57.900 --> 00:08:10.900 must be 100 minus 84 percent which is 16 percent. So 84 plus 16 is 100. NOTE Treffsikkerhet: 91% (H?Y) 00:08:11.500 --> 00:08:22.500 So if this is 16 from 1 to the end forever, and because this is a symmetric curve, then this also 00:08:22.500 --> 00:08:34.049 should be 16 from minus one to the left end. So, 16 plus 16 means that the probability that is 00:08:34.049 --> 00:08:38.150 outside of the range from minus 1 to 1. NOTE Treffsikkerhet: 88% (H?Y) 00:08:38.150 --> 00:08:43.750 The sum of these two areas is about 32%. NOTE Treffsikkerhet: 91% (H?Y) 00:08:43.750 --> 00:08:52.400 And if that's 32, then what's left to complete the 100, it has to be. NOTE Treffsikkerhet: 68% (MEDIUM) 00:08:52.400 --> 00:08:54.450 68. NOTE Treffsikkerhet: 83% (H?Y) 00:08:54.450 --> 00:09:02.200 So, there is a 68% probability between minus 1 and 1. NOTE Treffsikkerhet: 91% (H?Y) 00:09:03.000 --> 00:09:10.100 And a 32 percent probability outside of this range. NOTE Treffsikkerhet: 90% (H?Y) 00:09:10.100 --> 00:09:19.800 What about the range from minus 2 to 2. It turns out this is about 95%. NOTE Treffsikkerhet: 77% (H?Y) 00:09:19.800 --> 00:09:30.200 So what's left to be outside of this range? This probability is five percent. Five plus 95 equals 00:09:30.200 --> 00:09:43.100 100. So a probability of 0.05 is associated with the area outside of the range minus 2 to 2. NOTE Treffsikkerhet: 84% (H?Y) 00:09:43.800 --> 00:09:50.150 Therefore, one side only would be half of that. NOTE Treffsikkerhet: 78% (H?Y) 00:09:50.150 --> 00:09:59.100 What is half of five percent that is about two and a half percent. That is beyond two on the right 00:09:59.100 --> 00:10:01.700 side, on the positive side. NOTE Treffsikkerhet: 87% (H?Y) 00:10:02.200 --> 00:10:16.200 And what about the range between - 2.5 and + 2 .5? This probability here is about 99% and therefore 00:10:16.200 --> 00:10:24.400 outside of this range. So, from 2.5 and above and minus 2.5 and below. The sum of 00:10:24.400 --> 00:10:29.800 these probabilities is about 1%. NOTE Treffsikkerhet: 87% (H?Y) 00:10:30.400 --> 00:10:39.300 And if we go all the way from minus 3 to 3, this is very hard to see. But there is a point right 00:10:39.300 --> 00:10:51.600 here. And so, almost everything is included the probability within 3 from minus 3 to 3 is 99.7. NOTE Treffsikkerhet: 91% (H?Y) 00:10:53.400 --> 00:11:00.200 So, we have some proportions that are based on the normal distribution. NOTE Treffsikkerhet: 81% (H?Y) 00:11:00.200 --> 00:11:08.900 And I realized you are still wondering why we're doing this. But this will become clear very soon 00:11:08.900 --> 00:11:16.350 This is very, very important. I need you to learn these sets of numbers. NOTE Treffsikkerhet: 91% (H?Y) 00:11:16.350 --> 00:11:26.400 Actually, you only need to learn one column. This is the easiest one because they're all related. So 00:11:26.400 --> 00:11:31.100 as we saw, if you think about one, NOTE Treffsikkerhet: 88% (H?Y) 00:11:31.700 --> 00:11:43.000 There is 68% from minus 1 to plus 1, which leaves 32% to be outside this range and therefore each 00:11:43.000 --> 00:11:52.800 half. So each tail, the right tail above one or the left tail below minus 1 has to be 16 percent. NOTE Treffsikkerhet: 89% (H?Y) 00:11:52.800 --> 00:12:02.700 So if you remember this 32, then you know that inside has to be what's left from a hundred and each 00:12:02.700 --> 00:12:05.000 tail has to be half of it. NOTE Treffsikkerhet: 71% (MEDIUM) 00:12:05.200 --> 00:12:07.200 and, NOTE Treffsikkerhet: 68% (MEDIUM) 00:12:07.200 --> 00:12:09.550 For two. NOTE Treffsikkerhet: 83% (H?Y) 00:12:09.550 --> 00:12:20.300 We have 95 percent within the range, minus two to two. Five percent left for outside the range. Minus 00:12:20.300 --> 00:12:25.250 two to two. Therefore each tail would be 2.5%. NOTE Treffsikkerhet: 80% (H?Y) 00:12:25.250 --> 00:12:33.100 Again, if you remember the 5, then this is what's left for a hundred and this is half. And the 00:12:33.100 --> 00:12:39.200 third, the third set of numbers to remember, corresponds to two and a half. NOTE Treffsikkerhet: 73% (MEDIUM) 00:12:39.300 --> 00:12:47.600 And that's because it results in the easy number of one percent outside the range 99 percent within 00:12:47.600 --> 00:12:54.100 the range from minus two and a half, two plus two and a half and therefore each tail must be about 00:12:54.100 --> 00:12:56.600 half of a percent. NOTE Treffsikkerhet: 87% (H?Y) 00:12:57.100 --> 00:13:04.200 Okay. Now, why is this important? Why are you supposed to learn these three number? You should 00:13:04.200 --> 00:13:10.300 really memorize these numbers, not all the others, but these three, you should memorize. Because 00:13:10.300 --> 00:13:16.200 there are the basis for all comparisons and conclusions in statistics that will follow. NOTE Treffsikkerhet: 91% (H?Y) 00:13:17.200 --> 00:13:24.000 You can actually check out a lot more different values and how they participate in the probability 00:13:24.000 --> 00:13:27.200 area. If you go to this link. NOTE Treffsikkerhet: 87% (H?Y) 00:13:27.300 --> 00:13:37.600 Now, why are we so concerned, so interested in this curve? I mean, it looks nice, but it's not 00:13:37.600 --> 00:13:46.500 obvious why it's so important. The mathematical formula for this, which you should not try to learn. 00:13:46.500 --> 00:13:54.000 I don't know it. And you should not really hear about it. Is this one. Yes, it's very scary. That's 00:13:54.000 --> 00:13:57.349 why you shouldn't learn it on. Also there's no use for it NOTE Treffsikkerhet: 69% (MEDIUM) 00:13:57.349 --> 00:14:04.900 and there's nothing you can do with it. The only reason I'm displaying it is to tell you to show you 00:14:04.900 --> 00:14:10.550 that it includes these two symbols, which should now be familiar to you. NOTE Treffsikkerhet: 91% (H?Y) 00:14:10.550 --> 00:14:15.400 It's no coincidence that these symbols are there. NOTE Treffsikkerhet: 79% (H?Y) 00:14:15.900 --> 00:14:27.800 This particular curve is the one that we obtain for the values of ¦Ì equals zero and sigma (¦Ò) equals 1. NOTE Treffsikkerhet: 90% (H?Y) 00:14:28.000 --> 00:14:36.700 And if you recall from our indices of central tendency and dispersion, we use this letter to refer 00:14:36.700 --> 00:14:45.100 to the mean of a population and this letter for referring to the standard deviation of a population. NOTE Treffsikkerhet: 91% (H?Y) 00:14:45.400 --> 00:14:48.900 So what this curve NOTE Treffsikkerhet: 91% (H?Y) 00:14:48.900 --> 00:14:51.599 can represent NOTE Treffsikkerhet: 91% (H?Y) 00:14:51.599 --> 00:15:03.800 is probabilities or proportions that relate to a set of measurements, a population of values that 00:15:03.800 --> 00:15:06.400 has a mean of zero NOTE Treffsikkerhet: 88% (H?Y) 00:15:06.400 --> 00:15:10.400 and a standard deviation of one. NOTE Treffsikkerhet: 91% (H?Y) 00:15:11.200 --> 00:15:16.550 The reason we are so interested in this curve NOTE Treffsikkerhet: 91% (H?Y) 00:15:16.550 --> 00:15:22.300 is that a lot of kinds of actual measurements we can make NOTE Treffsikkerhet: 80% (H?Y) 00:15:22.300 --> 00:15:32.650 do have this shape or can be transformed to approximate this shape and when you have real data NOTE Treffsikkerhet: 77% (H?Y) 00:15:32.650 --> 00:15:38.800 and you plot their histogram and the histogram looks like this. NOTE Treffsikkerhet: 91% (H?Y) 00:15:38.800 --> 00:15:46.150 Then this means that you can use your knowledge of the probabilities from the normal distribution NOTE Treffsikkerhet: 91% (H?Y) 00:15:46.150 --> 00:15:53.050 to make inferences about proportions in your population of measurements. NOTE Treffsikkerhet: 91% (H?Y) 00:15:53.050 --> 00:15:57.250 Let's make this a little bit more concrete. NOTE Treffsikkerhet: 87% (H?Y) 00:15:57.250 --> 00:16:10.700 Think of this axis as number of standard deviations away from the mean. So 0 is 0 standard 00:16:10.700 --> 00:16:18.400 deviations away from the mean. So is the mean okay, if you're 0 way from the ¦Ì are the ¦Ì. So 00:16:18.400 --> 00:16:20.500 this is the mean right here. NOTE Treffsikkerhet: 86% (H?Y) 00:16:20.500 --> 00:16:29.000 1 ¦Ì means one standard deviation above the mean, minus one means one standard deviation below the 00:16:29.000 --> 00:16:30.099 mean. NOTE Treffsikkerhet: 64% (MEDIUM) 00:16:30.099 --> 00:16:38.200 Two means two standard deviations above the mean. Minus 2 means minus two standard deviations below 00:16:38.200 --> 00:16:39.349 the mean. NOTE Treffsikkerhet: 91% (H?Y) 00:16:39.349 --> 00:16:49.600 Therefore, if we apply the numbers, we mentioned before to the situation where this is not a 00:16:49.600 --> 00:16:56.900 mathematical abstraction, but the histogram of an actual population of lots and lots of 00:16:56.900 --> 00:17:02.850 observations, then this means that we can reasonably expect. NOTE Treffsikkerhet: 89% (H?Y) 00:17:02.850 --> 00:17:14.400 In fact, it has to be the case that 68% of our observations will lie within one standard deviation 00:17:14.400 --> 00:17:18.650 away from the mean. So between minus 1 and 1. NOTE Treffsikkerhet: 91% (H?Y) 00:17:18.650 --> 00:17:21.550 Let me say this again. NOTE Treffsikkerhet: 72% (MEDIUM) 00:17:21.550 --> 00:17:32.450 In a variable that is approximately normally distributed, about 68 percent of the data will be 00:17:32.450 --> 00:17:37.300 within one standard deviation from the mean. NOTE Treffsikkerhet: 77% (H?Y) 00:17:38.600 --> 00:17:48.800 This means that about 32 percent of the data will be outside. This means they will be farther from 00:17:48.800 --> 00:17:52.050 the mean than one standard deviation. NOTE Treffsikkerhet: 90% (H?Y) 00:17:52.050 --> 00:18:02.150 So now you may begin to understand why we care about calculating a standard deviation. When I said 00:18:02.150 --> 00:18:09.400 in talking about indices of dispersion that we prefer to use the standard deviation when we can 00:18:09.400 --> 00:18:16.500 instead of the other possible indices of dispersion and especially instead of the mean absolute 00:18:16.500 --> 00:18:20.400 difference which seemed to be pretty similar superficially. NOTE Treffsikkerhet: 91% (H?Y) 00:18:20.400 --> 00:18:29.300 So the reason is if you use the standard deviation, then you can know a lot about proportions of 00:18:29.300 --> 00:18:36.699 your data that are expected to be within or outside any given range. NOTE Treffsikkerhet: 91% (H?Y) 00:18:36.699 --> 00:18:45.500 So, if your variable is approximately normally distributed about one-third, or thirty two percent of 00:18:45.500 --> 00:18:51.700 your data will be farther from the mean than one standard deviation. So what is the standard 00:18:51.700 --> 00:19:00.100 deviation then? It's the distance from the mean that contains about two-thirds of your data. That's 00:19:00.100 --> 00:19:03.000 why the standard deviation is so important. NOTE Treffsikkerhet: 87% (H?Y) 00:19:03.000 --> 00:19:12.500 It's the range within which you find about 68% of your data and that doesn't depend on anything 00:19:12.500 --> 00:19:19.300 about the nature of your data, about the mean or about the standard deviation itself. It only 00:19:19.300 --> 00:19:20.650 depends NOTE Treffsikkerhet: 91% (H?Y) 00:19:20.650 --> 00:19:27.400 on the measurements being normally distributed. It only depends on the histogram looking like this. 00:19:27.400 --> 00:19:33.700 If it does then you know that 68% of your data will be within one standard deviation away from the 00:19:33.700 --> 00:19:34.600 mean. NOTE Treffsikkerhet: 91% (H?Y) 00:19:34.600 --> 00:19:37.800 And therefore, if you apply the other numbers. NOTE Treffsikkerhet: 68% (MEDIUM) 00:19:37.800 --> 00:19:41.199 Ninety-five percent of your data NOTE Treffsikkerhet: 69% (MEDIUM) 00:19:41.199 --> 00:19:48.900 will be within two standard deviations from the mean, so no farther than 2 standard deviations 00:19:48.900 --> 00:19:50.100 away. NOTE Treffsikkerhet: 78% (H?Y) 00:19:50.100 --> 00:19:58.200 Between -2 and 2 standard deviations away from the mean, you have 95% of the data. And this means 00:19:58.200 --> 00:20:05.600 that five percent of the data are expected to be farther than 2 standard deviations away from the 00:20:05.600 --> 00:20:15.300 mean. So these proportions, these numbers that you have to learn are. So general that you can use 00:20:15.300 --> 00:20:20.550 them in very, very many situations, as we'll see later in the course and that's why they're NOTE Treffsikkerhet: 74% (MEDIUM) 00:20:20.550 --> 00:20:27.900 important to know. So whenever you can reasonably justify calculation of a mean and a standard 00:20:27.900 --> 00:20:29.250 deviation. NOTE Treffsikkerhet: 91% (H?Y) 00:20:29.250 --> 00:20:37.300 Whenever your data are approximately normally distributed. There is an intimate mathematically based 00:20:37.300 --> 00:20:42.750 connection between standard deviations and proportions. NOTE Treffsikkerhet: 75% (MEDIUM) 00:20:42.750 --> 00:20:48.950 So 68% of your data within one standard deviation away from the mean. NOTE Treffsikkerhet: 67% (MEDIUM) 00:20:48.950 --> 00:20:58.500 And 32% are outside, are farther from the mean than one standard deviation. 95% of the data are 00:20:58.500 --> 00:21:05.950 within two standard deviations from the mean and 5% are farther than that. NOTE Treffsikkerhet: 85% (H?Y) 00:21:05.950 --> 00:21:15.450 And 99% of your data are going to be within two and a half standard deviations away from the mean. 00:21:15.450 --> 00:21:22.700 And one percent of the data will be farther than two and a half standard deviations away from the 00:21:22.700 --> 00:21:23.750 mean. NOTE Treffsikkerhet: 78% (H?Y) 00:21:23.750 --> 00:21:31.400 And you can find and see, you can find the values and see the areas that correspond to proportions 00:21:31.400 --> 00:21:38.750 for any number of standard deviations, using the application online you can find at this link. NOTE Treffsikkerhet: 91% (H?Y) 00:21:38.750 --> 00:21:44.750 And when it comes to a sample, instead of having the whole population. NOTE Treffsikkerhet: 90% (H?Y) 00:21:44.750 --> 00:21:55.700 Well, we just change the symbol and we can still reason in the same way. So if you're talking about 00:21:55.700 --> 00:22:04.600 a sample of data that you have collected, and this sample is normally distributed, then 68% of your 00:22:04.600 --> 00:22:10.750 measurements, can be expected to be within one sample standard deviation. So the 00:22:10.750 --> 00:22:13.050 standard deviation of your sample. NOTE Treffsikkerhet: 65% (MEDIUM) 00:22:13.050 --> 00:22:21.100 68% will be within one standard deviation. 32 percent will be beyond. 95 percent within two standard 00:22:21.100 --> 00:22:29.200 deviations and so on. So this is also true of samples. It's true of any set of values that are 00:22:29.200 --> 00:22:35.600 approximately normally distributed and that's why the normal distribution is so important.