WEBVTT Kind: captions; language: en-us NOTE Treffsikkerhet: 91% (H?Y) 00:00:00.000 --> 00:00:09.200 In this video we'll talk about diagnostic performance evaluation of diagnostic performance is based 00:00:09.200 --> 00:00:16.400 on two by two contingency matrices that are already familiar with, and we will start with a familiar 00:00:16.400 --> 00:00:24.900 example: the language screening example in which we have teachers completing a rating checklist 00:00:24.900 --> 00:00:29.950 fitted against a clinical evaluation using a specific NOTE Treffsikkerhet: 87% (H?Y) 00:00:29.950 --> 00:00:37.500 standardised assesment test used to detect developmental language disorder. Each child is 00:00:37.500 --> 00:00:44.800 classified as either language impaired or non impaired by each of the two instruments, so we have two 00:00:44.800 --> 00:00:53.000 categorical variables: a checklist classification as language impaired or non-impaired, and a clinical 00:00:53.000 --> 00:01:00.000 classification as language impaired or not impaired. And these are the numbers of children NOTE Treffsikkerhet: 78% (H?Y) 00:01:00.000 --> 00:01:08.400 that were classified in each of the two categories according to teacher ratings down the rows and 00:01:08.400 --> 00:01:14.300 according to the clinical evaluation across the columns. NOTE Treffsikkerhet: 89% (H?Y) 00:01:14.300 --> 00:01:25.000 For example there were 13 children who were rated as impaired on the basis of both the clinical and 00:01:25.000 --> 00:01:32.900 the teacher scale and a eightyfive children who were classified as non impaired according to both 00:01:32.900 --> 00:01:38.750 instruments out of a total of 149 children, NOTE Treffsikkerhet: 79% (H?Y) 00:01:38.750 --> 00:01:48.200 and the sum of the agreements divided by the grand total indicated and agreement of 66 percent over 00:01:48.200 --> 00:01:50.400 the two instruments. NOTE Treffsikkerhet: 91% (H?Y) 00:01:50.400 --> 00:01:58.550 We have seen in a previous video that this amounts to a non significant difference in proportions 00:01:58.550 --> 00:02:06.700 for this contingency Matrix which leads us to not reject the null hypothesis that the two variables 00:02:06.700 --> 00:02:15.700 are unrelated. Importantly for that kind of analysis we treat these two kinds of classifications as 00:02:15.700 --> 00:02:18.250 just two variables. NOTE Treffsikkerhet: 91% (H?Y) 00:02:18.250 --> 00:02:27.200 However in certain instances, possibly including this one, the two different variables do not stand on 00:02:27.200 --> 00:02:37.400 equal footing. Instead one of the two variables can be treated as valid diagnosis, can be treated as 00:02:37.400 --> 00:02:46.800 the truth in other ways. So our reference variable is what is considered correct, and there we are 00:02:46.800 --> 00:02:48.400 essentially judging NOTE Treffsikkerhet: 91% (H?Y) 00:02:48.400 --> 00:02:59.300 or evaluating the adequacy of the other instrument. So in this sense this would be valid diagnosis, so we 00:02:59.300 --> 00:03:11.700 have 32 children who are language impaired and 117 children who are not language impaired. NOTE Treffsikkerhet: 91% (H?Y) 00:03:11.700 --> 00:03:19.600 Against this background there are teacher ratings from these checklists, and children from the two 00:03:19.600 --> 00:03:27.200 categories from impaired children and non impaired children are classified based on these teacher 00:03:27.200 --> 00:03:31.550 ratings resulting in these numbers. NOTE Treffsikkerhet: 84% (H?Y) 00:03:31.550 --> 00:03:37.900 So now instead of talking about agreement we talk about accuracy. NOTE Treffsikkerhet: 90% (H?Y) 00:03:37.900 --> 00:03:46.300 How accurate is this teacher rating instrument in classifying the children correctly into the 00:03:46.300 --> 00:03:49.600 impaired and non-impaired category? NOTE Treffsikkerhet: 91% (H?Y) 00:03:49.600 --> 00:03:59.800 Accuracy is the total number of correctly classified children divided by the total, it's the same as 00:03:59.800 --> 00:04:06.800 the one we called agreement before when the two variables were seen more symmetrically. NOTE Treffsikkerhet: 91% (H?Y) 00:04:06.800 --> 00:04:16.399 Now there's a clear asymmetry, so in this sense we can use detection terminology. If there is a valid 00:04:16.399 --> 00:04:26.200 classification taken to be true, we can evaluate any other instrument attempting to replicate that 00:04:26.200 --> 00:04:28.150 classification. NOTE Treffsikkerhet: 91% (H?Y) 00:04:28.150 --> 00:04:36.100 So the terminology that is used in the detection literature is that when a case is correctly 00:04:36.100 --> 00:04:44.750 classified, so a case of language impairment is detected correctly, that is called a hit, so these are 00:04:44.750 --> 00:04:49.549 the hits. How many hits were counted NOTE Treffsikkerhet: 91% (H?Y) 00:04:49.549 --> 00:04:59.100 using this rating instrument. When the case of language impairment is not detected it is missed, so 00:04:59.100 --> 00:05:01.500 these are the misses. NOTE Treffsikkerhet: 83% (H?Y) 00:05:01.700 --> 00:05:09.600 When a non case so someone who is not language impaired is misclassified as language impaired that's 00:05:09.600 --> 00:05:14.300 called a false alarm or false positive. NOTE Treffsikkerhet: 91% (H?Y) 00:05:14.300 --> 00:05:22.900 And when someone who's not impaired is correctly judge to be not impaired that is called a correct 00:05:22.900 --> 00:05:28.850 rejection, so not detected as impaired and that is correct. NOTE Treffsikkerhet: 91% (H?Y) 00:05:28.850 --> 00:05:38.100 So how is the diagnostic performance of the instrument evaluated based on the hits, misses false 00:05:38.100 --> 00:05:44.750 alarms and correct projections ? The first important index NOTE Treffsikkerhet: 91% (H?Y) 00:05:44.750 --> 00:05:52.250 is the proportion of children who are impaired, who are correctly detected. NOTE Treffsikkerhet: 91% (H?Y) 00:05:52.250 --> 00:06:01.600 So how many of the impaired children are actually detected by this screening instrument this 00:06:01.600 --> 00:06:03.000 checklist. NOTE Treffsikkerhet: 87% (H?Y) 00:06:03.000 --> 00:06:14.350 We Divide 13 by 32, we derive what is called this sensitivity of the instrument, and that's 41 percent. 00:06:14.350 --> 00:06:24.400 The teacher rating checklist only detects 40%, 41%, of language impaired children. That is not a very 00:06:24.400 --> 00:06:30.450 high proportion so sensitivity is relatively low for this instrument. NOTE Treffsikkerhet: 91% (H?Y) 00:06:30.450 --> 00:06:40.400 The other index we can calculate is related to the children who are not impaired, of those children 00:06:40.400 --> 00:06:49.850 who are not impaired What proportion are correctly classified as not impaired? So divide 85 by 117 00:06:49.850 --> 00:06:58.150 this is called specificity, how specific is the detection provided by this instrument? NOTE Treffsikkerhet: 90% (H?Y) 00:06:58.150 --> 00:07:07.200 If it is specific it will always reject those who aren't impaired, if it is nonspecific it will 00:07:07.200 --> 00:07:15.700 detect children who aren't impaired. In this case specificity is 73%, NOTE Treffsikkerhet: 91% (H?Y) 00:07:17.200 --> 00:07:28.200 in other words sensitivity concerns the correct classification of cases, cases here means language 00:07:28.200 --> 00:07:37.000 impairment, and specificity concerns the correct classification of non cases, those who are not 00:07:37.000 --> 00:07:42.500 language impaired. But this is not all the information we need, NOTE Treffsikkerhet: 86% (H?Y) 00:07:42.500 --> 00:07:53.700 if we were did you start using this instrument, these ratings, as a screening instead of the clinical 00:07:53.700 --> 00:08:00.900 evaluation, our reference variable which is considered to be true, then we wouldn't be seeing these 00:08:00.900 --> 00:08:07.950 numbers anymore. Instead we would be seeing the report of our new instrument. NOTE Treffsikkerhet: 90% (H?Y) 00:08:07.950 --> 00:08:17.100 So we have to evaluate our instrument also on the basis of the marginal sums of the rows NOTE Treffsikkerhet: 91% (H?Y) 00:08:17.100 --> 00:08:23.400 in order to see how Dependable the outcomes are. NOTE Treffsikkerhet: 86% (H?Y) 00:08:23.500 --> 00:08:33.400 Turning to the rows we see that of the 45 children reported as language impaired by the teacher 00:08:33.400 --> 00:08:39.400 rating scale only 13 are actually impaired. NOTE Treffsikkerhet: 83% (H?Y) 00:08:39.500 --> 00:08:47.900 If we divide these numbers we see that this is less than one-third, it's 29%, this is called the 00:08:47.900 --> 00:08:58.000 positive predictive value. Sometimes it's called Precision. The Precision of this screening is low, is 00:08:58.000 --> 00:09:00.000 29 percent. NOTE Treffsikkerhet: 89% (H?Y) 00:09:00.000 --> 00:09:08.800 This means that a child that is classified as language impaired by the instrument has only 29 00:09:08.800 --> 00:09:12.800 percent probability of actually being impaired. NOTE Treffsikkerhet: 86% (H?Y) 00:09:12.800 --> 00:09:16.400 Moving on to the second row, NOTE Treffsikkerhet: 91% (H?Y) 00:09:16.400 --> 00:09:25.700 out of 104 children classified as non impaired, only 85 were in fact non-impaired, so the negative 00:09:25.700 --> 00:09:29.600 predictive value of this test is 82. NOTE Treffsikkerhet: 91% (H?Y) 00:09:30.300 --> 00:09:38.700 The positive predictive value indicates the trustworthiness of case identification, if this 00:09:38.700 --> 00:09:41.800 instrument identifies a case. NOTE Treffsikkerhet: 85% (H?Y) 00:09:41.800 --> 00:09:49.200 What is the proportion of those identified that are actual cases, the negative predictive value is the 00:09:49.200 --> 00:09:55.800 opposite if the instrument identifies someone as a non case what is the probability that they 00:09:55.800 --> 00:09:58.800 actually aren't a case ? NOTE Treffsikkerhet: 83% (H?Y) 00:09:59.600 --> 00:10:09.500 These four indices express all the information that there is about a screening or diagnostic test, 00:10:09.500 --> 00:10:16.800 assuming that there is a reference classification that can be considered to be true. NOTE Treffsikkerhet: 91% (H?Y) 00:10:16.800 --> 00:10:27.250 These indices cannot be interpreted in isolation, none of the indices that we encountered can be seen 00:10:27.250 --> 00:10:35.250 in isolation from the others because there is a very important variable we haven't considered yet 00:10:35.250 --> 00:10:42.400 which is the prevalence of the condition. The prevalence of the condition is the proportion of the 00:10:42.400 --> 00:10:46.100 population that actually has the condition NOTE Treffsikkerhet: 57% (MEDIUM) 00:10:46.100 --> 00:10:48.349 we're trying to detect NOTE Treffsikkerhet: 88% (H?Y) 00:10:48.349 --> 00:10:55.650 and in order to be able to evaluate a test this is a very important parameter. NOTE Treffsikkerhet: 91% (H?Y) 00:10:55.650 --> 00:11:04.700 Let's see why, let's start with the first and easiest case of accuracy that is a generally useless 00:11:04.700 --> 00:11:09.500 index for evaluating diagnostic performance. NOTE Treffsikkerhet: 91% (H?Y) 00:11:11.000 --> 00:11:19.400 Cervical cancer is fortunately a relatively rare condition, I looked up some numbers and it looks 00:11:19.400 --> 00:11:27.700 like it occurs in about one in 500 women in the United States. Imagine you make a test that you call 00:11:27.700 --> 00:11:35.600 a screening test for cervical cancer, but it's a completely bogus test and always says no you don't 00:11:35.600 --> 00:11:41.650 have cancer, because only 1 in 500 have cervical cancer NOTE Treffsikkerhet: 85% (H?Y) 00:11:41.650 --> 00:11:51.000 your test will actually be correct most of the time, it will be accurate in 499 out of 500 cases, 00:11:51.000 --> 00:11:59.700 which means it will have an excellent accuracy of 99.8%. Everyone will know your test is useless 00:11:59.700 --> 00:12:03.600 because it's sensitivity is 0%, NOTE Treffsikkerhet: 91% (H?Y) 00:12:03.600 --> 00:12:12.600 it doesn't detect any cases. So accuracy is a completely useless index, that's why we don't use it in 00:12:12.600 --> 00:12:20.300 diagnostic performance and we refer to the other values, but those other values also need some care 00:12:20.300 --> 00:12:28.400 in their interpretation. Let's stay on the topic of cervical cancer and look at pap smears which is 00:12:28.400 --> 00:12:31.599 the usual screening procedure, NOTE Treffsikkerhet: 90% (H?Y) 00:12:31.599 --> 00:12:39.400 according to the literature the sensitivity of this procedure is somewhere between 50 and 75 percent, 00:12:39.400 --> 00:12:43.900 and the specificity is about 95%. NOTE Treffsikkerhet: 79% (H?Y) 00:12:44.300 --> 00:12:55.500 The prevalence 1 in 500 is 0.2% let us use the worst case scenario first where sensitivity is only 00:12:55.500 --> 00:12:57.700 50%. NOTE Treffsikkerhet: 86% (H?Y) 00:12:57.700 --> 00:13:10.000 Out of 10,000 women with the prevalence of 0.2%, 20 will have cervical cancer and the screening 00:13:10.000 --> 00:13:15.349 procedure at a fifty percent sensitivity will detect 10 of them, NOTE Treffsikkerhet: 86% (H?Y) 00:13:15.349 --> 00:13:24.800 so 10 will go undetected there misses and the specificity of 95% means that 95% of these will indeed 00:13:24.800 --> 00:13:34.200 be correctly rejected. So these are the numbers for these indices, and what this means is that the 00:13:34.200 --> 00:13:36.750 positive predictive value NOTE Treffsikkerhet: 80% (H?Y) 00:13:36.750 --> 00:13:47.500 is 2%. There are five hundred and nine out of ten thousand women who are classified as positive by 00:13:47.500 --> 00:13:56.300 the screening test and of these only 10 actually have the condition. So the positive predictive value 00:13:56.300 --> 00:13:58.400 is 2 percent. NOTE Treffsikkerhet: 90% (H?Y) 00:13:59.700 --> 00:14:07.200 The negative predictive value on the other hand is very high, it's 99.9. NOTE Treffsikkerhet: 91% (H?Y) 00:14:07.400 --> 00:14:18.100 What if we use the better scenario of 75 percent sensitivity ? Then out of the 20 women that have the 00:14:18.100 --> 00:14:23.200 condition in the population of 10,000, 15 are detected. NOTE Treffsikkerhet: 86% (H?Y) 00:14:23.200 --> 00:14:32.600 In this case the positive predictive value is 15 out of 514, which is almost three percent, and the 00:14:32.600 --> 00:14:37.350 negative predictive value is 99.95 percent. NOTE Treffsikkerhet: 89% (H?Y) 00:14:37.350 --> 00:14:45.800 These numbers indicate that if you get a negative result you get a slight increase in your relief, 00:14:45.800 --> 00:14:54.000 you had one in 500 chance now it's less than one in a thousand. If you get a positive result the 00:14:54.000 --> 00:15:01.200 doctor says don't worry an orders a second test and if necessary follow-up biopsy because the 00:15:01.200 --> 00:15:06.750 probability that there's actually something wrong even in the best-case scenario NOTE Treffsikkerhet: 85% (H?Y) 00:15:06.750 --> 00:15:14.500 is only three percent. So you are vastly more likely to not have cervical cancer with the positive 00:15:14.500 --> 00:15:23.100 result on your screening test, than to actually have it. And that's why doctors are not very worried 00:15:23.100 --> 00:15:33.100 by positive results. Let us look at a different example, covid-19 antibody testing, antibody test tell 00:15:33.100 --> 00:15:36.650 whether you had the disease so that you now have NOTE Treffsikkerhet: 91% (H?Y) 00:15:36.650 --> 00:15:39.400 anti bodies in your body. NOTE Treffsikkerhet: 91% (H?Y) 00:15:39.400 --> 00:15:48.000 The sensitivity varies greatly because it's very low in the beginning and increases a few weeks 00:15:48.000 --> 00:15:56.400 after you've had the symptoms reaching its maximum a couple of weeks after you've had symptoms. These 00:15:56.400 --> 00:16:01.900 tests tend to have quite High specificity at 98%, NOTE Treffsikkerhet: 90% (H?Y) 00:16:01.900 --> 00:16:12.300 the prevalence in this case is the proportion of people who had the disease and now have antibodies 00:16:12.300 --> 00:16:15.650 for covid-19 in their bloodstream. NOTE Treffsikkerhet: 70% (MEDIUM) 00:16:15.650 --> 00:16:23.200 So at a relatively early stage in the pandemic were only one percent of people have had the 00:16:23.200 --> 00:16:32.500 disease and under the assumption of relatively late testing following symptoms, so high sensitivity 00:16:32.500 --> 00:16:40.900 of 90%, these are the relevant numbers for a hypothetical 10,000 people who've been tested. NOTE Treffsikkerhet: 91% (H?Y) 00:16:40.900 --> 00:16:49.750 If only one percent of the population have antibodies this means that picking ten thousand people 00:16:49.750 --> 00:16:58.600 will result in 100 of them having antibodies and a sensitivity of 90%, means that 90 of them will be 00:16:58.600 --> 00:17:01.800 correctly detected, these are our hits. NOTE Treffsikkerhet: 91% (H?Y) 00:17:01.800 --> 00:17:10.800 Specificity of 98% means that 98% of those not having on anti bodies will correctly be rejected 00:17:10.800 --> 00:17:11.949 here. NOTE Treffsikkerhet: 91% (H?Y) 00:17:11.949 --> 00:17:25.800 The positive predictive value of this test is 31.3%, so of those reported to have antibodies 00:17:25.800 --> 00:17:29.800 therefore considered immune. NOTE Treffsikkerhet: 82% (H?Y) 00:17:29.800 --> 00:17:42.050 Only one-third will actually be immune, the rest will be false alarms because this specificity is not 00:17:42.050 --> 00:17:53.000 perfect, it's only 98 percent, but the low prevalence makes this number to be higher than this number. NOTE Treffsikkerhet: 91% (H?Y) 00:17:53.700 --> 00:17:57.900 So in the early stages of the pandemic NOTE Treffsikkerhet: 89% (H?Y) 00:17:57.900 --> 00:18:07.600 a positive antibody test only gives you a 31 chance, less than a third chance, of actually being 00:18:07.600 --> 00:18:08.800 immune- NOTE Treffsikkerhet: 91% (H?Y) 00:18:08.800 --> 00:18:14.000 The negative predictive value on the other hand is quite High NOTE Treffsikkerhet: 89% (H?Y) 00:18:14.000 --> 00:18:23.300 but then you were quite probable to not have antibodies before anyway, because you're talking about 00:18:23.300 --> 00:18:25.300 early stages. NOTE Treffsikkerhet: 71% (MEDIUM) 00:18:26.100 --> 00:18:34.600 In later stages in the pandemic assuming that 50 percent of the population has actually contracted 00:18:34.600 --> 00:18:42.800 the disease and it's been enough weeks after the symptoms that the sensitivity that we can work with 00:18:42.800 --> 00:18:51.200 is still high, if you tested 10,000 people in the population now you'd expect half of them to have 00:18:51.200 --> 00:18:52.800 antibodies NOTE Treffsikkerhet: 89% (H?Y) 00:18:52.800 --> 00:19:01.750 and detecting 90% of them, and detecting 98% of those not having contracted it yet. NOTE Treffsikkerhet: 80% (H?Y) 00:19:01.750 --> 00:19:10.600 So the positive predictive value is 97.8%, and the negative predictive value is 90.7 percent, these 00:19:10.600 --> 00:19:16.200 are both High numbers that can be trusted with much higher confidence. NOTE Treffsikkerhet: 85% (H?Y) 00:19:17.000 --> 00:19:25.400 What about the rapid antigen test ? These are tests that are meant to show if you have just contracted 00:19:25.400 --> 00:19:33.700 the disease, if you now have the virus not if you had the sickness in the past. These aren't the 00:19:33.700 --> 00:19:41.100 PCR test that are considered to be extremely accurate, these are the rapid test that everyone can 00:19:41.100 --> 00:19:48.000 take that very quickly indicate whether you had you've contracted the virus or not. NOTE Treffsikkerhet: 90% (H?Y) 00:19:48.000 --> 00:20:01.200 The best among these tests have a sensitivity around 95% and a very high specificity about 99.5%. The 00:20:01.200 --> 00:20:09.699 prevalence that is relevant for this calculation is not known, because it depends on who gets tested. NOTE Treffsikkerhet: 79% (H?Y) 00:20:09.699 --> 00:20:18.150 So if we assume that five percent of those who get tested actually have contracted the virus, so if 00:20:18.150 --> 00:20:27.400 only people who've been somehow exposed to the virus or have symptoms or have some reason to suspect 00:20:27.400 --> 00:20:35.200 that they've been exposed to the virus that they've contracted it, such that at least that 5% of 00:20:35.200 --> 00:20:39.900 those taking the test actually have the virus, then we can NOTE Treffsikkerhet: 82% (H?Y) 00:20:39.900 --> 00:20:45.200 with these numbers. Five percent of 10,000 means 500 people NOTE Treffsikkerhet: 84% (H?Y) 00:20:45.200 --> 00:20:56.600 with the virus out of the 10,000, the sensitivity of 95% will mean that we detect 475 of them NOTE Treffsikkerhet: 91% (H?Y) 00:20:56.600 --> 00:21:09.300 and so the positive predictive value of the test in this population will be 475 divided by 522 which is very 00:21:09.300 --> 00:21:19.800 high it's 91%, and of course it helps that the specificity is so high that only a few false positives 00:21:19.800 --> 00:21:24.750 here can trickle up from the negative side. NOTE Treffsikkerhet: 91% (H?Y) 00:21:24.750 --> 00:21:29.600 And the negative predictive value is also very high. NOTE Treffsikkerhet: 91% (H?Y) 00:21:29.900 --> 00:21:35.800 What if you were to test the whole population? NOTE Treffsikkerhet: 90% (H?Y) 00:21:35.800 --> 00:21:44.400 The number of active cases at a given time point over the whole population is relatively small as 00:21:44.400 --> 00:21:52.450 long as the pandemic is under control, so looking at the number of active cases we can estimate a 00:21:52.450 --> 00:22:03.000 prevalence of around 0.05% at any given time. Now the same test characteristics, NOTE Treffsikkerhet: 85% (H?Y) 00:22:03.800 --> 00:22:12.700 if we test a million people indiscriminately without any symptoms or any reason to suspect that they 00:22:12.700 --> 00:22:17.949 contracted the virus just go out in a population with this prevalence NOTE Treffsikkerhet: 83% (H?Y) 00:22:17.949 --> 00:22:24.150 we get a positive predictive value of only 8.7% NOTE Treffsikkerhet: 91% (H?Y) 00:22:24.150 --> 00:22:32.600 because even with such a high specificity the low prevalence means that there are enough false 00:22:32.600 --> 00:22:37.700 positives that are actually many more than the true positives. NOTE Treffsikkerhet: 91% (H?Y) 00:22:37.900 --> 00:22:49.200 The negative predictive value is very very high but if you get a positive rapid antigen test in a 00:22:49.200 --> 00:22:57.800 widespread population testing your chances of actually having contracted the virus are fairly small. NOTE Treffsikkerhet: 88% (H?Y) 00:22:57.800 --> 00:23:04.700 So this is an argument why widespread testing over whole populations with these kinds of tests is 00:23:04.700 --> 00:23:07.000 not advised. NOTE Treffsikkerhet: 83% (H?Y) 00:23:08.500 --> 00:23:18.600 So these four indices sensitivity, specificity, positive and negative predictive value, are the 00:23:18.600 --> 00:23:26.199 important numbers you need to keep in mind whenever you hear about a diagnostic or screening test 00:23:26.199 --> 00:23:33.700 for any condition whether their medical or educational or anything else.