## November 19, 2003 The car was 51.7% blue and 48.3% green. I will GO TO THE GRAVE without admitting that I am wrong here, even though I have horrible instincts when it comes to guessing about probability. This seems like a trick question to me. If the witness has an 80% chance of correctly identifying either color and a 20% chance of failure, then the chance that the cab was blue is 80%, right? The last paragraph seems like a mistake, in that the witness said it was blue.

So what's the so-called right answer? No mistakes. It wont be any fun to answer until we have a few more "opinions" on this one. We should send this to Overdeck when we are ready for the definitive tutoring. Let's consider two extremes.
#1 Your reliability as a witness is 100% perfect. Therefore the odds that the cab was blue is 100%.
#2 You are virtually blind and a liar to boot. Therefore the odds that the cab actually was blue is 15%

You would expect a linear relationship, yes? So interpolate between the two extremes. The answer is 15 + (80x85)/100 = 83%

It's easier to see this relationship geometrically. Graph it with the X-axis as your reliability, and the Y-axis as your likelihood of being correct. Locate the two end conditions that you know to be true and draw a straight line. I agree that what makes probability questions so fascinating is exactly because people have such a hard time grasping the fundamentals of probability theory. Still, I think that once again you have posted a question that is at best incomplete and at worst deceptive.

I tend to think of probability as a guide for making future decisions given that we have incomplete and unknown information. It doesn't mean that any particular decision will be correct. But, in the long run, if we make more choices that have high probability of success we will be better off.

This problem, however, isn't speaking to future occurences. It's asking about the outcome of a specific event. And so the question is flawed. The cab is with 100% certainty either blue or green. No amount of rolling of dice is going to change that. The question, then, is really what is the probability that our investigative methods will uncover the truth.

Here are a few ways of analyzing the problem.

I. The Legal Solution

The probability doesn't matter. There was an accident and an eye witness. The witness is more likely than not to have correctly identified the cab. If the jury finds the witness credible, this is certainly enough to win a judgment for the plaintiff in a civil case and you might have a good shot at winning a criminal case as well.

The cab was blue.

II. The Trick Question Solution

WIthout violating the assumptions of the question, I declare that the Blue cab company has never been involved in an accident. It has prided itself on being the number one cab company for customer service. And part of that customer service is not being involved in accidents. Each cab driver must go through an intensive interview process where they demonstrate real-world handling of driving situations on a whiteboard. An elite set of interviewers makes sure that each an every driver hired raises the bar. Without any physical evidence to the contrary (a scrape of blue paint on the street or a blue cab in a repair shop), the witness was mistaken.

The cab was green.

III. The "Correct" Solution

I think this is question you were asking.

Let us suppose that green and blue cabs are equally likely to get into an accident. Let us further suppose that both companies run similar shifts and work in the same parts of town. In that case, the cab involved in some hypothetical future accident is approximately the same as the number of cabs on the road. P(B) = 0.15, P(G) = 0.85.

The witness identified the cab as blue. This is possible if either the cab was green and he got it wrong, or the cab was blue and he got it right. We have then P(WB) = 0.2*P(G) + 0.8*P(B) = 0.29.

The probability that the witness identifies the cab as blue and that the cab is in fact blue is P(WC) = 0.8*P(B) = 0.12.

And so finally, this means that the probability that the cab was blue given the witness identified it as blue is P(B|WB) = P(WC) / P(WB) = 0.41.

If this particular witness sees a random accident and identifies the cab as blue, there is only a 41% chance that the cab was in fact blue.

This is pretty bad. It means that the odds of a green cab being involved in an accident are so much higher than a blue cab being involved in an accident (simply because of the relative sizes of the fleets) that it is more probable that the witness mistook a green cab for blue than he correctly recognized a blue cab.

But we aren't necessarily any closer to understanding which cab actually was involved in the accident we are investigating.... Let's not get too worked up folks. The question was: What is the probability that the cab involved in the accident was Blue rather than Green? It doesn't ask for facts. No one is going to court. And the example comes from a well known paper on how people perceive probability (it isn't my example, other than I put it on my blog).

So 4 reader comments, 4 different answers. And the title of the problem and intro pretty much tips you off to the issue. "The cab is with 100% certainty either blue or green."

This statement is obviously wrong. I'm going to have to go with 80% odds that the car was blue. Witness says Blue. So either he is correct, or he is wrong and the car was green.

1. He is correct, the car is blue. This happens with probablility 0.15*0.8=0.12

2. He is wrong, the car is green. This happens with probability 0.2*0.85=0.17

Witness says blue, so it's either 1 or 2 with probability 1. He is correct and the car is blue with probability 0.12/(0.12+0.17) = 12/29 or approx. 41.3%

Which is basically what Robert lables the "correct solution" above, I guess. Mats & Bob get the "correct" answer here - its another example of conditional probability rather than simple probability and people ignoring the base rate. Eben points out a funny message board that talks about these problems -- and similar to what I'm prone to doing -- the guy explaining "base rate" problems gets schooled by someone else for falling into the same problem. Its worth a read: Yeah, my mistake was that I imagined 100 cars in accidents, 15 of which were blue, and divided that by the number that the witness would've guessed were blue, 29. Seemed reasonable at the time. It was those 3 pesky blue cars in the 15 that the witness would've called green that got me. Drats! Foiled again by base rates!

Here's a question for you. If there were 100 policeman in the city and 75 of them fell for the base rate fallacy 90% of the time, and the other 25 only fell for it 10% of the time... oh never mind. Ooh. I'm glad I got that one right. I was tossing and turning in bed all last night drawing imaginary Venn diagrams in my head. I still feel a little uneasy about the solution, even though I think I sort of get it.

So what are the lessons that we learn from this problem? Clearly a test with a high degree of success is not sufficient to identify a rare case. Nevertheless, the witness did add value. Absent the witness, the odds of a blue cab being involved in the accident were only 15%. With the witness's identification, the odds jump to 41%.

So, Josh, if you were investigating the hit and run report, would the witness's statement be enough to compel you to ask for a search warrant against the Blue Cab company? If your goal is to solve the case, from the start, you'd have to consider looking at both; if you understand the base rate, you'll know you need to look at Green as well as Blue.

Maybe a better criminal investigation example of this problem is the OJ case. I don't remember the details, but the prosecution team made efforts to show that DNA evidence showed OJ had to be the killer -- the odds were in the one to 170 million range.

The defense worked to cast doubt on probability theory and to whittle the odds down to 1 in 1 million range -- saying 4-5 people in Los Angeles would match the DNA evidence. Now I have no idea if this would have been effective, but if conditional probability was better understood and the DNA evidence seen in context -- by this I mean:
- the odds your wife will be murdered are n/m
- the odds your wife is battered by her husband are y/e
- the odd that your wife's blood is on the socks of the husband are u/w
- the conditional probability of all 3 is . . .

Actually, that probably is what they did, the jurors got confused by statistics, and the rest is history More than just calculating the probabilities above, I would like to add my point of view of the psychological "fallacies" in this field. I don't believe they are real. We are too good at intuitively processing information, and hence we don't need and don't have (unless we've studied lots of math + statistics + prob. at school)language tools to deal with them.

In this Taxi example, the townspeople who know their city probably recognizes most cars as beeing green, simply because their perception expects cars to be green. They only think they see a blue car if they really can see an actually blue car clearly in broad light. Otherwise, they "see" it as green.

Hence, the laboratory experiment telling that a witness confuses colors in X% of the cases will not be representative for a citizen who have been living in a town with mostly green cars. Her or his perception will be pre-conditined with the base rate. If she or he says it is a blue car in the town, it will hence be more probable that he or she is right than if he or she says he or she see a green in the town. The laboratory experiment will be useless!

http://blogofpandora.blogspot.com/2003_08_01_blogofpandora_archive.html#106192715588179871

(link may fail in Internet Explorer 6.0, if so, shrinking the browser window will mysterically fix it) Hmm I can't get past 83%. First, there is an assumption that the witness is correct 80% of the time, so the car is at least 80% blue. Furthermore, if the witness was incorrect, then he or she cannot be trusted. The only information we have to act upon, then, is that, of the remaining unknown 20%, 15% is blue. 15% of 20 percentage points is 3, and 80 + 3 = 83%.

It seems I may be insane, but this is logical to me. I am not a statistician, so I could be wrong, but here is my take on this:

One of our assumptions is that the witness is 80% accurate at distinguishing which colour of car was involved in the accident.

Later, we deduce from the assumptions that there is a 41% chance that the car involved in this accident is blue.

The witness declared the car was blue.

Can we not conclude that there is a 41% chance that the witness is correct in his statement that the car was blue, and therefore the situation is reduced to absurdity?

It seems to me that the core contradiction comes from a misunderstanding of the test. When we judge the test is 80% accurate, we have done so from empirical data. One of the rules governing the collection of such data is that it should be done using a representative sample of the population.

As such, when we set out to judge the test, the base rate should be included in the judgement. e.g. the witness should be asked to judge situations in which 15% blue cars and 85% green cars were involved.

Once we have such data in our grasp, the question becomes:

IF witness states car was blue AND witness accurate THEN the car was blue

Clearly the probability of the witness stating the car was blue is 1, and the probability the witness was accurate is 0.8, so the probability of the conjunction of the two is 0. 8.

Is this a fair judgement of the problem? I am writing because, although two people arrived at what I think
is the correct answer, their explanations are inadequate.

These types of problems trip up even people (like me) who have
studied probability theory systematically, if they don't do
this type of exercise regularly.

You need some notation and to know some basic formulas.

The notation:

P(A) means the probability of A. P(AB) means the probability
of A and B: of both happening. P(A|B) means the probability
of A given B. Called a conditional probability, it means the
probability of A happening under the assumption that B also
happens.

The formulas:

These, as well as the notation above, come out of any beginning
text in probability theory. The proofs are not hard, but if you
want to know why they are true, look it up.

P(AB) = P(A)P(B)..............................(1)

if and only if A and B are independent events:
if one being true has no bearing on the other being true, like
two successive coin flips. If the events are not independent,
the more general formula is

P(AB) = P(A)P(B|A)............................(2)

in other words, for both to be true, A must happen and B must
also happen under the condition that A happens.

P(A) OR P(B) = P(A) + P(B)....................(3)

if and only if A and B are disjoint (either one or the other
happens: P(AB) = 0).

P(A|B) = P(AB)/P(B)...........................(4)

All these have natural extensions to any number of events; I've
given the special cases of just two events because that's all
we need for this problem.

Now to this exercise:

Definitions:

P(B) = probability that car is blue.
P(G) = probability that car is green.
P(S) = probability that witness identifies car as blue.
P(C) = probability that witness is correct.
P(I) = probability that witness is incorrect.

We need to find P(B|S): the probability that the car is blue GIVEN
that the witness identifies it as blue. Now we just have to apply
our formulas.

Calculation:

P(B) = 0.15 (given)
P(G) = 0.85 (given)
P(C) = 0.8 (given)
P(I) = 0.2 (right or wrong are only choices)

P(S) = P(B) P(C) + P(G) P(I) = 0.29

using (1) and (3). Now using (4),

P(B|S) = P(BS)/P(S)

and we have to calculate the numerator of the right hand side.
Using (2), this is

P(BS) = P(B) P(S|B)

and we know that P(S|B) is just P(C) = 0.8. Putting this together, we get

P(B|S) = P(B) P(S|B)/P(S) = 0.41

which is what we wanted to find.

It is easy to go wrong by equating P(BS) with P(B) P(S); this is wrong
because B and S are not independent: the witness identifying the car
as blue is influenced by whether the car is actually blue! That's why
we need to use the more general formula (2).

Why did I take so much space to explain this? Because I wanted to show
how you can solve this by using nothing more than the axioms and simple
theorems of discrete probability theory, without any extra assumptions about
linear relationships, intuition, or arbitrary arithmetic procedures that
have no formal justification. In fact, trying to apply intuition to these
types of problems will lead you astray very quickly: that's why the
formal theory is useful.

The comments to this entry are closed. 