Answer by Marc Ettlinger, Ph.D., linguistics, U.C. Berkeley:
English speakers don't actually differentiate "th" and "f" all that well. Indeed, in certain speech perception tests, native English speakers can perform as poorly as random guessing in distinguishing "th" and "f" because it's one of the most difficult contrasts in English.
That should be clear when you look at the spectrograms:
While "s" and "sh" have pretty clear differences in the amount of energy in the mid- to upper part of the spectrum, "th" and "f" are barely distinguishable, corroborating what we find in perception tests. The nature of some of these tests gives us insight into how this contrast is perceived.
First of all, people have done tests juxtaposing purely auditory stimulus with auditory plus visual. You'll notice that although this is one of the most difficult perceptual distinctions in language, it is also among the easiest visual distinctions. They're made with the lips (as noted in), which we can see, but in different positions. Indeed, seeing the lips accounts for about a 20-30 percent difference in performance, all other things being equal.
Second, people are particularly bad at this contrast when any noises are present or when they have any hearing loss. This is because the acoustic differences are primarily in the upper part of the speech spectrum (see figure above), and the upper part of the spectrum is where noise and hearing loss are particularly problematic. So, your typical elderly person with mild hearing loss will perform around guessing level for out-of-context "th" and "f," too.
Given those perceptual challenges, we English speakers clearly use an appreciable amount of context in differentiating these sounds.
Luckily (but not coincidentally), the English language facilitates that. Thefor this contrast is relatively low compared to other contrasts, meaning, aside from thin and fin there aren't too many words that critically rely on differentiating these sounds.
So, the answer to your question of how? Not all that well. And when we do, it's often due to context or visual cues. Otherwise, it's that small difference in the upper part of the speech spectrum, around 8 kHz, that serves as the differentiator.
More questions on Quora: