An intriguing article in Science two months ago suggests that advances in speech processing ‘may soon place speech and writing on a more equal footing, with broad implications for many aspects of society’. It reminds us that most of humanity’s approximately 50,000 years1 with language was dominated by the spoken word, and that the balance was upset only some few thousands of years ago by the invention of writing. But was it?
This leads to Oard’s key observation: writing has been hugely succesful due to providing these advantages — but with todays’ (and tomorrows’) speech recognition technologies these advantages are no longer exclusive to writing. Why?
Digital storage is a great equalizer with regard to permanence: The same infrastructure that can reliably store digital text can equally well store digital speech. (…) Commercial “media management” systems can now reliably find specific content in the well-articulated speech of news announcers, and laboratory systems can handle much of the substantial variation in speaking styles that have made automatic transcription of interviews, meetings, and telephone conversations difficult. (Oard 2008:1787)
And thus, argues Oard, the comeback of the spoken word is upon us: ‘We now stand at the treshold of a new era, one in which the spoken word can again rise to prominence.’
Another conduit unlocked
These are exciting developments, not in the least for information retrievalists or for those of us doing conversation analysis of, say, well-behaved English telephone conversations. But one must not read too much into it. The rhetorical A-B-A structure of Oard’s argument (50,000 years of speech, a few millenia of writing, and now the return of speech!) suggests a radical turn where there is none.
Looking back at the invention of writing, perhaps the most crucial change it brought about was that information could now be stored reliably and effectively in some other medium than human memory. The recent developments in speech processing are just a simple variation on that theme; another modality has yielded to the advantages of permanence (storage) and findability (retrieval). Oard’s article, titled Unlocking the Potential of the Spoken Word, is thus primarily about unlocking it for information retrievalists, and as such it is a prime example of the conduit metaphor in action (Reddy 1979). Briefly, this metaphor suggests that words are simply vehicles for transporting ideas. One of its problems is that it trivializes the part played by the listener, who in fact faces the highly creative task of recreating ideas from the multi-modal signals uttered by the speaker.
Language is more than a series of tubes!
The actual potential of the spoken word is a lot more wide-ranging and interesting than suggested by this all-too-common metaphor (a more timely metaphor for this way of thinking may be ‘language as a series of tubes‘). To be fair, Oard does mention in passing that ‘to this day, people find spoken expression and its visual correlates (…) to be a fluid and compelling way of communicating’ (p. 1787). It’s easy to see why from an information retrieval perspective that seems the less important stuff, embellishments which may provide some fluidity but otherwise are immaterial to the goal of getting a message across (again in the conduit-metaphor paradigm). But try to analyse five minutes of conversational discourse and suddenly both of the underlying assumptions — that the extra stuff is mere embellishment, and that the spoken word merely functions to transport information — will be a lot less obvious.
Consider the cartoon that came with the article (above). A man presents something by combining a pointing gesture with eye-gaze and a particular facial expression. Another man passes a judgment with his body language as much as with some words (“It’ll never catch on…”) There’s not much that speech processing can do to unlock the potential of the spoken word here. ‘But this is a cartoon! It’s meant to use few words and lots of imagery!’, I hear you protest. Sure — but then in reality, discourse throughout these 50,000 years has been a lot more like this cartoon than like a neat text with a high information density, ready to be data-mined. Words have always come to us in richly contextualized multi-modal speech events in which speaker and listener jointly construct meaning, relying on such things as common ground, social relationships, imagery, gestures and facial expressions.3 To me, there lies the true potential of the spoken word.
- Clark, Herbert H. 1996. Using Language. Cambridge: Cambridge University Press.
- Enfield, Nick J., and Stephen C. Levinson. 2006. Roots of human sociality: Culture, cognition, and human interaction. Oxford: Berg.
- Lieberman, Philip. 2007. The Evolution of Human Speech: Its Anatomical and Neural Bases. Current Anthropology 48, no. 1 (February 1): 39-66. doi:10.1086/509092.
- Oard, Douglas W. 2008. Unlocking the Potential of the Spoken Word. Science 321, no. 5897 (September 26): 1787-1788.
- McNeill, David, ed. 2000. Language and Gesture. Cambridge: Cambridge University Press.
- Reddy, M. J. 1979. The conduit methapor – a case of frame conflict in our language about language. In Metaphor and Thought, ed. A. Ortony, 284-297. Cambridge: Cambridge University Press.
- Tannen, Deborah. 1989. Talking Voices: Repetition, Dialogue, and Imagery in Conversational Discourse. Studies in Interactional Sociolinguistics 6. Cambridge: Cambridge University Press.
- Oard refers to Lieberman 2007 for this date. ↩
- Oard mentions a third property, ‘contextualization’. I have trouble understanding this one; he briefly mentions the invention of ‘ways of writing that conveyed the needed context to a reader’ (p. 1787), but it seems to me that multi-modal speech (richly contextualized as it is) is not at all at a disadvantage to writing on that point. ↩
- See e.g. Tannen 1989, Clark 1996, Enfield & Levinson 2006, McNeill 2000, to mention just a few random works from a huge literature. ↩