There is a minor industry in speech science and NLP devoted to detecting and removing disfluencies. In some of our recent work we’re showing that treating talk as sanitised text can adversely impact voice user interfaces. However, this is still a minority position. Googlers Dan Walker and Dan Liebling represent the mainstream view well in this blog post:
People don’t write in the same way that they speak. Written language is controlled and deliberate, whereas transcripts of spontaneous speech (like interviews) are hard to read because speech is disorganized and less fluent. One aspect that makes speech transcripts particularly difficult to read is disfluency, which includes self-corrections, repetitions, and filled pauses (e.g., words like “umm”, and “you know”). Following is an example of a spoken sentence with disfluencies from the LDC CALLHOME corpus:
“But that’s it’s not, it’s not, it’s, uh, it’s a word play on what you just said.“
It takes some time to understand this sentence — the listener must filter out the extraneous words and resolve all of the nots. Removing the disfluencies makes the sentence much easier to read and understand:
“But it’s a word play on what you just said.“
Fair enough, you might say. Everyone understands there are use cases for identifying and sometimes removing these items, for instance (possibly) when subtitling or transcribing spoken material for written consumption. And surely in this example, the sanitized version seems “much easier to read and understand” than the original.
Easier for whom and relative to what?
Hold on. Easier to read for whom? Easier to understand relative to what? It never hurts to go back to the source. Here is a more precise transcript of the interaction from the CALLHOME corpus. The target utterance appears at line 11:
What happens at line 11 cannot be understood without the immediate prior context. Technically (that is, using the analytical tools of conversation analysis) we can describe it as a case of ‘disfluency’ or ‘hesitation’ deployed to do the interactional work of showing an orientation to inappropriateness (Lerner 2013) — in this case of a lame pun with some sexual innuendo. The pun (“kind of a switcheroo” as B says) is a juvenile word play that exchanges “come and visit him” for “visit and come on him” (15). After an initial laugh (5), B spends considerable work drawing attention to what crossed his mind while at the same time casting doubt on its tellability: “anyway- never mind” and “I don’t want to say” (7, 9). It is quite remarkable to see so many evasive moves. All this forms the backdrop to the turn in focus:
but that’s it’s not- it’s not- its- uh- it’s a word play on what you just said (11)
When something is produced after so much evasion and in such a belaboured, disfluent, hesitant way, you can bet the delivery is meaningful in itself. The hemming and hawing is the point. It contributes to putting up a smokescreen of ambiguous commitment to what might become (we already sense at that point) something problematic. The deflationary “kind of switcheroo” (13) further aims to diffuse a delicate situation. Only after A’s second request to deliver the goods, B produces the word play. And then the whole thing falls flat, as seen among other things by A’s performative laughter particles, B’s explanation (any pun that needs an explanation is dead on arrival), A’s non-commital “yeah well okay”, the subdued laughter by both, and B’s self-deprecating “I just” (16-19).
An infrastructure for collaborative indiscretion
When we live through episodes like this in everyday life, we get all this in a split second. The slipperiness of jokes and puns, the inescapable social accountability that always hovers over anything we say, and the degree to which we depend on others for realizing indiscretions. We get it when others do it, and we do it ourselves. As I said, the hemming and hawing is the point. Disfluencies are a key interactional tool that we use to navigate interactionally delicate episodes (Jefferson 1974). Gene Lerner (2013) has described hesitations in this kind of context as an infrastructure for collaborative indiscretion. The point: there is a great deal of order and regularity even to things like hesitations and disfluencies.
Let’s back up a bit. First we have an original utterance, warts and all, situated in an actual interaction, formulated in a way that displays self-consciousness, saturated with accountability. Then we have, in the Googlers’ version, an abbreviated, regularized, decontextualized version that is emptied of all significance, all of the wrinkles ironed out. The original relates to the sanitized form approximately as a living, fluttering butterfly relates to a pinned and preserved specimen. The latter may be easier to classify, POS-tag, vectorize — which is probably what most NLPers mean when they say “easier to read and understand”. But it is not the same.
(And note, too, that even the cleaned up version is not going to lead to better understanding of what’s actually going on. After all, the hesitations and so on only served to foreshadow that a word play crossed the speakers’ mind, which is only revealed after the whole back and forth. Good luck to your co-reference resolution, sentiment analysis, or stance detection algos!)
Why this matters
How we say something has implications for what it means, how we want it to be taken up, how we expect to be held accountable for it (Jefferson 1974, Clift 2016). People in interaction frequently mobilize disfluencies to stall for time, to display uncertainty, to foreshadow disagreement, to find an ally to co-produce an indiscretion, and a great many other things. Perhaps there are contexts or applications where it may be useful to detect or even hide disfluencies, but erasing them wholesale should raise red flags. And yet that appears to be the sole purpose of Walker and Liebling’s work. As they write,
we created machine learning (ML) algorithms that identify disfluencies in human speech. Once those are identified , we can remove the extra words to make transcripts more readable. This also improves the performance of natural language processing (NLP) algorithms that work on transcripts of human speech.
If we ‘clean up’ transcripts of talk to look more like sanitised text data, NLP algorithms trained on text data also perform better on the cleaned-up transcripts. I bet they do! And again, for some purposes, this may be useful. But it so happens that for this particular case —which, remember, I didn’t pick, they did— the act of cleaning up actually conceals what happened and why. Something essential was lost in the process. Not just the disfluencies, but our power to understand what people do when they wield disfluencies.
Scaling up and losing touch
Let’s think ahead. When feeding only ‘sanitised’ transcripts like this to NLP algorithms, one thing you’re doing is you’re deciding, before any analysis, that disfluencies and the like don’t matter for whatever you want to study or classify or understand. This is a big choice to make. How sure can you be of this for say, sentiment analysis or emotion detection? Why assume that everything relevant will be in the ‘content’ words, when human interaction is famous for its flexibility and metalinguistic prowess?
As a side effect, you may also be enabling ML algorithms to pick up and reproduce, say, lame jokes without the hedging and disfluency they are sometimes produced with (as in this case). It doesn’t take a lot of imagination to see how scaling this up might lead to serious problems (see Birhane et al. 2023 on the not so innocent nature of scaling). The case Walker and Liebling picked happened to be a relatively tame pun. Racism, sexism, gaslighting, and all forms of subtle and not so subtle verbal abuse — these occur in real data, and the way they are produced and responded to is immensely important for a deeper understanding of human interaction.
By removing disfluencies and turning situated talk into sanitised text, you’re removing all public evidence of the very resources people mobilize to manage social accountability and navigate episodes of interactional delicateness. You’re sabotaging your own ability to understand how ethical norms and values are socially enforced in interaction. You’re obscuring how stance and epistemics actually work. You’re forcing rich, ambiguous, human interaction into a straitjacket of tokenizers and transformers. You are, fundamentally, dehumanizing human interaction.
Talk, warts and all
Linguists and computer scientists have long been conditioned to separate competence from performance, and to regard the latter as essentially disposable. If pristine competence is the supreme goal, only to be reached by excavating it from under the rubble of performance, no wonder that we work hard to remove all evidence of the human in our texts and transcripts (Dingemanse & Enfield 2023).
However, even though the competence/performance distinction has loomed large in NLP, and likely forms part of the cultural backdrop to unexamined choices like this (the standard ‘stopword removal’ procedure is another example), it’s not the only game in town and never has been. A century ago, anthropologist Bronislaw Malinowski wrote:
Indeed behaviour is a fact, a relevant fact, and one that can be recorded. And foolish indeed and short-sighted would be the [wo]man of science who would pass by a whole class of phenomena, ready to be garnered, and leave them to waste, even though [s]he did not see at the moment to what theoretical use they might be put!Malinowski 1922:20
If we take this whole class of phenomena to include human interactive behaviour, recorded and represented as faithfully as possible, then it should be clear today that not only are their ample theoretical uses for it, but also practical ones. The theoretical uses include forming a sophisticated understanding of how people exchange information and build social relations through situated talk; a critical prerequisite to any serious work on human language technology. The practical uses include building on such insights to make language technologies that do not sanitise and dumb down what we say, but that instead harness our linguistic abilities — including our formidable and sophisticated abilities to delay, hesitate, backtrack, and repair. As conversational agents and voice-driven interfaces grow increasingly ubiquitous, now is the time to move beyond text-bound conceptions of language, and to start taking talk seriously.
- Birhane, A., Prabhu, V. U., Han, S., Boddeti, V., & Luccioni, S. (2023). Into the LAION’s Den: Investigating Hate in Multimodal Datasets. Presented at the Thirty-seventh Conference on Neural Information Processing Systems, Datasets and Benchmarks Track. Available at https://openreview.net/forum?id=6URyQ9QhYv
- Clift, Rebecca. 2016. Conversation Analysis. Cambridge: Cambridge University Press.
- Dingemanse, Mark, and N. J. Enfield. 2023. ‘Interactive Repair and the Foundations of Language’. Trends in Cognitive Sciences. https://doi.org/10.1016/j.tics.2023.09.003.
- Jefferson, Gail. 1974. ‘Error Correction as an Interactional Resource’. Language in Society 2: 181–99.
- Lerner, Gene. 2013. ‘On the Place of Hesitating in Delicate Formulations: A Turn-Constructional Infrastructure for Collaborative Indiscretion’. In Conversational Repair and Human Understanding, edited by Makoto Hayashi, Geoffrey Raymond, and Jack Sidnell, 95–134. Studies in Interactional Sociolinguistics 30. Cambridge: Cambridge University Press.
- Liesenfeld, Andreas, Alianda Lopez, and Mark Dingemanse. 2023. ‘The Timing Bottleneck: Why Timing and Overlap Are Mission-Critical for Conversational User Interfaces, Speech Recognition and Dialogue Systems’. In Proceedings of the 24th Annual SIGdial Meeting on Discourse and Dialogue. Prague. https://aclanthology.org/2023.sigdial-1.45/
- Malinowski, Bronislaw. 1922. Argonauts Of The Western Pacific. London: Routledge & Kegan Paul.
- Walker, Dan, and Dan Liebling. 2022. ‘Identifying Disfluencies in Natural Speech’. Google Research Blog. 30 June 2022. https://blog.research.google/2022/06/identifying-disfluencies-in-natural.html.