📣New! From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology — very happy to see this position paper with Andreas Liesenfeld accepted to ACL 2022. This paper is one of multiple coming out of our @NWO_SSH Vidi project ‘Elementary Particles of Conversation’ and presents a broad-ranging overview of our approach, which combines comparison, computation and conversation.
More NLP work on diverse languages is direly needed. Here we identify a type of data that will be critical to the field’s future and yet remains largely untapped: linguistically diverse conversational corpora. There’s more of it than you might think! Large conversational corpora are still rare & ᴡᴇɪʀᴅ* but granularity matters: even an hour of conversation easily means 1000s of turns with fine details on timing, joint action, incremental planning, & other aspects of interactional infrastructure (*Henrich et al. 2010).



We argue for a move from monologic text to interactive, dialogical, incremental talk. One simple reason this matters: the text corpora that feed most language models & inform many theories woefully underrepresent the very stuff that streamlines & scaffolds human interaction. Text is atemporal, depersonalized, concatenated, monologic — it yields readily to our transformers, tokenizers, taggers, and classifiers. Talk is temporal, personal, sequentially contingent, dialogical. As Wittgenstein would say, it’s a whole different ball game.
Take turn-taking. Building on prior work, we find that across unrelated lgs people seem to aim for rapid transitions on the order of 0~200ms, resulting in plenty small gaps and overlaps — one big reason most voice UIs today feel stilted and out of sync. This calls for incremental architectures (as folks like David Schlangen, Gabriel Skante and Karola Pitsch have long pointed out). Here, cross-linguistically diverse conversational corpora can help to enable local calibration & to identify features that may inform TRP projection.
Turns come in sequences. It’s alluring to see exchanges as slot filling exercises (e.g. Q→A), but most conversations are way more open-ended and fluid. Promisingly, some broad activity type distinctions can be made visible in language-agnostic ways & are worth examing. This bottom-up view invites us to think about languages less in terms of tokens with transition probabilities, and more as tools for flexible coordination games. Look closely at just a minute of quotidian conversation in any language (as #EMCA does) and you cannot unsee this. What’s more, even seemingly similar patterns can harbour diversity. While English & Korean both use a minimal continuer form mhm/응, we find that response tokens are about twice as frequent in the latter (and more often overlapped), with implications for parsers & interaction design.

Finally, we touch on J.R. Firth — not his NLP-famous dictum on distributional semantics, but his lesser known thoughts on conversation, which according to him holds “the key to a better understanding of what language really is and how it works” (1935, p. 71). As Firth observed, talk is more orderly and ritualized than most people think. We rightly celebrate the amazing creativity of language, but tend to overlook the extent to which it is scaffolded by recurrent turn formats — which, we find, may make up ~20% of turns at talk.
Do recurrent turn formats follow the kind of rank/frequency distribution we know from tokenised words? We find that across 22 languages, it seems they do — further evidence they serve as interactional tools (making them a prime example of Zipf’s notion of tools-for-jobs).

We ignore these turn formats at our own peril. Text erases them; tokenisation obscures them; dialog managers stumble over them; ASR misses them — and yet we can’t have a conversation for even thirty seconds without them. Firth was not wrong in saying they’re key.
Implications
So, implications! How can linguistically diverse conversational corpora help us do better language science and build more inclusive language technologies? We present three principles: aim for ecological validity, represent interactional infrastructure, design for diversity
Ecological validity is a hard sell because incentives in #nlproc and #ML —size, speed, SOTA-chasing— work against a move from text to talk. However, terabytes of text cannot replace the intricacies of interpersonal interaction. Data curation is key, we say with Anna Rogers. Pivoting from text to talk means taking conversational infrastructure seriously, also at the level of data structures & representations. Flattened text is radically different from the texture of turn-organized, time-bound, tailor-made talk — it takes two to tango.
A user study (Hoegen et al. 2019) provides a fascinating view of what happens when interactional infrastructure is overlooked. People run into overlap when talking with a conversational agent; the paper proposes this may be solved by filtering out “stop words and interjections”. This seems pretty much the wrong way round to us. Filtering out interjections to avoid overlap is like removing all pedestrian crossings to give free reign to self-driving cars. It’s robbing people of self-determination & agency just because technology can’t cope.
Our 3rd recommendation is to design for diversity. As the case studies show, we cannot assume that well-studied languages tell us everything we need to know. Extending the empirical and conceptual foundations of #nlproc and language technologies will be critical for progress.
TL;DR
Voice user interfaces are ubiquitous, yet still feel stilted; text-based LMs have many applications, yet can’t sustain meaningful interaction; and crosslinguistic data & perspectives are in short supply. Our paper sits right at the intersection of these challenges. If you’re going to be at ACL you’ll find our talk on Underline, but here’s a public version of the 12min pre-recorded talk with corrected captions for accessibility:
Originally tweeted by @dingemansemark@scholar.social (@DingemanseMark) on March 23, 2022.