‘From text to talk’, ACL 2022 paper

(this post originated as a twitter thread)

📣New! From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology — very happy to see this position paper w/ @a_liesenfeld accepted to #acl2022nlp — Preprint 📜: http://doi.org/10.31219/osf.io/m43zh

Screenhot of cover page of article. Abstract: "Informal social interaction is the primordial home of human language. Linguistically diverse conversational corpora are an important and largely untapped resource for computational linguistics and language technology. Through the efforts of a worldwide language documentation movement, such corpora are increasingly becoming available. We show how interactional data from 63 languages (26 families) harbours insights about turn-taking, timing, sequential structure and social action, with implications for language technology, natural language understanding, and the design of conversational interfaces. Harnessing linguistically diverse conversational corpora will provide the empirical foundations for flexible, localizable, humane language technologies of the future."

This paper is one of multiple coming out of our @NWO_SSH Vidi project 'Elementary Particles of Conversation' and presents a broad-ranging overview of our approach, which combines comparison, computation and conversation

More NLP work on diverse languages is direly needed. In this #acl2022nlp position paper we identify a type of data that will be critical to the field's future and yet remains largely untapped: linguistically diverse conversational corpora. There's more of it than you might think!

World map showing the location of 63 spoken languages included in the curated collection considered in the paper: 1 Arapaho 2 Cora 3 English 4 Otomi 5 Ulwa 6 Kichwa 7 Siona 8 Tehuelche 9 Br. Portuguese 10 Kakabe 11 Minderico 12 Spanish 13 Siwu 14 Catalan 15 French 16 Dutch 17 Akpes 18 Hausa 19 Danish 20 Zaar 21 Baa 22 German 23 Italian 24 Sakun 25 Czech 26 Croatian 27 Limassa 28 }Akhoe 29 Saami 30 Laal 31 Polish 32 N|uu 33 Hungarian 34 Juba Creole 35 Arabic 36 Siputhi 37 Farsi 38 Chitkuli 39 Gutob 40 Nganasan 41 Yakkha 42 Anal 43 Zauzou 44 Kerinci 45 Duoxu 46 S. Qiang 47 Nasal 48 Sambas 49 Kelabit 50 Mandarin 51 Totoli 52 Kula 53 Jejueo 54 Korean 55 Pagu 56 Ambel 57 Gunwinggu 58 Japanese 59 Wooi 60 Yali 61 Heyo 62 Yélî Dnye 63 Vamale.

Large conversational corpora are still rare & ᴡᴇɪʀᴅ* but granularity matters: even an hour of conversation easily means 1000s of turns with fine details on timing, joint action, incremental planning, & other aspects of interactional infrastructure (*Henrich et al. 2010)

Language resources (corpora) and their size in relation to global language diversity. >7000 languages, >180 with some form of corpus resources, ~70 with conversational corpora of casual talk.

We argue for a move from monologic text to interactive, dialogical, incremental talk. One simple reason this matters: the text corpora that feed most language models & inform many theories woefully underrepresent the very stuff that streamlines & scaffolds human interaction

Diagram showing the words and expressions most distinctive of talk (compared to text): interjections like hhuh, hm, mhm, wow, um, yeah, etc.

Text is atemporal, depersonalized, concatenated, monologic — it yields readily to our transformers, tokenizers, taggers, and classifiers. Talk is temporal, personal, sequentially contingent, dialogical. As Wittgenstein would say, it's a whole different ball game

Take turn-taking. Building on prior work, we find that across unrelated lgs people seem to aim for rapid transitions on the order of 0~200ms, resulting in plenty small gaps and overlaps — one big reason most voice UIs today feel stilted and out of sync

The timing of turn transitions in dyadic interactions in 24 languages around the world, replicating earlier findings and extending the evidence for the interplay of universals and cultural variation in turn-taking (n = number of turn transitions per corpus). Positive values represent gaps between turns; negative values represent overlaps. Across languages, the mean transition time is 59ms, and 46% of turns are produced in (slight) terminal overlap with a prior turn

This calls for incremental architectures (as folks like @davidschlangen @GabrielSkantze @KarolaPitsch have long pointed out). Here, cross-linguistically diverse conversational corpora can help to enable local calibration & to identify features that may inform TRP projection

Turns come in sequences. It's alluring to see exchanges as slot filling exercises (e.g. Q→A), but most conversations are way more open-ended and fluid. Promisingly, some broad activity type distinctions can be made visible in language-agnostic ways & are worth examing

Two types of conversational activity in 6 unrelated languages, showing the viability of identifying broad activity types using ebbs and flows in amount of talk contributed (time in ms). Panel A: a 'piano roll' display of turns by two participants as they unfold over time. Tellings (‘chunks’) are characterized by highly skewed relative contributions, with one participant serving as teller and the other taking on a recipient role (roles may switch, as in the Japanese example). Panel B. In ‘chat’ segments, turns and speaking time are distributed more evenly. Panel C. Shifts from one state to another are interactionally managed by participants.

This bottom-up view invites us to think about languages less in terms of tokens with transition probabilities, and more as tools for flexible coordination games. Look closely at just a minute of quotidian conversation in any language (as #EMCA does) and you cannot unsee this

Even seemingly similar patterns can harbour diversity. While English & Korean both use a minimal continuer form mhm/응, we find that response tokens are about twice as frequent in the latter (and more often overlapped), with implications for parsers & interaction design

Finally, we touch on J.R. Firth — not his NLP-famous dictum on distributional semantics, but his lesser known thoughts on conversation, which according to him holds "the key to a better understanding of what language really is and how it works" (1935, p. 71)

Quote from Firth (1935): "Neither linguists nor psychologists have begun the study of conversation; but it is here we shall find the key to a better understanding of what language really is and how it works"

As Firth observed, talk is more orderly and ritualized than most people think. We rightly celebrate the amazing creativity of language, but tend to overlook the extent to which it is scaffolded by recurrent turn formats — which, we find, may make up ~20% of turns at talk

a look at conversational
a look at conversational data shows that many turns are not one-offs: at least 28% of the utterances in our sample (436 367 out of 1 532 915 across 63 languages) occur more than once, and over 21% (329 548) occur more than 20 times. Many of these recurring turn formats are interjections and other pragmatic devices that help manage the flow of interaction and calibrate understanding

Do recurrent turn formats follow the kind of rank/frequency distribution we know from tokenised words? We find that across 22 languages, it seems they do — further evidence they serve as interactional tools (making them a prime example of Zipf's notion of tools-for-jobs)

We ignore these turn formats at our own peril. Text erases them; tokenisation obscures them; dialog managers stumble over them; ASR misses them — and yet we can't have a conversation for even thirty seconds without them. Firth was not wrong in saying they're 🔑

Conversation Deepening GIF

I've been slow-threading my way through some of our empirical results and will be adding a bunch more tweets on the implications. If you're hopping on, this is work with the amazing @a_liesenfeld, preprinted at http://doi.org/10.31219/osf.io/m43zh & to be presented at #acl2022nlp soon

So, implications! How can linguistically diverse conversational corpora help us do better language science and build more inclusive language technologies? We present three principles: aim for ecological validity, represent interactional infrastructure, design for diversity

Ecological validity is a hard sell because incentives in #nlproc and #ML —size, speed, SOTA-chasing— work against a move from text to talk. However, terabytes of text cannot replace the intricacies of interpersonal interaction. Data curation is key, we say with @annargrs

Pivoting from text to talk means taking conversational infrastructure seriously, also at the level of data structures & representations. Flattened text is radically different from the texture of turn-organized, time-bound, tailor-made talk — it takes two to tango

Kaysar Dadour Mayara Araujo GIF

A user study (Hoegen et al. 2019) provides a fascinating view of what happens when interactional infrastructure is overlooked. People run into overlap when talking with a conversational agent; the paper proposes this may be solved by filtering out "stop words and interjections"

This seems pretty much the wrong way round to us. Filtering out interjections to avoid overlap is like removing all pedestrian crossings to give free reign to self-driving cars. It's robbing people of self-determination & agency just because technology can't cope

Our 3rd recommendation is to design for diversity. As the case studies show, we cannot assume that well-studied languages tell us everything we need to know. Extending the empirical and conceptual foundations of #nlproc and language technologies will be critical for progress

To escape the reign of the resourceful few, use linguistically diverse data and anticipate a combination of universal and language-specific design principles. This not only ensures broad empirical coverage and enables new discoveries; it also benefits diversity and inclusion, as it enables language technology development that serves the needs of diverse communities. and makes technology more inclusive, more humane and more convivial for a larger range of possible users (Munn, 2018; Voinea, 2018). Localizing user interface elements is only a first step; diversity in how and when basic interactional structures are deployed must ultimately be reflected in the design of conversational user interfaces. In the rush for better language technology we should avoid being driven into the arms of only the

Voice user interfaces are ubiquitous, yet still feel stilted; text-based LMs have many applications, yet can't sustain meaningful interaction; and crosslinguistic data & perspectives are in short supply. Our #acl2022nlp paper sits right at the intersection of these challenges

Still of video showing opening page of paper, which is available here: https://osf.io/m43zh

Cleaning up the youtube autocaptions for our #acl2022nlp preview, it is really uncanny how accurate it is at *not ever transcribing* interjections like "m-hm", "huh?" — a neat illustration of our point that ASR often misses these words

Revisiting this thread to record the official link to the paper in the ACL Anthology (for those of you who like official page numbers):

https://aclanthology.org/2022.acl-long.385/

If you're going to be at ACL you'll find our talk on Underline, but here's a public version of the 12min pre-recorded talk with corrected captions for accessibility — w/ @a_liesenfeld #ACL2022 #nlproc #ACL2022nlp

Sweet: this line from our paper's conclusions was highlighted by @thamar_solorio as a key take-away message at the #acl2022nlp Next Big Ideas plenary session. Here's to more room for linguistic agency and diversity in NLP

By the way, one of the more puzzling #acl2022nlp reviewer comments we got was precisely about that line (among others), and featured a serious charge that @a_liesenfeld and I now often lob at each other: 🚨 "figurative language in evidence" 🚨

Originally tweeted by @dingemansemark@scholar.social (@DingemanseMark) on March 23, 2022.

Leave a Reply

Your email address will not be published. Required fields are marked *