The paper also devotes some attention to the importance of linguistic diversity in computer science and NLP — a key theme in the new language diversity track at #acl2022nlp, where another paper by Blasi and colleagues stood out. (The relevance of cross-linguistically diverse corpora for NLP was also a focus in this ACL paper of ours, where we argue such data is crucial for diversity-aware modelling of dialogue and conversational AI.
I do have a nitpick about Blasi &al’s backchannel claim. They note many languages have minimal forms (citing a study of ours that provides evidence on this for 32 languages) and add, “However, listeners of Ruruuli … repeat whole words said by the speaker” — seeming to imply they rarely produce such minimal forms and (tend to) repeat words instead. Or at least I’m guessing that would be most people’s reading of this claim.
The source given for this idea is Zellers 2021. However, this actually paints a very different picture: in fact, ~87% of relevant utterances (1325 out of 1517) do consist of minimal forms like the ‘nonlexical’ hmm and the ‘short lexical’ eeh ‘yes’, against <9% featuring repetition, as seen in this table from Zellers:
I don’t think anyone has done the relevant comparison for other languages yet, but it seems safe to say that Ruruuli/Lunyala does in fact mostly use “the minimal mm-hmm”, and that repetition, while certainly worthwhile of more research, is one of the minority strategies for backchanneling in the language.
Despite this shortcoming, the relevance of cross-linguistic diversity in this domain can be supported by a different observation: the relative frequency and points of occurrence of ‘backchannels’ do seem to differ across languages — as shown in our ACL paper for English versus Korean. And the work on repetition is fascinating in itself — it is certainly possible that repetition is used in a wider range of interactional practices in some languages, with possible effects on transmission & lg structure as suggested in work by Sonja Gipper.
A serendipitous wormhole into #EMCA history. I picked up Sudnow’s piano course online and diligently work through the lessons. Guess what he says some time into the audio-recorded version of his 1988 Chicago weekend seminar (see lines 7-11)
[Chicago, 1988. Audio recording of David Sudnow’s weekend seminar]
We learn too quickly and cannot afford to contaminate a movement by making a mistake.
People who type a lot have had this experience. You type a word and you make a mistake.
I have been involved, uh of late, in: a great deal of correspondence in connection with uh a deceased friend’s archives of scholarly work and what should be done with that and his name is Harvey. And about two months ago or three months ago when the correspondence started I made a mistake when I ( ) taped his name once and I wrote H A S R V E Y, >jst a mistake<.
I must’ve written his name uh two hundred times in the last few months in connection with all the letters and the various things they were doing. Every single time I do that I get H A S R V E Y and I have to go back and correct the S. I put it in the one time and my hands learned a new way of spelling Harvey. I call ‘m Harvey but my hands call ‘m Hasrvey.
And they learned it that one time. Right then and there, the old Harvey got replaced and a new Harvey, spelled H A S R V E Y got put in. So we learn very fast.
Folks who know #EMCA history will notice this is right at the height of the activity of the Harvey Sacks Memorial Association, when Sudnow, Jefferson, Schegloff, and others were exchanging letters on Sacks’ Nachlass, intellectual priority in CA, and so on
We have here a rare first person record of the activity that Gail Jefferson obliquely referred to in her acknowledgement to the posthumously published Sacks lectures (“With thanks to David Sudnow who kick-started the editing process when it had stalled”), and much more explicitly in an 1988 letter (paraphrased in Button et al. 2022).
Historical interest aside, I like how the telling demonstrates Sudnow’s gift for first-person observation — a powerful combination of ethnomethodology and phenomenology that is also on display in his books, Pilgrim in the Microworld and Ways of the Hand #EMCA
Ten years ago, fresh out of my PhD, I completed three papers. One I submitted to a regular journal; it came out in 2012. One was for a special issue; it took until 2017 to appear. One was for an edited volume; the volume is yet to appear.
These may be extreme cases, but I think they reflect quite well the relative risks for early career researchers (in linguistics & perhaps more widely) of submitting to regular journals vs special issues vs edited volumes.
Avoiding the latter is not always possible; in linguistics, handbooks still have an audience. If I could advise my 2012 self, I’d say: 1. always preprint your work; 2. privilege online-first & open access venues; 3. use #RightsRetention statements to keep control over your work.
A natural experiment
Anyway, these three papers also provide an interesting natural experiment on the role of availability for reach and impact. The first, Advances in the cross-linguistic study of ideophones, now has >400 cites according to Google Scholar, improbably making it one of the most cited papers in its journal. This paper has done amazingly well.
The second, Expressiveness and system integration, has >50 cites and was scooped by a paper on Japanese that I wrote with Kimi Akita. We wrote that second paper two years after the first, but it appeared one year before it, if you still follow the chronology. As linguistics papers go, I don’t think it has done all that bad, especially considering that its impact was stunted by being in editorial purgatory for 4 years.
The third, “The language of perception in Siwu”, has only been seen by three people and cited by one of them (not me). I am not sure if or when it will see the light of day.
Too much going on at #acl2022nlp for live-tweeting, but I’ll do a wee thread on 3 papers I found thought-provoking: one on robustness probing by @jmderiu et al.; one on underclaiming by @sleepinyourhat; and one on bots for psychotherapy by Das et al..
Deriu et al. stress-test automated metrics for evaluating conversational dialogue systems. They use Blenderbot to identify local maxima in trained metrics and so identify blatantly nonsensical response types that reliably lead to high scores https://aclanthology.org/2022.acl-short.85/
As they write, "there are no known remedies to this problem". My conjecture (also see Goodhart's law): any automated metric will be affected by this as long as we're training on form alone. It's a thought-provoking paper, go read it
Next! Bowman https://aclanthology.org/2022.acl-long.516 acknowledges the harms of hype but focuses on the inverse: overclaiming the scope of work on limitations (='underclaiming'). I think his argument underestimates the enormous asymmetry of these cases and therefore may overclaim the harms?
I did wonder whether @sleepinyourhat is playing 4D chess here by writing a paper that's likely to attract citations from work that may have an incentive to overclaim the harms of underclaiming 🤯😂 #acl2022nlp
Okay because @KLM has decided to cancel my flight and delay the next one, some quick notes from the liminality of Dublin Airport on a few more #acl2022nlp papers I found interesting, revealing, or thought-provoking
Ung et al. (Facebook AI Research) train chatbots to say sorry in nicer ways, though without addressing the underlying problems that make them say offensive things in the 1st place. I thought this was both interesting and revealing of FBs priorities. Paper:https://aclanthology.org/2022.acl-long.447
Room for improvement: throughout, Ung et al remove "stop words" — but as conversation analysts can tell you, turn prefaces like uh, um, well, etc. often signal interactionally delicate matters, i.e. precisely the stuff they're hoping to track here 😬
Further, feedback is seen as strictly individual — whereas in normal human interaction it (also) reinforces *social* norms. Consider: those offended may not always have the social capital, privilege or energy to speak out ➡️ FBs bots will blithely continue to offend them 🤷
DALL-E, a new image generation system by OpenAI, does impressive visualizations of biased datasets. I like how the first example that OpenAI used to present DALL-E to the world is a meme-like koala dunking a baseball leading into an array of old white men — representing at one blow the past and future of representation and generation.
It’s easy to be impressed by cherry-picked examples of DALL•E 2 output, but if the training data is web-scraped image+text data (of course it is) the ethical questions and consequences should command much more of our attention, as argued here by Abeba Birhane and Vinay Uday Prabhu.
Suave imagery makes it easy to miss what #dalle2 really excels at: automating bias. Consider what DALL•E 2 produces for the prompt “a data scientist creating artificial general intelligence”:
When the male bias was pointed out to AI lead developer Boris Power, he countered that “it generates a woman if you ask for a woman”. Ah yes, what more could we ask for? The irony is so thicc on this one that we should be happy to have ample #dalle2 generated techbros to roll eyes at. It inspired me to make a meme. Feel free to use this meme to express your utter delight at the dexterousness of DALL-E, cream of the crop of image generation!
The systematic erasure of human labour
It is not surprising that glamour magazines like Cosmopolitan, self-appointed suppliers of suave imagery, are the first to fall for the gimmicks of image generation. As its editor Karen Cheng found out after thousands of tries, it generates a woman if you ask for “a female astronaut with an athletic feminine body walking with swagger” (Figure 3).
I also love this triptych because of the evidence of human curation in the editor’s tweet (“after thousands of options, none felt quite right…”) — and the glib erasure of exactly that curation in the subtitle of the magazine cover: “and it only took 20 seconds to make”.
The erasure of human labour holds for just about every stage of the processing-to-production pipeline of today’s image generation models: from data collection to output curation. Believing in the magic of AI can only happen because of this systematic erasure.
More NLP work on diverse languages is direly needed. Here we identify a type of data that will be critical to the field’s future and yet remains largely untapped: linguistically diverse conversational corpora. There’s more of it than you might think! Large conversational corpora are still rare & ᴡᴇɪʀᴅ* but granularity matters: even an hour of conversation easily means 1000s of turns with fine details on timing, joint action, incremental planning, & other aspects of interactional infrastructure (*Henrich et al. 2010).
We argue for a move from monologic text to interactive, dialogical, incremental talk. One simple reason this matters: the text corpora that feed most language models & inform many theories woefully underrepresent the very stuff that streamlines & scaffolds human interaction. Text is atemporal, depersonalized, concatenated, monologic — it yields readily to our transformers, tokenizers, taggers, and classifiers. Talk is temporal, personal, sequentially contingent, dialogical. As Wittgenstein would say, it’s a whole different ball game.
Take turn-taking. Building on prior work, we find that across unrelated lgs people seem to aim for rapid transitions on the order of 0~200ms, resulting in plenty small gaps and overlaps — one big reason most voice UIs today feel stilted and out of sync. This calls for incremental architectures (as folks like David Schlangen, Gabriel Skante and Karola Pitsch have long pointed out). Here, cross-linguistically diverse conversational corpora can help to enable local calibration & to identify features that may inform TRP projection.
Turns come in sequences. It’s alluring to see exchanges as slot filling exercises (e.g. Q→A), but most conversations are way more open-ended and fluid. Promisingly, some broad activity type distinctions can be made visible in language-agnostic ways & are worth examing. This bottom-up view invites us to think about languages less in terms of tokens with transition probabilities, and more as tools for flexible coordination games. Look closely at just a minute of quotidian conversation in any language (as #EMCA does) and you cannot unsee this. What’s more, even seemingly similar patterns can harbour diversity. While English & Korean both use a minimal continuer form mhm/응, we find that response tokens are about twice as frequent in the latter (and more often overlapped), with implications for parsers & interaction design.
Finally, we touch on J.R. Firth — not his NLP-famous dictum on distributional semantics, but his lesser known thoughts on conversation, which according to him holds “the key to a better understanding of what language really is and how it works” (1935, p. 71). As Firth observed, talk is more orderly and ritualized than most people think. We rightly celebrate the amazing creativity of language, but tend to overlook the extent to which it is scaffolded by recurrent turn formats — which, we find, may make up ~20% of turns at talk.
Do recurrent turn formats follow the kind of rank/frequency distribution we know from tokenised words? We find that across 22 languages, it seems they do — further evidence they serve as interactional tools (making them a prime example of Zipf’s notion of tools-for-jobs).
We ignore these turn formats at our own peril. Text erases them; tokenisation obscures them; dialog managers stumble over them; ASR misses them — and yet we can’t have a conversation for even thirty seconds without them. Firth was not wrong in saying they’re key.
So, implications! How can linguistically diverse conversational corpora help us do better language science and build more inclusive language technologies? We present three principles: aim for ecological validity, represent interactional infrastructure, design for diversity
Ecological validity is a hard sell because incentives in #nlproc and #ML —size, speed, SOTA-chasing— work against a move from text to talk. However, terabytes of text cannot replace the intricacies of interpersonal interaction. Data curation is key, we say with Anna Rogers. Pivoting from text to talk means taking conversational infrastructure seriously, also at the level of data structures & representations. Flattened text is radically different from the texture of turn-organized, time-bound, tailor-made talk — it takes two to tango.
A user study (Hoegen et al. 2019) provides a fascinating view of what happens when interactional infrastructure is overlooked. People run into overlap when talking with a conversational agent; the paper proposes this may be solved by filtering out “stop words and interjections”. This seems pretty much the wrong way round to us. Filtering out interjections to avoid overlap is like removing all pedestrian crossings to give free reign to self-driving cars. It’s robbing people of self-determination & agency just because technology can’t cope.
Our 3rd recommendation is to design for diversity. As the case studies show, we cannot assume that well-studied languages tell us everything we need to know. Extending the empirical and conceptual foundations of #nlproc and language technologies will be critical for progress.
Voice user interfaces are ubiquitous, yet still feel stilted; text-based LMs have many applications, yet can’t sustain meaningful interaction; and crosslinguistic data & perspectives are in short supply. Our paper sits right at the intersection of these challenges. If you’re going to be at ACL you’ll find our talk on Underline, but here’s a public version of the 12min pre-recorded talk with corrected captions for accessibility:
In this paper we consider the awesome flexibility of communicative repair in human interaction and take a peek under the hood. We ask: what elementary building blocks make this possible?
We find that several of the building blocks are found across species —from gibbons apparently self-correcting to chimps & bonobos showing persistence and elaboration— and introduce a conceptual framework that we hope will foster further comparative work
I've been interested in this topic ever since observing (in http://doi.org/10.1371/journal.pone.0136100) that ways of dealing with communicative trouble pattern within & across species in interesting ways. This year serendipity struck and we were able to get to it with an interdisciplinary team
It was great to work on this with @rapha_heesen@MarlenFroehlich Christine Sievers and @mariekewoe — between us, we represent (at least) psychology, anthropology, primatology, philosophy, psychobiology and the language sciences, which made things all the more fun and interesting
Anyway, while we still seem to have a joint focus of attention, let me just drop this link here again, which (as you can read in the paper) may be a form of persistence if not elaboration http://doi.org/10.31234/osf.io/35hzt — go check it out!
One thing we found is that outside primates, research on sequentially organized social interaction is still rare — most work focuses on acoustics, song structure & ethograms rather than on contingency, sequence & interactional achievement. Lots of opportunities for exciting work!
As we point out in the paper, sequential analysis allows us to unify work on persistence & elaboration in great apes w/ work on repair in humans; and to identify possible continuities or bridging contexts, such as the freeze-look described by @elycorman and @njenfield
One risk of introducing a 'framework' is that it may be interpreted as proposing a simple matrix of ready-to-use labels for reified phenomena. Our goal here is different: we seek to make visible a space of possibilities with room for diversity & gradience
Primates 🦧🧑 are cool, but if there is one thing that I hope our paper will help contribute to it would be a broader interactive turn in communicative ethology across species 🐳🐠🐘🐦🦇 : from signals and their properties to sequential exchanges as an interactional achievement.
Always plot your data. We're working with conversational corpora and looking at timing data. Here's a density plot of the timing of turn-taking for three corpora of Japanese and Spanish. At least 3 of the distributions look off (non-normal). But why?
Plotting turn duration against offset provides a clue: in the weird looking ones, there’s a crazy number of turns whose negative offset is equal to their duration — something that can happen if consecutive turns in the data share the exact same end time (very unlikely in actual data).
Plotting the actual timing of the turns as a piano roll shows what’s up: the way turns are segmented and overlap are highly improbable ways — imagine a conversation that goes like this! (in red are data points on the diagonal lines above)
Fortunately some of the corpora we have for these languages don’t show this — so we’re using those. If we hadn’t plotted the data in a few different ways it would have been pretty hard to spot, with consequences down the line. So: always plot your data.
Lezenswaardig: een groep jonge medici ageert tegen de marketing-wedstrijd waarin volgens hen narratieve CVs in kunnen ontaarden — de nieuwste bijdrage aan het Erkennen & Waarderen-debat. Maar niets is wat het lijkt. Over evidence-based CVs, kwaliteit & kwantificatie.
Eerst dit: de brief benoemt het risico dat je met narratieve CVs een soort competitie krijgt tussen verhalen. Dat kan zeker als de conventies van het genre nog niet uitgekristalliseerd zijn, zoals ik al schreef in 2019, toen NWO het invoerde. Een mooie-verhalen-wedstrijd wil niemand, daar zijn we het over eens. Ik ben het wat dat betreft trouwens ook eens met misschien wel het belangrijkste punt van de eerste brief o.l.v. Raymond Poot: meten is weten. Je moet alleen wel weten wát je meet. Daarover gaat dit stuk ook.
De medici (zowel deze jongere collega’s als de senioren o.l.v. @raymondpoot in het openingssalvo) lijken vooral te ageren tegen de term “narratief CV”. Die heeft ook de schijn tegen natuurlijk: gaan we elkaar nou sterke verhalen zitten vertellen bij het kampvuur? Nee toch zeker! Volgens de briefschrijvers moet een wetenschapper in het nieuwe systeem iets over haar achtergrond & prestaties opschrijven “op een onderscheidende manier” en “zonder kwantitatieve maten te gebruiken”. Factcheck: ❌ Kwantificatie in het narratieve CV is prima, gewenst zelfs!
Laten we de call van NWO er anders even bij pakken: hier is de PDF — het stuk waar het om gaat (§3.4.1 sectie 1 en 2) plak ik hieronder
Als de term “narratief CV” je niet zint kun je het ook een evidence-based CV noemen: in plaats van contextloze lijstjes & getallen wil men argumenten zien voor de excellentie van de kandidaat & haar werk, kracht bijgezet door kwalitatief en kwantitatief bewijs van impact.
Want kijk even mee: zowel kwantitatieve als kwalitatieve indicatoren zijn uitdrukkelijk toegestaan. Dat zou je niet uit de brieven van Raymond Poot & collega-medici gehaald hebben. Het cruciale verschil is dat indicatoren duidelijk betrekking moeten hebben op specieke items: “Alle type kwaliteitsindicatoren mogen genoemd worden, zolang ze betrekking hebben op slechts één output item.”
Wat hier goed aan is 1: Waar eerder de complete publicatielijst geplempt mocht worden (waar vooral veelschrijvers bij gebaat zijn) vraagt dit format om een gemotiveerde keuze van 10 items: de n-best methode die aan Ivy Leagues gangbaar is. Niks mis mee!
Wat hier goed aan is 2: Waar je eerder goede sier kon maken met journal-level metrics als IF (statistisch gezien niet meer dan een opgedirkt halo-effect) moet je nu hard bewijs leveren voor de impact & het belang van je werk.
Wat hier goed aan is 3: Waar je eerder te koop kon lopen met een hoge h-index (niet gecorrigeerd voor voorsprong door leeftijd, coauteurschap, zelfcitaties & andere biases) mag je nu laten zien welke van je papers echt zo briljant & origineel zijn.
Dat kwantificatie niet meer mag is quatsch
Volgens mij zijn dat ook 3 manieren waarop een evidence-based CV meer kansen biedt juist voor de ‘kwetsbare groepen’ die de brief noemt. (En ook: 3 manieren waarop de voorsprong van traditioneel bevoorrechten enigszins rechtgetrokken wordt — is dat niet ook een deel van de pijn?)
Kortom, dat kwantificatie niet meer zou mogen is quatsch. Je kunt alleen niet meer wegkomen met de meest indirecte cijfers (die vooral wat zeggen over privileges, kruiwagens en co-auteurs) — in plaats daarvan moet je nu hard bewijs leveren voor de impact & het belang van je werk.
Ik moet wel zeggen: de misverstanden in de brieven komen niet helemaal uit de lucht vallen. “Narratief CV” is geen beste term en er is kennelijk gebrek aan sterke voorbeelden van verantwoorde & genuanceerde kwantificatie op artikelniveau. Werk aan de winkel voor Erkennen en Waarderen en NWO!
Tot slot: álle briefschrijvers —van @raymondpoot cs tot @DeJongeAkademie@RadboudYA etc tot de jonge medici— zijn het erover eens dat roofbouw op de financiering de echte nekslag is voor topwetenschap in ons land: meer investering in fundamenteel onderzoek is cruciaal
Toevoeging 11 mei 2022:
Nou, mijn betoog in dit draadje, of in ieder geval de de term ‘evidence-based CV’, lijkt bij NWO gehoor gevonden te hebben — waar op mijn CV zal ik dat zetten? 😃
This Lingbuzz preprint by Baroni is a nice read if you’re interested in linguistically oriented deep net analysis. I did feel it’s a bit hampered by the near-exclusive equation of linguistic theory with generative/Chomskyan aps. (I know it makes a point of claiming a “very broad notion of theoretical linguistics”, but it doesn’t really demonstrate this, and throughout the implicit notion of theory is near-exclusively aligned with GG and its associated concerns of competence, poverty of the stimulus, et cetera).
For instance, it notes (citing Lappin) that theoretical linguistics “played no role” in deep learning for NLP, but while this may hold for generative grammar (GG), linguistic theorizing was much broader than that right at the start of connectionism and RNNs, e.g. in Elman 1991.
In fact, just look at the bibliography of Elman’s classic RNN work and tell us again how exactly theoretical linguistics “played no role” — Bates & Macwhinney, Chomsky, Fillmore, Fodor, Givon, Hopper & Thompson, Lakoff, Langacker, they’re all there. Elman’s bibliography is a virtual Who is Who of big tent linguistics at the start of the 1990s. The only way to give any content to Lappin’s claim (and by extension, Baroni’s generalization) is to give the notion of “theoretical linguistics” the narrowest conceivable reading.
However, Baroni’s point may generalize: perhaps modern-day usage-based, functional, and cognitive approaches to ling theory aren’t drawing as heavily on current NLP/ML/DL work as they could either. Might a lack of reciprocity play a role? After all, the well known ahistoricism and lack of interdisciplinary engagement of NLP today does not exactly invite productive exchange. (Though some of us try.)
The theory=Chomsky equation also makes it appearance at the end, where Baroni muses about incorporating storage, retrieval, gating and attention in theories of language. Outside the confines of Chomskyan linguistics folks have long been working on precisely such things. One might think work by Joan Bybee, Maryellen MacDonald, Morten Christiansen, and others might merit a mention!
In sum, Baroni’s piece provides an informative if partial review of recent work and includes bold proposals (e.g., deep nets as algorithmic linguistic theories), worth reading if you’re interested in a particular kind of linguistics. Consider pairing it with this well-aged bottle of Elman 1991!
Bybee, J. L. (2010). Language, Usage, and Cognition. Cambridge: Cambridge University Press.
Christiansen, M. H., & Chater, N. (2017). Towards an integrated science of language. Nature Human Behaviour, 1, s41562-017-0163–017. doi: 10.1038/s41562-017-0163
Elman, J. L. (1991). Distributed Representations, Simple Recurrent Networks, And Grammatical Structure. Machine Learning, 7, 195–225. doi: 10.1023/A:1022699029236
Lappin, S. (2021). Deep learning and linguistic representation. Boca Raton: CRC Press.
MacDonald, M. C., & Christiansen, M. H. (2002). Reassessing working memory: Comment on Just and Carpenter (1992) and Waters and Caplan (1996). Psychological Review, 109(1), 35–54. doi: 10.1037/0033-295X.109.1.35