Always plot your data

Always plot your data. We're working with conversational corpora and looking at timing data. Here's a density plot of the timing of turn-taking for three corpora of Japanese and Spanish. At least 3 of the distributions look off (non-normal). But why?

Plotting turn duration against offset provides a clue: in the weird looking ones, there’s a crazy number of turns whose negative offset is equal to their duration — something that can happen if consecutive turns in the data share the exact same end time (very unlikely in actual data).

Plotting the actual timing of the turns as a piano roll shows what’s up: the way turns are segmented and overlap are highly improbable ways — imagine a conversation that goes like this! (in red are data points on the diagonal lines above)

Fortunately some of the corpora we have for these languages don’t show this — so we’re using those. If we hadn’t plotted the data in a few different ways it would have been pretty hard to spot, with consequences down the line. So: always plot your data.

Originally tweeted by (@DingemanseMark) on November 6, 2021.


Via Language Log, a nice tutorial titled Interactive Visualization for Computational Linguistics [PDF, 13,1 Mb] by Christopher Collins, Gerald Penn, and Sheelagh Carpendale. Includes not only lots of wonderful visualizations, but also a lot of background information on Gestalt perception, visualizations as ‘external cognition’, preattentive processing, info on a case study (slide 196ff.), and ample examples of different kinds of visualization software. See also InfoVis:Wiki — Linguistic Visualization.

Wordle now does Extended Latin and diacritics

Great news for those who are into visual corpus linguistics but don’t work on SAE languages: since July, Wordle handles alphabets in the Extended Latin ranges; and today its maker, Jonathan Feinberg, added support for combining diacritics. That means that you can now feed Wordle texts from languages that use tone marks and other diacritics in their orthographies. Like Siwu.

Wordle based on some ten minutes of spontaneous conversation in Siwu.

The Wordle above displays the most common words in some ten minutes of spontaneous conversation in Siwu, one of the fruits of my last fieldtrip. The conversation has four participants. Nothing groundbreaking about this particular Wordle, it’s just a nice word cloud starring: Continue reading

More visualizations


A visualization of the previous two posts on Many Eyes and Siwu ne

Because recursivity is a Good Thing, here is a visualization of the previous two posts on visualizing linguistic data with Many Eyes. The astute reader will note that the strange loop is not perfect since I didn’t use Many Eyes for the visualization — that is because nothing can do a simple visualization as beautifully as Wordle.

Unfortunately, Wordle doesn’t seem to handle Unicode outside the basic Latin range very well (probably because of the fancy fonts), otherwise I would’ve fed it some Siwu text, too. (I think Wordle could be made to work with SIL’s freely available Unicode fonts.)

Queensland grammar scandal at a glance

As an added bonus, here are the 75 most common content words from the recent discussion of the Queensland grammar scandal (sampled from three verbose posts at Language Log and from matjjin-nehen, including comments). It won’t help the debate, but it does give you the brouhaha at a glance. Another different something function. Grammatical grammar. Australian grammar errors indeed.


The 75 most common content words about the Queensland grammar brouhaha in the linguablogosphere

(Link to Wordle found in Cornelis Puschmann‘s feed.)

Many Eyes on Siwu ne

Lots of readers looked at the challenge I posted last week (my blog statistics say more than 450 views for the post alone, so that’s many eyes indeed). A few of you were even daring enough to come up with a story on the various functions of Siwu ne. The challenge was probably a bit too difficult (involving an untranslated text in an as of yet undescribed Niger-Congo language), which makes those few attempts all the more heroic. So what did they see?


ne and its right periphery; “ne, …” accounts for almost half of the tokens

Brett was the first to bite the bullet, providing some statistics on the use of ne. He noted that “it occurs sentence initially 158 times (out of 1161) and sentence terminally 83 times. (…) It often seems to bracket a whole clause and it can even be doubled. The ne kama ne string is quite common.” Ray Girvan didn’t trust the visualization and inspected the raw text instead. He discovered that the text contains some dialogues “as well as as a complete song/poem with multiple uses of “ne” in a question”; that the construct Si …. ne occurs frequently in it; and that the text probably consisted of several different text types. On came Jason with a number of rather detailed observations: Continue reading

Visual corpus linguistics with Many Eyes

I recently came across Many Eyes, a nifty data visualisation tool by IBM’s Visual Communication Lab. It has lots of options to handle tabular data, but —more interesting to linguists— it can also handle free text. The two visualization options it currently offers for text are a tag cloud and a so-called ‘word tree’. The former visualizes simple token frequency, the latter displays the occurences of a given word (or phrase) in a branching view. It is the latter that I find the most exciting feature, because it allows for rapid visual exploration of linguistic patterns in a text.

Take for instance the Siwu locative marker i. Before today I vaguely knew where it usually occurs (before an NP and after a VP, more or less). Now I know (1) that it also occurs sentence initially, as in I Ɔtuka ame, … {LOC Lolobi inside} ‘In Lolobi country, …’; (2) that it often precedes a deictic, as in …i mmɔ {LOC there} ‘over there’; and (3) that one can have nested occurrences, as in ma-sɛ ma-a-su kaku i ngbe-gɔ i ɔturi ɔ-kpi mmɔ {they-HAB they-FUT-take funeral LOC here-REL LOC person he-died there} ‘they usually will hold the funeral there were (‘in the place in which’) the person died’. The next step is to look more carefully into these particular constructions and improve my grammatical analysis. I might conclude, for example, that the distal deictic mmɔ is more nouny than I had taken it to be.

Of course, I would have discovered these facts eventually after carefully analyzing enough Siwu texts — but the point is that right now, finding and comparing these patterns took only five minutes of playing around with the word tree above. Cool, isn’t it? Let’s call it visual corpus linguistics. Continue reading