Visual corpus linguistics with Many Eyes

I recently came across Many Eyes, a nifty data visualisation tool by IBM’s Visual Communication Lab. It has lots of options to handle tabular data, but —more interesting to linguists— it can also handle free text. The two visualization options it currently offers for text are a tag cloud and a so-called ‘word tree’. The former visualizes simple token frequency, the latter displays the occurences of a given word (or phrase) in a branching view. It is the latter that I find the most exciting feature, because it allows for rapid visual exploration of linguistic patterns in a text.

Take for instance the Siwu locative marker i. Before today I vaguely knew where it usually occurs (before an NP and after a VP, more or less). Now I know (1) that it also occurs sentence initially, as in I Ɔtuka ame, … {LOC Lolobi inside} ‘In Lolobi country, …’; (2) that it often precedes a deictic, as in …i mmɔ {LOC there} ‘over there’; and (3) that one can have nested occurrences, as in ma-sɛ ma-a-su kaku i ngbe-gɔ i ɔturi ɔ-kpi mmɔ {they-HAB they-FUT-take funeral LOC here-REL LOC person he-died there} ‘they usually will hold the funeral there were (‘in the place in which’) the person died’. The next step is to look more carefully into these particular constructions and improve my grammatical analysis. I might conclude, for example, that the distal deictic mmɔ is more nouny than I had taken it to be.

Of course, I would have discovered these facts eventually after carefully analyzing enough Siwu texts — but the point is that right now, finding and comparing these patterns took only five minutes of playing around with the word tree above. Cool, isn’t it? Let’s call it visual corpus linguistics.

Some of the best options, by the way, are not available in the embedded view above. Take a look at the full version to order alphabetically or by frequence, or try the start/end radio buttons which enable you to view preceding and following words. And if you don’t happen to read Siwu, why don’t you check out Shakespeare’s All’s Well That Ends Well?

Collaborative data mining

Wait, let me retract that. It doesn’t even matter that you don’t know any Siwu — the power of frequency-based visualization is that you should be able to spot salient grammatical patterns anyway.1 It would be interesting to test this. Can you, based on this dataset of 21,891 Siwu words and punctuation marks,2 work out what kind of functions the word ne might have?

Perhaps I’m making it too difficult by not providing the meaning of the Siwu text, but I do think that this type of collaborative pattern hunting is an interesting addition to the toolkit for linguistic analysis. In the future, we will use visualizations like this to get a feeling for the relative frequencies of different constructions in which an item occurs, and to quickly test each other’s hypotheses. The philosophy behind Many Eyes is interesting in this respect. As the About page says: Many Eyes is a bet on the power of human visual intelligence to find patterns. Our goal is to “democratize” visualization and to enable a new social kind of data analysis.

I am quite sure that there are specialized corpus linguistic applications out there that have far more sophisticated data-munching and searching capabilities. But the simplicity of Many Eyes may be its secret power. It is totally hasslefree: feed it a text and you can instantly play around with different types of visualizations. The intuitive interface makes it possible to quickly traverse the data in search of patterns, or do some quick testing of constructional hypotheses. In fact, you do not even need to upload your own dataset, because you can create visualizations based on any existing dataset (all the data is publicly available). And did I mention it is free?

Limitations

Despite all the goodness, Many Eyes is not perfect, although to be fair, it wasn’t exactly made for linguistic analysis to begin with. After some playing around, you will inevitably hit upon the limitations; for me, it was a bit disappointing for instance to see that the tag cloud chokes on Unicode characters outside the basic Latin range. The following few relatively simple improvements would hugely enhance the linguistic uses of Many Eyes:

  • tag cloud: proper handling of Unicode characters outside the basic Latin range.
  • visualization over multiple data sets. (Right now, a workaround is to make a new dataset combining other data sets. However, this is not straigtforward.)
  • word tree: positional wildcards to enable searching for patterns like Siwu “i * ame” (LOC X inside) or English “I * know” (capturing ‘I don’t know’, ‘I damn well know’, etc.) . This would involve a branching tree inside two focus words, very interesting visually. It should be possibly to limit the search to N intervening words, or to search for an arbitrary number of intervening words (respecting sentence boundaries, I guess).
  • word tree: show target word in full context (i.e. show the words left and right of it). Note that this is not trivial, as there is the problem of how to connect left and right parts of sentences on potential regroupings. The sort by occurrence option would no longer work in any case. Perhaps highlighting the relevant connector lines on mouseover would be a good solution.
  • word tree: some way to do partial word searches, e.g. “*kɛlɛ” to find all occurrences of the verb kɛlɛ ‘go’ regardless of subject agreement and tense/aspect prefixes. This would involve non-trivial ordering and visualization problems (how to visualize the difference between a preceding word and a preceding part of a word? Perhaps this would add too much clutter).

Until these are implemented (either in Many Eyes or in a specialized linguistic cousin), those of us working on radically isolating languages will probably be more happy with the tool than those working on the polysynthetic end of the continuum. Be that as it may, Many Eyes remains an exceptionally useful application, offering a glance into the bright future of visual corpus linguistics.3 Thank you, VCL people!

(Hat tip: Ethan Zuckerman.)

Links

  1. Only to some extent of course, and given enough patience. []
  2. That’s a hint. Try searching for “ne ,”, “ne .”, and “,” for example. And remember to switch views with the ‘start’ and ‘end’ radio buttons to see both the left and right periphery. []
  3. The term Visual Corpus Linguistics seems to be new to the Google corpus. Interesting. Unfortunately, its acronym VCL has already been taken by a certain IBM research lab. Ah well, let’s hope the token frequency of the former will soon come to overshadow that of the latter. []

10 thoughts on “Visual corpus linguistics with Many Eyes

  1. OK, I’ll play. It occurs sentence initially 158 times (out of 1161) and sentence terminally 83 times (assuming that in Siwu a period marks sentence boundaries). It often seems to bracket a whole clause and it can even be doubled. The ne kama ne string is quite common.

    I’m running out of time, so I’ll guess it has a grammatical function. Perhaps a negation (though I may be influenced by the spelling) or a question marker.

    I find it frustrating not to be able to compare two things at the same time.

  2. Yes indeed, sentence boundaries are marked by periods. Questions are marked by a question mark, by the way. Nice try; I’ll withhold my comments until later.

    To all Eyes:
    Do note that you can save snapshots of a particular state of the word tree by adding a comment below the visualization window (I appended one as an example).

  3. I agree about the problem of the lack of positional wildcards. As I said at Language Log, some patterns are far easier to see in the raw text, such as the frequent occurrence of a “Si … ne” construct as sentences and opening clauses.

  4. We need a cloud view too. :(

    Some raw-ish observations:

    – “ne” is the most common word in the sample data. It is more common than the most common punctuation marks, “.” and “,”. It is the most common word following a period (“. ne” gives 158 hits; the runner-up “. si” has 115) and the most common word preceding a period (“ne .”: 84; “ame .”: 72).

    – “ne” usually appears before or after a “.” or “,”:
    — 42.9%: immediately before “,”
    — 13.6%: sentence-initial (“ne” after “.” “?” or “!”)
    — 7.8%: sentence-final (“ne” before “.” “?” or “!”)
    — 4.1%: immediately after “,”
    — 31.5%: other

    – There are 9 hits for “ne , ne” and one for “ne ne”. (“Si maɔsɛgu ne ne, ɔrɔ̃go kpakpa gɔ mpia…”)

    – “ne kama ne” appears 31 times, almost always after punctuation.

    – The most common word after a sentence ending in “ne .” is “ne”.

    But I found myself wanting to browse the source text, and as soon as I did that, I noticed two things:

    – Wow, that tool hides a lot about the nature of the sample!

    – Right away I wanted to write Python scripts to analyze the data in ways the tool doesn’t support.

    Doing this, I found out that in the file there’s 1 instance of “ne-oo” and 1 instance of “ne- – – oo!” No idea whether those are really instances of “ne” or something else entirely.

  5. Wow, that tool hides a lot about the nature of the sample!
    Yep: stuff that could be useful, like possible relation to proper names and what appears to be dialogue, as well as a complete song/poem with multiple uses of “ne” in a question.

  6. I forgot to add a silly guess. I don’t know anything about languages, beyond English and a little Spanish, so this is probably pretty dumb :-) but: I’m guessing it performs some function that is not served by any English word. Maybe it indicates past tense.

    For a while I was thinking preposition, but it seems really inconvenient for a language to prefer that prepositions appear at the extremes of sentences. What do I know, though.

    Oh, and my guess is that ne kama ne serves as a conjunctive adverb (like furthermore or however in English).

    Don’t know what to make of that ne ne, though.

  7. Brett, Ray, Jason: beautiful observations. Expect a writeup incorporating your findings in a few days. Re: the missing tag cloud — I contacted the Many Eyes people on this issue, and they’ve said they are working to fix the Unicode problem.

  8. Another thing I noticed from the raw text: the “Si … ne” construct is only common in the first half, and absent in the second half (whose title indicates it to be a formal book). Identifying the text type of the first half – for instance, is it personal narrative that would have many “I …” sentences? – could be enlightening in identifying the nature of “Si … ne”.

Leave a Reply

Your email address will not be published. Required fields are marked *