Wordle now does Extended Latin and diacritics

Great news for those who are into visual corpus linguistics but don’t work on SAE languages: since July, Wordle handles alphabets in the Extended Latin ranges; and today its maker, Jonathan Feinberg, added support for combining diacritics. That means that you can now feed Wordle texts from languages that use tone marks and other diacritics in their orthographies. Like Siwu.

Wordle based on some ten minutes of spontaneous conversation in Siwu.

The Wordle above displays the most common words in some ten minutes of spontaneous conversation in Siwu, one of the fruits of my last fieldtrip. The conversation has four participants. Nothing groundbreaking about this particular Wordle, it’s just a nice word cloud starring: Continue reading

Now serving you from ideophone.org

The Ideophone has found a new home at https://ideophone.org/. Links to the old pages should still work, but I would like to ask readers and fellow bloggers to update their bookmarks and blogrolls.

The move was planned to take place in September but it had to be carried out prematurely because my provider itself was migrating their servers and I didn’t want to go with them. Being in the field for five more weeks I had no quick way of fixing it. The ever so helpful Lieuwe of ON2IT Security came to the rescue and carried out a swift and smooth migration. Lieuwe, you owe me!

Readers, thanks for understanding, and welcome back!

Zotero Sync Preview

Exciting news for Zotero users: synchronization has arrived. After some months of closed beta-testing, a public Sync Preview version was released recently. This means that Zotero users can now automatically synchronize their libraries across computers and even across platforms.

Although there are still some minor wrinkles, the sync functionality works perfectly fine and there are some exciting new features, including the possibility to import thousands of Endnote styles.1 With the import functionality comes a handy style manager, another step towards an elegant, shared, and open source solution to citation styling. That’s two killer features in one release — impressive work by the Zotero folks.

Also note the following:

Before Zotero 1.5 ships, we will add functionality to allow users to synchronize attachments to their own servers or other storage space (and we’ll also provide a hosted storage solution for all Zotero users). [forum post by Sean Takats]

Do keep in mind that the current preview is a preliminary version intended for public testing; do not expect it to be bug-free. Always make a backup copy of your full Zotero folder and try the Sync Preview in a new profile (step-by-step instructions on the sync preview page). Easier yet, download Firefox 3 Portable and try out Zotero Sync Preview 1.5 on a copy of your library without risking data loss or profile mixups. If your workflow is fine without synchronization, my advice is to avoid the growing pains of the preview version and wait until the release of the official 1.5 version, which should follow within a few months.

Not sure what Zotero is? Check the website or read my review of it.

  1. This doesn’t seem to work for all Endnote styles yet; some problems have been reported in the Zotero forums, probably due to parsing problems, in some cases because of sloppy coding in the .ens files. []

More visualizations


A visualization of the previous two posts on Many Eyes and Siwu ne

Because recursivity is a Good Thing, here is a visualization of the previous two posts on visualizing linguistic data with Many Eyes. The astute reader will note that the strange loop is not perfect since I didn’t use Many Eyes for the visualization — that is because nothing can do a simple visualization as beautifully as Wordle.

Unfortunately, Wordle doesn’t seem to handle Unicode outside the basic Latin range very well (probably because of the fancy fonts), otherwise I would’ve fed it some Siwu text, too. (I think Wordle could be made to work with SIL’s freely available Unicode fonts.)

Queensland grammar scandal at a glance

As an added bonus, here are the 75 most common content words from the recent discussion of the Queensland grammar scandal (sampled from three verbose posts at Language Log and from matjjin-nehen, including comments). It won’t help the debate, but it does give you the brouhaha at a glance. Another different something function. Grammatical grammar. Australian grammar errors indeed.


The 75 most common content words about the Queensland grammar brouhaha in the linguablogosphere

(Link to Wordle found in Cornelis Puschmann‘s del.icio.us feed.)

Many Eyes on Siwu ne

Lots of readers looked at the challenge I posted last week (my blog statistics say more than 450 views for the post alone, so that’s many eyes indeed). A few of you were even daring enough to come up with a story on the various functions of Siwu ne. The challenge was probably a bit too difficult (involving an untranslated text in an as of yet undescribed Niger-Congo language), which makes those few attempts all the more heroic. So what did they see?


ne and its right periphery; “ne, …” accounts for almost half of the tokens

Brett was the first to bite the bullet, providing some statistics on the use of ne. He noted that “it occurs sentence initially 158 times (out of 1161) and sentence terminally 83 times. (…) It often seems to bracket a whole clause and it can even be doubled. The ne kama ne string is quite common.” Ray Girvan didn’t trust the visualization and inspected the raw text instead. He discovered that the text contains some dialogues “as well as as a complete song/poem with multiple uses of “ne” in a question”; that the construct Si …. ne occurs frequently in it; and that the text probably consisted of several different text types. On came Jason with a number of rather detailed observations: Continue reading

Visual corpus linguistics with Many Eyes

I recently came across Many Eyes, a nifty data visualisation tool by IBM’s Visual Communication Lab. It has lots of options to handle tabular data, but —more interesting to linguists— it can also handle free text. The two visualization options it currently offers for text are a tag cloud and a so-called ‘word tree’. The former visualizes simple token frequency, the latter displays the occurences of a given word (or phrase) in a branching view. It is the latter that I find the most exciting feature, because it allows for rapid visual exploration of linguistic patterns in a text.

Take for instance the Siwu locative marker i. Before today I vaguely knew where it usually occurs (before an NP and after a VP, more or less). Now I know (1) that it also occurs sentence initially, as in I Ɔtuka ame, … {LOC Lolobi inside} ‘In Lolobi country, …’; (2) that it often precedes a deictic, as in …i mmɔ {LOC there} ‘over there’; and (3) that one can have nested occurrences, as in ma-sɛ ma-a-su kaku i ngbe-gɔ i ɔturi ɔ-kpi mmɔ {they-HAB they-FUT-take funeral LOC here-REL LOC person he-died there} ‘they usually will hold the funeral there were (‘in the place in which’) the person died’. The next step is to look more carefully into these particular constructions and improve my grammatical analysis. I might conclude, for example, that the distal deictic mmɔ is more nouny than I had taken it to be.

Of course, I would have discovered these facts eventually after carefully analyzing enough Siwu texts — but the point is that right now, finding and comparing these patterns took only five minutes of playing around with the word tree above. Cool, isn’t it? Let’s call it visual corpus linguistics. Continue reading

Done well: WALS Online

Note: An updated version of this review has been published in eLanguage on July 15th, 2008.

A common dashboard sticker in Ghanaian taxi’s has it that “If it must be done, it must be done well”, where ‘done well’ cleverly doubles as a brand name. This is largely irrelevant except by way of introducing WALS Online, the web version of the World Atlas of Language Structures, which really has been done well.

The massive 2005 volume and the somewhat bumpy interface of the interactive maps on the accompanying CDROM have been transformed into a slick web interface with all sorts of clever stuff going on behind the scenes. In a time where an increasing number of print sources is thrown online simply in the form of scans or huge PDF files, it is refreshing to see what true adaptation to the medium of hypertext can bring us. One consequence of this is that WALS Online, rather than a reprint or a second edition of WALS 2005, has become a separate publication, edited by the same authors but published by the Max Planck Digital Library.

Features and languages

WALS Online is a website consisting of five main parts. The first part, Features, functions as an index to the 142 maps and chapters of the original edition. The opening page of each feature is merely a configuration screen from where one can navigate to the chapter text or map, change the indicators used on the map, or select another feature for combined display. The chapter text is beautifully laid out, with an eye for good web typography. A minor issue is that after using the atlas for some time, the configuration screen starts to feel as an unnecessary barrier between the index and the texts and maps. It might have been better to make the content more directly accessible from the main index of features. Continue reading

The etymology of Zotero

If you’ve read yesterday’s post (Zotero, an Endnote alternative) or come across Zotero elsewhere, you may have been wondering about its name. I believe most Anglophones pronounce the word [ˌzɔˈtɛɹoʊ] (zoh-TER-o), but the term itself actually derives from the Albanian verb zotëro-j [zɔtərɔj] ‘master, acquire’.1 The final -j marks the 1st person indicative (the regular citation form for Albanian verbs); in the imperative, we would get the bare verb root zotëro [zɔtərɔ]. Such subtleties did not figure in the initial baptismal act though, as we learn from the following transcript of a podcast featuring the people behind Zotero:

The web being what it is, I just quickly googled and found an English-Albanian dictionary and typed a bunch of our keywords that we associated with the project and when I typed in ‘learning’, uhm… one of the variants was ‘to learn something extremely well, that is to master or acquire a skill in learning’ was “zotëroj” [pronounced [ˌzɒˈtuəɹʏdʲ] by DC, MD] (laughs), which we have shortened, we took of the -j at the end which is more of a ‘y’-sound and uh we took off the umlaut …
(Dan Cohen, Library Geeks Podcast 5, 22:48—25:15)

It’s that simple. And for good reason: essentially, want you need in branding is a name that sticks but at the same time is not too common; if it makes some sense (as ‘Zotero’ does), that’s even better. The main reason for choosing an Albanian word was thus quite simply to minimize namespace competition. It could have been any other language — in the podcast, Cohen mentions Maori; Hawaiian is another popular one (wikiwiki), and Bantu languages do well too (cf. Ubuntu, a trendy Linux distribution).

Will It Brand?2

Well, not really any other language of course — a quick glance over the newest web 2.0 names shows that the preferred languages for this kind of stuff seem to be those with simple phonotactics, a preference for open syllables, a basic 5 vowel system, and not-too-outlandish consonant inventories. So at least in the Zotero case, Siwu is out of luck with suã ‘learn’ (nasal vowel penalty); as is Tamashek with əlmæd ‘learn, acquire’ (muddy vowels and a voiced coda, tsk); as is Ibibio with kpéép ‘learn, acquire’ (a labio-velar stop, for petes sake!); readers are no doubt able to come up with better examples.

Fortunately, these need not be fatal problems. Dan Cohen’s account shows that if it doesn’t fit, we can always make it fit; just chop off needless morphology and diacritics and you’re good to go. Now Albanian, hitherto an obscure 6 million speaker language making up it’s own branch of Indo-European, enjoys celebrity status as the language that endowed the Next-Generation Research Tool with a worthy name. Come to think of it, who would not like to sacrifice some orthographic blunt for publicity’s sake? Suddenly all those woefully inadequate orthographies we linguists have been cursing at are beginning to make sense!3 Next time the underspecified orthography drives you nuts again, find a product in need of a name and monetize your despair. I’ve heard naming consultants easily make twice as much as linguists.

P.S. A great resource on naming is Nancy Friedman’s Away With Words, which I found via the posting on Web 2.0 names referenced above.

  1. See this article in the GMU Gazette. []
  2. For those not aware of the covert reference here, check out the hilarious Will It Blend? viral marketing campaign. []
  3. Did you know that Maa (Eastern Nilotic, East Africa) has nine contrastive vowels but is usually written using Swahili’s five vowel orthography? Did you know Siwu (Kwa, Ghana) and Nafaanra (Senufo, Ghana) have three distinctive tones, none of which are marked orthographically? []

Zotero, an Endnote alternative

I wasn’t planning to make this a software weblog, but I’ll make an exception for Zotero because I think fellow researchers will find it an interesting tool. Zotero [ˌzɔˈtɛɹoʊ] is a free piece of software that lives in your browser, helping you to ‘collect, manage and cite your research sources’ in all sorts of beautiful ways. It bills itself as The Next-Generation Research Tool, and in this post I’ll try to explain why I think that’s true. The background to this posting is that I made the move from Endnote to Zotero two months ago — and I have never since considered going back.

It all started when I upgraded from Endnote 7 to Endnote X to get Unicode support.1 Endnote X included Endnote Web, a web-based implementation that looked interesting. I had some difficulty getting the two to work together, and when I finally did, there were drawbacks that made me look out for an alternative. A Google search led me to Zotero, which was a breeze to install. I could simply import my Endnote library and started a testdrive. Within minutes I was totally hooked. The Zotero interface offered everything I had been missing in Endnote and then some. What makes Zotero so good?

Seamless integration with online research

First of all, Zotero answers the needs of researchers in the digital age. The rise of online repositories like JSTOR, ProQuest, SpringerLink, and Google Scholar has caused a shift in our research habits; we spend more time browsing virtual libraries, and less time hanging around in physical ones.2 Zotero seamlessly integrates with this online experience by automating the wearisome labour of saving references and by offering many ways to manage and enrich the data thus collected. All from within the web browser. Continue reading

  1. Some Unicode support been in place since v. 8 on, though without RTL abilities. []
  2. We all like to stress how we still appreciate the feel of paper in our hands, and the smell of books in a well-stocked library. The point here is merely that as more and more of these offline sources become available for online searching, our research habits (though not necessarily our reading habits) are bound to be affected by this. []