Duplicate detection is one of the things any serious reference manager should offer. Zotero users have been clamouring for it since the early days. There are basically two ways to implement it: as a preflight check, warning the user when they are about to add a potential duplicate; and as an after the fact scan, which enables users to weed duplicate items from their library.
The most recent version of Zotero takes the second route: a posthoc duplicate detection mechanism. Though definitely better than nothing, and with an elegant merging solution, the interface is still far from perfect and yields a lot of false positives, making it somewhat difficult to use. Besides, it is slow, because it tries to compare everything with everything, which amounts to a huge amount of operations even in moderately sized libraries. Although it is good to have at least something, what seems to me have been overlooked is that prevention is better than cure, and that a quick check before adding new items to the library would help users a lot.
It appears that the Zotero developers have too much on their plate to think about such niceties, but as Avram Lyon has pointed out, this could be done as a Zotero plugin, and anyone interested in a little challenge could wade in. (I think you need to be literate in XUL and JS, and probably SQLite too. A sample Zotero plugin here. Also see this thread for some helpful comments by Zotero developer Dan Stillman on the best way to implement it.)
What kind of form should a user interface for this take? I would propose something like the following:
Do you really want to add this item? It looks like it already exists in your library.
Smith, Joe. 2010. How to avoid duplicate entries. Ms., Amsterdam.
[Button A: Cancel and go to similar item.] [Button B: Add anyway.]
The interface should be fast and reliable. I propose the following basic workflow: Upon adding a new item, check a low number of strategically chosen fields and assign a duplicate score according to some simple rules, similar to spam rating systems. If duplicate score exceeds x, bring up the interface I propose above. (The variable x and the weight of individual rules could be made customizable but there is no need in a first version — as they say, release early, release often.) My proposal of fields to use, ranked by descending weight:
- author last name
- title not case-sensitive (only first n words?)
- page numbers
DOI is hit or miss, so good; but not all items will have a DOI. Author last name + Title + Year probably should receive a combined weight that is the same or higher as DOI. Given the importance of the first four, perhaps 5 and 6 have little added value.