How promotes poor metadata and plays to our vanity

giphyA while back some low quality citations started showing up on Google Scholar. They had titles like “CHAPTER 2 draft — email” and it was hard find actual bibliographic metadata. Google Scholar seemed to have scraped random PDFs uploaded on and decided it was worth counting the citations in them even in the absence of proper metadata. I shared this on Twitter and promptly forgot about it.

Then I got an email from someone asking me to say a bit more about my concerns with poor metadata. I decided to write it up in a blog post. I’m afraid it turned into a a bit of rant about how seems built not so much for sharing scientific information as for playing to our vanity. Sorry about that. Let’s start with the poor metadata issue, which turns out to be rather pervasive. has a massive metadata problem

  • doesn’t record any metadata except title and author for the bulk of papers, and doesn’t expose any metadata using standard formats like RDF/unAPI.

Ever tried to figure out how to cite a paper on This is hard because most of the metadata is missing. Reference managers like Zotero or Mendeley cannot detect and save papers for citing. For those hoping to cite works uploaded there, this makes life more difficult than it needs to be. For users of, this hurts citability. Yes, I know touts a 69% citation advantage. See here for a discussion of some concerns about that study.1 My point here is simply that whoever wants to cite papers found on currently has to get the metadata from elsewhere.

Google Scholar has started to index the heap of PDFs on Academia, but has to resort to scraping only the most superficial info available, usually from the PDF, because there is no metadata. This is the number one reason for the junk citations in Google Scholar that started this story. It means people’s work is misrepresented and makes it harder to figure out how to cite it  — bad news, because Google Scholar is very widely used. For users of, this hurts both the findability and the citability of their work. For users of Google Scholar, this adds more noise in an already noisy system.

Screenshot 2016-08-26 09.30.36

  • is built for single-authored papers, and its handling of multi-authored papers is surprisingly poor. 

The default way of scraping author names leads to many errors and they can only be fixed manually. Take the paper staff published on ‘discoverability’ — the authors are all jumbled up. Only the original uploader owns the item and can add or fix bibliographic metadata, and for other authors, it’s hard to see who’s the owner. There is no system for duplicate detection and resolution. It is too easy for multiple authors to upload the same paper with slight differences in bibliographic metadata. It is too hard to clean up the mess and make sure there is only one good  version of record. This affects people’s profiles and has undesirable knock-on effects for the points above.

  • The process of adding papers is geared towards enriching content rather than towards promoting the sharing of correct and complete scientific information.

After gets your PDF (and that’s a requirement for adding a paper), there are very few opportunities for providing metadata, and the primary upload interface cares more about fluff like ‘research interests’ than about getting the basic bibliographic metadata right. There is no way to import via DOI or PMID (which would prevent many errors), or even to record these identifiers — a fatal lack of concern for interoperability which is quite surprising. Essentially, a user interface should make it easy for people to get things right, and hard to get them wrong. The current user interface for adding papers does the exact opposite (see annoted screenshot).

Screenshot 2016-08-25 10.19.35
  • It is surprisingly (and needlessly) hard to add crucial bibliographical information like journal, DOI and URL.

More details can only be added after importing papers, which simply means most users won’t do it. As far as I can see, the only way to do it is to go back to your publication list, hover over the Edit button, and find other fields to edit. Even here, there appears to be no place for identifiers like DOI or PMID. Page numbers and so on are hidden in an “Other” field.  Any user interface designer will tell you that stuff buried this deeply might as well be left out: only a negligible amount of users will find and use it.2 Anyway, if you’ve succeeded in adding some of this metadata, congratulations for completing a futile exercise. The information you painstakingly entered is nowhere exposed and so cannot be reused or exported, except, again, manually. Quite remarkable in the age of APIs and interoperability.

How nudges us towards narcissism

My conclusion from these points: seems not to care about promoting the curation and sharing of correct and high quality metadata of scientific publications. One might counter that this is not the goal of the network, and that the content of the papers is what’s most important anyway. But peer-reviewed publications are still the main vehicle for advancing scientific results, and citations are still the main currency of cumulative science. So getting bibliographic metadata right is key to promoting science as a cumulative enterprise. Nor should this be hard in the era of DOIs and PMIDs, making it all the more surprising doesn’t care about them.

If there really is a problem, why do relatively few people complain and why are so many users seemingly happy about the service? There are several reasons. Not every academic has access to personal website or an open academic repository, and presents itself as one of the easiest options to make one’s work visible online (never mind the fact that it doesn’t actually make it easily citable, and laces it with ads to boot). It may be a  way to keep up with colleagues. Also, I’ve heard people are happy with its “sessions” as a way to get interactive feedback on a paper. But there’s one important reason that I haven’t seen commented upon often: plays to our vanity. Many elements of its design are built to satisfy and amplify our craving for external validation.

Judging from the navigation menu, “analytics” is one of the most important elements of Upload papers, tag them with research interests, and they generate paper views. Follow people and they’ll follow you back, generating profile views. Tomorrow your paper may be in the top 5%! Next week you might be crowned as the 1%! Look, your paper was just read by someone from Vienna! Your work is being read in 27 countries! You’re being followed by someone you barely know! All those things are nicely presented in spiffy graphs — evidently a part of that a lot of design resources have been devoted to.3

And note some of the design here is cynical. The only two time windows offered are 30 days and 60 days, inviting you to come back at least this often to keep up with the stats (yes, you can download a CSV for more, but once again that is one of those power user features that will rarely be used). Views are promoted over actual downloads while bounce rates (basically, how many people are gone after a quick glance, usually the majority) are concealed. The most important metadata for papers (again, just taking the design as a measure of what promotes as important) is this mostly meaningless view count. Not where it was published, not how to cite it, certainly not where to find it off — just how many people had a look. doesn’t take academia seriously

Does this mean everybody on is a narcissist? Of course not. My point is not about users; it is about the design of the service. User interface design is not innocent: as a recent Medium essay noted, technology hijacks our minds, constraining our options and nudging us in ways that often elude our awareness. Not everybody on is a narcissist, but many aspects of its design make it easy to become one. (The emails! Don’t get me started about the emails. By default, will send you an email whenever someone stumbled upon your profile or one of your papers. Just look at this Twitter feed to see how creepy people find that feature. You might even spot a few who have come to like it, Stockholm syndrome-style.)

On balance, I feel doesn’t really take us seriously as academics. It takes our work to make a profit (for instance by putting advertisements around it), totally botches the metadata and tries to appease us by offering stats and social rankings that promote constant comparison. And nothing in its design suggests a regard for getting even the most basic bibliographic information about our scientific work right — even though that would be one way to turn page views into citations. This is one of the reasons the only paper I’ve uploaded there for years has been one pointing people to where they can find all my papers freely and without hassle.

To end on a slightly more optimistic note: at least the poor metadata problem can be solved. As far as I can see, nothing in’s business model turns on proliferating poor and incomplete metadata. The citation advantage it likes to claim could significantly increase if it started exposing metadata in ways that are compatible with widely used tools like Google Scholar and Zotero. It still won’t be a service I’m keen on using, but I do hold hope it will become better at promoting cumulative science rather than cynically playing to our vanity.


  1. The most important criticism is that the advantage doesn’t hold for just (if at all) but for findable and freely available PDFs more generally, including those from OA repositories indexed by Google Scholar.
  2. This is reminiscent of the backward way in which Facebook designs its privacy UIs, but confusingly, there seems to be little reason for to make it so hard to get things right.
  3. Digression: this is also what Frontiers gets absolutely right, which makes it highly attractive to a generation of researchers craving for immediate external validation.

3 thoughts on “How promotes poor metadata and plays to our vanity

  1. Hey Mark,

    It is possible to create papers without uploading a PDF, so that is not a requirement. I’ve done so for most of my publications. And I simply put a citation into the info field, however that will mostly help human readers, not Google.

  2. Ah there’s another clever way in which their design nudges you to give them your PDFs. The only way to add a paper is the big “Upload” button, and the only way to avoid actually uploading is to first click that button, then find the greyed out small type saying “No file to upload?”. As linguists we can see this frames the business of adding papers as being primarily that of “uploading PDFs”. As Trump would say, “they don’t get sarcasm?”.

    And yes, having a citation in the info field means you’ve found a workaround to the limitations which will make Academia’s task of providing clean and correct metadata only more difficult eventually.

Leave a Reply

Your email address will not be published. Required fields are marked *

+ 3 = eight