Wednesday, 13 February 2008

Which are worse? PubMed metadata or CrossRef metadata?

Tough call, eh? :-)

Given a DOI, CrossRef's OpenURL won't give you any author except the first author, and won't give you the end page of an article. In other words, it won't give you enough information to create a citation for a paper.

And PubMed? PubMed will give you all the authors, though it truncates them all to two initials. If you want the full journal name, you will have to capitalise it yourself, which isn't trivial to do automatically (e.g. "Journal of the american chemical society"); if you want the journal abbreviation, you will have to insert the periods yourself, which again isn't quite trivial (it's not just a question of sticking a period in front of everything in sight).

My favourite thing of all about PubMed is that even where a paper is in PubMed and where the paper has a DOI, the DOI mightn't be in PubMed (e.g. doi://10.1016/j.jmb.2003.08.006 and PMID 14499606). Nice. This last one means that to get PubMed metadata relating to that DOI, you need to first look up CrossRef, and then use that metadata (i.e. journal, year, volume and startpage) to look up PubMed.

So, which do I think is worse? CrossRef - but only because it doesn't give enough information to create a citation.

9 comments:

Geoffrey Bilder said...

Just to note, we (CrossRef) actually often have the additional author metadata, it just isn't currently returned via the OpenURL interface.

And this is a request that we are getting fairly often, so we are working on changing it. I can't give you an ETA yet, but when it is changed, we'll post to the CrossTech blog.

baoilleach said...

That's great to hear. And don't forget to consider the endpage - most journals also require the endpage in citations. With these in place, a lot of useful tools could be built, e.g. for validating citations in papers - this would benefit both journals and authors.

Chuck Koscher said...

CrossRef's openurl resolver now supports a third non-standard parameter which will return a more verbose XML response.

format=unixref

The help information at www.crossref.org/openurl has also been updated

baoilleach said...

I haven't tested unixref in general, but for the OpenURl example
it seems that the endpage is now included. Great!

For the record, it still only has the first author (there should be two), and the journal abbreviation given is identical to the full journal title (I don't know whether this is a bug, or expected behaviour).

Chuck Koscher said...

Okay... now the disclaimer.

This format returns exactly what the publisher deposited will CrossRef. So if they only gave us one author instead of a full listing it is not a 'bug' its a bad practice.

Likewise for journal title abbreviations. However, some publishers don't actually advocate one abbreviation over another and as such won't deposit one with CrossRef.

David Bradley said...

I reckon my concept of a PaperID created at the author end using a standard algo and then deposited on a distributed OA/OS system at the time of first creating the paper could circumvent all the problems with DOI, openURL, CrossRef etc.

No doubt it will bring with it its own issues, but I haven't thought of any yet. I just need to find someone with the programming skills and a current insider's view of being a journal author to help make it a reality.

Dave Bradley

baoilleach said...

@chuck: The journal abbreviation isn't really important, but is there any chance you could start hassling publishers to give CrossRef the full author lists? Without that information, the metadata is still somewhat hobbled. There isn't even any way to know whether there is just a single author, or whether the other authors have been omitted.

Sigh...and I was sure that you guys, at least, must have all the data.

Tom.Pasley said...

Hi Noel,
I'm just wondering what you're using to extract the data from PubMed and CrossRef xml?

cheers,

Tom

baoilleach said...

In order of specificity, I use Python, ElementTree and code like that in this file.