Friday, 11 January 2008

DOI or DOH? Proposal for a RESTful unique identifier for papers

Update (18/Jan/07): This proposal has been implemented by Andrew Perry

When DOIs first became widespread for papers, I was a big fan. Just by adding "http://dx.doi.org/" to the start of the DOI I could ensure that a link would always lead the reader to the correct web page of the publisher. As a unique identifier, the DOI could be used to connect disparate resources relating to papers; e.g. comments on papers in blog posts and Table of Contents pages of journals.

But...do we really need DOIs? At least in their current form? Let's consider the following paper (discussed in a previous blog post):
EL Willighagen, NM O'Boyle, H Gopalakrishnan, D Jiao, R Guha, C Steinbeck and D J Wild Userscripts for the Life Sciences BMC Bioinformatics 2007, 8, 487.

What unique identifiers could we use? Well, there's the DOI:
doi://10.1186/1471-2105-8-487
Then there's the PubMed ID:
PMID 18154664
Instead of these, I propose OpenRef:
openref://BMC Bioinformatics/2007/8/487
Spot the difference. Neither the DOI nor the PMID can be derived from the paper itself. Similarly, it's not possible to figure out from the DOI or the PMID what the paper is (without access to the web, at least). Furthermore, the openref is available for all papers published, whether or not the publishers have assigned them a DOI (in associated with CrossRef). Needless to say, not all papers are in PubMed and so don't have PMIDs.

So, is it too late for OpenRef? Certainly not. Any publisher could implement it on their own server with an hour or two's work. Similarly, CrossRef could do it (though it would only work for those papers which have DOIs). Other Web 2.0 sites that manipulate information on publications could use it also; e.g. CiteULike and Connotea.

This would mean that you could instantly access information on a particular paper using a web browser and going to
http://dx.openref.org/BMC Bioinformatics/2007/8/487
or
http://www.biomedcentral/openref/BMC Bioinformatics/2007/8/487
instead of having to know the DOI or search on a publisher's web site.

Notes:
(1) For journals that don't use volumes, the openref would be of the form openref://Journal Name/Year/Page
(2) There are certain parallels for chemists between DOIs vs. openref and CAS numbers vs. InChI.
(3) The term RESTful is used in the sense of "RESTful web services" (an excellent book).

28 comments:

Rich Apodaca said...

I feel your pain, and this is an interesting proposal.

How would openref encode citations to purely electronic journals such as Beilstein Journal of Organic Chemistry with only two numerical biblio elements and no page number?

baoilleach said...

There's no pain involved, just an itch :-)

BMC Bioinformatics is actually a web-only journal, and as you can see, there's no problem. In fact, there *are* three numerical biblio elements for the Beilstein J of O C, also.

Jim said...

Why not just use HTTP URLs?

Web browsers won't resolve either DOIs or OpenRef directly. Connotea needs special screen scraping sauce to work with DOIs.

I'm not convinced that guessing URLs from publisher and date metadata would be more helpful than just Googling the title and authors.

baoilleach said...

@jim: Although the example I gave was for looking up a paper on the web, a unique identifier for papers has several other uses; e.g. for accessing information on a paper through an API, or linking information from distinct resources.

Andrew Perry said...

The idea of OpenRef tickled my fancy ... needs more work, but it's less of a proposal now.

alf said...

OpenURL!

baoilleach said...

@andrew: To repeat, "wow".

@alf: I've just looked at OpenURL again, and haven't been able to figure out exactly what problem it's trying to solve. It does seem to be a similar idea though.

alf said...

OpenURL is designed to solve exactly the problem you're talking about. Basically you pass a set of key/value pairs to an OpenURL resolver, which looks them up and directs you to an appropriate copy of the thing you're requesting.

In the case of a journal article, you'd give it the citation details and it would tell you where to find the full-text of the article (ideally somewhere you have access to it; for example, if you used your university's OpenURL resolver it would direct you to the full-text through your university's proxy server).

Here's an example OpenURL:
http://www.crossref.org/openurl/?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.jtitle=BMC%20Bioinformatics&rft.date=2007&rft.volume=8&rft.spage=487

alf said...

Actually I now notice your emphasis on an *identifier* rather than just a link. I don't know if you can condense citation details into an identifier without using key/value pairs though, really.

baoilleach said...

Yes - I think you understand me, although it's something of a quibble.

Regarding non-use of key-value pairs, I think I have proven that it is possible. The OpenRef, like a URL, can be considered a hierarchy, with each successive layer narrowing down the possibilities, until the final number (the startpage or article) uniquely identifies the resource.

alf said...

I guess it depends on what you want to use the identifier for. If it's just for "access[ing] information on a particular paper" then you should use OpenURL, I think.

Mr. Gunn said...

I really like the idea you could just type the URL from a citation, without having to remember keynames or DOIs. There's no point to hashing the citation data, because it'll be unique anyways, and human readable URLs are so much nicer.

As y'all already know, I also like having things distributed, so every publisher/library/service could run its own resolver and not be dependent on CrossRef, no matter how dependable they are at the moment.

alf said...

I couldn't let that go without a response: anyone can run an OpenURL resolver, that's the whole point. The CrossRef example was just one resolver that everyone has access to.

Ross said...

To jump into the fray here, I'm not sure what your proposal solves.

One of the big problems with DOIs is that it takes the user to the publisher copy of the article, which in many cases, is probably not appropriate for the user in question (their institution may provide access to that journal via a third party journal aggregator rather than the publisher). This proposal does nothing to solve that .

I'm also not sure how your hierarchical approach would actually scale to real world serials issue identification scenarios. What about supplemental issues? Issues for Oct/Nov?

OpenURL is obviously not as simple or as elegant as this, but it's already widely deployed along the chain (publishers, databases, libraries, etc.) and is already trying to solve this problem (OpenURL has exposed so many inefficiencies and variations between publications and publishers that I can't say it's "solved it"). It's also easy to (although so far basically unexploited) to XSLT from BibTex, Endnote/Refworks, Zotero, etc. to OpenURL, making "legacy" citations "compatible". It would be impossible to reverse-engineer OpenRef from bibliograpic citations, because you'd need way too much prior knowledge about the publication pattern of the journal/conference/whatever.

baoilleach said...

First of all, can we leave to one side the discussion of whether OpenRef would actually work. Without a specific counterexample, you won't convince me. :-)

Let's say I develop a web application where people can enter comments on papers. The webpage for each paper will be at something like: http://papercomments.org/openref/Nature/2005/21/121.

So, you need a URL, so why not use the OpenRef. To appreciate why this is a good idea, you need to appreciate REST. Connotea have not used OpenRef or even OpenURL. You can find info on a particular paper at a URL like:
http://www.connotea.org/article/81efca2dd741699bd0cf940f6124d16c
Nice, eh?

On the other hand, OpenURL can describe a paper's metadata. This is a different problem. If you want to pass around the metadata on a paper, OpenRef could not of course help you as you say.

alf said...

I actually tried something quite similar for Amazon (with minimal success) a while ago. One big problem is that the intermediate server could disappear at any time, and then all the links break...

Mr. Gunn said...

In some ways, I sympathize with how Connotea/DOI identifies things, because chapter/volume/pagenumber is a bit of an anachronism for online documents, but as long a paper's still around, you've got to have something that translates across, and let's face it, putting bibliographic stuff in a weird encoded form in a span tag is a bit of a hack, isn't it? Shouldn't you be able to hand-write a URL containing all the info you need that will still resolve properly for whoever clicks on it or whichever resolver fulfills the request? If you want scientists, and not librarians, to actually use the identifiers when writing about the paper, you probably should.

This is just from a author's viewpoint, and if there's overriding technical problems with doing this, could someone please explain it to the slow guy?

"One of the big problems with DOIs is that it takes the user to the publisher copy of the article, which in many cases, is probably not appropriate for the user in question (their institution may provide access to that journal via a third party journal aggregator"

Third-party journal aggregators are also a bit of an anachronism, aren't they? I don't know if it'd be a good idea to design for that, but couldn't the resolver direct the browser to whichever URL it's told is the correct one for the requested resource, so if you set your resolver to direct people to the publishers site, they'd go there, and if you set it to direct people to an aggregator, that's where it would send people.

This means using the resolver provided by your library could send you one place whereas a resolver run by a non-library group might send you to a different place the document can be found(where you may or may not have permissions), but I think that's rather to be expected, at least under the current viewing permissions paradigm we have. Portable identifiers like OpenID would allow the resolver to send you to the document you have viewing rights to based on your identity, instead of location, even if it does seem silly to make someone access the exact same document from Site X instead of Site Y.

My viewpoint in all this isn't from a IT perspective, an archive maintainer's perspective, nor a publisher's perspective, but simply from the perspective of someone who's sat down and tried to write a document referring to academic papers. I guess some of my ideas are wishful thinking, but it seems like a human-readable and writable URL that uniquely identifies a paper isn't unfeasible today. If it is, I'd love to know why. My impression is that human-comprehensibility just hasn't been considered an important feature because no one's actually sat down and tried to write the darn things.

BTW, alf, I know anyone can run an OpenURL resolver, and that's what I like about it, but the KEV/XML markup is what I find cumbersome and painful.

Ross said...

I guess I still am not getting it. I don't really see what's going to resolve the journal name to actual journal (without reinventing DOI or OpenURL).

Also, in the case of Connotea/CiteULike/etc., aren't the bibliographic elements mutable? Can't I change the journal title, issue, etc.? What happens to the openref in that case?

mr. gunn, I think you're referring to COinS, which, in fact, are a hack to place OpenURL Context Objects in HTML. Believe me, most librarians don't know what this is or how to make one, either.

The problem here is that scholarly publishing is a mess, and it's very, very easy to make a simple identifier that works for your needs that simply doesn't scale in the slightest.

It's a shame that initiatives like hCite have gone absolutely nowhere, although I still don't think it would satisfy the desire to make something "simple".

I can't really defend OpenURL. It's an ugly spec. It's hard to look at. It's hard to write. However, you only need to look at the complexity of creating a well-formatted academic bibliography to see why it has to be as complicated as it is.

If there's any ambiguity, it's not an identifier (btw, OpenURL isn't an identifier, although it can contain them).

Andrew Perry said...

I've got to agree with Ross .. it would be hard to make OpenRef globally applicable for referencing every journal ever published, while still keeping it RESTful. The (sometimes strange) variations in journal identifiers are presumably why an OpenURL is a little ugly and hard to guess from a traditional textual citation; it's the price paid for being more widely applicable.

In addition to the Belstein J. Org. Chem. mentioned by rich apodaca above, there are many other cases where a simple Journal/Year/[Volume/]Page type url cannot easily be applied. One is the Hindawi Open Access journals which use an Article ID and no page numbers ... it also doesn't cover pre-press electronic versions which don't yet have a Volume, Page and sometimes Year assigned yet.

We could propose ways to make OpenRef cover these cases ... in which case I suspect it would begin to loose it's human predictable human readable appeal, and probably begin to resemble OpenURL.

(oh, and sorry if I contributed to confusing matters alf ,... since the quick hack OpenRef 'implementation' I wrote is really just a PubMed frontend, it's not really treating OpenRef URLs like true *identifiers*).

baoilleach said...

I fear that I haven't expressed myself strongly enough so here goes:
Every citable journal article ever published can be uniquely identified by an OpenRef.

You're right about the pre-press/in press publications, though. Regarding article number vs. page number, this is a red herring; the final term in the OpenRef is either the article number or start page, whichever is appropriate (journals use either one or the other), just like if you were referencing the paper in a journal that doesn't use endpage numbers.

Ross said...

I still don't entirely understand how you disambiguate between journals that have the same name. If the convention would be contrive some variation between them, who makes this decision?

baoilleach said...

But there are no journals that share the same name. If there were, our existing method for citing papers wouldn't work.

All I am doing with OpenRef is taking the normal system for citing papers and removing redundant information.

Andrew Perry said...

(comments above deleted since I botched up the URL .. twice)

Ah, I understand now how first page number vs. article ID works.

Funny thing is, for the Hindawi journals PubMed maps the article ID to the [PG] "page number" field already, so an OpenRef URL like:

http://openref.pansapiens.com/openref/Comp Funct Genomics/2007/58721

using the article ID as the final term actually works as expected in my hacked up implementation !

Someone, please find me another counter example to prove why OpenRef is not going to work ... :) !

baoilleach said...

For the record, I note that the OpenRef in the previous comment is not valid, as the full journal name is required. Since the OpenRef is a unique identifier, and since there are several possible abbreviations for one journal, it rules out the use of abbreviations as the canonical form.

Of course, you can configure your server to do whatever you want; for example, you could accept common journal abbreviations such as PNAS and Proc Natl Acad Sci USA (and indeed, Proc. Natl. Acad. Sci. U.S.A.), but an OpenRef server is not required to do this.

Andrew Perry said...

Good point. The OpenRef above uses the journal abbreviation from PubMed, since my 'OpenRef resolver' is really only a PubMed frontend. Turns out that the full journal name works too ( http://openref.pansapiens.com/openref/Comparative and Functional Genomics/2007/58721 ), but this is not the case for all journals .. eg the full journal name used by PubMed is not always 'guessable' ... for instance the PubMed full name for Current Biology is in fact Current biology : CB. Not sure why this is ... there doesn't seem to be another "Current Biology" such that they might be trying to avoid a name clash. The problem is that even if every journal name is unique (which I doubt), or could be made unique in the database (eg. Biochemistry vs. Biochemistry (Tokyo)), the 'guessability' will be lost in some instances. And if the OpenRef isn't easily guessible, it begins to lose it's appeal.

baoilleach said...

Andrew, you're a tough guy to convince :-)

A couple of points:
(1) Let's separate problems with PubMed from problems with OpenRef. Problems with PubMed merits a blog post of its own :-)

(2) You doubt that every journal name is unique. However, the current citation system (where you list references at the end of your paper) relies on the fact that journal names are unique...how else could you find the paper?

[Actually, not quite true, but almost true. It relies on the reader uniquely mapping a journal abbreviation to one particular journal.]

(3) It would be nice to have a specific counterexample that combined non-uniqueness of journal names with confusion over what name to use. In the absence of a counterexample, let's assume that the full journal name is always obvious if you have the paper in your hand. For example, the Journal of Biochemistry. On the website, the abbreviation is given as "J Biochem (Tokyo)" for the most recent paper. However, if you look at the PDF of the paper, it's given as "J. Biochem." In other words, given the PDF, you would assume the full name of the journal is "Journal of Biochemistry", which appears to be correct. It's a good example, as there is some confusion, but there doesn't seem to be any doubt over the name of the journal, just the correct abbreviation to use (another vote against using abbreviations).

Rod Page said...

Robert Cameron addressed the issue of journal article identifiers some years back in two papers that seem to have been almost completely overlooked: Towards Universal Serial Item Names and Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention. I've discussed some of this work on my blog.

One thing that makes DOIs attractive is the infrastructure CrossRef has in place to support them. Using their services I can take a DOI and retrieve metadata about the corresponding publication, or use their OpenURL resolver to see whether a publication has a DOI. They are also building a citation database that adds forward linking to publications. Identifiers by themselves aren't much use, it's the services built upon them that matter.

baoilleach said...

Thanks for the excellent references, Rod. Indeed, I see that I am basically making the same points that Robert Cameron did several years ago. He is more rigourous in his approach, though.

Now that I am not the only one, I don't feel like such a crackpot after all!

I will discuss your other comments in a further blog post.