Monday 3 June 2019

Building a PubChem query

Evan Bolton was visiting recently and I got to watch over his shoulder as he built up a PubChem query on the website step-by-step. I helpfully added comments like "wait a second - how did you bring that up?" and "you can do that?", not to mention "ah - so that's how you do that".

The general idea is that any search you can do, you can combine with other searches with Boolean operators. It's a little bit confusing that the page that you use for combining them is not necessarily the same page you are using for doing the search, but once you're happy with that, it makes sense.

As an example, I am interested in those entries in the peptide dataset I uploaded to PubChem that do not have a biological annotation. This is a little bit complicated in that the biological annotations are a feature of PubChem Compound, but the datasets are all in PubChem Substance, so we are going to have to link from one to the other.

Here is a step-by-step guide which you may find useful to work through:

  1. First of all let's find all molecules in PubChem Compound that have a biological annotation. Go to https://pubchem.ncbi.nlm.nih.gov, click "Browse Data", select classification "PubChem/PubChem Compound TOC". In the tree that appears at the bottom of the page click on the number next to "Biologic Description" to do the search. (May have to reload the page if it gives an error first time.)
  2. Next, let's find all molecules in PubChem Compound that came from my dataset. Go to https://pubchem.ncbi.nlm.nih.gov/, click "Explore Data Sources" towards the bottom right, type "NextMove" into the "Search Sources" box, and then "8,699 Live Substances" to do the search.
  3. The previous search was of PubChem Substance, which is where datasets live. We want to find the corresponding molecules from PubChem Compound, so on the right under "Find related data" select "PubChem Compound", and then "PubChem Same Compound" in the Option box that appears. Click "Find Items" to do the search.
  4. Finally, we need to get the intersection between the search described in 3 and all those not in 1. For this go to https://www.ncbi.nlm.nih.gov/pccompound and click "Advanced". You will see under "History" the searches you have done. In my case, the relevant searches are #1 (step 1 above) and #4 (step 3 above), so I type into the builder box "#4", change "AND" to "NOT" on the next line and enter #1. Finally, I clicked "Search" to do the search. (261 results)