Annotation Sources in ConceptMetab

KEGG Pathways

We considered 85 human metabolic pathways from the summer 2011 freeze of KEGG (1), and parsed the corresponding XML files to directly associate compounds with pathways. Taking only those pathways with 5 or more compounds (a restriction used for all annotation sources), we obtained 74 KEGG Pathways.

Gene Ontology

We used version 2.9.0 of the org.Hs.eg.db R package (2) to map Gene Ontology terms to Entrez IDs for genes, and version 2.9.0 of the GO.db R package (3) to annotate GO IDs to their names. Next, we used information parsed from KEGG Pathway XML files to map genes to reactions and reactions to compounds. We split the GO terms into their three top-level categories: GO Biological Process (GOBP) having 3,385 compound sets, GO Cellular Component (GOCC) having 300 compound sets, and GO Molecular Function (GOMF) having 839 compound sets. We observed 5 GO terms to be supersets of all other GO terms in their respective top-level categories – metabolic process and organic substance metabolic process in GOBP, cell and cell part in GOCC, and catalytic activity in GOMF – and have removed them.

Enzymes

We used the org.Hs.eg.db R package (2) to map enzymes to genes. We then reused the gene to reaction to compound mapping from KEGG. There are 176 compound sets relating enzymes to compounds.

Medical Subject Headings

We leveraged the database developed for Metab2MeSH (4), which uses Fisher's Exact Test to annotate PubChem compounds to concepts defined in MeSH, the National Library of Medicine's controlled vocabulary for biology and medicine used to manually index articles for MEDLINE/PubMed. We elected to consider the top-level MeSH categories for Anatomy, Diseases, Organisms, Phenomena and Processes, Psychiatry and Psychology, and Agriculture and Technology. MeSH has a tree structure, and a MeSH term may appear in different branches of the tree. To remove the ambiguity of which top-level category a term should be considered a part of, we prioritized membership thus: Diseases, Phenomena and Process, Psychiatry and Psychology, Anatomy, Organisms, Technology and Agriculture. See the table below for the number of concepts in each top-level MeSH category.

Table: Summary of concept types, their sizes, mean concept sizes, and background set sizes.

Concept Type # Concepts Mean Concept Size # Compounds
Enzyme 175 11 874
GO Biological Process 3712 56 1220
GO Cellular Component 346 117 1213
GO Molecular Function 864 48 1226
KEGG Pathway 74 42 2427
MeSH Anatomy 1506 357 37706
MeSH Diseases 4089 182 33074
MeSH Organisms 3011 150 48688
MeSH Phenomena and Processes 1443 404 43016
MeSH Psychiatry and Psychology 519 180 9188
MeSH Technology, Industry, and Agriculture 330 280 15721

References

  1. Kanehisa,M. et al. (2011) KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Research, 40, D109–D114.
  2. Carlson M. org.Hs.eg.db: Genome wide annotation for Human. R package version 2.9.0.
  3. Carlson M. GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 2.9.0.
  4. Sartor,M.A. et al. (2012) Metab2MeSH: annotating compounds with medical subject headings. Bioinformatics, 28, 1408–1410.