Annotation Sources in ConceptMetab
KEGG Pathways
We considered 85 human metabolic pathways from the summer 2011 freeze of
KEGG (1), and parsed the corresponding XML files to directly associate
compounds with pathways. Taking only those pathways with 5 or more compounds
(a restriction used for all annotation sources), we obtained 74 KEGG Pathways.
Gene Ontology
We used version 2.9.0 of the org.Hs.eg.db R package (2) to map Gene Ontology
terms to Entrez IDs for genes, and version 2.9.0 of the GO.db R package (3)
to annotate GO IDs to their names. Next, we used information parsed from KEGG
Pathway XML files to map genes to reactions and reactions to compounds. We
split the GO terms into their three top-level categories: GO Biological
Process (GOBP) having 3,385 compound sets, GO Cellular Component (GOCC) having
300 compound sets, and GO Molecular Function (GOMF) having 839 compound sets.
We observed 5 GO terms to be supersets of all other GO terms in their
respective top-level categories – metabolic process and organic substance
metabolic process in GOBP, cell and cell part in GOCC, and catalytic
activity in GOMF – and have removed them.
Enzymes
We used the org.Hs.eg.db R package (2) to map enzymes to genes. We then reused the gene to reaction to compound mapping from KEGG. There are 176 compound sets relating enzymes to compounds.
Medical Subject Headings
We leveraged the database developed for Metab2MeSH (4), which uses Fisher's Exact Test to annotate PubChem compounds to concepts defined in MeSH, the National Library of Medicine's controlled vocabulary for biology and medicine used to manually index articles for MEDLINE/PubMed. We elected to consider the top-level MeSH categories for Anatomy, Diseases, Organisms, Phenomena and Processes, Psychiatry and Psychology, and Agriculture and Technology. MeSH has a tree structure, and a MeSH term may appear in different branches of the tree. To remove the ambiguity of which top-level category a term should be considered a part of, we prioritized membership thus: Diseases, Phenomena and Process, Psychiatry and Psychology, Anatomy, Organisms, Technology and Agriculture. See the table below for the number of concepts in each top-level MeSH category.
Table: Summary of concept types, their sizes, mean concept sizes, and background set sizes.
Concept Type |
# Concepts |
Mean Concept Size |
# Compounds |
Enzyme |
175 |
11 |
874 |
GO Biological Process |
3712 |
56 |
1220 |
GO Cellular Component |
346 |
117 |
1213 |
GO Molecular Function |
864 |
48 |
1226 |
KEGG Pathway |
74 |
42 |
2427 |
MeSH Anatomy |
1506 |
357 |
37706 |
MeSH Diseases |
4089 |
182 |
33074 |
MeSH Organisms |
3011 |
150 |
48688 |
MeSH Phenomena and Processes |
1443 |
404 |
43016 |
MeSH Psychiatry and Psychology |
519 |
180 |
9188 |
MeSH Technology, Industry, and Agriculture |
330 |
280 |
15721 |
References
- Kanehisa,M. et al. (2011) KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Research, 40, D109–D114.
- Carlson M. org.Hs.eg.db: Genome wide annotation for Human. R package version 2.9.0.
- Carlson M. GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 2.9.0.
- Sartor,M.A. et al. (2012) Metab2MeSH: annotating compounds with medical subject headings. Bioinformatics, 28, 1408–1410.