Seeq Entities

Seeq API’s knowledge graph is built on top of a number of core entities which are the nodes in the graph. This document describes all core entities of Seeq, how they are identified, and how they map to external identification schemes.

Genes

Seeq includes more than 60,000 human genes. This includes more than 20,000 protein-coding genes. The rest are various non-protein coding genes like pseudo-genes, rRNA, miRNA, etc.

Seeq internally identifies genes by their Entrez ID but additionally recognizes their correspondence with HGNC and Ensembl identifiers as well as canonical and synonymous gene symbols. For example, the Entrez and Seeq id for IDH1 is 3417, its other identifiers are HGNC:5382 and ENSG00000138413.

Since Entrez genes are a superset of both HGNC and Ensembl, not all Seeq genes necessarily have an HGNC or Ensembl ID.

$ curl https://api.seeq.bio/genes/3417/info
{
  "gene": {
    "entrez_id": "3417",
    "name": "isocitrate dehydrogenase (NADP(+)) 1",
    "canonical_symbol": "IDH1",
    "hgnc_id": "HGNC:5382",
    "ensembl_id": "ENSG00000138413",
    "biotype": "protein_coding"
  },
  "genome_region": {
    "chromosome": "2",
    "start": 208236227,
    "end": 208266074,
    "assembly": "hg38"
  },
  "synonyms": [
    "HEL-216",
    "IDH",
    "IDCD",
    "IDP",
    "PICD",
    "IDPC",
    "HEL-S-26"
  ]
}

Gene Symbols and Synonyms

Even through gene ids are ideal machine consumable identifiers, you and your colleagues probably typically refer to genes by their symbol. The trouble with gene symbols is that they are continuously evolving and don’t unambiguously specify a gene. For example, the same gene symbol might refer to two distinct symbols, or an old symbol maybe replaced with a new one while the old symbol remains widely in use.

For your convenience, Seeq recognizes all current and previous gene symbols and their relationships. This includes canonical, previous, and alias symbols in HGNC as well as gene symbols and synonyms in Entrez. However, all internal manifestations of a gene in the knowledge graph are coded with the unique identifier of the gene.

For example, Seeq gene 2064 is what is commonly known as HER2, and its canonical symbol is ERBB2.

mRNA Transcripts and Proteins

Seeq recognizes both Ensembl and RefSeq transcripts of each gene, including a complete gene model for each transcript. For protein-coding genes, Seeq also recognizes Ensembl protein identifiers and Pfam functional domains.

Seeq relies on the cDNA sequence (joint exons) and coding sequences (CDS, protein-coding only) of genes provided by Ensembl for calculating the molecular consequence of user-provided variants.

Variants

Unlike genes, diseases, and drugs, variants of the human genome are not a pre-determined, enumerable list of entities. Variants can be specified in a number of different ways with different semantics. Seeq recognizes all these identification systems and can convert variants from one system to another, even for those variants it has not encountered before.

Genomic Coordinates

The genomic specification of a variant is a “CPRA” tuple (chrom, pos, ref, alt) specifying, for example, that a variant is located at position 208,248,389 of chromosome 2, and modifies the reference allele from a G to the alternative allele A. This is how variants are encountered when analyzing raw sequencing data, in the VCF format.

This is the unambiguous way to describe variants, and used in URL parameters by Seeq. In Seeq, a variant described by its CPRA is called a GVariant, a variant specified at the genomic level.

$ curl https://api.seeq.bio/variants/g.2:208248389:G:A/
{
  "cpra": "g.2:208248389:G:A",
  "entrez_id": "3417",
  "gene_symbol": "IDH1",
  "gvariant": {
    "chrom": "2",
    "chrom_ref": "G",
    "chrom_alt": "A",
    "chrom_pos": [ 208248389, 208248389 ]
  },
  "nt_change": "c.394C>T",
  "aa_change": "R132C",
  "eal_cds_pos": [ 394, 394 ],
  "aa_pos": [ 132, 132 ],
  "mol_conseqs": [ "coding_sequence_variant", "missense_variant" ]
}

Protein Coordinates

Even though CPRA is the unambiguous way to describe variants, its human illegibility has led to widespread adoption of protein-level specification of variants, through the HGVS nomenclature. For example, the CPRA we described above is commonly known as IDH1 R132C, meaning that the molecular consequence of this GVariant is to turn the 132th residue of the protein product of the IDH1 gene from an arginine (R) to a cysteine (C). In Seeq, protein specifications of variants are called PVariants.

Note

A common source of ambiguity in genomic analyses is alternative splicing and the different ways in which a single variant can affect different transcripts of the same gene. In converting a GVariant to its corresponding PVariant, Seeq uses the Ensembl canonical transcript of the gene.

This dual identification system is necessary because the correspondence between genomic and protein level specifications is not 1 to 1; distinct GVariants may have the same molecular consequence at the protein level, hence mapping to identical PVariants.

Seeq is able to convert arbitrary GVariants to their corresponding PVariants, and likewise recognize different GVariants that correspond to the same PVariant.

Note

When a GVariant lands on a genomic region that belongs to two distinct genes on opposite strands of a chromosome, its protein consequence is ambiguous and depends on the specific gene in question. In such cases, you need to specify the gene id to disambiguate between the two distinct PVariants.

Database Identifiers

Another common way variants are spoken of is through their identifiers in large databases of known variants, like dbSNP or ClinVar. Naturally, this is limited only to those variants that have been previously observed, characterized, and registered in these databases.

For example, the variant with CPRA (2, 208248389, G, A), at the protein level IDH1 R132C, is variant rs121913499 in dbSNP and variation 375891 in ClinVar.

Seeq recognizes these databases and cross-references known and novel variants against them. Even for a novel variant, Seeq is able to identify known genomic variants that have the same protein consequence as the observed, novel variant.

Diseases

Seeq recognizes over 200,000 disease concepts from multiple disease ontologies: MedGen, Mondo, HPO, and Disease Ontology. The primary identifier of diseases within Seeq is MedGen id. This is motivated by the high coverage and widespread inter-operability of MedGen ids thanks to their roots in UMLS.

$ curl https://api.seeq.bio/diseases/C0023470/info/
{
  "disease": { "cui": "C0023470", "name": "Myeloid leukemia" },
  "xrefs": {
    "mondo": [ "MONDO:0004643" ],
    "hpo": [ "HP:0012324" ],
    "disease_ontology": [ "DOID:8692" ]
  },
  "subtypes": [
    { "cui": "C0023467", "name": "Acute myeloid leukemia" },
    { "cui": "C0023476", "name": "Philadelphia-positive myelogenous leukemia" },
    { "cui": "C1292771", "name": "Chronic Myelogenous Leukemia, BCR-ABL1 Positive" },
    { "cui": "C1292772", "name": "Atypical chronic myeloid leukemia" }
  ],
  "supertypes": [
    { "cui": "C2703042", "name": "Bone marrow cancer" },
    { "cui": "C2939461", "name": "Myeloid neoplasm" },
    { "cui": "C0023418", "name": "Leukemia" }
  ],
  "summary": {
    "source": "NCI",
    "text": "A clonal proliferation of myeloid cells and their precursors in the bone marrow ..."
  }
}

MedGen and UMLS

UMLS is a 30+ year long project, led by the NLM, with the stated goal to “harmonize vocabularies for the purpose of computer system interoperability” specifically in the context of EMR and medical research. As such, UMLS is ubiquitously used.

UMLS is metathesaurus: it has grown to include hundreds of vocabularies. For example, each HGNC gene, each HPO phenotype, and each drug in DrugBank get their own concept identifier (CUI) in UMLS. The scope of UMLS is also extremely broad. It covers everything from entities of interest in Seeq to medical billing, visual features like “round face”, and behaviors like “gambling”. As of 2021, the English language UMLS contains over a million concept ids.

MedGen is a project by the NCBI with a specific focus on conditions, phenotypes, and medical genetics, hence the name. MedGen rarely introduces new identifiers and delegates to UMLS CUIs for the most part. This makes MedGen ids an ideal subset of UMLS for our purposes.

Disease Hierarchy Graph

In all ontologies, diseases form a hierarchy through their is-a relationships. For example, acute myeloid leukemia (AML) is a subtype of myeloid leukemia which itself is a subtype of leukemia.

Through these relationships, diseases form a DAG which Seeq incorporates in its graph search algorithms. The disease hierarchy is important as you explore the knowledge graph. For example, when you ask Seeq API for genes associated with leukemia, the output will include all genes that are associated with all hierarchical descendants of Leukemia (eg includes genes associated with AML).

Drugs

Seeq identifies drugs by their ChEMBL ids. For each drug a number of human-readable names are included from various sources. Furthermore, Seeq annotates each drug with its cross-references in the FDA Orange Book and Purple Book.

$ curl https://api.seeq.bio/drugs/CHEMBL4535757/info
{
  "drug": {
    "chembl_id": "CHEMBL4535757",
    "name": "Sotorasib",
    "synonyms": [],
    "fda_approved": true
  },
  "xrefs": {
    "drugbank": [],
    "fda": [
      {
        "appl_no": "214665",
        "product_no": "001",
        "trade_name": "Lumakras",
        "applicant": "Amgen Inc",
        "dosage_form": "tablet",
        "strength": "120MG",
        "route": "oral",
        "type": "orange",
        "approval_date": "May 28, 2021"
      }
    ]
  }
}

Warning

Seeq drugs correspond to ChEMBL drugs (not ChEMBL molecules) and FDA product ingredients (not FDA products).