Introduction

This site contains:

A guide to using the GA4GH Phenotools
An API Documentation
An explanation of the HPOA output format
An tutorial for the demo app.
A description of the legacy Excel format in pyphetools (We are updating this dataset to use the new JSON format for storing data about a cohort)

Getting Started

Add the crate to your project:

[dependencies]
ga4ghphetools = { git = 'https://github.com/P2GX/ga4ghphetools.git', tag = '0.4.136' }

Note that while ga4ghphetools is still in development, it will be available from github; you should go to the Releases page and choose the latest version.

API documentation

👉 API Documentation

phenotype.hpoa output

The phenotype.hpoa file is is the core download of disease annotation data for the HPO project. Internally, the HPO project uses one so-called "small file" for each disease; information from these files is processed to make the phenotype.hpoa project, which is offered for download. The small files are not offered for download at this time.

GA4GH phenotools has functionality to export a cohort of phenopackets to small file format. For convenience, we explain this format here and explain the assumptions to code makes to generate the files.

Format

#diseaseID	diseaseName	phenotypeID	phenotypeName	frequency	publication	evidence	biocuration
OMIM:605275	Noonan syndrome 2	HP:0011636	Abnormal coronary artery origin	1/6	PMID:30368668	PCS	HPO:probinson[2021-05-21]
OMIM:605275	Noonan syndrome 2	HP:0001928	Abnormality of coagulation	0/4	PMID:30368668	PCS	ORCID:0000-0002-0736-9199[2024-10-02]
OMIM:605275	Noonan syndrome 2	HP:0000766	Abnormal sternum morphology		PMID:29469822	PCS	HPO:skoehler[2019-04-18]; HPO:probinson[2021-05-21]
OMIM:605275	Noonan syndrome 2	HP:0006721	Acute lymphoblastic leukemia	3/20	PMID:29469822	PCS	ORCID:0000-0002-0736-9199[2024-04-01]
OMIM:605275	Noonan syndrome 2	HP:0030674	Antenatal onset	11/18	PMID:29469822	PCS	ORCID:0000-0002-0736-9199[2024-04-01]
OMIM:605275	Noonan syndrome 2	HP:0001134	Anterior polar cataract	1/20	PMID:29469822	PCS	ORCID:0000-0002-0736-9199[2024-04-01]
OMIM:605275	Noonan syndrome 2	HP:0000463	Anteverted nares	3/20	PMID:29469822	PCS	ORCID:0000-0002-0736-9199[2024-04-01]
OMIM:605275	Noonan syndrome 2	HP:0002804	Arthrogryposis multiplex congenita	2/20	PMID:29469822	PCS	HPO:skoehler[2019-04-18]; HPO:probinson[2021-05-21]
OMIM:605275	Noonan syndrome 2	HP:0001631	Atrial septal defect	2/3	PMID:30368668	PCS	HPO:probinson[2021-05-21]

Note that this file has one line that specifies the frequency of each HPO term in the cohort of individuals described in the PMID. It is thus possible to have multiple likes for the same HPO with data from different PMIDs.

Legacy template

The pypheools project was our first attempt to curate GA4GH Phenopackets at scale (2021-2024). For this, we designed an Excel template for curation. GA4GH phenotools contains code that enables us to import daa from these excel files and then serialize the contents to the new JSON format. We plan to update the entire phenopacket store to use the new JSON serialization format and delete the legacy Excel files. At this point, we will remove the legacy-related code from the current library as well. We present information about the structure of the Excel files here in order to help with this process.

A format for cohort descriptions in excel

The schema of the template consists in two rows that specify the nature of the data. There is a fixed set of columns that capture basic demographic data together with the disease, the source publication, and the variants. The second half of the template should be used to record information about HPO terms curated from the publications.

Note that the format specifier "CURIE" means "Compact uniform resource identifier", which means the entry should have a prefix (e.g., PMID, a colon, and the identifier, e.g., 3021034, altogether "PMID:3021034"). "str" refers to an arbitrary text (string). "optional" means the cell can be left empty.

Fixed columns

The first (leftmost) 11 or 12 columns specify basic demographic data together with the disease, the source publication, and the variants. The first two rows are used to specify the datatypes and should not be changed. The following tables show the first two rows together with one example row with data extracted from a publication (We show two tables for better legibility)

PMID	title	individual_id	Comment	disease_id	disease_label
CURIE	str	str	optional	CURIE	str
PMID:33087723	Early-onset autoimmunity associated with SOCS1 haploinsufficiency	A1		OMIM:603597	Autoinflammatory syndrome, familial, with or without immunodeficiency

PMID (CURIE: The PubMed identifier of the publication being curated.
title (str): The title of the publication being curated.
individual_id (str): The identifier of the individual being described in the original publication. This field is required. Please add ‘individual’ if the original article does not provide an identifier (if needed, individual 1, individual 2,...).
comment (str): This field is provided to record additional information that will not be used for creating phenopackets but may be helpful for future reference. It can be left empty.
disease_id (CURIE). The disease identifer (e.g., OMIM:154700 or MONDO:0007947).
disease_label (str). The name of the disease (e.g. Marfan syndrome).

HGNC_id	gene_symbol	transcript	allele_1	allele_2	variant.comment
CURIE	str	str	str	str	str
HGNC:19383	SOCS1	NM_003745.2	c.368C>G	na	p.P123R

HGNC_id (CURIE): Identifier of the HUGO Gene Nomenclature Committee{target="_blank"}.
gene_symbol (str):: Gene symbol, e.g., SOCS1.
transcript (str): The identifier of the transcript. NCBI RefSeq or ENSEMBL identifiers are preferred.
allele_1 (str): A string representing the first pathogenic allele (variant) according to HGVS nomenclature.
allele_2 (str): This field should not be used for monoallelic diseases (e.g. autosomal dominant, XLR). The column can eigther be omitted or can be filled with "na" to denote "not applicable". It the column is present and is left empty, this will be flagged as an error. For biallelic diseases (autosomal recessive), specific the second allele (which will be the same as the first for homozygous genotypes).
variant.comment (str): This field is provided to record additional information that will not be used for creating phenopackets but may be helpful for future reference.

age_of_onset	age_at_last_encounter	sex	HPO
age	age	M:F:O:U	na
Infantile onset	P21Y	F	na

age_of_onset: The age of onset of disease, recorded using iso8601 convention or an HPO Onset{:target="_blank"} term.
sex (M:F:U:O): one of M (male), female (F), other(O), or unknown (U)
HPO (str): The column marks the end of the data columns and should contain "na".

HPO Term Columns

All of the following columns denote HPO terms. The first row has the HPO term label. Be sure to use the same label as is shown on the HPO webpage and do not chance the capitalization. The second row has the corresponding HPO id. The following table shows several examples, whereby the individual_id column from above is shown for ease of exposition.

individual_id	Hepatitis	Pancreatitis	Lymphadenopathy	Splenomegaly
	HP:0012115	HP:0001733	HP:0002716	HP:0001744
A	observed	excluded		P4Y2M
B	P3Y	na	observed	excluded

Each table cell can contain either

observed: The phenotypic abnormality denoted by the HPO term was present
excluded: The phenotypic abnormality denoted by the HPO term was investigated and ruled out.
An iso8601 string denoting the age of onset.
na or empty (blank): Information not available or phenotypic feature not measured.

In this example, individual A was observed to have hepatitis (but age of onset is unknown or not available), pancreatitis was ruled out, no information is available about lymphadenopathy, and splenomegaly was first observed at age 4 years and 2 months.

Individual B was found to have hepatitis first observed at age 3 years, no information was available about pancreatitis, lymphadenopathy was observed (but age of onset is unknown or not available), and splenomegaly was ruled out.

The file should contain at least the following information; see explanations below.

row_type	id	age	sex	allele	Tall stature	Abnormal sternum morphology	Potassium
header1	str	ISO8601	str	NM_000138.5	simple	option	threshold
header2					HP:0000098	HP:0000767; HP:0000768	3.5-5.2 mEq/L: High->Hyperkalemia{HP:0002153); Low->Hypokalemia(HP:0002900)
individual	patient A	P6Y	male	c.8326C>T	+	Pectus carinatum	n/a
individual	patient B	P9Y	female	c.7988G>C	-	Pectus excavatum	5.8

row_type

Each row must begin with one of the words "header1", "header2" or "individual". There should be one row for each individual in the cohort.

id

This is an cohort-specific identifier that must be anonymized.

age

This is the age of the individual at the time of the medical encounter at which the phenotypic features were recorded. The format of the column is recorded in the header1 line. Valid options are ISO8601 for strings such as "P4Y" (four years of age) and "P71Y6M2D" for 71 years, 6 months, and 2 days; Years for 5 (5 years of age) or 7.5 (7 years and 6 months).

sex

Use male, female, other, or unknown.

HPO columns

The remaining columns contain information about HPO terms observed in the individuals. There are three types of column.

The top row contains the label of the term. The header1 row contains the word "simple". The header2 row contains the HPO id; in the example table, we see Tall stature; HP:0000098. If the feature is observed in an indivual, use "+"; if the feature was explicitly excluded, use "-". If the feature was not measured or no information is available, use "n/a".

Demo

Example program

To run the example program (whose code is available at bin/main.rs), enter

cargo run --bin rpt -- --rpt /.../.../phenopacket-store/notebooks/CD28/input/CD28_IMD123_individuals.xlsx --json /.../.../hp.json

Adjust the paths to an Excel legacy template and the hp.json file as needed.

For faster performance, enter

cargo run --features="cli"  --release --bin rpt -- --template ./../phenopacket-store/notebooks/CD28/input/CD28_IMD123_individuals.xlsx --json ./../../data/hpo/hp.json

To build the binary demo (with clap)

cargo build --release --features cli

(the binary is then in ./target/release/rpt) to run it

cargo run --features cli --bin rpt

To see private features in documentation

cargo doc --document-private-items --open

Output

GA4GH Phenools can be used to output data about cohorts as Excel or text files.

Excel

To compare two cohorts starting with their cohort files, we can enter

cargo run --features excel_export -- compare \
  --cohort1 cohort_a.json \
  --cohort2 cohort_b.json \
  --output comparison.xlsx \
  --hpo hp.json \
  --threshold 5

Note that we need to compile with the excel_export feature activated.

API Reference

API Documentation

GA4GH Phenotools Documentation