.. _measurement-stat:
====================
Compare measurements
====================
****************
Example analysis
****************
We will analyze 69 individuals reported by
`Xu et al. (2019) `_.
The reported data includes the *CYP21A2* mutations, the lab measurement,
and other clinical signs and symptoms encoded as HPO terms.
TODO: wordsmith
Load cohort
===========
For the purpose of this analysis, we will load the :class:`~gpsea.model.Cohort`
from a `JSON file `_.
The cohort was prepared from phenopackets as described in :ref:`create-a-cohort` section,
and then serialized as a JSON file following the instructions in :ref:`cohort-persistence` section.
..
Prepare the JSON file by running the tests in `tests/tests/test_generate_doc_cohorts.py`.
>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort_json = 'docs/cohort-data/CYP21A2.0.1.20.json'
>>> with open(fpath_cohort_json) as fh:
... cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
69
Configure analysis
==================
*MANE* transcript of *CYP21A2*.
>>> tx_id = 'NM_000500.9'
Genotype predicate
------------------
For now, just Missense vs rest.
Nonsense variants + big SVs,
Missense should *NOT* be severe.
TODO - create real predicate.
>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate import variant_effect
>>> is_missense = variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=tx_id)
>>> is_missense.description
'MISSENSE_VARIANT on NM_000500.9'
Assuming AR inheritance, we compare missense vs. rest:
>>> from gpsea.analysis.clf import biallelic_classifier
>>> gt_clf = biallelic_classifier(
... a_predicate=is_missense,
... b_predicate=~is_missense,
... a_label="Missense", b_label="Other",
... partitions=({0,}, {1, 2}),
... )
>>> gt_clf.class_labels
('Missense/Missense', 'Missense/Other OR Other/Other')
Phenotype score
---------------
We investigate an association between a measurement value and a genotype group.
We use the measurement of `Testosterone [Mass/volume] in Serum or Plasma `_
(`LOINC:2986-8`).
>>> from gpsea.analysis.pscore import MeasurementPhenotypeScorer
>>> testosterone = 'LOINC:2986-8'
>>> pheno_scorer = MeasurementPhenotypeScorer.from_measurement_id(
... term_id=testosterone,
... label="Testosterone [Mass/volume] in Serum or Plasma",
... )
>>> pheno_scorer.description
'Value of Testosterone [Mass/volume] in Serum or Plasma [LOINC:2986-8]'
Statistical test
----------------
We will use T test to test for differences between scores
of the different genotype groups
>>> from gpsea.analysis.pscore.stats import TTestStatistic
>>> score_statistic = TTestStatistic()
Final analysis
--------------
We will put the final analysis together into :class:`~gpsea.analysis.pscore.PhenotypeScoreAnalysis`.
>>> from gpsea.analysis.pscore import PhenotypeScoreAnalysis
>>> score_analysis = PhenotypeScoreAnalysis(
... score_statistic=score_statistic,
... )
Analysis
========
We execute the analysis by running
>>> result = score_analysis.compare_genotype_vs_phenotype_score(
... cohort=cohort,
... gt_clf=gt_clf,
... pheno_scorer=pheno_scorer,
... )
>>> result.pval
0.741216622359659
Show data frame with scores
>>> scores = result.data.sort_index()
>>> scores.head() # doctest: +NORMALIZE_WHITESPACE
genotype phenotype
patient_id
individual 10[PMID_30968594_individual_10] 1 614.0
individual 11[PMID_30968594_individual_11] 1 630.0
individual 12[PMID_30968594_individual_12] 1 NaN
individual 13[PMID_30968594_individual_13] 1 303.0
individual 14[PMID_30968594_individual_14] 1 664.0
Prepare genotype category legend:
>>> gt_id_to_name = {c.category.cat_id: c.category.name for c in gt_clf.get_categorizations()}
>>> gt_id_to_name
{0: 'Missense/Missense', 1: 'Missense/Other OR Other/Other'}
TODO: wordsmith & finish!