.. _survival: ================= Survival analysis ================= .. doctest:: :hide: >>> from gpsea import _overwrite **************** Example analysis **************** We will analyze the time until end stage renal disease in 207 individuals with mutations in *UMOD*. Specifically, we will test for difference between the onset of the end stage renal disease in the individuals with mutation in exon 3 of *UMOD* vs. individuals with other *UMOD* mutation. Load cohort =========== For the purpose of this analysis, we will load the :class:`~gpsea.model.Cohort` from a `JSON file `_. The cohort was prepared from phenopackets as described in :ref:`create-a-cohort` section, and then serialized as a JSON file following the instructions in :ref:`cohort-persistence` section. .. Prepare the JSON file by running the tests in `tests/tests/test_generate_doc_cohorts.py`. >>> import json >>> from gpsea.io import GpseaJSONDecoder >>> fpath_cohort_json = 'docs/cohort-data/UMOD.0.1.20.json' >>> with open(fpath_cohort_json) as fh: ... cohort = json.load(fh, cls=GpseaJSONDecoder) >>> len(cohort) 207 Configure analysis ================== *MANE* transcript of *UMOD*. >>> tx_id = 'NM_003361.4' Genotype classifier ------------------- One allele of exon 3 vs. one allele of elsewhere. >>> from gpsea.analysis.predicate import exon >>> is_in_exon3 = exon(exon=3, tx_id=tx_id) >>> is_in_exon3.description 'overlaps with exon 3 of NM_003361.4' Monoallelic classifier to compare one allele of *UMOD* exon 3 variant versus one allele of other *UMOD* variant: >>> from gpsea.analysis.clf import monoallelic_classifier >>> gt_clf = monoallelic_classifier( ... a_predicate=is_in_exon3, ... b_predicate=~is_in_exon3, ... a_label="Exon 3", b_label="Other", ... ) >>> gt_clf.class_labels ('Exon 3', 'Other') Survival endpoint ----------------- The endpoint of our study is defined as development of end stage renal disease. In the *UMOD* cohort, this is encoded with `Stage 5 chronic kidney disease `_ (`HP:0003774`) HPO term. We need to leverage the HPO hierarchy when computing the onset of an HPO term. Let's load HPO: >>> import hpotk >>> store = hpotk.configure_ontology_store() >>> hpo = store.load_minimal_hpo(release='v2024-07-01') and now we can create an :class:`~gpsea.analysis.temporal.Endpoint` to compute the time until an individual develops end stage renal disease: >>> from gpsea.analysis.temporal.endpoint import hpo_onset >>> term_id = "HP:0003774" # Stage 5 chronic kidney disease >>> endpoint = hpo_onset(hpo=hpo, term_id=term_id) >>> endpoint.description 'Compute time until onset of Stage 5 chronic kidney disease' Statistical test ---------------- We will use Log rank test to compare the age until the endpoint between the genotype groups: >>> from gpsea.analysis.temporal.stats import LogRankTest >>> survival_statistic = LogRankTest() Final analysis -------------- We will put the final analysis together into a :class:`~gpsea.analysis.temporal.SurvivalAnalysis`. >>> from gpsea.analysis.temporal import SurvivalAnalysis >>> survival_analysis = SurvivalAnalysis( ... statistic=survival_statistic, ... ) Analysis ======== We execute the analysis by running >>> result = survival_analysis.compare_genotype_vs_survival( ... cohort=cohort, ... gt_clf=gt_clf, ... endpoint=endpoint, ... ) >>> result.pval 0.062004258300... Kaplan-Meier curves ------------------- We can plot Kaplan-Meier curves: >>> from gpsea.model import Age >>> import matplotlib as mpl >>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots(figsize=(6, 4), dpi=120) >>> result.plot_kaplan_meier_curves( ... ax=ax, ... ) >>> _ = ax.xaxis.set( ... # Show X axis in years ... ... major_formatter=mpl.ticker.FuncFormatter(lambda x, pos: f"{x / Age.DAYS_IN_YEAR:.0f}"), ... # ... with a tick for every decade ... major_locator=mpl.ticker.MultipleLocator(10 * Age.DAYS_IN_YEAR), ... ) >>> _ = ax.set( ... xlabel=endpoint.name + " [years]", ... ylabel="Event-free proportion", ... ) >>> _ = ax.grid(axis="y") .. image:: report/umod_km_curves.png :alt: UMOD Kaplan-Meier curves :align: center :width: 600px .. doctest:: survival :hide: >>> if _overwrite: fig.savefig('docs/user-guide/analyses/report/umod_km_curves.png') Raw data -------- The `result` includes the survival values for all cohort members: >>> survivals = result.data.sort_index() >>> survivals.head() # doctest: +NORMALIZE_WHITESPACE genotype phenotype patient_id AII.1[PMID_22034507_AII_1] 0 Survival(value=18262.5, is_censored=True) AII.2[PMID_22034507_AII_2] 0 None AII.3[PMID_22034507_AII_3] 0 Survival(value=16436.25, is_censored=True) AII.5[PMID_22034507_AII_5] 0 Survival(value=22280.25, is_censored=False) AIII.4[PMID_22034507_AIII_4] 0 Survival(value=19723.5, is_censored=False) Each line corresponeds to an individual and the dataframe is indexed by the individual's identifier/label. The `genotype` column contains the genotype class code, and `phenotype` column includes a :class:`~gpsea.analysis.temporal.Survival` value or `None` if computing the survival was impossible (see :func:`~gpsea.analysis.temporal.endpoint.hpo_onset` for details). The `Survival` reports the number of days until attaining the endpoint, here defined as end stage renal disease (`is_censored=False`), or until the individual dropped out of the analysis (`is_censored=True`). Troubleshooting =============== Sometimes the survival analysis fails and an :class:`~gpsea.analysis.AnalysisException` is raised. For instance, the current Logrank test implementation reports a p value of `NaN` if the survival is the same for all individuals. This is unlikely an expected outcome, therefore GPSEA raises an :class:`~gpsea.analysis.AnalysisException` to force the user to troubleshoot. To help with troubleshooting, the data computed prior detecting the error is included in the exception's :attr:`~gpsea.analysis.AnalysisException.data` attribute. In survival analysis, the data should include the identifiers, genotype classes, and survivals of the tested individuals. Let's show this on an example. We will create a toy cohort of 10 individuals with onset of `Lynch syndrome I `_ (`OMIM:120435`) at 40 years. >>> from gpsea.model import Cohort, Patient, Disease, Age >>> onset = Age.from_iso8601_period("P40Y") >>> individuals = [ ... Patient.from_raw_parts( ... labels=label, ... diseases=( ... Disease.from_raw_parts( ... term_id="OMIM:120435", ... name="Lynch syndrome I", ... is_observed=True, ... onset=onset, ... ), ... ), ... ) ... for label in "ABCDEFGHIJ" # 10 individuals ... ] >>> cohort = Cohort.from_patients(individuals) We will assign them into genotype classes on random, ... >>> from gpsea.analysis.clf import random_classifier >>> gt_clf = random_classifier(seed=123) >>> gt_clf.description 'Classify the individual into random classes' ... using the Lynch syndrome I diagnosis as the endpoint ... >>> from gpsea.analysis.temporal.endpoint import disease_onset >>> endpoint = disease_onset(disease_id="OMIM:120435") >>> endpoint.description 'Compute time until OMIM:120435 onset' ... and we will use Logrank test for differences in survival. >>> from gpsea.analysis.temporal.stats import LogRankTest >>> survival_statistic = LogRankTest() We put together the survival analysis ... >>> from gpsea.analysis.temporal import SurvivalAnalysis >>> survival_analysis = SurvivalAnalysis( ... statistic=survival_statistic, ... ) ... which we expect to fail with an :class:`~gpsea.analysis.AnalysisException`: >>> result = survival_analysis.compare_genotype_vs_survival( ... cohort=cohort, ... gt_clf=gt_clf, ... endpoint=endpoint, ... ) Traceback (most recent call last): ... gpsea.analysis._base.AnalysisException: The survival values did not meet the expectation of the statistical test! The genotype classes and survival values can be retrieved from the exception: >>> from gpsea.analysis import AnalysisException >>> try: ... result = survival_analysis.compare_genotype_vs_survival( ... cohort=cohort, ... gt_clf=gt_clf, ... endpoint=endpoint, ... ) ... except AnalysisException as ae: ... genotypes = ae.data["genotype"] ... survivals = ae.data["survival"] and the values can come in handy in troubleshooting: >>> genotypes[:3] (0, 0, 0) >>> survivals[:3] (Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False))