Skip to content

mecfs_bio.build_system.task.gene_manhattan_plot_task

Task to produce an interactive gene-level Manhattan plot as an HTML file.

Supports two source types:

  • :class:MagmaGeneSource: read a MAGMA gene-level analysis output directory (the .genes.out file produced by :class:MagmaGeneAnalysisTask) and join a gene thesaurus to translate Ensembl IDs into human-readable gene names.
  • :class:GenePValueTableSource: read an arbitrary table of (gene_ensembl_id, p_value) rows and look up chromosomal locations and human-readable gene names from a gene-locations reference (such as MAGMA_ENSEMBL_GENE_LOCATION_REFERENCE_DATA_BUILD_37_RAW). Intended for rare-variant test output or any other gene-level result table.

The plot uses Plotly's WebGL Scattergl renderer for performance with 20k-30k gene points and exposes hover text containing the gene name, Ensembl ID, chromosome, genomic midpoint position (labelled Position (hg19) or Position (hg38) according to the source's declared genome_build), and -log10(p).

Classes:

Functions:

Attributes:

GeneIdKind module-attribute

GeneIdKind = Literal['ensembl_id', 'gene_name']

logger module-attribute

logger = get_logger()

GeneManhattanPlotTask

Bases: Task

Create an interactive HTML gene-level Manhattan plot.

Backed by Plotly's WebGL renderer (Scattergl) so that hover stays responsive at gene-scale point counts (~20k-30k).

Methods:

Attributes:

colors class-attribute instance-attribute

colors: tuple[str, str] = ('#1f77b4', '#ff7f0e')

deps property

deps: list[Task]

meta instance-attribute

meta: Meta

plotly_js_mode class-attribute instance-attribute

plotly_js_mode: bool | PlotlyWriteMode = 'cdn'

point_size class-attribute instance-attribute

point_size: int = 5

sig_line_color class-attribute instance-attribute

sig_line_color: str = 'red'

sig_threshold class-attribute instance-attribute

sig_threshold: float | None = None

source instance-attribute

source: GeneManhattanSource

title class-attribute instance-attribute

title: str | None = None

create classmethod

create(
    asset_id: str,
    source: GeneManhattanSource,
    sig_threshold: float | None = None,
    title: str | None = None,
) -> GeneManhattanPlotTask
Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py
@classmethod
def create(
    cls,
    asset_id: str,
    source: GeneManhattanSource,
    sig_threshold: float | None = None,
    title: str | None = None,
) -> "GeneManhattanPlotTask":
    meta = GWASPlotFileMeta(
        trait=source.trait,
        project=source.project,
        extension=".html",
        id=AssetId(asset_id),
    )
    return cls(
        meta=meta,
        source=source,
        sig_threshold=sig_threshold,
        title=title,
    )

execute

execute(scratch_dir: Path, fetch: Fetch, wf: WF) -> Asset
Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py
def execute(self, scratch_dir: Path, fetch: Fetch, wf: WF) -> Asset:
    df = self.source.load_df(fetch=fetch)
    fig = build_manhattan_plot(
        df=df,
        sig_threshold=self.sig_threshold,
        point_size=self.point_size,
        colors=self.colors,
        sig_line_color=self.sig_line_color,
        title=self.title,
        genome_build=self.source.genome_build,
    )
    out_path = scratch_dir / "gene_manhattan.html"
    fig.write_html(out_path, include_plotlyjs=self.plotly_js_mode)
    return FileAsset(out_path)

GeneManhattanSource

Bases: ABC

A source that yields rows of (chrom, pos, ensembl_id, gene_name, p) for a Manhattan plot.

Methods:

  • load_df

    Materialize a pandas DataFrame with columns chrom, pos, ensembl_id, gene_name, p_value.

Attributes:

  • deps (list[Task]) –
  • genome_build (GenomeBuild) –

    Genome build of the chromosomal positions exposed by load_df.

  • project (str) –

    The project label inherited from the primary input task's metadata.

  • trait (str) –

    The trait label inherited from the primary input task's metadata.

deps abstractmethod property

deps: list[Task]

genome_build abstractmethod property

genome_build: GenomeBuild

Genome build of the chromosomal positions exposed by load_df.

Drives the hover-text position label (pos_hg19 vs pos_hg38).

project abstractmethod property

project: str

The project label inherited from the primary input task's metadata.

trait abstractmethod property

trait: str

The trait label inherited from the primary input task's metadata.

load_df abstractmethod

load_df(fetch: Fetch) -> pd.DataFrame

Materialize a pandas DataFrame with columns chrom, pos, ensembl_id, gene_name, p_value.

Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py
@abstractmethod
def load_df(self, fetch: Fetch) -> pd.DataFrame:
    """Materialize a pandas DataFrame with columns ``chrom``, ``pos``, ``ensembl_id``, ``gene_name``, ``p_value``."""
    pass

GenePValueTableSource

Bases: GeneManhattanSource

Load a Manhattan-plot table from an arbitrary (gene, p-value) table.

Chromosomal positions and the complementary gene identifier (Ensembl ID or human-readable gene name) are looked up from gene_locations_task (e.g. the MAGMA Ensembl gene-locations reference) by inner join. Genes missing from the locations file are dropped because they cannot be placed on the x-axis.

gene_id_kind declares which identifier the input table uses in gene_col. The locations reference must contain a matching column: Ensembl IDs ("ensembl_id") join on the reference's Ensembl-ID column, gene symbols ("gene_name") join on the reference's gene-name column.

Methods:

Attributes:

deps property

deps: list[Task]

gene_col instance-attribute

gene_col: str

gene_id_kind class-attribute instance-attribute

gene_id_kind: GeneIdKind = 'ensembl_id'

gene_locations_task instance-attribute

gene_locations_task: Task

genome_build instance-attribute

genome_build: GenomeBuild

p_col instance-attribute

p_col: str

project property

project: str

table_task instance-attribute

table_task: Task

trait property

trait: str

load_df

load_df(fetch: Fetch) -> pd.DataFrame
Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py
def load_df(self, fetch: Fetch) -> pd.DataFrame:
    join_col = _ENSEMBL_ID if self.gene_id_kind == "ensembl_id" else _GENE_NAME

    table_asset = fetch(self.table_task.asset_id)
    p_df = (
        scan_dataframe_asset(table_asset, meta=self.table_task.meta)
        .collect()
        .to_pandas()
    )
    p_df = pd.DataFrame(
        {
            join_col: p_df[self.gene_col].astype(str),
            _P: p_df[self.p_col].astype(float),
        }
    )

    loc_asset = fetch(self.gene_locations_task.asset_id)
    loc_df = (
        scan_dataframe_asset(loc_asset, meta=self.gene_locations_task.meta)
        .collect()
        .to_pandas()
    )
    loc_df = pd.DataFrame(
        {
            _ENSEMBL_ID: loc_df[_MAGMA_LOC_ENSEMBL_COL].astype(str),
            _CHROM: loc_df[GENE_INFO_CHROM_COL].astype(str),
            _POS: (
                loc_df[GENE_INFO_START_COL].astype(float)
                + loc_df[GENE_INFO_END_COL].astype(float)
            )
            / 2.0,
            _GENE_NAME: loc_df[GENE_INFO_NAME_COL].astype(str),
        }
    ).drop_duplicates(subset=[join_col])

    merged = p_df.merge(loc_df, on=join_col, how="inner")
    missing = len(p_df) - len(merged)
    if missing > 0:
        logger.warning(
            "Dropped genes missing from the locations reference",
            num_dropped=missing,
            num_kept=len(merged),
            gene_id_kind=self.gene_id_kind,
        )
    return merged

MagmaGeneSource

Bases: GeneManhattanSource

Load a Manhattan-plot table from a :class:MagmaGeneAnalysisTask.

Chromosomal positions come from the MAGMA output itself. Human-readable gene names are joined in from gene_thesaurus_task by Ensembl ID. When a gene is missing from the thesaurus, the Ensembl ID is used as the display name.

Methods:

Attributes:

deps property

deps: list[Task]

gene_thesaurus_task instance-attribute

gene_thesaurus_task: Task

genome_build instance-attribute

genome_build: GenomeBuild

magma_task instance-attribute

magma_task: Task

project property

project: str

trait property

trait: str

load_df

load_df(fetch: Fetch) -> pd.DataFrame
Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py
def load_df(self, fetch: Fetch) -> pd.DataFrame:
    magma_asset = fetch(self.magma_task.asset_id)
    assert isinstance(magma_asset, DirectoryAsset)
    magma_path = magma_asset.path / (GENE_ANALYSIS_OUTPUT_STEM_NAME + ".genes.out")
    magma_df = (
        scan_dataframe(
            path=magma_path,
            spec=DataFrameReadSpec(
                DataFrameWhiteSpaceSepTextFormat(comment_code="#")
            ),
        )
        .collect()
        .to_pandas()
    )
    magma_df = pd.DataFrame(
        {
            _ENSEMBL_ID: magma_df[_MAGMA_GENE_COL].astype(str),
            _CHROM: magma_df[_MAGMA_CHR_COL].astype(str),
            _POS: (
                magma_df[_MAGMA_START_COL].astype(float)
                + magma_df[_MAGMA_STOP_COL].astype(float)
            )
            / 2.0,
            _P: magma_df[_MAGMA_P_COL].astype(float),
        }
    )

    thesaurus_asset = fetch(self.gene_thesaurus_task.asset_id)
    thesaurus_df = (
        scan_dataframe_asset(thesaurus_asset, meta=self.gene_thesaurus_task.meta)
        .collect()
        .to_pandas()
    )
    thesaurus_df = (
        thesaurus_df[[_THESAURUS_ENSEMBL_COL, _THESAURUS_NAME_COL]]
        .rename(
            columns={
                _THESAURUS_ENSEMBL_COL: _ENSEMBL_ID,
                _THESAURUS_NAME_COL: _GENE_NAME,
            }
        )
        .drop_duplicates(subset=[_ENSEMBL_ID])
    )
    merged = magma_df.merge(thesaurus_df, on=_ENSEMBL_ID, how="left")
    merged[_GENE_NAME] = merged[_GENE_NAME].fillna(merged[_ENSEMBL_ID])
    return merged

build_manhattan_plot

build_manhattan_plot(
    df: DataFrame,
    sig_threshold: float | None,
    point_size: int,
    colors: tuple[str, str],
    sig_line_color: str,
    title: str | None,
    genome_build: GenomeBuild,
) -> go.Figure

Construct a Plotly figure containing a gene-level Manhattan plot.

Genes with non-positive or null p-values are dropped (-log10 is undefined). If sig_threshold is None, a Bonferroni-corrected threshold 0.05 / N_genes is used and a dashed horizontal line is drawn at the corresponding -log10(p).

genome_build selects the hover label for the gene's midpoint position (Position (hg19) for build 37, Position (hg38) for build 38). Positions in df are assumed to already be in the declared build.

Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py
def build_manhattan_plot(
    df: pd.DataFrame,
    sig_threshold: float | None,
    point_size: int,
    colors: tuple[str, str],
    sig_line_color: str,
    title: str | None,
    genome_build: GenomeBuild,
) -> go.Figure:
    """Construct a Plotly figure containing a gene-level Manhattan plot.

    Genes with non-positive or null p-values are dropped (``-log10`` is
    undefined). If ``sig_threshold`` is ``None``, a Bonferroni-corrected
    threshold ``0.05 / N_genes`` is used and a dashed horizontal line is drawn
    at the corresponding ``-log10(p)``.

    ``genome_build`` selects the hover label for the gene's midpoint position
    (``Position (hg19)`` for build 37, ``Position (hg38)`` for build 38).
    Positions in ``df`` are assumed to already be in the declared build.
    """
    df = df.dropna(subset=[_P]).copy()
    df = df[df[_P] > 0]
    df[_CHROM] = df[_CHROM].astype(str)
    if len(df) == 0:
        raise ValueError(
            "No plottable rows: all gene p-values were null or non-positive."
        )

    chroms = sorted(df[_CHROM].unique().tolist(), key=_chrom_sort_key)

    chrom_max_series = df.groupby(_CHROM)[_POS].max().astype(float)
    chrom_max_pos: dict[str, float] = {
        str(chrom): float(value) for chrom, value in chrom_max_series.items()
    }
    chrom_offsets: dict[str, float] = {}
    chrom_centers: dict[str, float] = {}
    running_offset = 0.0
    for chrom in chroms:
        chrom_offsets[chrom] = running_offset
        chrom_centers[chrom] = running_offset + chrom_max_pos[chrom] / 2.0
        running_offset += chrom_max_pos[chrom]

    df = df.assign(
        _x=df[_POS] + df[_CHROM].map(chrom_offsets),
        _mlog10p=-np.log10(df[_P]),
    )

    if sig_threshold is None:
        sig_threshold = 0.05 / len(df)
    sig_y = float(-np.log10(sig_threshold))

    pos_label = f"position (hg{genome_build})"
    fig = go.Figure()
    for idx, chrom in enumerate(chroms):
        chrom_df = df[df[_CHROM] == chrom]
        color = colors[idx % 2]
        fig.add_trace(
            go.Scattergl(
                x=chrom_df["_x"],
                y=chrom_df["_mlog10p"],
                mode="markers",
                marker=dict(size=point_size, color=color),
                name=f"chr{chrom}",
                customdata=list(
                    zip(
                        chrom_df[_GENE_NAME].astype(str),
                        chrom_df[_ENSEMBL_ID].astype(str),
                        chrom_df[_CHROM].astype(str),
                        chrom_df[_POS].astype(float),
                        strict=True,
                    )
                ),
                hovertemplate=(
                    "<b>%{customdata[0]}</b><br>"
                    f"{pos_label}:" + " chr%{customdata[2]} %{customdata[3]:,.0f}<br>"
                    # "Ensembl: %{customdata[1]}<br>"
                    # "Chromosome: %{customdata[2]}<br>"
                    # f"{pos_label}: " + "%{customdata[3]:,.0f}<br>"
                    "-log<sub>10</sub>(p): %{y:.3f}<br>"
                    "<extra></extra>"
                ),
                showlegend=False,
            )
        )

    fig.add_hline(
        y=sig_y,
        line=dict(color=sig_line_color, dash="dash"),
        annotation_text=f"p = {sig_threshold:.2e}",
        annotation_position="top left",
    )

    fig.update_layout(
        title=title,
        xaxis=dict(
            tickmode="array",
            tickvals=[chrom_centers[c] for c in chroms],
            ticktext=chroms,
            title="Chromosome",
            showgrid=False,
            zeroline=False,
        ),
        yaxis=dict(
            title="-log<sub>10</sub>(p)",
            zeroline=False,
        ),
        plot_bgcolor="white",
        hovermode="closest",
    )
    return fig