mecfs_bio.build_system.task.gene_manhattan_plot_task

Task to produce an interactive gene-level Manhattan plot as an HTML file.

Supports two source types:

:class:MagmaGeneSource: read a MAGMA gene-level analysis output directory (the .genes.out file produced by :class:MagmaGeneAnalysisTask) and join a gene thesaurus to translate Ensembl IDs into human-readable gene names.
:class:GenePValueTableSource: read an arbitrary table of (gene_ensembl_id, p_value) rows and look up chromosomal locations and human-readable gene names from a gene-locations reference (such as MAGMA_ENSEMBL_GENE_LOCATION_REFERENCE_DATA_BUILD_37_RAW). Intended for rare-variant test output or any other gene-level result table.

The plot uses Plotly's WebGL Scattergl renderer for performance with 20k-30k gene points and exposes hover text containing the gene name, Ensembl ID, chromosome, genomic midpoint position (labelled Position (hg19) or Position (hg38) according to the source's declared genome_build), and -log10(p).

Classes:

GeneManhattanData –

The genes to plot plus the multiple-testing count for the significance line.
GeneManhattanPlotTask –

Create an interactive HTML gene-level Manhattan plot.
GeneManhattanSource –

A source that yields rows of (chrom, pos, ensembl_id, gene_name, p) for a Manhattan plot.
GenePValueTableSource –

Load a Manhattan-plot table from an arbitrary (gene, p-value) table.
MagmaGeneSource –

Load a Manhattan-plot table from a :class:MagmaGeneAnalysisTask.

Functions:

build_manhattan_plot –

Construct a Plotly figure containing a gene-level Manhattan plot.

Attributes:

GeneIdKind –
logger –

GeneIdKind `module-attribute`

GeneIdKind = Literal['ensembl_id', 'gene_name']

logger `module-attribute`

logger = get_logger()

GeneManhattanData

The genes to plot plus the multiple-testing count for the significance line.

df holds the rows to plot, after any max_p_value filtering.

num_genes_for_correction is the number of genes with a valid (positive, non-null) p-value before max_p_value filtering. It drives the default Bonferroni threshold so that the significance line stays invariant to the purely visual max_p_value filter.

Attributes:

df (DataFrame) –
num_genes_for_correction (int) –

df `instance-attribute`

df: DataFrame

num_genes_for_correction `instance-attribute`

num_genes_for_correction: int

GeneManhattanPlotTask

Bases: Task

Create an interactive HTML gene-level Manhattan plot.

Backed by Plotly's WebGL renderer (Scattergl) so that hover stays responsive at gene-scale point counts (~20k-30k).

Methods:

create –
execute –

Attributes:

colors (tuple[str, str]) –
deps (list[Task]) –
hla_marker_symbol (str | None) –
meta (Meta) –
plotly_js_mode (bool | PlotlyWriteMode) –
point_size (int) –
sig_line_color (str) –
sig_threshold (float | None) –
source (GeneManhattanSource) –
title (str | None) –

colors `class-attribute` `instance-attribute`

colors: tuple[str, str] = ('#1f77b4', '#ff7f0e')

deps `property`

deps: list[Task]

hla_marker_symbol `class-attribute` `instance-attribute`

hla_marker_symbol: str | None = None

meta `instance-attribute`

meta: Meta

plotly_js_mode `class-attribute` `instance-attribute`

plotly_js_mode: bool | PlotlyWriteMode = 'cdn'

point_size `class-attribute` `instance-attribute`

point_size: int = 5

sig_line_color `class-attribute` `instance-attribute`

sig_line_color: str = 'red'

sig_threshold `class-attribute` `instance-attribute`

sig_threshold: float | None = None

source `instance-attribute`

source: GeneManhattanSource

title `class-attribute` `instance-attribute`

title: str | None = None

create `classmethod`

create(
    asset_id: str,
    source: GeneManhattanSource,
    sig_threshold: float | None = None,
    title: str | None = None,
) -> GeneManhattanPlotTask

Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py

@classmethod
def create(
    cls,
    asset_id: str,
    source: GeneManhattanSource,
    sig_threshold: float | None = None,
    title: str | None = None,
) -> "GeneManhattanPlotTask":
    meta = GWASPlotFileMeta(
        trait=source.trait,
        project=source.project,
        extension=".html",
        id=AssetId(asset_id),
    )
    return cls(
        meta=meta,
        source=source,
        sig_threshold=sig_threshold,
        title=title,
    )

execute

execute(scratch_dir: Path, fetch: Fetch, wf: WF) -> Asset

Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py

def execute(self, scratch_dir: Path, fetch: Fetch, wf: WF) -> Asset:
    data = self.source.load_df(fetch=fetch)
    max_p_value = self.source.max_p_value
    y_axis_start = (
        float(-np.log10(max_p_value)) if max_p_value is not None else None
    )
    hla_interval = (
        extended_mhc_interval(self.source.genome_build)
        if self.hla_marker_symbol is not None
        else None
    )
    fig = build_manhattan_plot(
        df=data.df,
        sig_threshold=self.sig_threshold,
        point_size=self.point_size,
        colors=self.colors,
        sig_line_color=self.sig_line_color,
        title=self.title,
        genome_build=self.source.genome_build,
        num_genes_for_correction=data.num_genes_for_correction,
        y_axis_start=y_axis_start,
        hla_interval=hla_interval,
        hla_marker_symbol=self.hla_marker_symbol,
    )
    out_path = scratch_dir / "gene_manhattan.html"
    fig.write_html(out_path, include_plotlyjs=self.plotly_js_mode)
    return FileAsset(out_path)

GeneManhattanSource

Bases: ABC

A source that yields rows of (chrom, pos, ensembl_id, gene_name, p) for a Manhattan plot.

Methods:

load_df –

Load the full table, then apply the optional max_p_value filter.

Attributes:

deps (list[Task]) –
genome_build (GenomeBuild) –

Genome build of the chromosomal positions exposed by load_df.
max_p_value (float | None) –

Drop genes whose p-value is at or above this before plotting.
project (str) –

The project label inherited from the primary input task's metadata.
trait (str) –

The trait label inherited from the primary input task's metadata.

deps `abstractmethod` `property`

deps: list[Task]

genome_build `abstractmethod` `property`

genome_build: GenomeBuild

Genome build of the chromosomal positions exposed by load_df.

Drives the hover-text position label (pos_hg19 vs pos_hg38).

max_p_value `abstractmethod` `property`

max_p_value: float | None

Drop genes whose p-value is at or above this before plotting.

None disables filtering. Filtering is purely a visual simplification: it does not affect the Bonferroni significance threshold, which is based on the gene count before filtering.

project `abstractmethod` `property`

project: str

The project label inherited from the primary input task's metadata.

trait `abstractmethod` `property`

trait: str

The trait label inherited from the primary input task's metadata.

load_df

load_df(fetch: Fetch) -> GeneManhattanData

Load the full table, then apply the optional max_p_value filter.

The multiple-testing count is taken before filtering so that the significance threshold is unaffected by max_p_value.

Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py

def load_df(self, fetch: Fetch) -> GeneManhattanData:
    """Load the full table, then apply the optional max_p_value filter.

    The multiple-testing count is taken before filtering so that the
    significance threshold is unaffected by max_p_value.
    """
    df = self._load_full_df(fetch)
    valid_p = df[_P].notna() & (df[_P] > 0)
    num_genes_for_correction = int(valid_p.sum())
    if self.max_p_value is not None:
        num_before = len(df)
        df = df[df[_P] < self.max_p_value]
        logger.info(
            "Filtered genes by maximum p-value",
            max_p_value=self.max_p_value,
            num_dropped=num_before - len(df),
            num_kept=len(df),
        )
    return GeneManhattanData(
        df=df, num_genes_for_correction=num_genes_for_correction
    )

GenePValueTableSource

Bases: GeneManhattanSource

Load a Manhattan-plot table from an arbitrary (gene, p-value) table.

Chromosomal positions and the complementary gene identifier (Ensembl ID or human-readable gene name) are looked up from gene_locations_task (e.g. the MAGMA Ensembl gene-locations reference) by inner join. Genes missing from the locations file are dropped because they cannot be placed on the x-axis.

gene_id_kind declares which identifier the input table uses in gene_col. The locations reference must contain a matching column: Ensembl IDs ("ensembl_id") join on the reference's Ensembl-ID column, gene symbols ("gene_name") join on the reference's gene-name column.

max_p_value, when not None, drops genes whose p-value is at or above it before plotting, keeping the figure free of the many uninformative high-p-value points. The default of 0.1 retains only the nominally interesting tail. Filtering does not affect the significance threshold.

Attributes:

deps (list[Task]) –
gene_col (str) –
gene_id_kind (GeneIdKind) –
gene_locations_task (Task) –
genome_build (GenomeBuild) –
max_p_value (float | None) –
p_col (str) –
project (str) –
table_task (Task) –
trait (str) –

deps `property`

deps: list[Task]

gene_col `instance-attribute`

gene_col: str

gene_id_kind `class-attribute` `instance-attribute`

gene_id_kind: GeneIdKind = 'ensembl_id'

gene_locations_task `instance-attribute`

gene_locations_task: Task

genome_build `instance-attribute`

genome_build: GenomeBuild

max_p_value `class-attribute` `instance-attribute`

max_p_value: float | None = 0.1

p_col `instance-attribute`

p_col: str

project `property`

project: str

table_task `instance-attribute`

table_task: Task

trait `property`

trait: str

MagmaGeneSource

Bases: GeneManhattanSource

Load a Manhattan-plot table from a :class:MagmaGeneAnalysisTask.

Chromosomal positions come from the MAGMA output itself. Human-readable gene names are joined in from gene_thesaurus_task by Ensembl ID. When a gene is missing from the thesaurus, the Ensembl ID is used as the display name.

max_p_value, when not None, drops genes whose p-value is at or above it before plotting, keeping the figure free of the many uninformative high-p-value points. The default of 0.1 retains only the nominally interesting tail. Filtering does not affect the significance threshold.

Attributes:

deps (list[Task]) –
gene_thesaurus_task (Task) –
genome_build (GenomeBuild) –
magma_task (Task) –
max_p_value (float | None) –
project (str) –
trait (str) –

deps `property`

deps: list[Task]

gene_thesaurus_task `instance-attribute`

gene_thesaurus_task: Task

genome_build `instance-attribute`

genome_build: GenomeBuild

magma_task `instance-attribute`

magma_task: Task

max_p_value `class-attribute` `instance-attribute`

max_p_value: float | None = 0.01

project `property`

project: str

trait `property`

trait: str

build_manhattan_plot

build_manhattan_plot(
    df: DataFrame,
    sig_threshold: float | None,
    point_size: int,
    colors: tuple[str, str],
    sig_line_color: str,
    title: str | None,
    genome_build: GenomeBuild,
    num_genes_for_correction: int | None = None,
    y_axis_start: float | None = None,
    hla_interval: GenomicInterval | None = None,
    hla_marker_symbol: str | None = "diamond",
    plot_area_height_px: float = 700.0,
) -> go.Figure

Construct a Plotly figure containing a gene-level Manhattan plot.

Genes with non-positive or null p-values are dropped (-log10 is undefined). If sig_threshold is None, a Bonferroni-corrected threshold 0.05 / num_genes_for_correction is used and a dashed horizontal line is drawn at the corresponding -log10(p). num_genes_for_correction should be the number of genes tested, counted before any p-value filtering of df, so that the threshold is invariant to such filtering; it falls back to the number of plotted rows when not supplied.

genome_build selects the hover label for the gene's midpoint position (Position (hg19) for build 37, Position (hg38) for build 38). Positions in df are assumed to already be in the declared build.

y_axis_start, when not None, anchors the lower bound of the -log10(p) axis (drawn vertically). It is intended to be -log10(max_p_value) so that a p-value-filtered plot uses its full vertical extent instead of leaving empty space below the lowest surviving point. One marker diameter is subtracted so points sitting right at the cutoff clear the x-axis instead of being sliced; the upper bound stays data-driven. plot_area_height_px is the assumed rendered plotting-area height in pixels, used only to convert the marker's pixel diameter into that data-unit padding (the docs embed iframe is ~775px tall).

hla_interval, when not None, marks genes falling inside it (matched on chromosome and midpoint position) with hla_marker_symbol instead of the default circle, so that extended-HLA/MHC-region genes stand out. Those genes keep their chromosome's color; only the symbol changes. hla_marker_symbol must be a valid Plotly symbol whenever hla_interval is given.

Source code in mecfs_bio/build_system/task/gene_manhattan_plot_task.py

def build_manhattan_plot(
    df: pd.DataFrame,
    sig_threshold: float | None,
    point_size: int,
    colors: tuple[str, str],
    sig_line_color: str,
    title: str | None,
    genome_build: GenomeBuild,
    num_genes_for_correction: int | None = None,
    y_axis_start: float | None = None,
    hla_interval: GenomicInterval | None = None,
    hla_marker_symbol: str | None = "diamond",
    plot_area_height_px: float = 700.0,
) -> go.Figure:
    """Construct a Plotly figure containing a gene-level Manhattan plot.

    Genes with non-positive or null p-values are dropped (-log10 is undefined).
    If sig_threshold is None, a Bonferroni-corrected threshold
    0.05 / num_genes_for_correction is used and a dashed horizontal line is
    drawn at the corresponding -log10(p). num_genes_for_correction should be the
    number of genes tested, counted before any p-value filtering of df, so that
    the threshold is invariant to such filtering; it falls back to the number of
    plotted rows when not supplied.

    genome_build selects the hover label for the gene's midpoint position
    (Position (hg19) for build 37, Position (hg38) for build 38). Positions in
    df are assumed to already be in the declared build.

    y_axis_start, when not None, anchors the lower bound of the -log10(p) axis
    (drawn vertically). It is intended to be -log10(max_p_value) so that a
    p-value-filtered plot uses its full vertical extent instead of leaving empty
    space below the lowest surviving point. One marker diameter is subtracted so
    points sitting right at the cutoff clear the x-axis instead of being sliced;
    the upper bound stays data-driven. plot_area_height_px is the assumed
    rendered plotting-area height in pixels, used only to convert the marker's
    pixel diameter into that data-unit padding (the docs embed iframe is
    ~775px tall).

    hla_interval, when not None, marks genes falling inside it (matched on
    chromosome and midpoint position) with hla_marker_symbol instead of the
    default circle, so that extended-HLA/MHC-region genes stand out. Those genes
    keep their chromosome's color; only the symbol changes. hla_marker_symbol
    must be a valid Plotly symbol whenever hla_interval is given.
    """
    assert hla_interval is None or hla_marker_symbol is not None, (
        "hla_marker_symbol is required when hla_interval is given"
    )
    df = df.dropna(subset=[_P]).copy()
    df = df[df[_P] > 0]
    df[_CHROM] = df[_CHROM].astype(str)
    if len(df) == 0:
        raise ValueError(
            "No plottable rows: all gene p-values were null or non-positive."
        )

    chroms = sorted(df[_CHROM].unique().tolist(), key=_chrom_sort_key)

    chrom_max_series = df.groupby(_CHROM)[_POS].max().astype(float)
    chrom_max_pos: dict[str, float] = {
        str(chrom): float(value) for chrom, value in chrom_max_series.items()
    }
    chrom_offsets: dict[str, float] = {}
    chrom_centers: dict[str, float] = {}
    running_offset = 0.0
    for chrom in chroms:
        chrom_offsets[chrom] = running_offset
        chrom_centers[chrom] = running_offset + chrom_max_pos[chrom] / 2.0
        running_offset += chrom_max_pos[chrom]

    df = df.assign(
        _x=df[_POS] + df[_CHROM].map(chrom_offsets),
        _mlog10p=-np.log10(df[_P]),
    )

    if sig_threshold is None:
        n_correction = (
            num_genes_for_correction
            if num_genes_for_correction is not None
            else len(df)
        )
        sig_threshold = 0.05 / n_correction
    sig_y = float(-np.log10(sig_threshold))

    pos_label = f"position (hg{genome_build})"
    fig = go.Figure()
    for idx, chrom in enumerate(chroms):
        chrom_df = df[df[_CHROM] == chrom]
        color = colors[idx % 2]
        if hla_interval is not None and chrom == str(hla_interval.chrom):
            assert hla_marker_symbol is not None  # guaranteed by the top assert
            in_hla = chrom_df[_POS].between(
                hla_interval.start, hla_interval.end, inclusive="left"
            )
            symbol: str | list[str] = np.where(
                in_hla.to_numpy(), hla_marker_symbol, "circle"
            ).tolist()
        else:
            symbol = "circle"
        fig.add_trace(
            go.Scattergl(
                x=chrom_df["_x"],
                y=chrom_df["_mlog10p"],
                mode="markers",
                marker=dict(size=point_size, color=color, symbol=symbol),
                name=f"chr{chrom}",
                customdata=list(
                    zip(
                        chrom_df[_GENE_NAME].astype(str),
                        chrom_df[_ENSEMBL_ID].astype(str),
                        chrom_df[_CHROM].astype(str),
                        chrom_df[_POS].astype(float),
                        strict=True,
                    )
                ),
                hovertemplate=(
                    "<b>%{customdata[0]}</b><br>"
                    f"{pos_label}:" + " chr%{customdata[2]} %{customdata[3]:,.0f}<br>"
                    # "Ensembl: %{customdata[1]}<br>"
                    # "Chromosome: %{customdata[2]}<br>"
                    # f"{pos_label}: " + "%{customdata[3]:,.0f}<br>"
                    "-log<sub>10</sub>(p): %{y:.3f}<br>"
                    "<extra></extra>"
                ),
                showlegend=False,
            )
        )

    fig.add_hline(
        y=sig_y,
        line=dict(color=sig_line_color, dash="dash"),
        annotation_text=f"p = {sig_threshold:.2e}",
        annotation_position="top left",
    )

    yaxis: dict[str, object] = dict(title="-log<sub>10</sub>(p)", zeroline=False)
    if y_axis_start is not None:
        y_top = max(float(df["_mlog10p"].max()), sig_y)
        top_pad = 0.05 * max(y_top - y_axis_start, 1.0)
        # Drop the lower bound by one marker diameter so points sitting right at
        # the cutoff clear the x-axis. point_size is a pixel diameter, so convert
        # it to data units via the visible span and the assumed plot-area height.
        visible_span = (y_top + top_pad) - y_axis_start
        bottom_pad = point_size / plot_area_height_px * visible_span
        yaxis["range"] = [y_axis_start - bottom_pad, y_top + top_pad]

    fig.update_layout(
        title=title,
        xaxis=dict(
            tickmode="array",
            tickvals=[chrom_centers[c] for c in chroms],
            ticktext=chroms,
            title="Chromosome",
            showgrid=False,
            zeroline=False,
        ),
        yaxis=yaxis,
        plot_bgcolor="white",
        hovermode="closest",
        showlegend=False,
    )
    return fig

mecfs_bio.build_system.task.gene_manhattan_plot_task

GeneIdKind module-attribute

logger module-attribute

GeneManhattanData

df instance-attribute

num_genes_for_correction instance-attribute

GeneManhattanPlotTask

colors class-attribute instance-attribute

deps property

hla_marker_symbol class-attribute instance-attribute

meta instance-attribute

plotly_js_mode class-attribute instance-attribute

point_size class-attribute instance-attribute

sig_line_color class-attribute instance-attribute

sig_threshold class-attribute instance-attribute

source instance-attribute

title class-attribute instance-attribute

create classmethod

execute

GeneManhattanSource

deps abstractmethod property

genome_build abstractmethod property

max_p_value abstractmethod property

project abstractmethod property

trait abstractmethod property

load_df

GenePValueTableSource

deps property

gene_col instance-attribute

gene_id_kind class-attribute instance-attribute

gene_locations_task instance-attribute

genome_build instance-attribute

max_p_value class-attribute instance-attribute

p_col instance-attribute

project property

table_task instance-attribute

trait property

MagmaGeneSource

deps property

gene_thesaurus_task instance-attribute

genome_build instance-attribute

magma_task instance-attribute

max_p_value class-attribute instance-attribute

project property

trait property

build_manhattan_plot

GeneIdKind `module-attribute`

logger `module-attribute`

df `instance-attribute`

num_genes_for_correction `instance-attribute`

colors `class-attribute` `instance-attribute`

deps `property`

hla_marker_symbol `class-attribute` `instance-attribute`

meta `instance-attribute`

plotly_js_mode `class-attribute` `instance-attribute`

point_size `class-attribute` `instance-attribute`

sig_line_color `class-attribute` `instance-attribute`

sig_threshold `class-attribute` `instance-attribute`

source `instance-attribute`

title `class-attribute` `instance-attribute`

create `classmethod`

deps `abstractmethod` `property`

genome_build `abstractmethod` `property`

max_p_value `abstractmethod` `property`

project `abstractmethod` `property`

trait `abstractmethod` `property`

deps `property`

gene_col `instance-attribute`

gene_id_kind `class-attribute` `instance-attribute`

gene_locations_task `instance-attribute`

genome_build `instance-attribute`

max_p_value `class-attribute` `instance-attribute`

p_col `instance-attribute`

project `property`

table_task `instance-attribute`

trait `property`

deps `property`

gene_thesaurus_task `instance-attribute`

genome_build `instance-attribute`

magma_task `instance-attribute`

max_p_value `class-attribute` `instance-attribute`

project `property`

trait `property`