Skip to content

mecfs_bio.build_system.task.fetch_gget_info_task

Task to use gget to annotate gene lists with annotations from genetics databases.

Classes:

  • FetchGGetInfoTask

    Task to use gget (https://github.com/pachterlab/gget) to retrieve database information about a list of genes from a dataframe

Attributes:

PRIMARY_GENE_NAME module-attribute

PRIMARY_GENE_NAME = 'primary_gene_name'

PROTEIN_NAMES_COL module-attribute

PROTEIN_NAMES_COL = 'protein_names'

SUBCELLULAR_LOCALISATION_COL module-attribute

SUBCELLULAR_LOCALISATION_COL = 'subcellular_localisation'

UNIPROT_DESCRIPTION module-attribute

UNIPROT_DESCRIPTION = 'uniprot_description'

UNIPROT_ID_COL module-attribute

UNIPROT_ID_COL = 'uniprot_id'

logger module-attribute

logger = getLogger()

FetchGGetInfoTask

Bases: Task

Task to use gget (https://github.com/pachterlab/gget) to retrieve database information about a list of genes from a dataframe Useful for analyzing GWAS results.

Listen to an interview with the primary developer of gget here: https://podcasts.apple.com/nz/podcast/99-laura-luebbert-gget-hunting-viruses-and/id1534473511?i=1000664104787

Sometimes gget returns dataframes with inconsistent formatting. e.g.: some columns are partly lists, and partly singleton values. Thus this file also contains functionality to munge the output of gget into a more consistent format.

Methods:

Attributes:

deps property

deps: list[Task]

ensembl_id_col instance-attribute

ensembl_id_col: str

genes_to_use class-attribute instance-attribute

genes_to_use: int | None = None

meta property

meta: Meta

out_format class-attribute instance-attribute

out_format: OutFormat = CSVOutFormat(',')

post_pipe class-attribute instance-attribute

post_pipe: DataProcessingPipe = IdentityPipe()

source_df_task instance-attribute

source_df_task: Task

source_id property

source_id: AssetId

source_meta property

source_meta: Meta

create classmethod

create(
    asset_id: str,
    source_df_task: Task,
    ensembl_id_col: str,
    genes_to_use: int | None = None,
    post_pipe: DataProcessingPipe = IdentityPipe(),
    out_format: OutFormat = CSVOutFormat(","),
)
Source code in mecfs_bio/build_system/task/fetch_gget_info_task.py
@classmethod
def create(
    cls,
    asset_id: str,
    source_df_task: Task,
    ensembl_id_col: str,
    genes_to_use: int | None = None,
    post_pipe: DataProcessingPipe = IdentityPipe(),
    out_format: OutFormat = CSVOutFormat(","),
):
    source_meta = source_df_task.meta
    meta: Meta
    extension, read_spec = get_extension_and_read_spec_from_format(out_format)
    if isinstance(source_meta, ResultTableMeta):
        meta = ResultTableMeta(
            id=asset_id,
            trait=source_meta.trait,
            project=source_meta.project,
            extension=extension,
            read_spec=read_spec,
        )
    elif isinstance(source_meta, ReferenceFileMeta):
        meta = ReferenceFileMeta(
            id=AssetId(asset_id),
            group=source_meta.group,
            sub_group=source_meta.sub_group,
            extension=extension,
            read_spec=read_spec,
            sub_folder=source_meta.sub_folder,
        )
    else:
        raise ValueError("unknown source meta type")

    return cls(
        source_df_task=source_df_task,
        ensembl_id_col=ensembl_id_col,
        meta=meta,
        genes_to_use=genes_to_use,
        post_pipe=post_pipe,
        out_format=out_format,
    )

execute

execute(scratch_dir: Path, fetch: Fetch, wf: WF) -> Asset
Source code in mecfs_bio/build_system/task/fetch_gget_info_task.py
def execute(self, scratch_dir: Path, fetch: Fetch, wf: WF) -> Asset:
    source_asset = fetch(self.source_id)
    df = (
        scan_dataframe_asset(source_asset, meta=self.source_meta)
        .collect()
        .to_pandas()
    )
    genes = [item for item in df[self.ensembl_id_col].tolist() if item is not None]
    if self.genes_to_use is not None:
        genes = genes[: self.genes_to_use]
    logger.debug(f"Using gget to retrieve info on {len(genes)} genes")
    logger.debug(f"Genes are: {genes}")
    gget_result = gget.info(genes)
    if isinstance(gget_result, pd.DataFrame):
        gene_info = gget_result
    else:
        gene_info = _dummy_gget_result
    result_df = pd.merge(
        df, gene_info, left_on=self.ensembl_id_col, right_index=True, how="left"
    )
    out_path = scratch_dir / f"{self.source_id}.csv"
    result_df = (
        self.post_pipe.process(narwhals.from_native(result_df).lazy())
        .collect()
        .to_pandas()
    )
    result_df = _preprocess_columns(result_df)
    if isinstance(self.out_format, CSVOutFormat):
        result_df.to_csv(out_path, index=False, sep=self.out_format.sep)
    elif isinstance(self.out_format, ParquetOutFormat):
        result_df.to_parquet(
            out_path,
        )
    else:
        raise ValueError(f"Unsupported output format: {self.out_format}")
    return FileAsset(out_path)