mecfs_bio.build_system.task.fetch_gget_info_task

Task to use gget to annotate gene lists with annotations from genetics databases.

Classes:

FetchGGetInfoTask –

Task to use gget (https://github.com/pachterlab/gget) to retrieve database information about a list of genes from a dataframe

Attributes:

PRIMARY_GENE_NAME –
PROTEIN_NAMES_COL –
SUBCELLULAR_LOCALISATION_COL –
UNIPROT_DESCRIPTION –
UNIPROT_ID_COL –
logger –

PRIMARY_GENE_NAME `module-attribute`

PRIMARY_GENE_NAME = 'primary_gene_name'

PROTEIN_NAMES_COL `module-attribute`

PROTEIN_NAMES_COL = 'protein_names'

SUBCELLULAR_LOCALISATION_COL `module-attribute`

SUBCELLULAR_LOCALISATION_COL = 'subcellular_localisation'

UNIPROT_DESCRIPTION `module-attribute`

UNIPROT_DESCRIPTION = 'uniprot_description'

UNIPROT_ID_COL `module-attribute`

UNIPROT_ID_COL = 'uniprot_id'

logger `module-attribute`

logger = getLogger()

FetchGGetInfoTask

Bases: Task

Task to use gget (https://github.com/pachterlab/gget) to retrieve database information about a list of genes from a dataframe Useful for analyzing GWAS results.

Listen to an interview with the primary developer of gget here: https://podcasts.apple.com/nz/podcast/99-laura-luebbert-gget-hunting-viruses-and/id1534473511?i=1000664104787

Sometimes gget returns dataframes with inconsistent formatting. e.g.: some columns are partly lists, and partly singleton values. Thus this file also contains functionality to munge the output of gget into a more consistent format.

Methods:

create –
execute –

Attributes:

deps (list[Task]) –
ensembl_id_col (str) –
genes_to_use (int | None) –
meta (Meta) –
out_format (OutFormat) –
post_pipe (DataProcessingPipe) –
source_df_task (Task) –
source_id (AssetId) –
source_meta (Meta) –

deps `property`

deps: list[Task]

ensembl_id_col `instance-attribute`

ensembl_id_col: str

genes_to_use `class-attribute` `instance-attribute`

genes_to_use: int | None = None

meta `property`

meta: Meta

out_format `class-attribute` `instance-attribute`

out_format: OutFormat = CSVOutFormat(',')

post_pipe `class-attribute` `instance-attribute`

post_pipe: DataProcessingPipe = IdentityPipe()

source_df_task `instance-attribute`

source_df_task: Task

source_id `property`

source_id: AssetId

source_meta `property`

source_meta: Meta

create `classmethod`

create(
    asset_id: str,
    source_df_task: Task,
    ensembl_id_col: str,
    genes_to_use: int | None = None,
    post_pipe: DataProcessingPipe = IdentityPipe(),
    out_format: OutFormat = CSVOutFormat(","),
)

Source code in mecfs_bio/build_system/task/fetch_gget_info_task.py

@classmethod
def create(
    cls,
    asset_id: str,
    source_df_task: Task,
    ensembl_id_col: str,
    genes_to_use: int | None = None,
    post_pipe: DataProcessingPipe = IdentityPipe(),
    out_format: OutFormat = CSVOutFormat(","),
):
    source_meta = source_df_task.meta
    meta: Meta
    extension, read_spec = get_extension_and_read_spec_from_format(out_format)
    if isinstance(source_meta, ResultTableMeta):
        meta = ResultTableMeta(
            id=asset_id,
            trait=source_meta.trait,
            project=source_meta.project,
            extension=extension,
            read_spec=read_spec,
        )
    elif isinstance(source_meta, ReferenceFileMeta):
        meta = ReferenceFileMeta(
            id=AssetId(asset_id),
            group=source_meta.group,
            sub_group=source_meta.sub_group,
            extension=extension,
            read_spec=read_spec,
            sub_folder=source_meta.sub_folder,
        )
    else:
        raise ValueError("unknown source meta type")

    return cls(
        source_df_task=source_df_task,
        ensembl_id_col=ensembl_id_col,
        meta=meta,
        genes_to_use=genes_to_use,
        post_pipe=post_pipe,
        out_format=out_format,
    )

execute

execute(scratch_dir: Path, fetch: Fetch, wf: WF) -> Asset

Source code in mecfs_bio/build_system/task/fetch_gget_info_task.py

def execute(self, scratch_dir: Path, fetch: Fetch, wf: WF) -> Asset:
    source_asset = fetch(self.source_id)
    df = (
        scan_dataframe_asset(source_asset, meta=self.source_meta)
        .collect()
        .to_pandas()
    )
    genes = [item for item in df[self.ensembl_id_col].tolist() if item is not None]
    if self.genes_to_use is not None:
        genes = genes[: self.genes_to_use]
    logger.debug(f"Using gget to retrieve info on {len(genes)} genes")
    logger.debug(f"Genes are: {genes}")
    gget_result = gget.info(genes)
    if isinstance(gget_result, pd.DataFrame):
        gene_info = gget_result
    else:
        gene_info = _dummy_gget_result
    result_df = pd.merge(
        df, gene_info, left_on=self.ensembl_id_col, right_index=True, how="left"
    )
    out_path = scratch_dir / f"{self.source_id}.csv"
    result_df = (
        self.post_pipe.process(narwhals.from_native(result_df).lazy())
        .collect()
        .to_pandas()
    )
    result_df = _preprocess_columns(result_df)
    if isinstance(self.out_format, CSVOutFormat):
        result_df.to_csv(out_path, index=False, sep=self.out_format.sep)
    elif isinstance(self.out_format, ParquetOutFormat):
        result_df.to_parquet(
            out_path,
        )
    else:
        raise ValueError(f"Unsupported output format: {self.out_format}")
    return FileAsset(out_path)

mecfs_bio.build_system.task.fetch_gget_info_task

PRIMARY_GENE_NAME module-attribute

PROTEIN_NAMES_COL module-attribute

SUBCELLULAR_LOCALISATION_COL module-attribute

UNIPROT_DESCRIPTION module-attribute

UNIPROT_ID_COL module-attribute

logger module-attribute

FetchGGetInfoTask

deps property

ensembl_id_col instance-attribute

genes_to_use class-attribute instance-attribute

meta property

out_format class-attribute instance-attribute

post_pipe class-attribute instance-attribute

source_df_task instance-attribute

source_id property

source_meta property

create classmethod