mecfs_bio.build_system.task.gwaslab.gwaslab_create_sumstats_task
Task to read in a dataframe from disk and process it using the GWASLAB pipeline.
Classes:
-
GWASLabColumnSpecifiers– -
GWASLabCreateSumstatsTask–Task that processes a DataFrame of GWAS summary statistics using the GWASLab pipeline.
-
GWASLabVCFRef– -
GwasLabTransformSpec– -
HarmonizationOptions–Options for the call to GWASLab's harmonize function.
Functions:
Attributes:
ValidGwaslabFormat
module-attribute
GWASLabColumnSpecifiers
Methods:
Attributes:
-
OR(str | None) – -
beta(str | None) – -
chi_sq(str | None) – -
chrom(str | None) – -
ea(str | None) – -
eaf(str | None) – -
info(str | None) – -
mlog10p(str | None) – -
n(str | None) – -
ncase(str | None) – -
ncontrol(str | None) – -
nea(str | None) – -
neaf(str | None) – -
neff(str | None) – -
or_95l(str | None) – -
or_95u(str | None) – -
p(str | None) – -
pos(str | None) – -
rsid(str | None) – -
se(str | None) – -
snpid(str | None) –
get_selection_pipe
GWASLabCreateSumstatsTask
Bases: Task
Task that processes a DataFrame of GWAS summary statistics using the GWASLab pipeline. see: https://cloufield.github.io/gwaslab/SumstatsObject/
Methods:
-
execute–
Attributes:
-
basic_check(bool) – -
deps(list[Task]) – -
drop_col_list(Sequence[str]) – -
exclude_hla(bool) – -
exclude_sexchr(bool) – -
filter_hapmap3(bool) – -
filter_indels(bool) – -
filter_palindromic(bool) – -
fmt(GwaslabKnownFormat | GWASLabColumnSpecifiers) – -
genome_build(GenomeBuildMode) – -
harmonize_options(HarmonizationOptions | None) – -
liftover_to(GenomeBuild | None) – -
meta(Meta) – -
pre_pipe(DataProcessingPipe) –
fmt
class-attribute
instance-attribute
harmonize_options
class-attribute
instance-attribute
execute
Source code in mecfs_bio/build_system/task/gwaslab/gwaslab_create_sumstats_task.py
GWASLabVCFRef
Attributes:
-
extra_downloads(Sequence[str]) – -
name(str) – -
ref_alt_freq(str) –
GwasLabTransformSpec
Attributes:
-
basic_check(bool) – -
exclude_hla(bool) – -
exclude_sexchr(bool) – -
filter_hapmap3(bool) – -
filter_indels(bool) – -
filter_palindromic(bool) – -
genome_build(GenomeBuildMode) – -
harmonize_options(HarmonizationOptions | None) – -
liftover_to(GenomeBuild | None) –
harmonize_options
class-attribute
instance-attribute
HarmonizationOptions
Options for the call to GWASLab's harmonize function.
gwaslab's harmonization function changes the status codes in the STATUS column. These status codes are described here: https://cloufield.github.io/gwaslab/StatusCode/
Below I explain some points that were initially not clear to me from the gwaslab documentation.
Two reference files are used in harmonization:
- A VCF file (ref_infer). This is basically a table of genetic variants.
In some cases, this table is in dbSNP VCF format. In this case, each row describes a given genetic variant. Sometimes this description includes allele frequency.
In other cases, (such as when using a thousand genomes reference data) this table is in genotype VCF format. In this case, the rows of the VCF file correspond to variants, and the columns correspond to individuals (from the thousand genomes project, for example). For each individual and each variant, the table tells us whether that individual has that variant. Variant frequency information can be calculated from this individual-level genome data.
-
A FASTA file (ref_seq). This is a consensus human genome sequence. Here is an example of some rows from the hg19 FASTA file:
TAAGTTTTGTCTGGTAATAAAGGTATATTTTCAAAAGAGAGGTAAATAGA TCCACATACTGTGGAGGGAATAAAATACTTTTTGAAAAACAAACAACAAG TTGGATTTTTAGACACATAGAAATTGAATATGTACATTTATAAATATTTT TGGATTGAACTATTTCAAAATTATACCATAAAATAACTTGTAAAAATGTA GGCAAAATGTATATAATTATGGCATGAGGTATGCAACTTTAGGCAAGGAA GCAAAAGCAGAAACCATGAAAAAAGTCTAAATTTTACCATATTGAATTTA AATTTTCAAAAACAAAAATAAAGACAAAGTGGGAAAAATATGTATGCTTC ATGTGTGACAAGCCACTGATACCTATTAAATATGAAGAATATTATAAATC ATATCAATAACCACAACATTCAAGCTGTCAGTTTGAATAGACaatgtaaa tgacaaaactacatactcaacaagataacagcaaaccagcttcgacagca cgttaaaggggtcatacaacataatcgagtagaatttatctctgagatgc aagaatggttcaaaatatggaaaccaataaatgtgatatgccacactaac agaataaaaaataaaaatcatattatcatctcaatagatgcagaaaaagc attaacaaaagtaaacattctttcataataagacatcagataaaacaaat taggaatagaaggaatgtaccgcaacacaataaaggccatatataacaag cccacagctaacatcataatagtaaaatcatcacactggtaaaaaaaatg
gwaslab uses these two reference files to harmonize summary statistics. These two reference files each affect a different digit of the gwaslab STATUS code column.
- Digit 7 of the status code is determined by the ability of gwaslab to find the variant in the reference VCF (ref_infer) : see here: https://github.com/Cloufield/gwaslab/blob/d639b67c5264b1ac7ec89e284e638f2c8454ac48/src/gwaslab/hm/hm_harmonize_sumstats.py#L1521-L1530 Values of 7 or 8 here mean that the variant is palindromic, and the database could not be used to disambiguate the strand of the variant, or the variant was not found in the database
- Digit 6 of the status code is instead determined by the ability of the gwaslab to find the variant in the reference genome build FASTA file (ref_seq) see here: https://github.com/Cloufield/gwaslab/blob/d639b67c5264b1ac7ec89e284e638f2c8454ac48/src/gwaslab/hm/hm_harmonize_sumstats.py#L968-L975 a value of 8 means a failure to find the variant in the FASTA file.
Set drop_missing_from_ref_seq to drop based on digit 6. Set drop_missing_from_ref_infer to drop based on digit 7.
Attributes:
-
check_ref_files(bool) – -
cores(int) – -
drop_missing_from_ref_infer_or_ambiguous(bool) – -
drop_missing_from_ref_seq(bool) – -
ref_infer(GWASLabVCFRef) – -
ref_seq(str) –