Configuring the workflow#
Before running the RNA-Seq-Pop, we need to select which analyses we want to run (configuration). This is done by editing the file config.yaml
in the config
directory. The config file contains a number of options and parameters, which we describe below.
The configuration file looks like the following:
metadata: config/samples.tsv
dataset: 'Ag_Bouake'
fastq:
auto: True
paired: True
metadata
Path to the sample metadata file, usually located in the config
directory.
dataset
Name for the dataset, which will be used to name some output files.
fastq
auto: If true, fastq files are in resources/reads/ and following the {sampleID_1.fastq.gz}
pattern. If False, there are fq1 and fq2 columns pointing to the reads files for each sample in the sample metadata file. If reads are stored in the resources/reads
directory, the user must include this prefix in the fq1 and fq2 paths. The fq1 and fq2 columns are ignored if auto is True.
paired: If True, the reads are paired-end. If False, the reads are single-end, and there is only an fq1 column in the sample metadata file.
contigs: ['2L', '2R', '3L', '3R', 'X']
reference:
genome:
"resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa"
transcriptome:
"resources/reference/Anopheles-gambiae-PEST_TRANSCRIPTS_AgamP4.12.fa"
gff:
"resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3"
snpeffdb:
"Anopheles_gambiae"
genes2transcripts:
"resources/exampleGene2TranscriptMap.tsv"
contigs
A list of contigs you wish to analyse. Must match the reference genome files.
Reference files
genome: Path to the genome (.fa)
transcriptome: Path to the transcriptome (.fa)
gff: Path to the genome feature file (.gff3)
genes2transcripts: Path to the gene to transcript map (.tsv)
contrasts:
- 'Kisumu_gambiaeCont'
- 'gambiaeCont_gambiaePM'
QualityControl:
cutadapt:
activate: False
adaptors:
fastqc:
activate: True
mosdepth:
activate: True
multiqc:
activate: True
contrasts
A list of the pairwise contrasts you wish to run, from values in the treatment column of the sample metadata file. The format is control_case
, or susceptible_resistant
.
QualityControl
cutadapt: Trim reads with cutadapt
adaptors: adaptors to be trimmed
fastqc: activate: Run fastqc on read data
mosdepth: activate: Run mosdepth on read data
multiqc: activate: Run multiqc on read data
DifferentialExpression:
activate: True # Activate differential expression analyses
venn:
activate: True
padj_threshold: 0.05
groups: ['gambiaeCont_gambiaePM', 'Kisumu_gambiaeCont']
progressiveGenes:
activate: True
padj_threshold: 0.05
fc_threshold: 1
groups: "Kisumu_gambiaeCont_gambiaePM"
GSEA: # Activate fgsea Gene set enrichment analysis
activate: True
gaf: "resources/reference/VectorBase-50_AgambiaePEST_GO.gaf"
Differential Expression
activate: Run differential expression analyses
Venn
activate: Generate venn diagrams of DE expressed genes
padj_threshold: threshold to call significance
groups: string or list of strings to indicate the DE comparisons we wish to make a venn for.
ProgressiveGenes
activate: Run progressiveGenes analysis. Will find genes that are consistently up or downregulated across two comparisons.
padj_threshold: threshold to call significance
fc_threshold: threshold to call overexpression
groups: String or list of strings, which indicate the 3 populations and 2 DE comparisons to be compared. In this example - which genes upregulated in Kisumu v gambiaeCont, are also upregulated in gambiaeCont v gambiaePM.
gsea
activate: run hypergeometric test enrichment analysis.
gaf: gaf file with go annotations.
VariantAnalysis:
activate: True
ploidy: 10
chunks: 9
pca: # Run PCA on the genotype data?
activate: True
missingness: 1
summaryStatistics: # Estimate Population Genetic Summary Statistics such as Dxy, Pi
activate: True
missingness: 1
selection: # Calculate Fst and PBS per gene and in windows
activate: True
missingness: 1
pbs:
activate: True
contrasts:
- 'gambiaePM_gambiaeCont_Kisumu'
VariantAnalysis
activate : Perform genome alignment and variant calling.
ploidy : the ploidy of our samples to call variants at. If you have pooled data, the ploidy should be n_samples per pool * organism ploidy.
chunks : The number of chunks to split each chromosome into to parallelise variant calling.
pca
activate: Run PCA on the SNP data
missingness: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.
summaryStatistics
activate: Run genetic diversity analyses.
missingness: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.
selection
activate: Run Fst analysis
missingness: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.
pbs
activate: Run population branch statistic analysis. Requires 3 suitable treatment groups, two closely related and an outgroup. For resistance, do survivors_unexposed_susceptible.
contrasts: List of the 3 groups as strings separated by an underscore.
ancestry:
activate: True
missingness: 0.5 # proportion between 0 and 1
gambcolu: "resources/gamb_vs_colu.zarr" # path to gambcolu AIMs
arab: "resources/gambcolu_vs_arab.zarr" # path to arab AIMs
karyotype:
activate: True
inversions:
- "2La"
- "2Rb"
ancestry (An. gambiae only)
activate: If True, run AIM analysis.
missingness: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.
gambcolu: path to gamb_colu aims
arab: path to gambcolu_vs_arabiensis aims
karyotype (An. gambiae only)
activate: If True, run karyotyping module.
inversions: A list of inversions in compKaryo to analyse.
miscellaneous:
VariantsOfInterest:
activate: True
path: "resources/exampleMutations.tsv"
GeneFamiliesHeatmap:
activate: True
eggnog: resources/Anogam_long.pep_eggnog_diamond.emapper.annotations.GO
pfam: resources/Anogam_long.pep_Pfamscan.seqs
sweeps:
activate: True
padj_threshold: 0.05
fc_threshold: 1.5
VariantsOfInterest
activate: If True, run variants of interest analysis
path: Path to variants of interest data
GeneFamiliesHeatmap
activate: Run heatmaps on gene families using go terms and pfam domains.
eggnog: path to eggnog file with go annotations for your organism (produced by eggnog-mapper).
pfam: path to pfam domains for your organism.
sweeps (An. gambiae only)
activate: Run analysis to determine if DE genes lie under known selective sweep loci.
padj_threshold: threshold to call significance
fc_threshold: threshold to call overexpression
If you have any issues configuring the pipeline, please watch the video walkthrough first, and raise an issue on github or email me.