Configuring the workflow

Configuring the workflow#

Before running the RNA-Seq-Pop, we need to select which analyses we want to run (configuration). This is done by editing the file config.yaml in the config directory. The config file contains a number of options and parameters, which we describe below.

The configuration file looks like the following:

metadata: config/samples.tsv                    

dataset: 'Ag_Bouake'                            

fastq:
      auto: True   
      paired: True                                       

metadata
Path to the sample metadata file, usually located in the config directory.
dataset
Name for the dataset, which will be used to name some output files.
fastq
auto: If true, fastq files are in resources/reads/ and following the {sampleID_1.fastq.gz} pattern. If False, there are fq1 and fq2 columns pointing to the reads files for each sample in the sample metadata file. If reads are stored in the resources/reads directory, the user must include this prefix in the fq1 and fq2 paths. The fq1 and fq2 columns are ignored if auto is True.
paired: If True, the reads are paired-end. If False, the reads are single-end, and there is only an fq1 column in the sample metadata file.

contigs: ['2L', '2R', '3L', '3R', 'X']

reference:
  genome:
        "resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa"       
  transcriptome:
        "resources/reference/Anopheles-gambiae-PEST_TRANSCRIPTS_AgamP4.12.fa"      
  gff:
        "resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3"  
  snpeffdb:
        "Anopheles_gambiae"                                                       
  genes2transcripts:                                                                
        "resources/exampleGene2TranscriptMap.tsv"                                 

contigs
A list of contigs you wish to analyse. Must match the reference genome files.

Reference files
genome: Path to the genome (.fa)
transcriptome: Path to the transcriptome (.fa)
gff: Path to the genome feature file (.gff3)
genes2transcripts: Path to the gene to transcript map (.tsv)

contrasts:
      - 'Kisumu_gambiaeCont'
      - 'gambiaeCont_gambiaePM'


QualityControl:
      cutadapt:
            activate: False
            adaptors:
      fastqc:                               
            activate: True
      mosdepth:
            activate: True
      multiqc:
            activate: True

contrasts
A list of the pairwise contrasts you wish to run, from values in the treatment column of the sample metadata file. The format is control_case, or susceptible_resistant.

QualityControl
cutadapt: Trim reads with cutadapt
adaptors: adaptors to be trimmed

fastqc: activate: Run fastqc on read data
mosdepth: activate: Run mosdepth on read data
multiqc: activate: Run multiqc on read data

DifferentialExpression:
      activate: True                          # Activate differential expression analyses

      venn:
            activate: True
            padj_threshold: 0.05
            groups: ['gambiaeCont_gambiaePM', 'Kisumu_gambiaeCont']

      progressiveGenes:                       
            activate: True                    
            padj_threshold: 0.05
            fc_threshold: 1
            groups: "Kisumu_gambiaeCont_gambiaePM"

      GSEA:                                 # Activate fgsea Gene set enrichment analysis 
            activate: True
            gaf: "resources/reference/VectorBase-50_AgambiaePEST_GO.gaf"

Differential Expression
activate: Run differential expression analyses

Venn
activate: Generate venn diagrams of DE expressed genes
padj_threshold: threshold to call significance
groups: string or list of strings to indicate the DE comparisons we wish to make a venn for.

ProgressiveGenes
activate: Run progressiveGenes analysis. Will find genes that are consistently up or downregulated across two comparisons.
padj_threshold: threshold to call significance
fc_threshold: threshold to call overexpression
groups: String or list of strings, which indicate the 3 populations and 2 DE comparisons to be compared. In this example - which genes upregulated in Kisumu v gambiaeCont, are also upregulated in gambiaeCont v gambiaePM.

gsea
activate: run hypergeometric test enrichment analysis.
gaf: gaf file with go annotations.

VariantAnalysis:
      activate: True
      ploidy: 10                          
      chunks: 9                           

      pca:                                # Run PCA on the genotype data?
            activate: True
            missingness: 1 
      
      summaryStatistics:                  # Estimate Population Genetic Summary Statistics such as Dxy, Pi
            activate: True
            missingness: 1
      
      selection:                          # Calculate Fst and PBS per gene and in windows
            activate: True
            missingness: 1

            pbs:
                  activate: True          
                  contrasts: 
                        - 'gambiaePM_gambiaeCont_Kisumu'

VariantAnalysis
activate : Perform genome alignment and variant calling.
ploidy : the ploidy of our samples to call variants at. If you have pooled data, the ploidy should be n_samples per pool * organism ploidy.
chunks : The number of chunks to split each chromosome into to parallelise variant calling.

pca
activate: Run PCA on the SNP data
missingness: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.

summaryStatistics
activate: Run genetic diversity analyses.
missingness: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.

selection
activate: Run Fst analysis
missingness: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.

pbs
activate: Run population branch statistic analysis. Requires 3 suitable treatment groups, two closely related and an outgroup. For resistance, do survivors_unexposed_susceptible.
contrasts: List of the 3 groups as strings separated by an underscore.

      ancestry:                                
            activate: True
            missingness: 0.5                                # proportion between 0 and 1 
            gambcolu: "resources/gamb_vs_colu.zarr"         # path to gambcolu AIMs
            arab: "resources/gambcolu_vs_arab.zarr"         # path to arab AIMs

      karyotype:                            
            activate: True
            inversions:
                  - "2La"
                  - "2Rb"

ancestry (An. gambiae only)
activate: If True, run AIM analysis.
missingness: a filter - the proportion of samples that have data at a given allele. 1 means a genomic position can have no missing calls across all samples.
gambcolu: path to gamb_colu aims
arab: path to gambcolu_vs_arabiensis aims

karyotype (An. gambiae only)
activate: If True, run karyotyping module.
inversions: A list of inversions in compKaryo to analyse.

miscellaneous:
      VariantsOfInterest:             
            activate: True                             
            path: "resources/exampleMutations.tsv"         
      
      GeneFamiliesHeatmap:
            activate: True
            eggnog: resources/Anogam_long.pep_eggnog_diamond.emapper.annotations.GO
            pfam: resources/Anogam_long.pep_Pfamscan.seqs

      sweeps:
            activate: True
            padj_threshold: 0.05
            fc_threshold: 1.5

VariantsOfInterest
activate: If True, run variants of interest analysis
path: Path to variants of interest data

GeneFamiliesHeatmap
activate: Run heatmaps on gene families using go terms and pfam domains.
eggnog: path to eggnog file with go annotations for your organism (produced by eggnog-mapper).
pfam: path to pfam domains for your organism.

sweeps (An. gambiae only)
activate: Run analysis to determine if DE genes lie under known selective sweep loci.
padj_threshold: threshold to call significance
fc_threshold: threshold to call overexpression

If you have any issues configuring the pipeline, please watch the video walkthrough first, and raise an issue on github or email me.