Configuring the workflow

Configuring the workflow#

Before running AmpSeeker, we need to select which analyses we want to run (configuration). This is done by editing the file config.yaml in the config directory. The config file contains a number of options and parameters.

If you have any issues configuring the pipeline, please watch the video walkthrough first, and raise an issue on github or email me.

Configuration Parameters#

dataset: lab-strains
panel: ag-vampir

dataset: A name for your dataset. This will be used to name output files and directories.
panel: The name of the amplicon panel used. Currently, the workflow has special support for the “Ag-vampIR” panel (Anopheles gambiae vector amplicon marker panel for Insecticide Resistance). If not using Ag-vampIR, panel name can be anything.

Cohort Analysis Configuration#

cohort-columns:
  - location
  - taxon

cohort-columns: List of metadata columns used to group samples for analyses. These columns will be used to color samples in plots and to perform comparisons between groups. Common options include location, taxon, country, and any other categorical variables in your metadata.

Input Files Configuration#

targets: config/ag-vampir.bed
metadata: config/metadata.tsv

targets: Path to a BED file containing the genomic coordinates of amplicon targets. This file should have 5 columns: chromosome, start, end, amplicon_id, and target_label with no header. See the input_data.ipynb for more details.
metadata: Path to a TSV file containing sample metadata. At minimum, this file should have a “sample_id” column. Additional columns can be used for cohort analysis. Only required if working directly from fastq files (from-bcl: False).

Illumina Directory Configuration#

illumina-dir: resources/250110_M05658_0028_000000000-LTBV4

illumina-dir: Path to the Illumina MiSeq run directory containing BCL files. This is only required if converting from BCL to FASTQ (from-bcl: True). If you already have FASTQ files, this can be left empty.

Reference Genome Configuration#

reference-fasta: resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa
reference-gff3: resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3
reference-snpeffdb: Anopheles_gambiae
custom-snpeffdb: False

reference-fasta: Path to the reference genome FASTA file (can be gzipped with .fa.gz extension).
reference-gff3: Path to the genome annotation file in GFF3 format.
reference-snpeffdb: Name of the SnpEff database to use for variant annotation. This should match a database name available in SnpEff (e.g., “Anopheles_gambiae”).
custom-snpeffdb: Whether to build a custom SnpEff database. Set to True if the reference genome is not available in the standard SnpEff databases.

Input File Type Configuration#

from-bcl: True
fastq:
  auto: True

from-bcl: Whether to convert BCL files to FASTQ files. If True, the pipeline will use the illumina-dir path to find BCL files and convert them to FASTQ format.
fastq.auto: If True, the pipeline expects FASTQ files to be in resources/reads/ directory with the naming pattern {sample_id}_1.fastq.gz and {sample_id}_2.fastq.gz. If False, the pipeline expects the metadata file to have columns fq1 and fq2 specifying the paths to the FASTQ files.

Quality Control Configuration#

quality-control:
  sample-total-reads-threshold: 250
  amplicon-total-reads-threshold: 1000

  fastp: True
  coverage: True
  stats: True
  multiqc: True

quality-control.sample-total-reads-threshold: Minimum number of reads required for a sample to pass QC. Samples with fewer reads are removed at the quality control stage.
quality-control.amplicon-total-reads-threshold: Minimum number of reads required for an amplicon to be considered for analysis.
quality-control.fastp: Whether to run the fastp tool for read quality control and trimming.
quality-control.coverage: Whether to calculate and report coverage statistics for each sample.
quality-control.stats: Whether to generate alignment and variant calling statistics.
quality-control.multiqc: Whether to generate a MultiQC report aggregating various QC metrics.

Analysis Configuration#

analysis:  
  sample-map: False # needs lat and longs in metadata/sample_sheet
  population-structure: True
  genetic-diversity: True
  allele-frequencies: True

analysis.sample-map: Whether to generate a geographic map of sample collection locations. Requires latitude and longitude columns in the metadata file.
analysis.population-structure: Whether to perform population structure analysis using principal component analysis (PCA).
analysis.genetic-diversity: Whether to calculate genetic diversity metrics (such as nucleotide diversity).
analysis.allele-frequencies: Whether to calculate and visualize allele frequencies across samples and groups.

Jupyter Book Configuration#

build-jupyter-book: True

build-jupyter-book: Whether to compile all analysis notebooks into a Jupyter Book for convenient browsing of results. The book will be available at results/ampseeker-results/_build/html/index.html.

Special Analyses for Ag-vampIR Panel#

When using the Ag-vampIR panel (panel: ag-vampir), the workflow automatically enables additional analyses:

Species Identification: Identifies Anopheles species using amplicon sequencing data (results in results/notebooks/ag-vampir/species-id.ipynb).
Kdr Analysis: Analyzes knockdown resistance mutations in the voltage-gated sodium channel (results in results/notebooks/ag-vampir/kdr-analysis.ipynb).

Example Configuration#

Here’s a complete example configuration for reference:

# Dataset and panel information
dataset: lab-strains
panel: ag-vampir
cohort-columns:
  - location
  - taxon
  - country
targets: config/ag-vampir.bed
metadata: config/metadata.tsv

# Illumina directory (if using BCL files)
illumina-dir: resources/250110_M05658_0028_000000000-LTBV4

# Reference genome information
reference-name: AgamP4
reference-fasta: resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa
reference-gff3: resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3
reference-snpeffdb: Anopheles_gambiae
custom-snpeffdb: False

# Input file type options
from-bcl: True
fastq:
  auto: True

# Quality control options
quality-control:
  sample-total-reads-threshold: 250
  amplicon-total-reads-threshold: 1000
  fastp: True
  coverage: True
  stats: True
  multiqc: True

# Analysis options
analysis:  
  sample-map: False
  population-structure: True
  genetic-diversity: True
  allele-frequencies: True

# Build Jupyter book
build-jupyter-book: True