Configuring the workflow#
Before running AmpSeeker, we need to select which analyses we want to run (configuration). This is done by editing the file config.yaml
in the config
directory. The config file contains a number of options and parameters.
If you have any issues configuring the pipeline, please watch the video walkthrough first, and raise an issue on github or email me.
Configuration Parameters#
dataset: lab-strains
panel: ag-vampir
dataset: A name for your dataset. This will be used to name output files and directories.
panel: The name of the amplicon panel used. Currently, the workflow has special support for the “Ag-vampIR” panel (Anopheles gambiae vector amplicon marker panel for Insecticide Resistance). If not using Ag-vampIR, panel name can be anything.
Cohort Analysis Configuration#
cohort-columns:
- location
- taxon
cohort-columns: List of metadata columns used to group samples for analyses. These columns will be used to color samples in plots and to perform comparisons between groups. Common options include location, taxon, country, and any other categorical variables in your metadata.
Input Files Configuration#
targets: config/ag-vampir.bed
metadata: config/metadata.tsv
targets: Path to a BED file containing the genomic coordinates of amplicon targets. This file should have 5 columns: chromosome, start, end, amplicon_id, and target_label with no header. See the input_data.ipynb for more details.
metadata: Path to a TSV file containing sample metadata. At minimum, this file should have a “sample_id” column. Additional columns can be used for cohort analysis. Only required if working directly from fastq files (
from-bcl: False
).
Illumina Directory Configuration#
illumina-dir: resources/250110_M05658_0028_000000000-LTBV4
illumina-dir: Path to the Illumina MiSeq run directory containing BCL files. This is only required if converting from BCL to FASTQ (
from-bcl: True
). If you already have FASTQ files, this can be left empty.
Reference Genome Configuration#
reference-fasta: resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa
reference-gff3: resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3
reference-snpeffdb: Anopheles_gambiae
custom-snpeffdb: False
reference-fasta: Path to the reference genome FASTA file (can be gzipped with .fa.gz extension).
reference-gff3: Path to the genome annotation file in GFF3 format.
reference-snpeffdb: Name of the SnpEff database to use for variant annotation. This should match a database name available in SnpEff (e.g., “Anopheles_gambiae”).
custom-snpeffdb: Whether to build a custom SnpEff database. Set to
True
if the reference genome is not available in the standard SnpEff databases.
Input File Type Configuration#
from-bcl: True
fastq:
auto: True
from-bcl: Whether to convert BCL files to FASTQ files. If
True
, the pipeline will use theillumina-dir
path to find BCL files and convert them to FASTQ format.fastq.auto: If
True
, the pipeline expects FASTQ files to be inresources/reads/
directory with the naming pattern{sample_id}_1.fastq.gz
and{sample_id}_2.fastq.gz
. IfFalse
, the pipeline expects the metadata file to have columnsfq1
andfq2
specifying the paths to the FASTQ files.
Quality Control Configuration#
quality-control:
sample-total-reads-threshold: 250
amplicon-total-reads-threshold: 1000
fastp: True
coverage: True
stats: True
multiqc: True
quality-control.sample-total-reads-threshold: Minimum number of reads required for a sample to pass QC. Samples with fewer reads are removed at the quality control stage.
quality-control.amplicon-total-reads-threshold: Minimum number of reads required for an amplicon to be considered for analysis.
quality-control.fastp: Whether to run the fastp tool for read quality control and trimming.
quality-control.coverage: Whether to calculate and report coverage statistics for each sample.
quality-control.stats: Whether to generate alignment and variant calling statistics.
quality-control.multiqc: Whether to generate a MultiQC report aggregating various QC metrics.
Analysis Configuration#
analysis:
sample-map: False # needs lat and longs in metadata/sample_sheet
population-structure: True
genetic-diversity: True
allele-frequencies: True
analysis.sample-map: Whether to generate a geographic map of sample collection locations. Requires latitude and longitude columns in the metadata file.
analysis.population-structure: Whether to perform population structure analysis using principal component analysis (PCA).
analysis.genetic-diversity: Whether to calculate genetic diversity metrics (such as nucleotide diversity).
analysis.allele-frequencies: Whether to calculate and visualize allele frequencies across samples and groups.
Jupyter Book Configuration#
build-jupyter-book: True
build-jupyter-book: Whether to compile all analysis notebooks into a Jupyter Book for convenient browsing of results. The book will be available at
results/ampseeker-results/_build/html/index.html
.
Special Analyses for Ag-vampIR Panel#
When using the Ag-vampIR panel (panel: ag-vampir
), the workflow automatically enables additional analyses:
Species Identification: Identifies Anopheles species using amplicon sequencing data (results in
results/notebooks/ag-vampir/species-id.ipynb
).Kdr Analysis: Analyzes knockdown resistance mutations in the voltage-gated sodium channel (results in
results/notebooks/ag-vampir/kdr-analysis.ipynb
).
Example Configuration#
Here’s a complete example configuration for reference:
# Dataset and panel information
dataset: lab-strains
panel: ag-vampir
cohort-columns:
- location
- taxon
- country
targets: config/ag-vampir.bed
metadata: config/metadata.tsv
# Illumina directory (if using BCL files)
illumina-dir: resources/250110_M05658_0028_000000000-LTBV4
# Reference genome information
reference-name: AgamP4
reference-fasta: resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa
reference-gff3: resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3
reference-snpeffdb: Anopheles_gambiae
custom-snpeffdb: False
# Input file type options
from-bcl: True
fastq:
auto: True
# Quality control options
quality-control:
sample-total-reads-threshold: 250
amplicon-total-reads-threshold: 1000
fastp: True
coverage: True
stats: True
multiqc: True
# Analysis options
analysis:
sample-map: False
population-structure: True
genetic-diversity: True
allele-frequencies: True
# Build Jupyter book
build-jupyter-book: True