Input files#

AmpSeeker requires only:

  • Sample metadata file .tsv or SampleSheet.csv in Illumina run folder containing metadata

  • Bed file of amplicon target SNP coordinates .bed

  • Paired-end Amplicon-Seq data .fastq.gz

  • Reference genome files

Sample metadata file

If working directly from the Illumina BCL folder, a SampleSheet (SampleSheet.csv) must be placed in the Illumina run folder. An example sample sheet is located at resources/exampleSampleSheet.csv.

If working directly from fastq files, the user must provide either a SampleSheet or a sample metadata file in .tsv format which should be placed in the config/ folder and pointed to in the users config.yaml. The default name is metadata.tsv. The file only requires one column, sample_id, and a column with which to group the data for downstream analyses (for example, by location). Any extra metadata columns can be included (e.g location, taxon, strain, country, year etc) and used to split data and colour the results figures, with configurable options in the config.yaml file (cohort-columns).

metadata.tsv:

sample_id

location

taxon

latitude

longitude

ContTia1

Tiassale

coluzzii

7.1

0

ContTia2

Tiassale

coluzzii

7.1

0

ContTia4

Tiassale

coluzzii

7.1

0

MalaTia1

Tiassale

coluzzii

7.1

0

MalaTia2

Tiassale

coluzzii

7.1

0

MalaTia4

Tiassale

coluzzii

7.1

0

Latitude and Longitudes are used to plot sample collection locations. If sample maps are not required, these columns can be omitted.


Bed file of amplicon targets

Users should also provide a file in bed format, with at least 5 columns (chromosome, start, end, amplicon_id, target_label), and optionally reference and alternative alleles. The file should have no header. The bed file contains the coordinates of the amplicon targets and should be placed in the config/ folder and pointed to in the users config.yaml. The default name is amplicon_targets.bed.

2L

209535

209536

Agam_1

AIM1

A

G

2L

927246

927247

Agam_2

AIM2

C

A

2L

1274352

1274353

Agam_3

AIM3

G

A

2L

1418209

1418210

Agam_4

AIM4

T

C

2L

1571928

1571929

Agam_5

AIM5

T

C

The columns are:

  1. Chromosome: Genomic chromosome or contig name (e.g., “2L”)

  2. Start position: 0-based start coordinate of the target

  3. End position: End coordinate of the target (typically start+1 for SNPs)

  4. Amplicon ID: Identifier for the amplicon (e.g., “Agam_1”)

  5. Target label: Descriptive name for the target (e.g., “AIM1”)

  6. Reference allele: (Optional) The reference allele at this position

  7. Alternative allele(s): (Optional) The alternative allele(s) at this position

The target_label column values should be unique for each row of the bed file. The amplicon_id column values can be repeated for multiple rows, to indicate that those SNPs are located on the same amplicon.


Illumina run folders or Paired-end Amplicon-Sequencing fastq reads

Users can either provide the path to an Illumina MiSeq run folder, or provide paired-end fastq reads. If converting BCL files to fastq within the workflow, a SampleSheet.csv must be placed in the Illumina run folder. The SampleSheet must have the CreateFastqForIndexReads parameter set to 1, please see the exampleSampleSheet.csv in the resources/ directory.

SampleSheet.csv Structure#

The SampleSheet.csv is a structured file with several sections:

  • [Header]: Contains experiment metadata

  • [Reads]: Defines read lengths (typically 151 for paired-end reads)

  • [Settings]: Contains critical parameters including CreateFastqForIndexReads=1

  • [Data]: Sample information including IDs, indexes, and metadata

The CreateFastqForIndexReads=1 setting in the [Settings] section is essential as it instructs the BCL conversion to generate FASTQ files for index reads, which are used for demultiplexing samples.

Example [Data] section format:

sample_ID

sample_name

index

index2

well

plate_name

taxon

location

country

latitude

longitude

GH_01

GH_01

ATCACGTT

CCTATCCT

A1

3

Obuasi

Ghana

GH_02

GH_02

CGATGTTT

CCTATCCT

A2

3

Obuasi

Ghana

Required columns for the [Data] section:

  • sample_ID: Unique sample identifier, used to name output files

  • sample_name: Name displayed in reports (often the same as sample_ID)

  • index: Forward index sequence for demultiplexing

  • index2: Reverse index sequence for demultiplexing (for dual indexing)

Optional metadata columns can be added (e.g., well, plate_name, taxon, location, country, latitude, longitude) and will be incorporated into analysis results.

If providing fastq files, two gzipped fastq files for each sample are required, one for each pair of paired-end reads. Reads can be already trimmed or AmpSeeker can trim them, using the fastp module (configurable in the config.yaml).

Two options are available for specifying the location of the fastq files. Either the metadata contains two columns fq1 and fq2 with the paths to the fastq files, or the fastq files are placed in the following directory (resources/reads/) with the following naming pattern:

ampseeker_dir/resources/reads/

Reads should be named as `{sample_id}_1.fastq.gz`, `{sample_id}_2.fastq.gz`.

If providing fastq paths in the metadata file, they can be named anything.


Reference genome files

AmpSeeker uses bwa and samtools mpileup for alignment and variant calling. For variant calling, genome alignment is performed with bwa, which requires a fasta file containing the genome sequence. All input .fa files can be gzipped .fa.gz.

The user provides the path to the reference files in the configuration file (config.yaml).

  1. Genome chromosomes reference file (.fa/.fa.gz). Contains the DNA sequence for the genome in fasta format.

  2. Genome feature file (.gff3 format).

Ensure that contigs in the reference genome match the contigs in the bed file.

Note - Genome reference files from VectorBase now have prefixes before each contig name, such as ‘AgamP4_2L’. Either the bed file can be updated to match these, or the names in the reference files can be modified