Input files

Input files#

AmpSeeker requires only:

Sample metadata file .tsv or SampleSheet.csv in Illumina run folder containing metadata
Bed file of amplicon target SNP coordinates .bed
Paired-end Illumina Amplicon-Seq data .fastq.gz OR single-end Nanopore amplicon data .fastq.gz
Reference genome files

Sample metadata file

If working directly from the Illumina BCL folder, a SampleSheet (SampleSheet.csv) must be placed in the Illumina run folder. An example sample sheet is located at resources/exampleSampleSheet.csv.

If working directly from fastq files, the user must provide either a SampleSheet or a sample metadata file in .tsv format which should be placed in the config/ folder and pointed to in the users config.yaml. The default name is metadata.tsv. The file only requires one column, sample_id, and a column with which to group the data for downstream analyses (for example, by location). Any extra metadata columns can be included (e.g location, taxon, strain, country, year etc) and used to split data and colour the results figures, with configurable options in the config.yaml file (cohort-columns).

metadata.tsv:

sample_id	location	taxon	latitude
ContTia1	Tiassale	coluzzii	7.1
ContTia2	Tiassale	coluzzii	7.1
ContTia4	Tiassale	coluzzii	7.1
MalaTia1	Tiassale	coluzzii	7.1
MalaTia2	Tiassale	coluzzii	7.1
MalaTia4	Tiassale	coluzzii	7.1

Latitude and Longitudes are used to plot sample collection locations. If sample maps are not required, these columns can be omitted.

Bed file of amplicon targets

Users should also provide a file in bed format, with at least 5 columns (chromosome, start, end, amplicon_id, target_label), and optionally reference and alternative alleles. The file should have no header. The bed file contains the coordinates of the amplicon targets and should be placed in the config/ folder and pointed to in the users config.yaml. The default name is amplicon_targets.bed.

2L	209535	209536	Agam_1	AIM1	A	G
2L	927246	927247	Agam_2	AIM2	C	A
2L	1274352	1274353	Agam_3	AIM3	G	A
2L	1418209	1418210	Agam_4	AIM4	T	C
2L	1571928	1571929	Agam_5	AIM5	T	C

The columns are:

Chromosome: Genomic chromosome or contig name (e.g., “2L”)
Start position: 0-based start coordinate of the target
End position: End coordinate of the target (typically start+1 for SNPs)
Amplicon ID: Identifier for the amplicon (e.g., “Agam_1”)
Target label: Descriptive name for the target (e.g., “AIM1”)
Reference allele: (Optional) The reference allele at this position
Alternate allele(s): (Optional) The alternate allele(s) at this position

The target_label column values should be unique for each row of the bed file. The amplicon_id column values can be repeated for multiple rows, to indicate that those SNPs are located on the same amplicon.

Sequencing Data Input#

Illumina Data#

Illumina run folders or Paired-end Amplicon-Sequencing fastq reads

Users can either provide the path to an Illumina MiSeq run folder, or provide paired-end fastq reads. If converting BCL files to fastq within the workflow, a SampleSheet.csv must be placed in the Illumina run folder. The SampleSheet must have the CreateFastqForIndexReads parameter set to 1, please see the exampleSampleSheet.csv in the resources/ directory.

SampleSheet.csv Structure#

The SampleSheet.csv is a structured file with several sections:

[Header]: Contains experiment metadata
[Reads]: Defines read lengths (typically 151 for paired-end reads)
[Settings]: Contains critical parameters including CreateFastqForIndexReads=1
[Data]: Sample information including IDs, indexes, and metadata

The CreateFastqForIndexReads=1 setting in the [Settings] section is essential as it instructs the BCL conversion to generate FASTQ files for index reads, which are used for demultiplexing samples.

Example [Data] section format:

sample_ID	sample_name	index	index2	well	plate_name	taxon	location	country	latitude	longitude
GH_01	GH_01	ATCACGTT	CCTATCCT	A1	3		Obuasi	Ghana
GH_02	GH_02	CGATGTTT	CCTATCCT	A2	3		Obuasi	Ghana

Required columns for the [Data] section:

sample_ID: Unique sample identifier, used to name output files
sample_name: Name displayed in reports (often the same as sample_ID)
index: Forward index sequence for demultiplexing
index2: Reverse index sequence for demultiplexing (for dual indexing)

Optional metadata columns can be added (e.g., well, plate_name, taxon, location, country, latitude, longitude) and will be incorporated into analysis results.

FASTQ File Specifications for Illumina#

If providing fastq files, two gzipped fastq files for each sample are required, one for each pair of paired-end reads. Reads can be already trimmed or AmpSeeker can trim them, using the fastp module (configurable in the config.yaml).

Two options are available for specifying the location of the fastq files. Either the metadata contains two columns fq1 and fq2 with the paths to the fastq files, or the fastq files are placed in the following directory (resources/reads/) with the following naming pattern:

ampseeker_dir/resources/reads/

Reads should be named as `{sample_id}_1.fastq.gz`, `{sample_id}_2.fastq.gz`.

If providing fastq paths in the metadata file, they can be named anything.

Nanopore Data#

FASTQ File Specifications for Nanopore#

For Nanopore data, provide single gzipped fastq files for each sample. The metadata file should contain a single column fq1 with paths to the fastq files, or place files in resources/reads/ with the naming pattern {sample_id}.fastq.gz.

Nanopore metadata.tsv format:

sample_id	fq1	location	taxon
ContTia1	reads/ContTia1.fq.gz	Tiassale	coluzzii
ContTia2	reads/ContTia2.fq.gz	Tiassale	coluzzii

Or with automatic file detection:

ampseeker_dir/resources/reads/
ContTia1.fastq.gz
ContTia2.fastq.gz

Reference genome files

For both platforms, genome alignment requires a fasta file containing the genome sequence. All input .fa files can be gzipped .fa.gz.

Reference genomes can be downloaded from vectorbase with the resources/reference/download-vectorbase-reference.sh script. Run it from the root AmpSeeker directory.

The user provides the path to the reference files in the configuration file (config.yaml).

Genome chromosomes reference file (.fa/.fa.gz). Contains the DNA sequence for the genome in fasta format.
Genome feature file (.gff3 format).

Ensure that contigs in the reference genome match the contigs in the bed file.

*Note - Genome reference files from VectorBase now have prefixes before each contig name, such as ‘AgamP4_2L’. Either the bed file can be updated to match these, or the names in the reference files can be modified. The downloader script can strip these prefixes automatically.