Input files#
AmpSeeker requires only:
Sample metadata file
.tsv
or SampleSheet.csv in Illumina run folder containing metadataBed file of amplicon target SNP coordinates
.bed
Paired-end Amplicon-Seq data
.fastq.gz
Reference genome files
Sample metadata file
If working directly from the Illumina BCL folder, a SampleSheet (SampleSheet.csv
) must be placed in the Illumina run folder. An example sample sheet is located at resources/exampleSampleSheet.csv
.
If working directly from fastq files, the user must provide either a SampleSheet or a sample metadata file in .tsv format which should be placed in the config/ folder and pointed to in the users config.yaml
. The default name is metadata.tsv
. The file only requires one column, sample_id, and a column with which to group the data for downstream analyses (for example, by location). Any extra metadata columns can be included (e.g location, taxon, strain, country, year etc) and used to split data and colour the results figures, with configurable options in the config.yaml
file (cohort-columns).
metadata.tsv:
sample_id |
location |
taxon |
latitude |
longitude |
---|---|---|---|---|
ContTia1 |
Tiassale |
coluzzii |
7.1 |
0 |
ContTia2 |
Tiassale |
coluzzii |
7.1 |
0 |
ContTia4 |
Tiassale |
coluzzii |
7.1 |
0 |
MalaTia1 |
Tiassale |
coluzzii |
7.1 |
0 |
MalaTia2 |
Tiassale |
coluzzii |
7.1 |
0 |
MalaTia4 |
Tiassale |
coluzzii |
7.1 |
0 |
Latitude and Longitudes are used to plot sample collection locations. If sample maps are not required, these columns can be omitted.
Bed file of amplicon targets
Users should also provide a file in bed format, with at least 5 columns (chromosome, start, end, amplicon_id, target_label), and optionally reference and alternative alleles. The file should have no header. The bed file contains the coordinates of the amplicon targets and should be placed in the config/
folder and pointed to in the users config.yaml
. The default name is amplicon_targets.bed
.
2L |
209535 |
209536 |
Agam_1 |
AIM1 |
A |
G |
---|---|---|---|---|---|---|
2L |
927246 |
927247 |
Agam_2 |
AIM2 |
C |
A |
2L |
1274352 |
1274353 |
Agam_3 |
AIM3 |
G |
A |
2L |
1418209 |
1418210 |
Agam_4 |
AIM4 |
T |
C |
2L |
1571928 |
1571929 |
Agam_5 |
AIM5 |
T |
C |
The columns are:
Chromosome: Genomic chromosome or contig name (e.g., “2L”)
Start position: 0-based start coordinate of the target
End position: End coordinate of the target (typically start+1 for SNPs)
Amplicon ID: Identifier for the amplicon (e.g., “Agam_1”)
Target label: Descriptive name for the target (e.g., “AIM1”)
Reference allele: (Optional) The reference allele at this position
Alternative allele(s): (Optional) The alternative allele(s) at this position
The target_label
column values should be unique for each row of the bed file. The amplicon_id
column values can be repeated for multiple rows, to indicate that those SNPs are located on the same amplicon.
Illumina run folders or Paired-end Amplicon-Sequencing fastq reads
Users can either provide the path to an Illumina MiSeq run folder, or provide paired-end fastq reads. If converting BCL files to fastq within the workflow, a SampleSheet.csv
must be placed in the Illumina run folder. The SampleSheet must have the CreateFastqForIndexReads parameter set to 1, please see the exampleSampleSheet.csv
in the resources/
directory.
SampleSheet.csv Structure#
The SampleSheet.csv is a structured file with several sections:
[Header]: Contains experiment metadata
[Reads]: Defines read lengths (typically 151 for paired-end reads)
[Settings]: Contains critical parameters including
CreateFastqForIndexReads=1
[Data]: Sample information including IDs, indexes, and metadata
The CreateFastqForIndexReads=1
setting in the [Settings] section is essential as it instructs the BCL conversion to generate FASTQ files for index reads, which are used for demultiplexing samples.
Example [Data] section format:
sample_ID |
sample_name |
index |
index2 |
well |
plate_name |
taxon |
location |
country |
latitude |
longitude |
---|---|---|---|---|---|---|---|---|---|---|
GH_01 |
GH_01 |
ATCACGTT |
CCTATCCT |
A1 |
3 |
Obuasi |
Ghana |
|||
GH_02 |
GH_02 |
CGATGTTT |
CCTATCCT |
A2 |
3 |
Obuasi |
Ghana |
Required columns for the [Data] section:
sample_ID: Unique sample identifier, used to name output files
sample_name: Name displayed in reports (often the same as sample_ID)
index: Forward index sequence for demultiplexing
index2: Reverse index sequence for demultiplexing (for dual indexing)
Optional metadata columns can be added (e.g., well, plate_name, taxon, location, country, latitude, longitude) and will be incorporated into analysis results.
If providing fastq files, two gzipped fastq files for each sample are required, one for each pair of paired-end reads. Reads can be already trimmed or AmpSeeker can trim them, using the fastp module (configurable in the config.yaml).
Two options are available for specifying the location of the fastq files. Either the metadata contains two columns fq1
and fq2
with the paths to the fastq files, or the fastq files are placed in the following directory (resources/reads/
) with the following naming pattern:
ampseeker_dir/resources/reads/
Reads should be named as `{sample_id}_1.fastq.gz`, `{sample_id}_2.fastq.gz`.
If providing fastq paths in the metadata file, they can be named anything.
Reference genome files
AmpSeeker uses bwa and samtools mpileup for alignment and variant calling. For variant calling, genome alignment is performed with bwa, which requires a fasta file containing the genome sequence. All input .fa
files can be gzipped .fa.gz
.
The user provides the path to the reference files in the configuration file (config.yaml
).
Genome chromosomes reference file (.fa/.fa.gz). Contains the DNA sequence for the genome in fasta format.
Genome feature file (.gff3 format).
Ensure that contigs in the reference genome match the contigs in the bed file.
Note - Genome reference files from VectorBase now have prefixes before each contig name, such as ‘AgamP4_2L’. Either the bed file can be updated to match these, or the names in the reference files can be modified