Input files#

AmpSeeker requires only:

  • Sample metadata file .tsv, .csv, .xlsx

  • Bed file of amplicon SNP coordinates .bed

  • Paired-end Amplicon-Seq data .fastq.gz

  • Reference genome files

Sample metadata file

The user must provide a sample metadata file, in .tsv, .csv, or excel format which should be placed in the config/ folder and pointed to in the users config.yaml. The default name is metadata.tsv. The file only requires one column, sampleID, and a column with which to group the data for downstream analyses (for example, by location).

metadata.tsv:

|sampleID| location |  taxon  | latitude| longitude | 
|--------|----------|---------|-------- | --------- |
|ContTia1|Tiassale  |coluzzii |   7.1   |     0     |
|ContTia2|Tiassale  |coluzzii |   7.1   |     0     |
|ContTia4|Tiassale  |coluzzii |   7.1   |     0     |
|MalaTia1|Tiassale  |coluzzii |   7.1   |     0     |
|MalaTia2|Tiassale  |coluzzii |   7.1   |     0     |
|MalaTia4|Tiassale  |coluzzii |   7.1   |     0     |

Latitude and Longitudes are used to plot sample collection locations. If sample maps are not required, these columns can be omitted. Extra metadata columns can be added and used to colour the PCA plots, with configurable options in the config.yaml file.


Bed file of amplicon targets

Users should also provide a file in bed format, with 5 columns (chromosome, start, end, target_id, target_label), and no header. The bed file should contain the coordinates of the amplicon targets. The bed file should be placed in the config/ folder and pointed to in the users config.yaml. The default name is amplicon_targets.bed.


|2L|  927246|  927247| amplicon_id_1| snp_label_1|
|2L| 1274352| 1274353| amplicon_id_2| snp_label_2|
|2L| 1418209| 1418210| amplicon_id_3| snp_label_3|
|2L| 1571928| 1571929| amplicon_id_4| snp_label_4|
|2L| 1776347| 1776348| amplicon_id_5| snp_label_5|
|2L| 1947573| 1947574| amplicon_id_6| snp_label_6|
|2L| 1947578| 1947579| amplicon_id_6| snp_label_7|

The snp_label column values should be unique for each row of the bed file. The amplicon_id column values can be repeated for multiple rows, to indicate that those SNPs are located on the same amplicon.


Illumina run folders or Paired-end Amplicon-Sequencing fastq reads

Users can either provide the path to an Illumina MiSeq run folder, or provide paired-end fastq reads. If converting BCL files to fastq within the workflow, a SampleSheet.csv must be placed in the Illumina run folder, with sampleIDs that match the sample metadata.tsv file. The SampleSheet must have a the CreateFastqForIndexReads parameter set to 1, please see the example SampleSheet in the resources/ directory.

If providing fastq files, two gzipped fastq files for each sample are required, one for each pair of paired-end reads. Reads can be already trimmed or AmpSeeker can trim them, using the fastp module.

Two options are available for specifying the location of the fastq files. Either the metadata contains two columns fq1 and fq2 with the paths to the fastq files, or the fastq files are placed in the following directory (resources/reads/) with the following naming pattern:

ampseeker_dir/resources/reads/

Reads should be named as `{sampleID}_1.fastq.gz`, `{sampleID}_2.fastq.gz`.

If providing fastq paths in the metadata file, they can be named anything.


Reference genome files

AmpSeeker uses bwa and samtools mpileup for alignment and variant calling. For variant calling, genome alignment is performed with bwa, which requires a fasta file containing the genome sequence. All input .fa files can be gzipped .fa.gz.

The user provides the path to the reference files in the configuration file (config.yaml).

  1. Genome chromosomes reference file (.fa/.fa.gz). Contains the DNA sequence for the genome in fasta format.

  2. Genome feature file (.gff3 format).

Ensure that contigs in the reference genome match the contigs in the bed file.

Note - Genome reference files from VectorBase now have prefixes before each contig name, such as ‘AgamP4_2L’. Either the bed file can be updated to match these, or the names in the reference files can be modified