Input files#

RNA-Seq-Pop requires only:

  • Sample metadata file .tsv

  • Single-end or Paired-end RNA-Seq data - .fastq.gz

  • Reference genome files

Sample metadata file

The user must provide a tab-separated sample metadata file, which should be placed in the config/ folder and pointed to in the users config.yaml. The default name is samples.tsv. An example metadata file is provided in at config/examplesamples.tsv, and shown below:

samples.tsv:

|sampleID|treatment|species  |strain  |
|--------|---------|---------|--------|
|ContTia1|ContTia  |coluzzii |Tiassale|
|ContTia2|ContTia  |coluzzii |Tiassale|
|ContTia4|ContTia  |coluzzii |Tiassale|
|MalaTia1|MalaTia  |coluzzii |Tiassale|
|MalaTia2|MalaTia  |coluzzii |Tiassale|
|MalaTia4|MalaTia  |coluzzii |Tiassale|

In the config.yaml, we will use the treatment column to specify our comparative groups for analysis.

If the strain information is not relevant to your study organism, please use the same values as for species. The strain column is used to define smaller groups within the data for principal components analysis (PCA), and is useful when analysing datasets with multiple strains.


Single or Paired-end RNA-Sequencing fastq reads

One or two gzipped fastq files for each sample are required, depending on whether the user is using single-end or paired-end fastq files. Reads can be already trimmed or RNA-Seq-Pop can trim them, using the cutadapt module.

The read location may be specified in two ways:

  1. Reads can be named as {sampleID}_1.fastq.gz, {sampleID}_2.fastq.gz and stored in resources/reads/. In the config.yaml, fastq['auto'] == True, meaning snakemake will look for files in this folder which follow this naming pattern. For single-end reads, only the first _1.fastq.gz file is required.

  2. The user can add “fq1” and “fq2” columns to the samples.tsv metadata file, containing the path to each fastq file from the root rna-seq-pop directory. This allows the fastq files to be stored anywhere that is accessible and have arbitrary naming. In the config.yaml, this option is fastq['auto'] == False. For single-end reads, only the “fq1” column is required.


Reference genome files

RNA-Seq-Pop uses kallisto to perform differential expression analysis, which takes as input a reference transcriptome. The user must provide a fasta file containing the transcriptome sequence in fasta format. For variant calling, genome alignment is performed with hisat2, which requires a fasta file containing the genome sequence. All input .fa files can be gzipped .fa.gz.

The user provides the path to the reference files in the configuration file (config.yaml).

  1. Genome chromosomes reference file (.fa/.fa.gz). Contains the DNA sequence for the genome in fasta format.

  2. Transcriptome reference file (.fa/fa.gz). Contains the DNA sequence for each transcript in fasta format.

  3. Genome feature file (.gff3 format).

  4. Genes to Transcript mapping file (.tsv). An example is provided in the github repo (resources/exampleGene2TranscriptMap.tsv). This should contain four columns, GeneID, TranscriptID, GeneName, and GeneDescription, and is necessary for connecting transcripts to their parent genes, as well as adding gene annotations to results. Files for Anopheles gambiae, funestus and Aedes aegypti are provided in the github repo.

  5. SnpEff database name (if performing variant calling).


Contigs

The user must provide a list of contigs that they wish to analyse in the configuration file (config.yaml). These must match entries in the reference files (.fa, gff3).

For example, in An. gambiae, we refer to the chromosomes as [“2L”, “2R”, “3L”, “3R”, “X”], though in Aedes aegypti, its simply [“1”, “2”, “3”].

We specify contigs explicitly because in some reference assemblies we have a few large full chromosomes, plus hundreds of small contigs, which we may not necessarily wish to analyse.