Reads, QC and trimming

Last updated on 2025-10-21 | Edit this page

Estimated time: 150 minutes

This episode starts a series of labs where you will trace an outbreak of Mycobacterium tuberculosis, from downloading the reads to making assumptions about the transmissions between patients.

The first part, here, describes how to retrieve all necessary data from public repositories - the raw sequenced data of our isolates and a reference genome. It also introduces for loops which we will use for the rest of our analysis.

Overview

Questions

Trace an outbreak of Mycobacterium, from reads to phylogenetic tree.
How can I organize my file system for a new bioinformatics project?
How and where can data be downloaded?
How can I describe the quality of my data?
How can I get rid of sequence data that doesn’t meet my quality standards?

Objectives

How do you use genomics and phylogenetics to trace a microbial outbreak?
Create a file system for a bioinformatics project.
Download files necessary for further analysis.
Use ‘for’ loops to automate operations on multiple files
Interpret a FastQC plot summarizing per-base quality across all reads.
Clean FastQC reads for further analysis.
Use for loops to automate operations on multiple files.

Introduction

This part of the course is largely inspired from another tutorial developed by Anita Schürch. We will use (part of) the data described in Bryant et al., 2013.

Background of the data

Epidemiological contact tracing through interviewing of patients can identify potential chains of patients that transmitted an infectious disease to each other. Contact tracing was performed following the stone-in-the-pond principle, which interviews and tests potential contacts in concentric circles around a potential bash case.

Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis. It mostly affects the lungs. An infection with M. tuberculosis is often asymptomatic (latent infection). Only in about 10% of the cases the latent infection progresses to an active infection during a patients lifetime, which, if untreated, leads to death in about half of the cases. The symptoms of an active TB infection include cough, fever, night sweats, weight loss etc. An active TB infection can spread. Once exposed, people often have latent TB. To identify people with latent TB, a skin test can be applied.

Here we have 7 tuberculosis patients with active TB, that form three separate clusters of potential transmission as determined by epidemiological interviews. Patients were asked if they have been in direct contact with each other, or if they visited the same localities. From all patients, a bacterial isolate was grown, DNA isolated, and whole-genome sequenced on an Illumina sequencer.

The three clusters consist of:

Cluster 1
- Patient A1 (2011) - sample ERR029207
- Patient A2 (2011) - sample ERR029206
- Patient A3 (2008) - sample ERR026478
Cluster 2
- Patient B1 (2001) - sample ERR026474
- Patient B2 (2012) - sample ERR026473
Cluster 3
- Patient C1 (2011) - sample ERR026481
- Patient C2 (2016) - sample ERR026482

The second goal of this lab is to affirm or dispute the inferred transmissions by comparing the bacterial genomes with each other. We will do that by identifying SNPs between the genomes.

Getting your project started

Project organization is one of the most important parts of a sequencing project, and yet is often overlooked amidst the excitement of getting a first look at new data. Of course, while it’s best to get yourself organized before you even begin your analyses,it’s never too late to start, either.

Genomics projects can quickly accumulate hundreds of files across tens of folders. Every computational analysis you perform over the course of your project is going to create many files, which can especially become a problem when you’ll inevitably want to run some of those analyses again. For instance, you might have made significant headway into your project, but then have to remember the PCR conditions you used to create your sequencing library months prior.

Other questions might arise along the way:

What were your best alignment results?
Which folder were they in: Analysis1, AnalysisRedone, or AnalysisRedone2?
Which quality cutoff did you use?
What version of a given program did you implement your analysis in?

In this exercise we will setup a file system for the project we will be working on during this workshop.

Challenge

We will start by creating a sub directory that we can use for the rest of the labs. First navigate to your own directory on Uppmax and start an interactive session. Then confirm that you are in the correct directory using the pwd command.

You should see the output:

OUTPUT

/proj/g2020004/nobackup/3MK013/<username>

Show me the solution

BASH

interactive ...
cd /proj/g2020004/nobackup/3MK013/<username>
pwd

Instructor Note

BASH

interactive -A uppmax2025-3-4 -M snowy -t 04:00:00
cd /proj/g2020004/nobackup/3MK013/<username>
pwd

Challenge

Exercise

Use the mkdir command to make the following directories:

molepi
molepi/docs
molepi/data
molepi/results

Show me the solution

BASH

mkdir molepi molepi/docs molepi/data molepi/results

Use ls -R to verify that you have created these directories. The -R option for ls stands for recursive. This option causes ls to return the contents of each subdirectory within the directory iteratively.

BASH

ls -R molepi

You should see the following output:

OUTPUT

molepi/:
data  docs  results

molepi/data:

molepi/docs:

molepi/results:

Selection of a reference genome

Reference sequences (including many pathogen genomes) are available at NCBI’s refseq database

A reference genome is a genome that was previously sequenced and is closely related to the isolates we would like to analyse. The selection of a closely related reference genome is not trivial and will warrant an analysis in itself. However, for simplicity, here we will work with the M. tuberculosis reference genome H37Rv.

Challenge

Download reference genomes from NCBI

Download the M.tuberculosis reference genome from the NCBI ftp site.

First, we switch to the data subfolder to store all our data

Show me the solution

BASH

cd molepi/data

Challenge

Download reference genomes from NCBI (continued)

The reference genome will be downloaded programmatically from NCBI with wget. wget is a computer program that retrieves content from web servers. Its name derives from World Wide Web and get.

Find the location of the gzipped fna file for that reference genome via NCBI’s genome, by finding the FTP site and copying the link to the *_genomic.fna.gz file and using that as an argument to wget.

Give me a hint

BASH

wget -h

Show me the solution

Search for Mycobacterium tuberculosis on the Datasets/Genome website. Identify the “reference genome” (at the top), click on the accession number, and find the FTP tab. There is many files, among which GCF_000195955.2_ASM19595v2_genomic.fna.gz. Right-click and copy the link.

Instructor Note

BASH

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz

Challenge

Download reference genomes from NCBI (continued)

This file is compressed as indicated by the extension of .gz. It means that this file has been compressed using the gzip command.

Extract the file by typing

BASH

gunzip GCF_000195955.2_ASM19595v2_genomic.fna.gz

Make sure that is was extracted

BASH

ls

OUTPUT

GCF_000195955.2_ASM19595v2_genomic.fna

Challenge: What is the size of the genome?

Find out how many nucleotides the genome has. There is a software called SeqKit (and a corresponding module at Uppmax) that can help.

BASH

module load bioinfo-tools SeqKit
seqkit -h

Get assembly statistics from FASTA and FASTQ files.

Show me the solution

The genome has 4’411’532 bp. The stat command of seqkit is helpful.

OUTPUT

file                                    format  type  num_seqs    sum_len    min_len    avg_len    max_len
GCF_000195955.2_ASM19595v2_genomic.fna  FASTA   DNA          1  4,411,532  4,411,532  4,411,532  4,411,532

Instructor Note

BASH

seqkit stats GCF_000195955.2_ASM19595v2_genomic.fna

Loops

Loops are key to productivity improvements through automation as they allow us to execute commands repeatedly. Similar to wildcards and tab completion, using loops also reduces the amount of typing (and typing mistakes). Our next task is to download our data (see the introduction to this episode) from the short read archive (SRA) at the European Nucleotide Archive (ENA). There are many repositories for public data. Some model organisms or fields have specific databases, and there are ones for particular types of data. Two of the most comprehensive are the National Center for Biotechnology Information (NCBI) and European Nucleotide Archive (EMBL-EBI). In this lesson we’re working with the ENA, but the general process is the same for any database.

We can do this one by one but given that each download takes about one to two hours, this could keep us up all night. Instead of downloading one by one we can apply a loop. Let’s see what that looks like and then we’ll discuss what we’re doing with each line of our loop.

BASH

for sample in ERR029207 ERR029206 ERR026478 ERR026474 ERR026473 ERR026481 ERR026482
do
  echo -e "$sample"
  wget --spider ftp://ftp.sra.ebi.ac.uk/vol1/fastq/"${sample:0:6}"/"${sample}"/"${sample}"_*.fastq.gz
done

When the shell sees the keyword for, it knows to repeat a command (or group of commands) once for each item in a list. Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable, and the commands inside the loop are executed, before moving on to the next item in the list.

Inside the loop, we call for the variable’s value by putting $ in front of it. The $ tells the shell interpreter to treat the variable as a variable name and substitute its value in its place, rather than treat it as text or an external command.

In this example, the list is seven accession numbers of reads belonging to the genomes we are interested in. Each time the loop iterates, it will assign a sample name to the variable sample and echo (print) the value of the variable. In the second line, it will run the wget command.

The first time through the loop, $sample is ERR029207. The FTP site at EBI is constructed so that fastq files are stored in a vol1/fastq subfolder. Read files are not directly in that subfolder, but grouped in subfolders starting with the six first characters of the accession number (ERR029 in the first case), then in the accession number subfolder. To get to the right place, the file path is constructed with the root path (ftp://ftp.sra.ebi.ac.uk/vol1/fastq/), to which the first 6 characters of the accession number is added through that special variable operation ("${sample:0:6}") to get to the right subfolder. Finally, the correct accession subfolder ("${sample}"), the file name prefix ("${sample}"_*.fastq.gz") and the suffix (_*.fastq.gz) are added. The * ensures that we get reads from both ends.

Use {} to wrap the variable so that .fastq.gz will not be interpreted as part of the variable name. In addition, quoting the shell variables is a good practice AND necessary if your variables have spaces in them.

For the second iteration, $sample becomes ERR029206.

We added a --spider option to wget just for this exercise, to retrieve only the names of the files and not the actual file, to spare some time downloading files. The data is present in the data/fastq subfolder of our group folder:

BASH

ls /proj/g2020004/nobackup/3MK013/data/fastq

OUTPUT

ERR026473_1.fastq.gz  ERR026474_2.fastq.gz  ERR026481_1.fastq.gz  ERR026482_2.fastq.gz	ERR029207_1.fastq.gz
ERR026473_2.fastq.gz  ERR026478_1.fastq.gz  ERR026481_2.fastq.gz  ERR029206_1.fastq.gz	ERR029207_2.fastq.gz
ERR026474_1.fastq.gz  ERR026478_2.fastq.gz  ERR026482_1.fastq.gz  ERR029206_2.fastq.gz

For more, check Bash Pitfalls

Callout

Follow the Prompt

The shell prompt changes from $ to > and back again as we were typing in our loop. The second prompt, >, is different to remind us that we haven’t finished typing a complete command yet. A semicolon, ;, can be used to separate two commands written on a single line.

Callout

Same Symbols, Different Meanings

Here we see > being used a shell prompt, whereas > is also used to redirect output. Similarly, $ is used as a shell prompt, but, as we saw earlier, it is also used to ask the shell to get the value of a variable.

If the shell prints > or $ then it expects you to type something, and the symbol is a prompt.

If you type > or $ yourself, it is an instruction from you that the shell to redirect output or get the value of a variable.

We have called the variable in this loop sample in order to make its purpose clearer to human readers. The shell itself doesn’t care what the variable is called; if we wrote this loop as:

BASH

for x in ERR01 ERR02 ERR03
do
  echo ftp://ftp.sra.ebi.ac.uk/"${x}".fastq.gz
done

or:

BASH

for pizza in ERR01 ERR02 ERR03
do
  echo ftp://ftp.sra.ebi.ac.uk/"${pizza}".fastq.gz
done

it would work exactly the same way. Don’t do this. Programs are only useful if people can understand them, so meaningless names (like x) or misleading names (like pizza) increase the odds that the program won’t do what its readers think it does.

Callout

Multipart commands

The for loop is interpreted as a multipart command. If you press the up arrow on your keyboard to recall the command, it may be shown like so (depends on your bash version):

BASH

for sample in ERR01 ERR02 ERR03; do echo ftp://ftp.sra.ebi.ac.uk/"${sample}".fastq.gz ; done

When you check your history later, it will help your remember what you did!

Link the read files

Now link the files from the data folder, to perform quality control.

Challenge

All the read files start with ERR and have a fastq.gz extension. It means they are in fastq format and compressed. Softlink the files in your molepi/data folder, using a nested loop. The data/fastq folder is three levels up. For each accession number there are two fastq.gz files, one for each end (1 or 2).

Give me a hint

Softlinks are done with ln -s. Use the same loop backbone as above, and nest a for loop over the two ends of the reads (1 and 2).

Show me the solution

BASH

cd /proj/g2020004/nobackup/3MK013/<username>/molepi/data
for sample in <...>
do
  for <loop over 1 2>
  do
    <softlink> ../../../data/fastq/<build the sample with $ and double quotes>.fastq.gz .
  done
done

Instructor Note

BASH

cd /proj/g2020004/nobackup/3MK013/<username>/molepi/data
for sample in ERR029207 ERR029206 ERR026478 ERR026474 ERR026473 ERR026481 ERR026482
do
  for end in 1 2
  do
    ln -s ../../../data/fastq/"${sample}"_"${end}".fastq.gz .
  done
done

Bioinformatics workflows

When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass through a number of different tools in order to generate your final desired output. The execution of this set of tools in a specified order is commonly referred to as a workflow or a pipeline.

An example of the workflow we will be using is provided below with a brief description of each step.

Schematic workflow of the bioinformatics workflow

Quality control - Assessing quality using FastQC
Quality control - Trimming and/or filtering reads (if necessary)
Align reads to reference genome - Snippy
Variant calling - Snippy
Clustering variants - iTOL
Assembly - SKESA
Annotation (extra material)
Pangenome analysis (extra material)
Clustering presence and absence of genes (extra material)
Comparison of clustering methods (extra material)
Data visualization - Microreact

These workflows in bioinformatics adopt a plug-and-play approach in that the output of one tool can be easily used as input to another tool without any extensive configuration. Having standards for data formats is what makes this feasible. Standards ensure that data is stored in a way that is generally accepted and agreed upon within the community. The tools that are used to analyze data at different stages of the workflow are therefore built under the assumption that the data will be provided in a specific format.

Quality Control

The first step in the variant calling workflow is to take the FASTQ files received from the sequencing facility and assess the quality of the sequence reads.

You’ve already seen that part, but repetition doesn’t hurt. If you feel that it’s anyway unnecessary, move on to the next part.

Details on the FASTQ format

Although it looks complicated (and it is), it’s easy to understand the fastq format with a little decoding. Some rules about the format include…

Line	Description
1	Always begins with ‘@’ and then information about the read
2	The actual DNA sequence
3	Always begins with a ‘+’ and sometimes the same info in line 1
4	Has a string of characters which represent the quality scores; must have same number of characters as line 2

We can view the first complete read in one of the files our dataset by using head to look at the first four lines.

Challenge

You will use zcat to extract text from a gzipped file.

OUTPUT

@ERR029206.1 IL31_5505:8:1:6233:1087#4/1
TCNAGTCAGCACACACATGCGAAAGAATCCACCGACTAGGGTCAGCGGGGTTTGCAGTTGGTCGCGGACGTAACCG
+
<='=ABBBBCBBBBBBBBC>BBCBBBBBB@BBBBBB=BCBB@@B@@BB@@,@@@@@@@@@@@@@@@@@@@@@@@@@

Give me a hint

BASH

zcat -h
head -h

Show me the solution

BASH

zcat ERR029206_1.fastq.gz | head -n4

One of the nucleotides in this read is unknown (N).

Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call accuracy (eg 90%). To make it possible to line up each individual nucleotide with its quality score, the numerical score is converted into a code where each individual character represents the numerical quality score for an individual nucleotide. For example, in the line above, the quality score line is:

OUTPUT

<='=ABBBBCBBBBBBBBC>BBCBBBBBB@BBBBBB=BCBB@@B@@BB@@,@@@@@@@@@@@@@@@@@@@@@@@@@

The < character and each of the = characters represent the encoded quality for an individual nucleotide. The numerical value assigned to each of these characters depends on the sequencing platform that generated the reads. The sequencing machine used to generate our data uses the standard Sanger quality PHRED score encoding, using by Illumina version 1.8 onwards. Each character is assigned a quality score between 0 and 40 as shown in the chart below.

OUTPUT

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
                  |         |         |         |         |
Quality score:    0........10........20........30........40

Each quality score represents the probability that the corresponding nucleotide call is incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. These probability values are the results from the base calling algorithm and dependent on how much signal was captured for the base incorporation.

Looking back at our read:

OUTPUT

@ERR029206.1 IL31_5505:8:1:6233:1087#4/1
TCNAGTCAGCACACACATGCGAAAGAATCCACCGACTAGGGTCAGCGGGGTTTGCAGTTGGTCGCGGACGTAACCG
+
<='=ABBBBCBBBBBBBBC>BBCBBBBBB@BBBBBB=BCBB@@B@@BB@@,@@@@@@@@@@@@@@@@@@@@@@@@@

We can now see that the quality of an N (') is 6, which corresponds to a over 25% error probability:

$10^{-\frac{6}{10}} \cong 0.25$

Quality Encodings Vary

Although we’ve used a particular quality encoding system to demonstrate interpretation of read quality, different sequencing machines use different encoding systems. This means that, depending on which sequencer you use to generate your data, a # may not be an indicator of a poor quality base call.

Callout

This mainly relates to older Solexa/Illumina data, but it’s essential that you know which sequencing platform was used to generate your data, so that you can tell your quality control program which encoding to use. If you choose the wrong encoding, you run the risk of throwing away good reads or (even worse) not throwing away bad reads!

Assessing Quality using FastQC

In real life, you won’t be assessing the quality of your reads by visually inspecting your FASTQ files. Rather, you’ll be using a software program to assess read quality and filter out poor quality reads. We’ll first use a program called FastQC to visualize the quality of our reads.Later in our workflow, we’ll use another program to filter out poor quality reads.

FastQC has a number of features which can give you a quick impression of any problems your data may have, so you can take these issues into consideration before moving forward with your analyses. Rather than looking at quality scores for each individual read, FastQC looks at quality collectively across all reads within a sample. The image below shows a FastQC-generated plot that indicates a very high quality sample:

A FastQC output showing reads with good quality

The x-axis displays the base position in the read, and the y-axis shows quality scores. In this example, the sample contains reads that are 40 bp long. For each position, there is a box-and-whisker plot showing the distribution of quality scores for all reads at that position. The horizontal red line indicates the median quality score and the yellow box shows the 2nd to 3rd quartile range. This means that 50% of reads have a quality score that falls within the range of the yellow box at that position. The whiskers show the range to the 1st and 4th quartile.

For each position in this sample, the quality values do not drop much lower than 32. This is a high quality score. The plot background is also color-coded to identify good (green), acceptable (yellow), and bad (red) quality scores.

Now let’s take a look at a quality plot on the other end of the spectrum.

A FastQC output showing reads with bad quality

Here, we see positions within the read in which the boxes span a much wider range. Also, quality scores drop quite low into the “bad” range, particularly on the tail end of the reads. The FastQC tool produces several other diagnostic plots to assess sample quality, in addition to the one plotted above.

Running FastQC

Navigate to your FASTQ dataset:

BASH

cd /proj/g2020004/nobackup/3MK013/<username>/molepi/data

Challenge

Exercise

We softlinked the samples. How many FASTQ files are in this dataset? Why is this? What is the difference between the two files that start with the same name? Discuss with your neighbour.

Show me the solution

There are 14 FASTQ files. Each sample has paired-end reads.

FastQC can accept multiple file names as input, including files that are compressed (.gz) so we can use the *.fastq.gz wildcard to run FastQC on both the compressed FASTQ files for sample ERR029206 in this directory.

BASH

module load FastQC
fastqc ERR029206*.fastq.gz

You will see an automatically updating output message telling you the progress of the analysis. It will start like this:

OUTPUT

Started analysis of ERR029206_1.fastq.gz
Approx 5% complete for ERR029206_1.fastq.gz
Approx 10% complete for ERR029206_1.fastq.gz
Approx 15% complete for ERR029206_1.fastq.gz
Approx 20% complete for ERR029206_1.fastq.gz
Approx 25% complete for ERR029206_1.fastq.gz
Approx 30% complete for ERR029206_1.fastq.gz

In total, it should take a couple minutes minutes for FastQC to run on our two FASTQ files. When the analysis completes, your prompt will return. So your screen will look something like this:

OUTPUT

Approx 85% complete for ERR029206_2.fastq.gz
Approx 90% complete for ERR029206_2.fastq.gz
Approx 95% complete for ERR029206_2.fastq.gz
Analysis complete for ERR029206_2.fastq.gz
data $

The FastQC program has created several new files within our /data/ directory.

BASH

ls

OUTPUT

...
[...]
ERR029206_1.fastq.gz
ERR029206_1_fastqc.html
ERR029206_1_fastqc.zip
ERR029206_2.fastq.gz
ERR029206_2_fastqc.html
ERR029206_2_fastqc.zip
[...]

For each input gzipped FASTQ file, FastQC has created a .zip file and a .html file. The .zip file extension indicates that this is actually a compressed set of multiple output files. We’ll be working with these output files soon. The .html file is a stable webpage displaying the summary report for each of our samples.

We want to keep our data files and our results files separate, so we will move these output files into a new directory within our results/ directory.

BASH

mkdir ../results/fastqc_untrimmed_reads
mv *.zip ../results/fastqc_untrimmed_reads/
mv *.html ../results/fastqc_untrimmed_reads/

Now we can navigate into this results directory and do some closer inspection of our output files.

BASH

cd ../results/fastqc_untrimmed_reads/

Viewing HTML files

If we are working on our local computers, we may display each of these HTML files as a webpage. Otherwise, copy the files to your local computer using scp and use your browser:

BASH

firefox ERR029206_1_fastqc.html

You may need to replace the command firefox with the name of another web navigator if you have one.

Discussion

Exercise

Discuss your results with a neighbor. Do the sample looks good in terms of per base sequence quality?

Discussion

Exercise

What are the read lengths of the different samples?

Cleaning Reads

It’s very common to have some reads within a sample, or some positions (near the beginning or end of reads) across all reads that are low quality and should be discarded. We will use a program called seqtk to filter poor quality reads and trim poor quality bases from our samples.

Seqtk Options

Seqtk is a program written in C and aims to be a Swiss Army Knife for sequencing reads. You don’t need to learn C to use Seqtk, but the fact that it’s a C program helps explain the syntax that is used to run Seqtk. Seqtk takes as input files either FASTQ files or gzipped FASTQ files and outputs FASTQ or FASTA files. The basic command to run Seqtk starts like this:

BASH

module load seqtk
seqtk

That’s just the basic command, however. Seqtk has a variety of options and parameters. We will need to specify what options we want to use for our analysis. Here are some of the options:

option	meaning
`seq`	common transformation of FASTA/Q
`comp`	get the nucleotide composition of FASTA/Q
`trimfq`	trim FASTQ using the Phred algorithm

In addition to these options, there are a number if trimming options available:

BASH

seqtk trimfq

step	meaning
`-q`	error rate threshold (disabled by -b/-e) [0.05]
`-l`	maximally trim down to INT bp (disabled by -b/-e) [30]
`-b`	trim INT bp from left (non-zero to disable -q/-l) [0]
`-e`	trim INT bp from right (non-zero to disable -q/-l) [0].

We will use only a few of these options in our analysis. It is important to understand the steps you are using to clean your data.

A complete command for trimming with seqtk will look something like this:

BASH

seqtk trimfq -q 0.01 ERR01_1.fastq.gz > ERR01_1_trim.fastq

Trimming

Now we will run seqtk trimfq on our data. To begin, make sure you are still in your data directory:

BASH

pwd

We are going to run seqtk on one sample giving it an error rate threshold of 0.01 which indicates the base call accuracy. We request that, after trimming, the chances that a base is called incorrectly are only 1 in 10000.

BASH

seqtk trimfq -q 0.01 ERR029206_1.fastq.gz > ERR029206_1_trim.fastq

Notice that we needed to redirect the output to a file. If we don’t do that, the trimmed fastq data will be displayed in the console.

Challenge

Exercise

Use seqtk fqchk to compare the untrimmed and trimmed reads of ERR029206_1 in terms of number of sequenced bases, percentage of A,G,C,T and N and average quality. What do you notice? Discuss with your neighbor.

Show me the solution

BASH

seqtk fqchk ERR029206_1.fastq.gz | head -n 3

OUTPUT

min_len: 76; max_len: 76; avg_len: 76.00; 36 distinct quality values
POS #bases  %A  %C  %G  %T  %N  avgQ    errQ    %low    %high
ALL 315791552   18.0    32.1    32.2    17.7    0.0 30.6    21.4    5.4 94.6

BASH

seqtk fqchk ERR029206_1_trim.fastq | head -n 3

OUTPUT

min_len: 30; max_len: 76; avg_len: 69.12; 36 distinct quality values
POS #bases  %A  %C  %G  %T  %N  avgQ    errQ    %low    %high
ALL 287211920   18.2    32.0    31.9    17.8    0.0 31.7    28.8    1.3 98.7

We’ve just successfully trimmed one of our FASTQ files! However, there is some bad news. seqtk can only operate on one sample at a time and we have more than one sample. The good news is that we can use a for loop to iterate through our sample files quickly! This will take a few minutes.

Challenge

Write a loop which trims all .fastq.gz files and redirect the output into _trim.fastq files.

Give me a hint

Create a new variable, built from the input file name (ending in .fastq.gz).

BASH

outfile="${infile}"_trim.fastq

infile is the first variable in our loop and takes the value of each of the FASTQ files in our directory. outfile is the second variable in our loop and is defined by adding _trim.fastq to the end of our input file name. Use {} to wrap the variable so that _trim.fastq will not be interpreted as part of the variable name. In addition, quoting the shell variables is a good practice AND necessary if your variables have spaces in them.

For more, check Bash Pitfalls. There are no spaces before or after the =.

Show me the solution

BASH

for <loop over infiles ending with .fastq.gz>
do
  outfile="${infile}"_trim.fastq
  seqtk trimfq <error prob at most 0.01> <infile> > <outfile>
done

Instructor Note

BASH

for infile in *.fastq.gz
do
  outfile="${infile}"_trim.fastq
  seqtk trimfq -q 0.01 "${infile}" >  "${outfile}" 
done

Go ahead and run the for loop. It should take a few minutes for seqtk to run for each of our fourteen input files. Once it’s done running, take a look at your directory contents.

BASH

ls ERR029206*

OUTPUT

...
ERR029206_1.fastq.gz_trim.fastq  ERR029206_2.fastq.gz_trim.fastq

We’ve now completed the trimming and filtering steps of our quality control process! Before we move on, let’s move our trimmed FASTQ files to a new subdirectory within our data/ directory.

BASH

mkdir trimmed_fastq
mv *fastq.gz_trim* trimmed_fastq
cd trimmed_fastq
ls ERR029206*

OUTPUT

...
ERR029206_1.fastq.gz_trim.fastq
ERR029206_2.fastq.gz_trim.fastq

Challenge

Again, use seqtk fqchk to compare the untrimmed and trimmed reads of both samples. Note the number of bases ‘#bases’ of the trimmed and untrimmed reads. Calculate the theoretical coverage of the genomes before and after trimming, assuming that all our genomes do have the same size as our reference genome (4411532 bases).

Hint: Sum up forward and reverse reads!

Show me the solution

BASH

seqtk fqchk ERR029206_1.fastq.gz_trim.fastq | head -n 3

OUTPUT

min_len: 30; max_len: 76; avg_len: 69.12; 36 distinct quality values
POS #bases  %A  %C  %G  %T  %N  avgQ    errQ    %low    %high
ALL 287211920   18.2    32.0    31.9    17.8    0.0 31.7    28.8    1.3 98.7

BASH

seqtk fqchk ERR029206_2.fastq.gz_trim.fastq | head -n 3

OUTPUT

min_len: 30; max_len: 76; avg_len: 63.79; 36 distinct quality values
POS #bases  %A  %C  %G  %T  %N  avgQ    errQ    %low    %high
ALL 265048852   18.1    32.1    32.1    17.7    0.0 31.2    25.7    2.5 97.5

Coverage = #bases (forward + reverse) / genome size In this case:

$125.18 = \dfrac{(287211920 + 265048852)}{4411532}$

Key Points

wget is a computer program to get data from the internet
for loops let you perform the same set of operations on multiple files with a single command
Sequencing data is large
In bioinformatic workflows the output of one tool is the input of the other.
FastQC is used to judge the quality of sequencing reads.
Data cleaning is an essential step in a genomics pipeline.
We will work towards confirming or disputing transmission in TB cases
After this practical training you will have some familiarity with working on the command line