NCBI
- NCBI website is very, very comprehensive
- The Entrez cross-referencing system helps retrieving links bits of data of different type.
BLAST
- Using BLAST is a blast!
- Choice of database is a crucial trade-off between efficiency and sensitivity
Multiple sequence alignment
Phylogenetics
- The simplest of the trees are distance-based.
- UPGMA works by clustering the two closest leaves and recalculating the distance matrix.
- Neighbor-joining is distance-based and fast, but not necessarily very accurate
- Maximum-likelihood is slower, but more accurate
- Random genetic drift has a large influence on the probability of fixation of alleles in small populations, even for non-neutral alleles.
- Random genetic drift has very little influence of the probability of fixation of alleles in large populations.
- Slightly deleterious mutations can get fixed into the population through random genetic drift, if the population is small enough and the selective value is not too large.
Reads, QC and trimming
-
wget
is a computer program to get data from the internet -
for
loops let you perform the same set of operations on multiple files with a single command - Sequencing data is large
- In bioinformatic workflows the output of one tool is the input of the other.
- FastQC is used to judge the quality of sequencing reads.
- Data cleaning is an essential step in a genomics pipeline.
- We will work towards confirming or disputing transmission in TB cases
- After this practical training you will have some familiarity with working on the command line
Sequence assembly
- Assembly is a process which aligns and merges fragments from a longer DNA sequence in order to reconstruct the original sequence.
- k-mers are short fragments of DNA of length k
Mapping
- Single nucleotide polymorphisms can be identified by mapping reads to a reference genome
- Parameters for the analysis have to be selected based on expected outcomes for this organism
- Concatenation of SNPs helps to reduce analysis volume
- Phylogenetic trees can be written with a bracket syntax in Newick format
Molecular epidemiology
- SNP phylogeny and metadata can convey different messages
- Human interpretation is often needed to weigh the different information sources.
- The low mutation rate of M. tuberculosis does not allow to make confident inferences of transmission but does allow to exclude transmission