NCBI
Last updated on 2025-06-24 | Edit this page
Estimated time: 240 minutes
Exercise 1: Global Cross-database NCBI Search
Overview
Questions
Exercise 1 - How do you efficiently and reliably navigate NCBI’s website?
Exercise 2 - How do you search sequences? - How do you use the right BLAST flavor?
Objectives
Exercise 1 - Familiarize yourself with the differences NCBI databases - Browse and search the databases, using cross-links
Exercise 2 - Use and understand blast alignment tools - Characterize sequences - Understand genomic context and synteny
This part is to encourage you to explore NCBI resources. Questions are examples or real-life questions that you might ask yourself later. There are not necessarily exactly one solution to the question.
Start by going to NCBI’s web site, main search page: https://www.ncbi.nlm.nih.gov/search/. There, you have the list of most of NCBI’s databases, sorted by category. Take some time to explore the following sections: Literature, Genomes, Genes and Proteins.
Task 1.1: Find publications (this is a warm up!)
You want to find the following articles:
- an article about Escherichia coli O104:H4 published by Matthew K Waldor in the New England Journal of Medicine; and
- a paper that has Escherichia coli in the title, whose last author is L. Wang, and was published in PLoS ONE in 2014.
NOTE: you do not know the exact title, list of authors, PMID etc. Use your search skills (solution provided below). There may be more than one result.
Challenge 1.1.1
Find the two articles above using NCBI’s literature database.
Use the advanced search from PubMed, by clicking on “Advanced” (tutorial https://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.PubMed_Quick_Start), to build an exact search. Use the different fields to build a search that returns only a single match. In particular, look at the Search field tags.
Use search terms with search tags (“search term”[searchTag]“)and combine them with Boolean operators (”AND”).
Useful search tags: Author [au], Date of publication [dp], Journal [ta], Last author [lastau], Title [ti].
- Waldor[au] AND “Escherichia coli O104:H4” AND NEJM[ta]
- “Wang L”[lastau] AND “Escherichia coli”[ti] AND 2014[dp] and “PLOS One”[ta].
Challenge 1.1.2
Now try to find the most recent paper citing the first article by Rasko et al above. Compare the use of
- NCBI (“Cited by” link) and
- Google Scholar (https://scholar.google.com) (the language settings of Google Scholar can be adjusted in the menu under Settings) and
- Web of Knowledge (https://www.webofknowledge.com; Access from UU: https://www.ub.uu.se/ > Databases A-Z > W > Web of Science Core Collection).
Then:
- How did the two search engines compare?
- Which one was easier to work with dates?
- Which one returned most results?
- Which one has more complete information?
Task 1.2: Find Gene records
You want to find sequences for the subunit A of the Shiga toxin in E. coli O157:H7 Sakai. Try to search in the three following databases: Gene, Nucleotide and Protein.
Challenge 1.2.1
What do the three different databases contain? What information do you get from them?
NCBI is not as smart as Google, and copy/pasting might fail. Try spelling out Escherichia coli and removing parts of the search text.
If you get too many hits, go to the field “Search details”, and refine your search, using the [Organism] field, for example. You don’t need to get exactly one hit or to get the result you want on top. By default, the search will include everything that has the species name anywhere in the record.
Nucleotide database contains mostly genome records, so the Shiga toxin gene might be anywhere on these contigs/genomes.
Challenge 1.2.2
What do the five top results from the Protein database show?
Task 1.3: Understand Gene records
Click on a Gene search result from Task 1.2 and skim through the entire page.
Challenge 1.3.1
Who published the sequence? When? Is there a paper?
There is often a “Bibliography” section. There is also often a “PubMed” link under “Related Information” in the right menu.
Challenge 1.3.2
What is known about the gene? Does the record include taxonomy information?
The taxonomic information is under the “Summary” section, under “Organism” and “Lineage”.
Click on the Organism link and go to the Taxonomy Browser.
Challenge 1.3.3
What can you learn about species taxonomy there? What other useful information is given on this page?
Back to the Gene page, have a look at the “Related information” section in the right column.
Challenge 1.3.4
Where can you go from there?
All the information in that section (“Related information”) is coming
through the Entrez system. In GenBank
records, look for
\dbxref
(database cross-reference) which provides the
Entrez links.
Task 1.4: Understanding Protein records and sequences
Click on the Protein link for this gene. Look at the content of one of the records.
Challenge 1.4.1
What can you learn here? What information do you find here? What is the name of this format? Are there any known protein domains in this protein? Can you identify a link to these?
You can change the format by clicking on the top left corner.
Look also for “Related information” in the right menu and for the
/db_xref
links in the main file.
Challenge 1.4.2
Can you find only the sequence information in FASTA
format? How is the format organized? What is included in the header?
Discuss the advantages of the two formats. Can you find a definition of
both formats on the NCBI website or elsewhere?
If you are stuck, here is the link to
- one Gene record: https://www.ncbi.nlm.nih.gov/gene/916678
- one Protein record: https://www.ncbi.nlm.nih.gov/protein/NP_311001.1
Challenge 1.4.3
Can you find a 3D protein structure for this protein?
Not all proteins have 3D structures. If your protein doesn’t have one, try the following Shiga toxin:
In the right menu, under “Related Information”, there may be a “Related Structures” link.
Challenge 1.4.4
Can you display the genome record on which the gene is encoded?
If you lost the gene page, here is another one:
There are several ways to display the GenBank file. - In the “Genomic regions, transcripts, and products” section, there is a link to the GenBank record - There is another one in the “Related sequences” section (column “Accession and Version”).
Exercise 2: Sequence alignments
In this exericse, you will use the sequence alignment tool BLAST to align and retrieve sequences.
Caution
NOTE: the NCBI blast service can be under very heavy load (especially during daytime in the US), and don’t be surprised if a single blast takes >10 minutes. As an alternative, you can use the BLAST service at the EBI website (http://www.ebi.ac.uk). Note that the EBI doesn’t have the same level of tool integration as NCBI, and some parts of the questions might be more challenging to answer with the EBI tools.
In general, try reduce the load on the NCBI server and consider
smaller databases than the standard nr
/nt
ones, for example refseq_select
or
refseq_protein
: under the “Choose Search Set” heading on
the Blast search page, you can choose the right “Database”.
Task 2.1: Find nucleotide blast hits
Find the NCBI blast main page.
Challenge 2.1.1
- What is BLAST?
- What different kinds of BLAST searches are there?
- What is blastn? blastx? blastp? tblastn?
See the different flavors of BLAST: https://blast.ncbi.nlm.nih.gov/doc/blast-quick-start-guide/
Find sequence hits for the following sequence, but first consider the options available on the BLAST page. One option that might come in very handy is the “open results in another window” option, which will allow you to modify your query and rerun it quickly without losing the results:
>a
GATTACTTCAGCCAAAAGGAACACCTGTATATGAAGTGTATATTATTTAAATGGGTACTG
TGCCTGTTACTGGGTTTTTCTTCGGTATCCTATTCCCGGGAGTTTACGATAGACTTTTCG
ACCCAACAAAGTTATGTCTCTTCGTTAAATAGTATACGGACAGAGATATCGACCCCTCTT
GAACATATATCTCAGGGGACCACATCGGTGTCTGTTATTAACCACACCCCACCGGGCAGT
TATTTTGCTGTGGATATACGAGGGCTTGATGTCTATCAGGCGCGTTTTGACCATCTTCGT
CTGATTATTGAGCAAAATAATTTATATGTGGCCGGGTTCGTTAATACGGCAACAAATACT
TTCTACCGTTTTTCAGATTTTACACATATATCAGTGCCCGGTGTGACAACGGTTTCCATG
ACAACGGACAGCAGTTATACCACTCTGCAACGTGTCGCAGCGCTGGAACGTTCCGGAATG
CAAATCAGTCGTCACTCACTGGTTTCATCATATCTGGCGTTAATGGAGTTCAGTGGTAAT
ACAATGACCAGAGATGCATCCAGAGCAGTTCTGCGTTTTGTCACTGTCACAGCAGAAGCC
TTACGCTTCAGGCAGATACAGAGAGAATTTCGTCAGGCACTGTCTGAAACTGCTCCTGTG
TATACGATGACGCCGGGAGACGTGGACCTCACTCTGAACTGGGGGCGAATCAGCAATGTG
CTTCCGGAGTATCGGGGAGAGGATGGTGTCAGAGTGGGGAGAATATCCTTTAATAATATA
TCAGCGATACTGGGGACTGTGGCCGTTATACTGAATTGCCATCATCAGGGGGCGCGTTCT
GTTCGCGCCGTGAATGAAGAGAGTCAACCAGAATGTCAGATAACTGGCGACAGGCCCGTT
ATAAAAATAAACAATACATTATGGGAAAGTAATACAGCTGCAGCGTTTCTGAACAGAAAG
TCACAGTTTTTATATACAACGGGTAAATAAAGGAGTTAAGCATGAAGAAGATGTTTATGG
Challenge 2.1.2
- Is it a gene?
- If yes, what does it encode?
- What is the closest organism? What can you infer about its taxonomic distribution?
What does a protein-coding gene need to produce proteins?
Now go back to the blast main page.
Challenge 2.1.3
- Did you choose the right blast algorithm to answer the questions above?
- Can you do better?
What if you try another flavor of the blast suite?
What if you aligned the gene to the genome while keeping the codon frame?
Repeat the procedure for this other sequence:
>b
GATTACTTCAGCCAAAAGGAACACCTGTATATGAAGTGTATATTATTTAAATGGGTACTG
TGCCTGTTACTGGGTTTTTCTTCGGTATCCTATTCCCGGGAGTTTACGATAGACTTTTCG
ACCCAACAAAGTTATGTCTCTTCGTTAAATAGTATACGGACAGAGATATCGACCCCTCTT
GAACATATATCTCAGGGGACCACATCGGTGTCTGTTATTAACCACACCCCACCGGGCAGT
TATTTTGCTGTGGATATACGAGGGCTTGATGTCTATCAGGCGCGTTTTGACCATCTTCGT
CTGATTATTGAGCAAAATAATTTATATGTGGCCGGGTTCGTTAATACGGCAACAAATACT
TTCTACCGTTTTTCAGATTTTACACATATATCAGTGCCCGGTGGACAACGGTTTCCATGA
CAACGGACAGCAGTTATACCACTCTGCAACGTGTCGCAGCGCTGGAACGTTCCGGAATGC
AAATCAGTCGTCACTCACTGGTTTCATCATATCTGGCGTTAATGGAGTTCAGTGGTAATA
CAATGACCAGAGATGCATCCAGAGCAGTTCTGCGTTTTGACTGTCACTGTCACAGCAGAA
GCCTTACGCTTCAGGCAGATACAGAGAGAATTTCGTCAGGCACTGTCTGAAACTGCTCCT
GTGTATACGATGACGCCGGGAGACGTGGACCTCTTTTTTTACTCTGAACTGGGGGCGAAT
CAGCAATGTGCTTCCGGAGTATCGGGGAGAGGATGGTGTCAGAGTGGGGAGAATATCCTT
TAATAATATATCAGCGATACTGGGGACTGTGGCCGTTATACTGAATTGCCATCATCAGGG
GGCGCGTTCTGTTCGCGCCGTGAATGAAGAGAGTCAACCAGAATGTCAGATAACTGGCGA
CAGGCCCGTTATAAAAATAAACAATACATTATGGGAAAGTAATACAGCTGCAGCGTTTCT
GAACAGAAAGTCACAGTTTTTATATACAACGGGTAAATAAAGGAGTTAAGCATGAAGAAG
ATGTTTATGG
Challenge 2.1.4
How does sequence b
compare to the first sequence? Is it
also a gene?
if you get stuck, try e.g. StarORF (http://star.mit.edu/orf/index.html).
Challenge 2.1.5
- Can you perform a pairwise alignment between both sequences?
- What are the differences?
Check the “Align two or more sequences” on the BLAST search page.
Task 2.2: Find protein blast hits
Now find sequences with similarity to this protein
>c
MMNRSIYRQFVRYTIPTVAAMLVNGLYQVVDGIFIGHYVGAEGLAGINVAWPVIGTILGIGMLVGVGTGA
LASIKQGEGHPDSAKRILATGLMSLLLLAPIVAVILWFMADDFLRWQGAEGRVFELGLQYLQVLIVGCLF
TLGSIAMPFLLRNDHSPNLATLLMVIGALTNIALDYLFIAWLQWELTGAAIATTLAQVVVTLLGLAYFFS
ARANMRLTRRCLRLEWHSLPKIFAIGVSSFFMYAYGSTMVALHNTLMIEYGNAVMVGAYAVMGYIVTIYY
LTVEGIANGMQPLASYHFGARNYDHIRKLLRLAMGIAVLGGMAFVLVLNLFPEQAIAIFNSSDSELIAGA
QMGIKLHLFAMYLDGFLVVSAAYYQSTNRGGKAMFITVGNMSIMLPFLFILPQFFGLTGVWIALPISNIV
LSTVVGIMLWRDVNKMGKPTQVSYA
Challenge 2.2.1
- What is this gene coding for?
- Where did it likely come from?
Challenge 2.2.2
- Can you align nucleotide sequences (sequence
a
, Task 2.1) against a protein database? - If so, how?
Challenge 2.2.3
- Can you align proteins to a nucleotide database?
- If so, how? If not, why not?
Task 2.3: Find gene and genome hits
This task should be performed at the NCBI page (not at EBI).
What is the following sequence?
>d
GTGTCTAAAGAAAAATTTGAACGTACAAAACCGCACGTTAACGTTGGTACTATCGGCCACGTTGACCACG
GTAAAACTACTCTGACCGCTGCAATCACCACCGTACTGGCTAAAACCTACGGCGGTGCTGCTCGTGCATT
CGACCAGATCGATAACGCGCCGGAAGAAAAAGCTCGTGGTATCACCATCAACACTTCTCACGTTGAATAT
GACACCCCGACCCGTCACTACGCACACGTAGACTGCCCGGGGCACGCCGACTATGTTAAAAACATGATCA
CCGGTGCTGCTCAGATGGACGGCGCGATCCTGGTAGTTGCTGCGACTGACGGCCCGATGCCGCAGACTCG
TGAGCACATCCTGCTGGGTCGTCAGGTAGGCGTTCCGTACATCATCGTGTTCCTGAACAAATGCGACATG
GTTGATGACGAAGAGCTGCTGGAACTGGTTGAAATGGAAGTTCGTGAACTTCTGTCTCAGTACGACTTCC
CGGGCGACGACACTCCGATCGTTCGTGGTTCTGCTCTGAAAGCGCTGGAAGGCGACGCAGAGTGGGAAGC
GAAAATCCTGGAACTGGCTGGCTTCCTGGATTCTTACATTCCGGAACCAGAGCGTGCGATTGACAAGCCG
TTCCTGCTGCCGATCGAAGACGTGTTCTCCATCTCCGGTCGTGGTACCGTTGTTACCGGTCGTGTAGAAC
GCGGTATCATCAAAGTTGGTGAAGAAGTTGAAATCGTTGGTATCAAAGAGACTCAGAAGTCTACCTGTAC
TGGCGTTGAAATGTTCCGCAAACTGCTGGACGAAGGCCGTGCTGGTGAGAACGTAGGTGTTCTGCTGCGT
GGTATCAAACGTGAAGAAATCGAACGTGGTCAGGTACTGGCTAAGCCGGGCACCATCAAACCGCACACCA
AGTTCGAATCTGAAGTGTACATTCTGTCCAAAGATGAAGGCGGCCGTCATACTCCGTTCTTCAAAGGCTA
CCGTCCGCAGTTCTACTTCCGTACTACTGACGTGACTGGTACCATCGAACTGCCGGAAGGCGTAGAGATG
GTAATGCCGGGCGACAACATCAAAATGGTTGTTACCCTGATCCACCCGATCGCGATGGACGACGGTCTGC
GTTTCGCAATCCGTGAAGGCGGCCGTACCGTTGGTGCGGGCGTTGTTGCTAAAGTTCTGGGCTAA
Challenge 2.3.1
- What organism is it likely from?
- Is it a gene?
- If yes, which one?
- What does it code for?
Use rather blastn to identify the origin of the gene. Explore the alignments of the first hit, and try the “Graphics” button associated with the results.
Challenge 2.3.2
- Where is it located on the genome (precision at ~10kb is good enough)?
- Are there several copies in the genome?
Mark down the name of the gene (generally three letters in lower case and one in uppercase). From the protein record, there is a link to the genome record, which has a “graphics” interface.
Not all genomes are equally well annotated and you might have to search for a genome that is well annotated. One good example is Escherichia fergusonii ATCC 35469
Challenge 2.3.3
- How far is this gene (and possible copies) from the origin of replication?
- How is it oriented with respect to replication?
What genes are reliable markers for the origin of replication? Can you find a paper about that? Now can you find that gene in the genome above?
What is the dnaA gene doing?
Task 2.4 (optional): Explore the synteny functions of the STRING database
We are interested in genes that are normally located next to each other, and their taxonomic breadth.
Open the STRING database (https://string-db.org/). Search for the Shiga toxin-encoding genes in bacteria.
Challenge 2.4.1
- Can you find the Shiga toxin-coding genes in Escherichia coli K-12?
- Do you find the same strain as above?
Challenge 2.4.2
Can you tell on what chromosomal feature these genes are located?
What genes are located next to these genes?
What is a lysogenic phage?
Feel free to explore the different functions of the database. What do the different links in the network mean? What happens if you remove the least reliable interaction sources (i.e. text mining and databases?
Now modify the settings to keep only Neighborhood as edges. Go back to the Viewers tab. Search for the other gene discussed above, tufA, do the same as for stx1, and compare the results.
Challenge 2.4.3
- How many genes are co-localized with stx1?
- What are the functions of these?
Challenge 2.4.4
- How many genes are co-localized with tufA?
- What are the functions of the gene?
- Does the function correlate with the taxonomic breadth of the gene?
For both genes, select the Neighborhood viewer.
Challenge 2.4.5
- How taxonomically widespread are the stx1, respectively tufA genes?
- How conserved is the gene order for these two genes?
Key Points
- NCBI website is very, very comprehensive
- The Entrez cross-referencing system helps retrieving links bits of data of different type.