ESTs based
databases of pre-clustered ESTs
A shortcut to obtain either consensus sequence (TIGR) or a set of ESTs (Unigene) derived from a gene of interest.
- STACKdb (limited access, tissue-specific splice forms) [1]
- Unigene (no consensus sequence) [2]
- TIGR [3]
Search of EST databases using BLAST
- Depending on the level of homology we can use:
- blastn program, cDNA sequence as query, EST DB from the same species (== novel splice forms discovery in the same species)
- tblastn program, protein sequence as a query, EST DB from the same (==paralogue discovery) or other species (== cloning any homologs)
If possible, use protein sequences from related species i.e. zebrafish protein when looking for a homolog in salmon), but for a large number of proteins one can detect homology between humans and C.elegans.
- Restrict blast output with species, i.e search only porcine ESTs to simplify the output
- On the BLAST output page select reasonable hits by checking a box on the left in the alignment section.
- Retrieve all checked results as FASTA file (i.e. pig_Xgene_ESTs_date_round1.fasta
- check how many sensible hits you got, i.e. using grep on Unix/Linux
grep '>' pig_Xgene_ESTs_date_round1.fasta | wc
- assembly all your EST sequences using phrap (on Unix command line):
phrase pig_Xgene_ESTs_date_round1.fasta
you should get file: pig_Xgene_ESTs_date_round1.fasta.contigs
If you do not have phrap you may use:
- CAP3
- ESSEM (Est’s aSSEmbly using Malig) from the Technical University of Catalonia.
You may download sequences of human SYNGR4 [ESTs http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=221005&TAXID=9606&SEARCH= here], save it as FASTA file and then feed CAP3 or ESSEM with it to check how it works. Use Suggested assembly sequence:
>assembly: gnl|UG|Hs# -> gnl|UG|Hs# (R) TTTTTTTTTTTTTTTGTTTTTAGAAACCCTTCTGGAGGGAGGATTCTCTCTTTATTGATTTGGATAAGGATATTTAGTTG TCAGGCATCATAGCAAGCCGGGGGGACTTTGGAGCGGTCAGACAGGGGGACAGGGCAGAGCTAGCATAACTCAGGCTGTT GGGGCCAGTGGTGGGCATGTTCACAGGGCTGTTGGCAGAGGGCAAGGGGAGGGTGGTCAGCACCATGCCACCCTCATCCA GGAAGCGCTTGTAAGGGACTGGAGCATCATTTCGGAGGTCCTGGAATGCCAGGTAGGCCTGGAATATCCAGACAAGGATG GAGAAGAAGGTGAAGGCGATGGCTGCCTGGCACTGCTGCTCCCCAGGAGGAACTCTTTGGGCGGCGAATGCTGCCATTGG TTGGCCAGGAAGCAGAAACCCATGAACCAGACAACTGCCCAGAGAACAGCCAGGATGAAGTCCAGGAGCTGGAAGGCTGT CTTGAAGCGGGTGCCGGCAATGCGGGTCTCCTGTGTGTCCAGGACGAGGAAGGCCAGCCACGCTGAGGAAGGCCAGGAAG CCGGCTCCCACGGCAAAGCTGCAGGCCACGCTGTTGCTGTTGAGAATGCAGTGGAGCTGCGGAGACTCCATCTTGTTCTG GTAGCCGTCGGTCAGCAGGGAGGAGAAGACGATCAGGGAGAAGACCCCTGCCTCCCCCACACTCTCCTTCTGCCACCAAA CC
- mask possible repeats using the RepeatMasker server. EST libraries are notorious for containing non-spliced ESTs/contaminations.
- use masked consensus sequence (MCS) from the step above in the next round of BLAST search:
in the blastn program, MCS as query, EST DB from the same species
check how many sensible hits you got.
- repeat EST assembly, repeat masking, and compare new EST contigs with contigs from the previous step until you get no new hits in the EST database.
- after every assembly step make sure that the contig you use contains a sequence of interest (== compare it with the first cDNA or protein sequence)
Genome annotation using ESTs assembly
- PASA http://www.tigr.org/tdb/e2k1/ath1/pasa_annot_updates/pasa_annot_updates.shtml
Importing human, mouse, and zebrafish EST trace files
For a significant subset of human, mouse, and zebrafish ESTs there are available trace and even experiment files. For sane gene cloning, we need them because:
- sequences in GeneBank are usually shorter than original trace files
- there is no way you can detect a sequencing error in plain text/fasta file without looking at the trace file
To get them one can search for relevant trace files using Sanger’s Trace server:
http://trace.ensembl.org/cgi-bin/tracesearch
or NCBI http://www.ncbi.nlm.nih.gov/blast/mmtrace.shtml
After blasting one can retrieve trace files as compressed tar in SCF or RCF. RCF is encoded & shrunk SCF: obtain and compile the rcf2scf program here if you plan to get a large number of trace files for speeding up transfer times.
Genome-based
- based on homology
- de novo
This will be covered in the genome annotation guide.
 This article is a stub. You can help OpenWetWare by expanding it. |