My office mate (thanks Aaron) just finished tinkering with making a blast db that includes the taxon id's. He gave me the commands which I think might be useful to post as a reference:
${BLAST_BIN}/makeblastdb -parse_seqids -in bacteriaGenomes.fa -dbtype nucl -title bacteriaGenomes -out bacteriaGenomes -max_file_sz 10GB -taxid_map taxMap.txt
And taxMap.txt looks like:
NC_000907.1 71421
NC_000908.2 243273
NC_000964.3 224308
NC_000909.1 243232
NC_000911.1 1148
NC_000912.1 272634
NC_000913.2 511145
NC_000914.2 394
NC_000915.1 85962
NC_000916.1 187420
.
.
.
Fasta headers from input 'bacteriaGenomes.fa' look like:
>gi|45357563|ref|NC_005791.1| Methanococcus maripaludis S2 chromosome, complete genome
>gi|90421528|ref|NC_007925.1| Rhodopseudomonas palustris BisB18, complete genome
>gi|333977506|ref|NC_015573.1| Desulfotomaculum kuznetsovii DSM 6115 chromosome, complete genome
>gi|221230948|ref|NC_011900.1| Streptococcus pneumoniae ATCC 700669, complete genome
>gi|225350699|ref|NC_012226.1| Brachyspira hyodysenteriae WA1 plasmid pBHWA1, complete sequence
>gi|225618950|ref|NC_012225.1| Brachyspira hyodysenteriae WA1 chromosome, complete genome
>gi|307594149|ref|NC_014537.1| Vulcanisaeta distributa DSM 14429 chromosome, complete genome
>gi|254295496|ref|NC_012983.1| Hirschia baltica ATCC 49814 plasmid pHbal01, complete sequence
>gi|254292376|ref|NC_012982.1| Hirschia baltica ATCC 49814, complete genome
>gi|83816857|ref|NC_007678.1| Salinibacter ruber DSM 13855 plasmid pSR35, complete sequence
In this example, the fasta sequences were downloaded from NCBI's bacterial genomes ftp (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) - all.fna.tar.gz.
The taxMap.txt is a masaging of the columns found in the 'summary.txt', also found
on the ftp site.
Thursday, June 2, 2011
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment