Bioinformatics tips, tricks and reminders

Thursday, June 2, 2011

taxon id in blast db's

My office mate (thanks Aaron) just finished tinkering with making a blast db that includes the taxon id's. He gave me the commands which I think might be useful to post as a reference:

${BLAST_BIN}/makeblastdb -parse_seqids -in bacteriaGenomes.fa -dbtype nucl -title bacteriaGenomes -out bacteriaGenomes -max_file_sz 10GB -taxid_map taxMap.txt

And taxMap.txt looks like:

NC_000907.1 71421
NC_000908.2 243273
NC_000964.3 224308
NC_000909.1 243232
NC_000911.1 1148
NC_000912.1 272634
NC_000913.2 511145
NC_000914.2 394
NC_000915.1 85962
NC_000916.1 187420
.
.
.

Fasta headers from input 'bacteriaGenomes.fa' look like:

>gi|45357563|ref|NC_005791.1| Methanococcus maripaludis S2 chromosome, complete genome
>gi|90421528|ref|NC_007925.1| Rhodopseudomonas palustris BisB18, complete genome
>gi|333977506|ref|NC_015573.1| Desulfotomaculum kuznetsovii DSM 6115 chromosome, complete genome
>gi|221230948|ref|NC_011900.1| Streptococcus pneumoniae ATCC 700669, complete genome
>gi|225350699|ref|NC_012226.1| Brachyspira hyodysenteriae WA1 plasmid pBHWA1, complete sequence
>gi|225618950|ref|NC_012225.1| Brachyspira hyodysenteriae WA1 chromosome, complete genome
>gi|307594149|ref|NC_014537.1| Vulcanisaeta distributa DSM 14429 chromosome, complete genome
>gi|254295496|ref|NC_012983.1| Hirschia baltica ATCC 49814 plasmid pHbal01, complete sequence
>gi|254292376|ref|NC_012982.1| Hirschia baltica ATCC 49814, complete genome
>gi|83816857|ref|NC_007678.1| Salinibacter ruber DSM 13855 plasmid pSR35, complete sequence

In this example, the fasta sequences were downloaded from NCBI's bacterial genomes ftp (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) - all.fna.tar.gz.

The taxMap.txt is a masaging of the columns found in the 'summary.txt', also found
on the ftp site.

Tuesday, May 31, 2011

minimus2-blat

I'm currently playing around with putting together an assembly pipeline that will merge the contigs generated from assembly of same input data with different parameters.
My current assembler of choice for working with 454 data is Newbler.
It can only handle input reads as long as 2000 bases. Shredding the initial contigs is an option but I have been told that Newbler has difficulty putting these contigs back together again.
Minimus2, part of the AMOS package, is made just for that. The latest addition to the minimus tools is minimus2-blat, which is possibly the fastest of the three minimus2 tools.

Here's how to install it:
mkdir AMOS
cd AMOS
cvs -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos login
cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS
./bootstrap
./configure
make
make install

Wednesday, March 23, 2011

Split fastq mates

Here is code in perl to split a single paired interleaved FASTQ file into a pair of FASTQ files:


#!/usr/local/bin/perl -w
# Daniel Brami
# Util to split interlaced FASTQ files into pairs

use strict;

# Standard lib
use IO::File;
use File::Basename;

my $INPUT=shift;

if (!(defined ($INPUT)) || ($INPUT =~ '^\-')){
die "Usage: $0 <interleaved paired FASTQ file>\n";
}

my ($name,$path,$suffix) = fileparse($INPUT, qw/fastq FASTQ txt TXT/);

my $FH_IN = new IO::File($INPUT, "r") or die "could not open $INPUT: $!\n";
my $FH_OUT1 = new IO::File($name."split1.$suffix", "w") or die "could not open $name.split1$suffix for writing: $!\n";
my $FH_OUT2 = new IO::File($name."split2.$suffix", "w") or die "could not open $name.split2$suffix for writing: $!\n";

my ($recs1, $recs2)= (0,0);
my $flipflop = 1;
my $counter = 0;
my ($line, $TXT);
while($line = $FH_IN->getline()){
$TXT .= $line;
$counter++;
if($counter == 4){
 if($flipflop == 1){
  print $FH_OUT1 $TXT;
  ++$recs1;
 }else{
  print $FH_OUT2 $TXT;     
  ++$recs2;
 }
 $counter = 0;
 $TXT = '';
 $flipflop *= -1;
}
}
$FH_IN->close();
$FH_OUT1->close();
$FH_OUT2->close();

print STDERR "Processed $recs1 records for pair file 1 and $recs2 records for pair file 2.\n";
if($recs1 != $recs2){
print STDERR "The number of processed records does not match - check input data!";
}exit;

Quality trimming fastq files using CLC Bio cell

copy the mandatory license.properties file to your working directory
/usr/local/packages/clc-ngs-cell/quality_trim -o -p -c -l -m -r -i &
Optional: Split the resulting interleaved paired file into two components:
~/workspace/meta_scripts/splitMatedFastq.pl

Monday, July 19, 2010

modify a genbank file for mapping

Insert between the following tags at the tail of your Genbank file the nucleotide FASTA sequence only:

ORIGIN

(sequence goes here)

Thursday, June 10, 2010

CLC mapping

At ths point, the most current version of the CLC command-line utils is located here:

/usr/local/packages/clc-ngs-cell/

with the mapper for 454 / Sanger reads being clc_ref_assemble_long.

As usual, you need to copy the license.properties file from that folder to your working directory to perform any sort of analysis.

Here are some of the pitfalls:

- making a soft link to the mapping executable will throw a missing license error. forget the soft link, you need to use the fully qualified path of the executable.

- mapping to a Genbank file didn't work. However mapping to a FASTA file completed with blazing quickness.

Once the mapping has producea CAS output, you need to process it to get it in table format:

/usr/local/packages/clc-ngs-cell/assembly_table -n -s input.cas > output.table

Thursday, May 13, 2010

SVN stuff

This is how you create a repository (usually done by IT):

svnadmin create (repo_name)

This is how you add content to a repository:

svn import (files | folder) (repo_name) -m "initial import"

This is how you checkout the latest version of something existing in SVN into the current working directory:

svn checkout (repo_name/path)

This is how you check something into it to update the svn with a more current version:

from within the root folder when doing svn checkout (ie: /workspace/genomeQC):

svn commit -m "update to ver 1.5"