Best bioinfo one-liners?
8
25
Entering edit mode
8.9 years ago

Whereas an infinity of efficient tools exists out there, it is sometimes still quicker for achieving simple tasks to execute a one linux command. I'm starting by sharing 3 I use quite often.

## 1 get the sequences length distribution form a fastq file using awk
zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'

##2 Reverse complement a sequence (I use that a lot when I need to design primers)
echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'

##3 split a multifasta file into single ones with csplit:
csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}

I may be wrong, but I've not found such a list in Biostars.

So, what comes to your mind? I hope this post will yield some gold nuggets ;-)

linux • 7.9k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode
ADD REPLY
6
Entering edit mode
8.9 years ago

linearize fasta:

cat file.fasta | awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}'
ADD COMMENT
1
Entering edit mode

Just a bit shorter? :)

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa
ADD REPLY
0
Entering edit mode

Wow, a very compact way to handle multiline entries. Excellent!

ADD REPLY
0
Entering edit mode

Even shorter! :)

awk 'NR==1 {print ; next} {printf (/^>/) ? "\n"$0"\n" : $1}' file.fas
ADD REPLY
0
Entering edit mode

@Frederic: doesn't work.

ADD REPLY
0
Entering edit mode

@Pierre: hum, it works well on my config (GNU Awk 4.1.1). Maybe a problem with older Awk versions (Awk < 4.0)?

ADD REPLY
5
Entering edit mode
8.9 years ago
## fastq2fasta
zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[2]\n";' | gzip -c > file.fasta.gz

## bam2bed
samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed
##bam2wig
samtools mpileup -BQ0 file.sorted.bam | perl -pe '($c, $start, undef, $depth) = split;if ($c ne $lastC || $start != $lastStart+1) {print "fixedStep chrom=$c start=$start step=1 span=1\n";}$_ = $depth."\n";($lastC, $lastStart) = ($c, $start);' | gzip -c > file.wig.gz
ADD COMMENT
0
Entering edit mode

+1 Especially for the 1st one which avoids the installation of a specific tool (e.g. fastx-toolkit).

ADD REPLY
5
Entering edit mode
8.9 years ago
venu 7.1k

Here is a list I often use during my work.

My work always starts with a list of ids for which I need to extract fasta sequences..samtools - always a better choice

cut -c 2- ids.txt | xargs -n 1 samtools faidx file.fasta > out.fasta
ADD COMMENT
1
Entering edit mode

Nice! And thanks for the link with your whole list.

ADD REPLY
1
Entering edit mode

I still cannot help but find the existence of a list of 'favourite one liners' a bit paradoxical, in my opinion.

If they are used routinely for your day-to-day tasks, maybe it is time to 'graduate' them to an utility script, that may be easier to maintain and/or might benefit the others?

ADD REPLY
1
Entering edit mode

One-liners are like old-time Lego(tm) bricks, you can build lots of fun and different stuff with'em.

If you pack them into tools, they will become like current-day Legos: you attach a head, two arms and two legs to a body and voilá, "constructed" a robot.

ADD REPLY
5
Entering edit mode
8.9 years ago
george.ry ★ 1.2k

Reproducible subsampling of a FASTQ file:

cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq

srand() is the seed for the random number generator - keeps the subsampling the same when the script is run multiple times. 0.01 is the % of reads to output.

Deinterleaving a FASTQ:

cat file.fq | paste - - - - - - - - | tee >(cut -f1-4 | tr '\t' '\n' > out1.fq) | cut -f5-8 | tr '\t' '\n' > out2.fq

Archiving / unarchiving files:

Not exactly a one-liner, but I find this a very useful way to make files immutable, and most people probably won't have come across it. An immutable file cannot be modified or written to, deleted, renamed, or moved - even by the superuser. When I receive data or download it, this is the first thing I do, and it's saved a lot of heartache over the years.

sudo chattr +i file.fq #To archive a file
sudo chattr -i file.fq #To unarchive it again
ADD COMMENT
4
Entering edit mode
8.9 years ago
iraun 6.2k

There are quite a lot useful one liners. This is just a drop in the ocean :).

## Number of reads in a fastq file
cat file.fq | echo $((`wc -l`/4))

## Single line fasta file to multi-line fasta of 60 characteres each line
awk -v FS= '`/^>/{print;next}`{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file

## Sequence length of every entry in a multifasta file
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa
ADD COMMENT
1
Entering edit mode

Single line fasta file to multi-line fasta of 60 characteres each line

How about using fold -w 60 file?

ADD REPLY
0
Entering edit mode

Yes, that one also works :). I didn't remember fold command!

ADD REPLY
0
Entering edit mode

wouldn't this one wrap long header lines as well?

ADD REPLY
0
Entering edit mode

Yes, good point h.mon, fold command will also split the sequence names if their size is bigger than 60.

ADD REPLY
0
Entering edit mode

Why don't you just use word count for "Sequence length of fasta file"?

grep -v ">" file.fa | wc -c
ADD REPLY
2
Entering edit mode

Ah, got it... for every entry

ADD REPLY
0
Entering edit mode

Yep :). Of course if you want the total length the most straightforward command is your command.

ADD REPLY
0
Entering edit mode

As long as you don't forget the "

cf What Are The Most Common Stupid Mistakes In Bioinformatics? ;-)

ADD REPLY
2
Entering edit mode
8.9 years ago
h.mon 35k

I've used several "one-liners" from Scriptome.

ADD COMMENT
0
Entering edit mode

Although the idea of this post was to gather one-liners of GNU/Linux tools mainly, this is good to know. One other project of this type is Biopieces.

ADD REPLY
1
Entering edit mode
8.9 years ago
Harshal ▴ 60

Running fastqc for all the fastq files in multiple sample folders in parallel mode.

for i in Sample_*/*.fastq.gz ; do echo echo fastqc $i\|qsub -cwd; done # will create commands
for i in Sample_*/*.fastq.gz ; do echo echo fastqc $i\|qsub -cwd; done|sh #launches jobs on cluster
ADD COMMENT
1
Entering edit mode

Most of these are better done with xargs:

ls Sample_*/*.fastq.gz | xargs -i echo fastqc {} \| qsub -cwd
ADD REPLY
1
Entering edit mode
8.9 years ago

Parallelize time-consuming processes on Unix systems

Using mpileup for a whole genome can take forever. So, handling each chromosome separately and parallely running them on several cores will speed up your pipeline. Using xargs you can easily realize it.

Example usage of xargs (-P is the number of parallel processes started - don't use more than the number of cores you have available):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"

To merge the results afterwards, you might want to do something like this:

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'
ADD COMMENT

Login before adding your answer.

Traffic: 2065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6