Question

Can't grep qseqid (column 1) from blastx outfmt6

0

Entering edit mode

7.0 years ago

ladypurrsia ▴ 60

I have just completed a blastx on some fasta sequences against the uniprot protein database. I have obtained the BLAST tabular output format 6. This is my 1st time running blastx. Here is an example of my output.

$head blastx_uniprotsprot_spl_m40a001.outfmt6 

NS500162:92:H2L22BGXX:1:11112:5797:6378    APEX1_DANRE  42.857  42  22  1   125 6   166 207 0.36    30.4
NS500162:92:H2L22BGXX:1:11112:8897:6378     LIPA2_SYNY3 79.592  49  10  0   150 4   226 274 2.39e-18    78.2
NS500162:92:H2L22BGXX:1:11112:9991:6378 SERC_METBF  68.000  50  16  0   150 1   66  115 1.23e-17    76.6

A few specifics on my data:

My sample: M40A, has a total of 10,913,966 nucleotide sequences in fasta format and: wc -l M40A_blastx_uniprot.txt = 34,809,040 (raw output of blastx)

Thinking I have a lot of duplicate reads aligning to multiple places in the database, I decided to pull out all unique sequences:

awk '{print$1}' M40A_blastx_uniprot.txt| sort | uniq | wc -l  = 33,969,282

The next thing I did was to see if I could grep one of these sequences from my fasta file - which produced zero results!

$grep -A 1 'NS500162:92:H2L22BGXX:1:11112:5797:6378' ../m40c.r1.fa

$grep

Randomly choosing a bunch of these sequence IDs (from blastx column 1 output) and trying to grep the original sequences from my fasta file gives me the same results each time (zero results).

I'm beginning to think my results are wrong, but I thought I would ask anyone here if this is "normal" for a blastx outfmt6 file? My questions are this:

1) How can I have 33 million unique sequence IDs from my blastx output file when I've put in 10 million fasta reads? Is this a standard output - to have this many duplicate hits? For my data, this means each sequence has > than 3 hits.

2) How can I pull out the original sequences by the seqid given in blastx results when grep gives me zero results? Is there a separate script that I am missing that allows for this? Looking into this answer at other websites, other researchers just use standard grep.

Any insight into this rabbit hole I'm suddenly in would be greatly appreciated!

Many Thank-you's!

blastx fasta outfmt6 • 2.2k views

ADD COMMENT • link updated 7.0 years ago by Joe 21k • written 7.0 years ago by ladypurrsia ▴ 60

1

Entering edit mode

BLAST doesn't give you a single exact match for a given query. It is a local aligner, so it will have found subregions of your queries that match other things. For some of your queries it might only find one, thus you end up with results > the number of query sequences. We would need to see some example headers and content from your fasta file too to be able to say why the grep is failing.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

jrj.healey:

Thank you very much for your explanation. This makes cogent sense! As for my fasta file: Here is my header for my fa file. I am not exactly sure what you mean by content. I will say that the sequences have been quality filtered with Q > 20 and they range from 20bp - 150pb (due to quality filtering; before filtering they are 150bp).

$head m40a.r1.fa
>NS500162:172:HG5CJBGXX:1:11101:25222:1061 1:N:0:14
CGAGTNCGAGTTAAAGATGCACTTGGCTGCCCGGCTATTTGATAAAGGGATTATCACCTCTGGGCAGGGCGCTGAGATTGCCGGTGTGTCTAAGCAGACCTTCATAGAACTCTTGGGCCGGTACGAAGTCTCCGTTTTCCAGTACGGAAT
>NS500162:172:HG5CJBGXX:1:11101:23881:1061 1:N:0:14
GCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAACGATTGTGTTGCATAGCGCTGGCCATCCACGT
>NS500162:172:HG5CJBGXX:1:11101:22290:1064 1:N:0:14
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGCGCAGCTGCGCCCGCAGAAGACTATTCCTGGCC

ADD REPLY • link 7.0 years ago by ladypurrsia ▴ 60

0

Entering edit mode

I think you may need to upload the whole input file somewhere. The grep command you provided, and that sample data, will obviously not work, because the strings don't match after the first colon, your sample data has 172 and the original command looks for grep -A 1 'NS500162:92... but I would do you the service of presuming that the files with the substring 92 just appear later in the file and not that you're just trying to grep for the wrong thing.

Have you tried searching manually in the file? E.g. open the file with nano and look for particular strings with Ctrl+W. This would tell us whether your grep is malformed in some way, or your fasta file is maybe not what you think it is.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

your grep is malformed in some way

Unlikely. See this response: C: Can't grep qseqid (column 1) from blastx outfmt6

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

Yeah it wasn't high on my list of likely culprits. It looked OK by eye anyway.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

How can I pull out the original sequences by the seqid given in blastx results when grep gives me zero results?

See (remove -exclude): C: How to remove sequences from a fasta file based on ID list?

Why is grep not working? Have you tried the command without the quotes?

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

Genomax2:

Thank you for your reply. I have also tried the following with grep (all have given me zero results):

$grep -A 1 -e NS500162:92:H2L22BGXX:1:11112:5797:6378 m40a.r1.fa
$grep -A 1 NS500162:92:H2L22BGXX:1:11112:5797:6378 m40a.r1.fa
$grep -A 1 -Ff seqIds.txt m40a.r1.fa > m40a.blastx.grep.fa

By the way, this third grep command is incredibly memory intensive (>10gb) to run on 30million + sequence IDs in a file; where seqIDs.txt is all of my seqIDs from my blastx results (all of column 1) ex:

$head seqIDs.txt

NS500162:92:H2L22BGXX:1:11101:16892:1095
NS500162:92:H2L22BGXX:1:11101:11433:1095
NS500162:92:H2L22BGXX:1:11101:8653:1097
NS500162:92:H2L22BGXX:1:11101:2748:1097

I then switched to seqtk seqtk which I have used numerous times in the past (it's super fast) to pull out fasta/fastq seqs from a list of sequence IDs with this command: $seqtk subseq m40a.r1.fa seqIds.txt which only gave me a blank file, as well.

My fasta file looks like this:

$head m40a.r1.fa
>NS500162:172:HG5CJBGXX:1:11101:25222:1061 1:N:0:14
CGAGTNCGAGTTAAAGATGCACTTGGCTGCCCGGCTATTTGATAAAGGGATTATCACCTCTGGGCAGGGCGCTGAGATTGCCGGTGTGTCTAAGCAGACCTTCATAGAACTCTTGGGCCGGTACGAAGTCTCCGTTTTCCAGTACGGAAT
>NS500162:172:HG5CJBGXX:1:11101:23881:1061 1:N:0:14
GCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAACGATTGTGTTGCATAGCGCTGGCCATCCACGT
>NS500162:172:HG5CJBGXX:1:11101:22290:1064 1:N:0:14
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGCGCAGCTGCGCCCGCAGAAGACTATTCCTGGCC

ADD REPLY • link 7.0 years ago by ladypurrsia ▴ 60

0

Entering edit mode

I am able to get results instantly using your example data and the same grep command. Not sure why this is not working for you.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

Me either :( Yes - I have tested the grep command and get results, all the same. For example, if I were to grep the first seq ID from my .fa file, out of that same .fa file, I get a result. If I were to grep the 1st seqID from the blastx result, from the blastx file, I get a result. It's when I try to grep the first seqID from my blastx result (from column 1) against my .fa file, I get nothing. Quite the paradox.

ADD REPLY • link 7.0 years ago by ladypurrsia ▴ 60

score 1 · Answer 1 · 2017-04-18

I'm still not exactly sure where the problem lies. I suspect your multifasta might not be what you think it is.

For example, with your blast file:

NS500162:92:H2L22BGXX:1:11112:5797:6378    APEX1_DANRE  42.857  42  22  1   125 6   166 207 0.36    30.4
NS500162:92:H2L22BGXX:1:11112:8897:6378     LIPA2_SYNY3 79.592  49  10  0   150 4   226 274 2.39e-18    78.2
NS500162:92:H2L22BGXX:1:11112:9991:6378 SERC_METBF  68.000  50  16  0   150 1   66  115 1.23e-17    76.6

And this fasta file (note I had to totally make up the last sequence to match the grep command you gave):

>NS500162:172:HG5CJBGXX:1:11101:25222:1061 1:N:0:14
CGAGTNCGAGTTAAAGATGCACTTGGCTGCCCGGCTATTTGATAAAGGGATTATCACCTCTGGGCAGGGCGCTGAGATTGCCGGTGTGTCTAAGCAGACCTTCATAGAACTCTTGGGCCGGTACGAAGTCTCCGTTTTCCAGTACGGAAT
>NS500162:172:HG5CJBGXX:1:11101:23881:1061 1:N:0:14
GCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAACGATTGTGTTGCATAGCGCTGGCCATCCACGT
>NS500162:172:HG5CJBGXX:1:11101:22290:1064 1:N:0:14
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGCGCAGCTGCGCCCGCAGAAGACTATTCCTGGCC
>NS500162:92:H2L22BGXX:1:11112:5797:6378 1:N:0:14
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGCATCAGCTAGCATCGATCGACGATCAGCATCGA

Then I can do:

var=$(cat test.blastout | awk '{print$1}') && cat seqs.fasta | grep -E -A 1 "${var}"

Where $var correctly evaluates to:

[ NS500162:92:H2L22BGXX:1:11112:5797:6378, NS500162:92:H2L22BGXX:1:11112:8897:6378, NS500162:92:H2L22BGXX:1:11112:9991:6378 ]

and get:

>NS500162:92:H2L22BGXX:1:11112:5797:6378 1:N:0:14
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGCATCAGCTAGCATCGATCGACGATCAGCATCGA

(I realise this may not answer the OPs question if the commands simply aren't working but I figured it was a bit too much formatted code for a comment)