Question

Extraction Of Certain Lines Which Starts And Ends With A Particular Description.

0

Entering edit mode

10.7 years ago

Abdul Rawoof ▴ 60

Hello everyone,

I am trying to extract the lines which starts from ">>" and ends with "Complete" from my Input file.

INPUT FILE:

  Read Sequence:ENSG00000110092|ENST00000227507 (3192 nt)
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  Performing Scan: hsa-miR-193b-5p
   vs ENSG00000110092|ENST00000227507
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  Score for this Scan:
  No Hits Found above Threshold
  Complete

  Read Sequence:ENSG00000169429|ENST00000307407 (1252 nt)
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  Performing Scan: hsa-miR-491-5p
   vs ENSG00000169429|ENST00000307407
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
     Forward:    Score: 125.000000  Q:3 to 23  R:1060 to 1083 Align Len (23) (78.26%) (86.96%)

  Scores for this hit:
   **>hsa-miR-491-5p**
      ENSG00000169429|ENST00000307407    125.00    -19.70    0.00    3 23    1060 1083    23    78.26%    86.96%

  Score for this Scan:
  Seq1,Seq2,Tot Score,Tot Energy,Max Score,Max Energy,Strand,Len1,Len2,Positions
  **>>hsa-miR-491-5p**
      ENSG00000169429|ENST00000307407    125.00    -19.70    125.00    -19.70    9000    22    1252     1059
  Complete

  Read Sequence:ENSG00000109320|ENST00000226574 (708 nt)
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  Performing Scan: hsa-miR-193b-5p
   vs ENSG00000109320|ENST00000226574
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  Score for this Scan:
  No Hits Found above Threshold
  Complete

##Perl Script ##############

chomp($input=<STDIN>);
open(IN,$input) or die "Can not open the file";
@cont=<IN>;
foreach $line(@cont)
{
  if($line=~/>>/)
    {   chomp($line);
         print "$line\n";
     }
 }

Its giving me output like this:: 3.txt

 >>hsa-miR-491-5p
  >>hsa-miR-491-5p
  >>hsa-miR-491-5p
  >>hsa-miR-639
  >>hsa-miR-639

But I want the output like this::

>>hsa-miR-491-5p
      ENSG00000169429|ENST00000307407    125.00    -19.70    125.00    -19.70    9000    22    1252     1059
>>hsa-miR-491-5p
      ENSG00000169429|ENST00000307407    125.00    -18.30    125.00    -18.30    9000    22    1252     1028
>>hsa-miR-491-5p
      ENSG00000169429|ENST00000307407    125.00    -14.70    125.00    -14.70    9000    22    1252     1059

I have also tried another following script:

print"Enter input file:\n";
chomp($filename=<STDIN>);
unless(open(FH,$filename))
{print"Cannot open the file..\n"; exit; }
open(OUT,">targets.txt") or die "can't help it";
 @cont=<FH>;
close FH;
$flag=0; $seq=""; @anno=();
foreach (@cont)
{  if($_=~/^Complete/)
 { last;}    
elsif($_=~/^\s+>>/)
 {$flag=1;}
elsif($flag==1)
 {  $seq.=$_;}
else { push (@anno,$_);}
}

print OUT$seq,"\n";

But its searching first hit but then its going directly to the last word Complete and printing all the content in between them. Could anyone suggest me what should I follow to get the correct output what I want. Any help would be appreciated. Thanks..

perl fasta • 3.4k views

ADD COMMENT • link updated 8.7 years ago by Biostar 20 • written 10.7 years ago by Abdul Rawoof ▴ 60

0

Entering edit mode

Just a minor adjustment in the script's regex below (added \*\*) was needed to make the script produce your desired output. Hope this helps!

ADD REPLY • link 10.6 years ago by Kenosis ★ 1.3k

score 1 · Answer 1 · 2013-08-06

It appears that you want to extract and print the text between ">>" and "Complete" when those delimiters appear in the lines of your dataset. Is this correct? However, the last two lines of your desired output are not in the contents of your input file. Regardless, if the mentioned extraction is what you want, the following Perl script will do it:

use strict;
use warnings;

$/ = '';

while (<>) {
    print "$1$2 $1\n" if />>(\S+)\*\*(.+?)\s+Complete/s;
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Output from your dataset:

hsa-miR-491-5p
      ENSG00000169429|ENST00000307407    125.00    -19.70    125.00    -19.70    9000    22    1252     1059 hsa-miR-491-5p

The first line after the pragmas sets file reading to paragraph mode, so paragraphs are read a 'chunk' at a time, as blank lines separate the data you want to process. The regex will capture the text between the delimiters you've mentioned--and that text is printed if found.

Hope this helps!

score 0 · Answer 2 · 2013-08-01

My Perl's not brilliant, but the problem with your first approach is that you're only printing out the line when it has >> in it. What you want is to set a flag when you see that, and turn the flag off when you see Complete (similar to your second approach). I get the output you want on the sample input you gave us with the following Python code

readIn = open(inputFile, 'r')
displayFlag = False
for line in readIn:
    if line == 'Complete\n':
        displayFlag = False
    elif displayFlag:
        print(line)
    elif line[:2] == '>>':
        displayFlag = True
        print(line)
readIn.close()

Basically it turns the flag off when the line only contains Complete, prints out the line when the flag is on and sets it when the first two characters of the line are '>>'. I assume it's pretty straightforward to translate this into Perl.