Biostar Beta. Not for public use.
check if two columns match in 2 annotation files and print those lines to a new output file
0
Entering edit mode
2.1 years ago
United States

I have 2 gff3 files. i want to extract genes with same gene lengths in both the files. If they are same, i want to write them to a third file

example: file 1

C20775336 maker gene 1895 2166 . - . ID=gene1;Name=maker-C20
C20775336 maker gene 895 1166 . - . ID=mRNA1;Parent=gene1;N
C20775336 maker gene 1895 1962 . - . Parent=mRNA1
C20775336 maker gene 2795 2962 . - 2 ID=CDS1;Parent=mRNA1

file 2

C20775336 maker gene 895 1166 . - . ID=gene1;Name=maker-C20
C20775936 maker gene 1895 2166 . - . ID=mRNA1;Parent=gene1;N
C20775336 maker gene 2795 2962 . - . Parent=mRNA1

output file

C20775336 maker gene 895 1166 . - . ID=gene1;Name=maker-C20

C20775336 maker gene 2795 2962 . - . Parent=mRNA1

so if columns 4 and 5 in first file is equal to column 4 and 5 in second file, print that line from second file to a new output file.

I tried to work on awk, to achieve this, but i could not figure out.

Thanks in advance

sequence • 17k views
ADD COMMENTlink
0
Entering edit mode

Joining Multiple Fields Using Unix Join (from SO) gives very nice solutions using awk and join. If I have to combine multiple files in unix I use what Michael Mrozek suggested - combine fields using awk and then join files according one key field with join.

ADD REPLYlink
0
Entering edit mode

I have same query also, but if i don't care about rest of first column, only first column should match and print number of match found in first file and number of match found in second file.

thanks

ADD REPLYlink
0
Entering edit mode

For quicker response you should ask this general programming question on SO.

ADD REPLYlink
0
Entering edit mode

Sorry...I did not get about SO

ADD REPLYlink
0
Entering edit mode
ADD REPLYlink
4
Entering edit mode
16 months ago
venu 6.2k
Germany
Is this what you need?

​awk 'FNR==NR{a[$4,$5]=$0;next}{if(b=a[$4,$5]){print b}}' file1 file2
ADD COMMENTlink
0
Entering edit mode

+1 for a pure Awk solution (and a creative usage of FNR). Would you care to explain what the command does for people who are not familiar with Awk?

ADD REPLYlink
0
Entering edit mode

for some reason, the above command does not work for me, i get a error message "bash: ​awk: command not found.."

ADD REPLYlink
0
Entering edit mode

I tried to work on awk. This statement from your question letting us to assume you have used awk on your machine initially, if so the above oneliner should work. Type awk -Vin terminal.

ADD REPLYlink
0
Entering edit mode

@Frederic Mahe Two variables FNRand NRstores the number of records in the current file and total number of records respectively. $0 is to hold entire line. Rest is as simple as any other comparison, matching column 4 and 5 from file1 (a) with file2 (b), if matched, print entire line from file2 (b).

ADD REPLYlink
0
Entering edit mode

Hi, what if I need the inverse of this? I tried awk 'FNR==NR{a[$4,$5]=$0;next}{if(b!=a[$4,$5]){print b}}' file1 file2 but all I got was an empty file.

ADD REPLYlink
0
Entering edit mode
13 months ago
mark.ziemann ♦ 1.2k
Australia/Mebourne/Geelong/Deakin

Try this:

##First sort each of the input files on the 1st column

sort -k 1b,1 file2 > file2s

sort -k 1b,1 file1 > file1s

##Now they can be joined on the 1st field, followed by awk 1 line

join -1 1 -2 1 file1s file2s \

| awk '{OFS="\t"} $5-$4==$13-$12' > outputfile

ADD COMMENTlink
0
Entering edit mode

Thank you, your solution works perfectly. However it joins the lines from both the files . Is there a way to skip that.

for example , it did some thing like this on my origina file

scaffold1732 baker gene 85677 87592 . + . ID=gene9;Name=maker-gnlscaffold1732-augustus-gene-0.6 maker gene 85677 87592 . + . ID=gene9;Name=maker-gnl|Hsal_3.3|scaffold1732-augustus-gene-0.6

instead can i get output lines like this

scaffold1732 baker gene 85677 87592 . + . ID=gene9;Name=maker-

ADD REPLYlink
0
Entering edit mode

With join you can specify wanted output (simple examples).

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1