Question

How to compare two files and grep one upstream & and downstream genes

0

Entering edit mode

4.7 years ago

Kumar ▴ 170

Hi, I have two tab delimited files. I am looking to compare these two files and need to grep one upstream & and one downstream genes from file 2 based on the locations in column 2 and 3 in file 1. Please suggest any script.

File 1:

Node1 3223 3245
Node2 4556 4599

File 2:

Node3 3199 3210 ID=r4567
Node4 3260 3290 ID=r4587
Node5 3310 3340 ID=r6754
Node6 4410 4490 ID=r0987
Node7 4491 4550 ID=r2345
Node8 4610 4660 ID=6789
Node9 4670 4710 ID=3245

OUTPUT:

Node1 3199 3210 ID=r4567 Node4 3260 3290 ID=r4587
Node2 4491 4550 ID=r2345 Node8 4610 4660 ID=6789

alignment Assembly • 1.6k views

ADD COMMENT • link updated 4.6 years ago by Kevin Blighe 87k • written 4.7 years ago by Kumar ▴ 170

1

Entering edit mode

If you want the closest feature up- or downstream, bedtools closest could be useful.

ADD REPLY • link 4.7 years ago by alex.zaccaron ▴ 410

0

Entering edit mode

Hi, I am using bedtools, however it is showing error- I have used sort command for sorting files.

sort -k1,1 -k2,2n in.bed > in.sorted.bed

ERROR: Sort order was unspecified, and file xac.bed is not sorted lexicographically. Please rerun with the -g option for a genome file. See documentation for details.

ADD REPLY • link 4.7 years ago by Kumar ▴ 170

0

Entering edit mode

Did you sort both files using the above command? Are both files in bed format (chr, start, stop, name, score)? Could you post the first few lines of both files?

ADD REPLY • link 4.7 years ago by shawn.w.foley ★ 1.3k

0

Entering edit mode

Yes, I sorted both files, Please see some lines of my both files.

file1.bed

NODE_103_length_15107_cov_6.54451 3214 3273

NODE_104_length_1587_cov_15.5242 1455 1420

NODE_10_length_119210_cov_21.822 49628 49660

NODE_10_length_119210_cov_21.822 49631 49663

file2.bed

NODE_1000_length_1519_cov_1.91598 197 568

NODE_1000_length_1519_cov_1.91598 525 620

NODE_1000_length_1519_cov_1.91598 738 1169

NODE_1000_length_1519_cov_1.91598 1227 1502

NODE_1001_length_1515_cov_1.35616 306 728

NODE_1001_length_1515_cov_1.35616 967 1476

NODE_1002_length_1512_cov_1.62938 33 650

NODE_1002_length_1512_cov_1.62938 671 1036

ADD REPLY • link 4.7 years ago by Kumar ▴ 170

0

Entering edit mode

Even though the file names end in .bed, those aren't in .bed format. You need to reformat them:

https://genome.ucsc.edu/FAQ/FAQformat.html#format1

ADD REPLY • link 4.7 years ago by shawn.w.foley ★ 1.3k

0

Entering edit mode

I have prepared my files in .bed format and sorted by following, still I am not able to get answers, Please see my error below:

sort -k1,1 -k2,2n in.bed > in.sorted.bed

File 1:
chr start stop name
NODE_55_length_30858_cov_27.421 19901 19951 bg1
NODE_55_length_30858_cov_27.421 19900 19930 bg2
NODE_55_length_30858_cov_27.421 17578 17613 bg3

File 2:
chr start stop name
NODE_1_length_424572_cov_13.6788 99 174 ID=LGE0059
NODE_1_length_424572_cov_13.6788 179 254 ID=LGE0059

ERROR: Sort order was unspecified, and file sorted-gff.bed is not sorted lexicographically. Please rerun with the -g option for a genome file. See documentation for details.

ADD REPLY • link 4.7 years ago by Kumar ▴ 170

0

Entering edit mode

I'm not sure I follow what you're asking, it looks like you might have had issues formatting the code in your question.

Are you interested in printing nodes where File 2 has a greater value than File 1? Are you interested in printing File 2 for only nodes where File 1 passes a criteria? Are the same number of nodes in each file?

ADD REPLY • link 4.7 years ago by shawn.w.foley ★ 1.3k

0

Entering edit mode

Hi, I improved my query now, please consider now.

ADD REPLY • link 4.7 years ago by Kumar ▴ 170

score 0 · Answer 1 · 2019-09-28

This is possible if you manipulate the input files to create a 'dudd' chromosome so that BEDTools can match:

cat file1
Node1   3223    3245
Node2   4556    4599

cat file2
Node3   3199    3210    ID=r4567
Node4   3260    3290    ID=r4587
Node5   3310    3340    ID=r6754
Node6   4410    4490    ID=r0987
Node7   4491    4550    ID=r2345
Node8   4610    4660    ID=6789
Node9   4670    4710    ID=3245

/Programs/bedtools2/bin/bedtools closest \
  -a <(awk '{print "chr0\t"$2"\t"$3"\t"$0}' file1) \
  -b <(awk '{print "chr0\t"$2"\t"$3"\t"$0}' file2) | \
  cut -f4,5,6,10,11,12,13
Node1   3223    3245    Node3   3199    3210    ID=r4567
Node2   4556    4599    Node7   4491    4550    ID=r2345

...or:

/Programs/bedtools2/bin/bedtools closest  \
  -a <(awk '{print "chr0\t"$2"\t"$3"\t"$0}' file1)  \
  -b <(awk '{print "chr0\t"$2"\t"$3"\t"$0}' file2) | \
  awk '{print $4"\t"$5"\t"$6"\t"$13"\t"$10"\t"$11"\t"$12"\t"$13}'
Node1   3223    3245    ID=r4567    Node3   3199    3210    ID=r4567
Node2   4556    4599    ID=r2345    Node7   4491    4550    ID=r2345

Kevin