Merging Overlaping regions
3
0
Entering edit mode
5.5 years ago
hosin • 0

Hello I have two files and each file has one column, for example:

File 1:600rows, Like:

chr1:100109327-100162088
chr1:103728175-103796914
chr2:103908343-104007674
chr5:104660686-104731626
chr11:216500833-216575408
chr12:218693122-218856909
chr13:220346086-220550182
chr17:21377148-21496334
chr17:21808582-21902643
chr18:67771953-67842433
chr19:49677672-49745965
chr25:2341943-2453481
chr26:25210373-26275666

File 2:400rows,Like:

chr1:100119327-101162999
chr1:103728175-103796914
chr2:103908343-104777674
chr5:104660686-104731626
chr11:216500833-216575408
chr12:218693122-218856909
chr13:220356089-220555555
chr17:21377148-21496334
chr17:21808582-21902643
chr18:67772354-67942544
chr19:49877683-49945922
chr25:2341943-2453481
chr26:55210373-56275666

I want to merging overlapping them,Of course some regions are same, so we just need to one of them, and some regions doesn't overlap, we need both of them. Please help me.Thanks

genome • 1.7k views
ADD COMMENT
1
Entering edit mode
5.5 years ago
h.mon 35k

Concatenate the files, convert the stream to bed and use bedtools merge:

cat file1.txt file2.txt | sed "s/[-:]/\t/g" | sortBed | mergeBed

Note: bedtools sort complains the 7th line of file2.txt (chr13:220356089-22055555) is illegal, as the end coordinate is less than the start coordinate.

edit: the code bellow works for your updated files. It is ugly, but works.

cat file1.txt file2.txt \
  | awk -vFS="\t" -vOFS="\t" '{print $1,$2";"$3}' \
  | awk -vFS="\t" -vOFS="\t" '{gsub(/[:-]/,"\t",$1)}1' \
  | sortBed | mergeBed -c 4 -o collapse -delim ";" \
  | perl -lna -F'/\t/' -e '@tmp = split( /;/, $F[3] ); 
                           %genes   = map { $_, 1 } ( split( / /, ($tmp[0]." ".$tmp[2]) ) );
                           @genes = keys %genes;
                           %animals   = map { $_, 1 } ( split( / /, ($tmp[1]." ".$tmp[3]) ) ); 
                           @animals = keys %animals;
                           print "$F[0]\t$F[1]\t$F[2]\t@genes\t@animals";'
ADD COMMENT
0
Entering edit mode

I would be thankful for your time , Your suggestion is Ok. If I have 3 columns in each file? Actually I want to merge based on first column(regions), for example:

File 1:

Regions                   Genes                 Animals
chr1:100109327-100162088  BTG3 TRNAC-GCA ROBO1  FKD18 SUM3  
chr1:103728175-103796914  CD80 TRNAK-UUU LSAMP  AL108 BDP358 MFW157 MFW160 MFW163 
chr2:103908343-104007674  FGF12                                                                MFW163  BAL5
chr5:104660686-104731950  ITGB2 COL6A2 FTCD SPATC1L                      BAL3 BAL4 GHE2

file 2:

Regions                   Genes                                Animals
chr1:100119327-101162999  BTG3 TRNAC-GCA ROBO1 TRNAS-GGA       FKD19 
chr1:103728175-103796914  CD80 TRNAK-UUU LSAMP                 AL108 BDP358 MFW157 MFW163 BAL2
chr2:103908343-104777674  FGF12                                MFW163  BAL1
chr5:104660686-104731626 COL6A2 FTCD SPATC1L                   BAL3 BAL4

result:

Regions                   Genes                                 Animals
chr1:100109327-101162999  BTG3 TRNAC-GCA ROBO1  TRNAS-GGA       FKD18 SUM3 FKD19  
chr1:103728175-103796914  CD80 TRNAK-UUU LSAMP                  AL108 BDP358 MFW157 MFW160 MFW163  BAL2
chr2:103908343-104007674  FGF12                                  MFW163  BAL1  BAL5
chr5:104660686-104731950  ITGB2 COL6A2 FTCD SPATC1L              BAL3 BAL4 GHE2
ADD REPLY
0
Entering edit mode

The full format of your data should have been described from the start.

The simplest solution I can think of is to 1) merge your 2nd and 3rd columns into a new one with perl, python or awk; 2) convert the stream to bed with sed (as above); 3) sortBed; 4) mergeBed -c 4 -o collapse. This way, you will have the same result as above, but with a 4th column consisting of all genes and animals. Now, use perl, python or awk again to unroll this 4th column into two columns.

I am off my computer, so I can't provide code examples right now.

edit: I have updated my answer to cover your extended question. Honestly, you will be hard pressed to find more elegant code. I am amazed at myself for the combination of awk and perl in the same stream of commands.

ADD REPLY
0
Entering edit mode

Dear h.mon Hello I'm very grateful for your time. I've tried the last code but it does not work. It has these errors:

/home/software/bin/sortBed: line 2: 18121 Bus error (core dumped) ${0%/*}/bedtools sort "$@"

/home/software/bin/mergeBed: line 2: 18120 Bus error (core dumped) ${0%/*}/bedtools merge "$@"

ADD REPLY
0
Entering edit mode

Try to replace sortBed (BEDtools) with sort-bed (BEDOPS) or sort -k1,1 -k2,2n -k3,3n (GNU coreutils).

I just noticed my suggestion create bed files with spaces on the fourth column, which is not a good idea. It worked with the small example you provided, but may cause problems. It would be a good idea to replace those spaces with other character, then replace back for the final output.

What is the output of:

cat file1.txt file2.txt \
  | awk -vFS="\t" -vOFS="\t" '{print $1,$2";"$3}' \
  | awk -vFS="\t" -vOFS="\t" '{gsub(/[:-]/,"\t",$1)}1' \
  | sed -n '18119,18122p;18123q'
ADD REPLY
0
Entering edit mode

Hi, I did, But there is error, again :

The first run:replace sortBed with sort-bed

cat ss661pp.txt ss415pp.txt   \
  | awk -vFS="\t" -vOFS="\t" '{print $1,$2";"$3}' \
  | awk -vFS="\t" -vOFS="\t" '{gsub(/[:-]/,"\t",$1)}1' \
  | sort-bed | mergeBed -c 4 -o collapse -delim ";" \
  | perl -lna -F'/\t/' -e '@tmp = split( /;/, $F[3] ); 
                               %genes   = map { $_, 1 } ( split( / /, ($tmp[0]." ".$tmp[2]) ) );
                               @genes = keys %genes;
                               %animals   = map { $_, 1 } ( split( / /, ($tmp[1]." ".$tmp[3]) ) ); 
                               @animals = keys %animals;
                               print "$F[0]\t$F[1]\t$F[2]\t@genes\t@animals";'>Output
-bash: sort-bed: command not found
/home/software/bin/mergeBed: line 2: 30789 Bus error               (core dumped) ${0%/*}/bedtools merge "$@"
  

Second run: replace sortBed with sort -k1,1 -k2,2n -k3,3n

cat ss661pp.txt ss415pp.txt \
  | awk -vFS="\t" -vOFS="\t" '{print $1,$2";"$3}' \
  | awk -vFS="\t" -vOFS="\t" '{gsub(/[:-]/,"\t",$1)}1' \
  | sort -k1,1 -k2,2n -k3,3n | mergeBed -c 4 -o collapse -delim ";" \
  | perl -lna -F'/\t/' -e '@tmp = split( /;/, $F[3] ); 
                           %genes   = map { $_, 1 } ( split( / /, ($tmp[0]." ".$tmp[2]) ) );
                           @genes = keys %genes;
                           %animals   = map { $_, 1 } ( split( / /, ($tmp[1]." ".$tmp[3]) ) ); 
                           @animals = keys %animals;
                           print "$F[0]\t$F[1]\t$F[2]\t@genes\t@animals";' >Output
/home/software/bin/mergeBed: line 2: 30764 Bus error               (core dumped) ${0%/*}/bedtools merge "$@"
  

Third run: Output is empty

cat ss661pp.txt ss415pp.txt   | awk -vFS="\t" -vOFS="\t" '{print $1,$2";"$3}'   | awk -vFS="\t" -vOFS="\t" '{gsub(/[:-]/,"\t",$1)}1'   | sed -n '18119,18122p;18123q' >Output
ADD REPLY
0
Entering edit mode

Sorry, I am out of ideas and I don't have access to your data to troubleshoot it. All I can say is it worked for the example you provided. It is possible (but unlikely) that there is a bug in BEDtools - I think bad data being fed into BEDtools is a more likely cause.

ADD REPLY
1
Entering edit mode
5.5 years ago

Via BEDOPS:

$ cat file1.txt file2.bed | sed "s/[:-]/\t/g' | sort-bed - | bedops --merge - > answer.bed

You can insert the following awk step to fix coordinate problems:

$ cat file1.txt file2.bed | sed "s/[:-]/\t/g" | awk -vOFS="\t" -vFS="\t" '{ if ($2==$3) { $2--; } else if ($2 > $3) { t = $3; $3 = $2; $2 = t; } print }' | sort-bed - | bedops --merge - > answer.bed

If you're using Mac OS X, use Homebrew to install GNU versions of sed and awk:

$ brew install gnu-sed
$ brew install gawk

Then replace sed and awk with gsed and gawk.


I'm removing --check-sort from the command. I think I may have used that option incorrectly.

Here's the answer I get from the test input, if it is useful to you:

$ cat /tmp/A.txt /tmp/B.txt | gsed "s/[:-]/\t/g" | gawk -vOFS="\t" -vFS="\t" '{ if ($2==$3) { $2--; } else if ($2 > $3) { t = $3; $3 = $2; $2 = t; } print; }' | sort-bed - | bedops --merge -
chr1    100109327   101162999
chr1    103728175   103796914
chr11   216500833   216575408
chr12   218693122   218856909
chr13   220346086   220555555
chr17   21377148    21496334
chr17   21808582    21902643
chr18   67771953    67942544
chr19   49677672    49745965
chr19   49877683    49945922
chr2    103908343   104777674
chr25   2341943 2453481
chr26   25210373    26275666
chr26   55210373    56275666
chr5    104660686   104731626

Replace gsed/gawk with sed/awk depending on local conditions etc.

Hope this helps!

ADD COMMENT
1
Entering edit mode

With OP original example data, using sort-bed --check-sort results in:

Bed file not properly sorted by first column.
See row: 5
  

But with only sort-bed, results are identical to sortBed and sort -k1,1 -k2,2n -k3,3n. What is the cause of the error message?

ADD REPLY
0
Entering edit mode

I don't know. Are you using awk to fix potential coordinate issues? Are there Windows carriage return characters in the file? What comes out of sort-bed on line 5, after awk and before any set operations? What comes out of cat -te on that output?

ADD REPLY
0
Entering edit mode

Thanks — I think I need to improve the documentation a bit. The --check-sort option is for checking if the input is sorted or not, before sorting. Mea culpa.

As this input would not be sorted, the error message is correct but my use of that option here is incorrect. After removing that option, the merge proceeds correctly.

ADD REPLY
0
Entering edit mode

Ah, thanks for the reply. I was working on the answers to your questions when I saw your post.

ADD REPLY
0
Entering edit mode
5.5 years ago
afli ▴ 190

Hi, I can provide you with a very basic solution:

awk -F":" 'BEGIN{OFS="\t"}{print $1,$2}' file1.txt|awk -F"-"  'BEGIN{OFS="\t"}{print $1,$2}' > tmp1
awk -F":" 'BEGIN{OFS="\t"}{print $1,$2}' file2.txt|awk -F"-"  'BEGIN{OFS="\t"}{print $1,$2}' > tmp2
cat tmp1 tmp2 |sort -k1V -k2n|uniq |mergeBed -i - -c 1 -o count|awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' > final_merge.txt

There may be a mistake in chr13 220356089 22055555, I delete it, please check the similar line by yourself.

Aifu,

ADD COMMENT

Login before adding your answer.

Traffic: 2596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6