Question

Extracting genes from one file using coordinate from second file

0

Entering edit mode

6.4 years ago

bk11 ★ 2.3k

Hi Alex Reynolds,

I am sorry to ask this question once again. Because I got some error from previous command line.

My first file looks like: First.txt

deletion    chr10:1726501-1755000   28500   0.586226    9.73037E-05 715.754 0.00171548  3546.87 0.114216    1.17241

Second file looks like: Second.txt

Chr10   NC_029525.1 gene    1672245 1676954 -   LOC107318572
Chr10   NC_029525.1 gene    1677076 1682931 -   C10H15orf39
Chr10   NC_029525.1 gene    1690899 1710413 -   PPCDC
Chr10   NC_029525.1 gene    1710723 1714472 -   LOC107318577
Chr10   NC_029525.1 gene    1714558 1714977 -   LOC107318579
Chr10   NC_029525.1 gene    1717116 1719122 +   RPP25
Chr10   NC_029525.1 gene    1721742 1725395 +   LOC107318578
Chr10   NC_029525.1 gene    1725935 1728167 +   FAM219B
Chr10   NC_029525.1 gene    1728336 1731151 -   MPI
Chr10   NC_029525.1 gene    1731194 1739576 +   LOC107318570
Chr10   NC_029525.1 gene    1739821 1743801 +   ULK3
Chr10   NC_029525.1 gene    1744568 1747749 -   CPLX3
Chr10   NC_029525.1 gene    1752411 1759515 -   CSK

I want to get result like this: Asnwer.txt

Chr10   NC_029525.1 gene    1725935 1728167 +   FAM219B
Chr10   NC_029525.1 gene    1728336 1731151 -   MPI
Chr10   NC_029525.1 gene    1731194 1739576 +   LOC107318570
Chr10   NC_029525.1 gene    1739821 1743801 +   ULK3
Chr10   NC_029525.1 gene    1744568 1747749 -   CPLX3
Chr10   NC_029525.1 gene    1752411 1759515 -   CSK

awk bash bedoops • 1.6k views

ADD COMMENT • link updated 6.4 years ago by Alex Reynolds 35k • written 6.4 years ago by bk11 ★ 2.3k

score 1 · Answer 1 · 2017-11-14

1

Entering edit mode

6.4 years ago

Alex Reynolds 35k

Preprocess the first file into a sorted BED file.

You can split the second field of each line by a regular expression pattern and write the output to a three-column, unsorted BED file, sort it, and write the sorted output to a BED file:

$ awk -v OFS="\t" '{ n=split($2,a,/[:-]/); print a[1], a[2], a[3]; }' First.txt | sort-bed - > First.bed

Preprocess the second file into a sorted BED file.

This step requires decapitalizing the chromosome name (so that you can do set operations later), shuffling columns into UCSC BED order, and sorting:

$ awk -v OFS="\t" '{ if ($1~/^Chr/) { $1=tolower(substr($1,1,1))substr($1,2); } print $1, $4, $5, $2, $3, $6, $7; }' Second.txt | sort-bed - > Second.bed

Double-check that you have BED files by running head First.bed and head Second.bed. These should look like sorted BED files. If not, review the steps before proceeding.

Now you can do set operations, filtering the Second.bed file based on overlaps with elements in First.bed:

$ bedops --element-of 1 Second.bed First.bed > Filtered.bed

Re-capitalize the chromosome name in Filtered.bed and re-order columns to match your desired format:

$ awk -v OFS="\t" '{ $1=toupper(substr($1,1,1))substr($1,2); print $1, . . . }' Filtered.bed > Filtered.txt

I'll let you fill in . . . so that you get a little experience with moving columns around with awk.

ADD COMMENT • link 6.4 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank you Alex for quick reply. I should say it worked partially because it is missing some genes list from chromosomes like chrZ, chrM, LGE64 and LGE22C19W28_E50C23.

ADD REPLY • link 6.4 years ago by bk11 ★ 2.3k

0

Entering edit mode

Yep, that's a bug. See the revised preprocessing step for the second file.

ADD REPLY • link 6.4 years ago by Alex Reynolds 35k

0

Entering edit mode

I am able to get chrZ but chromosomes chrW, chrM, LGE64 and LGE22C19W28_E50C23 are still missing.

ADD REPLY • link 6.4 years ago by bk11 ★ 2.3k

0

Entering edit mode

I added a fix to test for the chromosome name in the second preprocess step. If the name starts with Chr then the first letter is made lowercase. Otherwise, the name is left unmodified.

You'll need to investigate further on your own, from this point out, as you may need to preprocess your first file, as well, depending on how you have named chromosomes in that file.

ADD REPLY • link 6.4 years ago by Alex Reynolds 35k