Get chromosome name from RefSeq format for C.elegans
2
0
Entering edit mode
8.7 years ago
rioualen ▴ 710

Hello,

I'm currently processing ChIP-seq data from C.elegans.

I've used data from GEO which I processed this way:

sra -> fastq (fastq-dump) -> sam (bwa) -> bam (samtools view) -> bed (bedtools bamtobed)

However I noticed chromosome names were in RefSeq format:

NC_003282.8    9767072    9767122
NC_003284.9    7940872    7940922
NC_003284.9    112717     112746 

I would like to get "standard" chromosome names (eg chr1, chr2...). Does one of the tools I used so far allows that? If not, is there a way to convert bedfiles, or at least some kind of conversion table?

Thanks a lot!

CR

ChIP-Seq Refseq • 2.0k views
ADD COMMENT
1
Entering edit mode
8.7 years ago
rioualen ▴ 710

OK, so I found the correspondence there: http://www.ncbi.nlm.nih.gov/assembly/GCF_000002985.6 and I wrote it to a 2-column tab-delimited file.

I wrote this python script:

out = open('test_converted.bed','w')
with open('test.bed') as bedfile:
    for line1 in bedfile:
        RefSeq_id_1 = line1.split()[0]
        other_info = '\t'.join(line1.split()[1:len(line1.split())])
        with open('RefSeq_to_std.tab') as conversion_table:
            for line2 in conversion_table:
                RefSeq_id_2 = line2.split()[0]
                chr_id = line2.split()[1]
                if RefSeq_id_1 == RefSeq_id_2:
                    towrite = chr_id + '\t' + other_info + '\n'
                    out.write(towrite)

However I'm still wondering if there's a more efficient way to do this, as I've got pretty big datasets... Either with python, shell language or R? (I'm using the 3 of them in my workflow).

Thanks again!

ADD COMMENT
1
Entering edit mode
8.7 years ago
PoGibas 5.1k

I am not an expert, but it seems that "standard" chromosome names for C. elegans are: chrI/II/III/IV/V/M/X.

I strongly recommend not to create your own names.

NC_003282 = chrIV
NC_003284 = chrX

For more efficient solution use sed or awk one-liner.

awk '{gsub("NC_003282","chrIV"); gsub("NC_003284","chrX"); print}' INPUT.bed
ADD COMMENT
0
Entering edit mode

1) Indeed ;)

2) Thanks! That's exactly what I was looking for, but I'm not very familiar with awk.

ADD REPLY

Login before adding your answer.

Traffic: 2694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6