Position of N nucleotides in reference genome
1
0
Entering edit mode
8.4 years ago

Title says it all.

Is there a resource that has the positions for every "N" nucleotide in the reference genome (hg38)?

If not, is there a method to quickly generate these positions.

I was thinking using bedtools nuc with a bed file that windows the genome, but that seems too much work.

Thanks!

reference-genome nucleotide • 2.6k views
ADD COMMENT
5
Entering edit mode
8.4 years ago

On Linux, you could do the following to get sequences, extract them to fasta, and write output to BED format:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
$ for i in `seq 1 22` X Y; do \
    ./twoBitToFa -seq=chr$i hg38.2bit hg38.2bit.chr$i.fa; \
  done
$ for i in `seq 1 22` X Y; do \
    awk -vchr="chr$i" ' \
        BEGIN { \
            p = 0; \
        } \
        { \
            if ($0 !~ /^>/) { \
                n = split($0, chars, ""); \
                for (i = 1; i <= n; i++) { \
                    if (chars[i] == "N") { \
                        printf("%s\t%d\t%d\n", chr, p, p + 1); \
                    } \
                    p++; \
                } \
            } \
        }' hg38.2bit.chr$i.fa \
        > hg38.2bit.chr$i.N.bed; \
  done

The work in the for loops can be parallelized, if you need a speed-up, by splitting out work to per-chromosome tasks.

If you need contiguous regions of Ns in the output BED, you can use BEDOPS bedops --merge:

$ for i in `seq 1 22` X Y; do \
    bedops --merge hg38.2bit.chr$i.N.bed > hg38.2bit.chr$i.N.contiguous.bed; \
  done

Be prepared for large BED files, unless you do a bedops --merge at the end.

Since these are three-column BED files, using the Starch format to archive BED files would be pretty useful here - there's a lot of redundancy to be removed. You could pipe the output of awk to starch to make an archive:

$ for i in `seq 1 22` X Y; do \
    awk -vchr="chr$i" ' \
        BEGIN { \
            p = 0; \
        } \
        { \
            if ($0 !~ /^>/) { \
                n = split($0, chars, ""); \
                for (i = 1; i <= n; i++) { \
                    if (chars[i] == "N") { \
                        printf("%s\t%d\t%d\n", chr, p, p + 1); \
                    } \
                    p++; \
                } \
            } \
        }' hg38.2bit.chr$i.fa \
        | starch - \
        > hg38.2bit.chr$i.N.starch; \
  done
ADD COMMENT
1
Entering edit mode

I would vote you up and say it's the solution but I keep getting a "Unable to submit vote" error

ADD REPLY
1
Entering edit mode

Don't worry, I get that a lot.

ADD REPLY

Login before adding your answer.

Traffic: 2021 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6