How to extract required lines of particular chromosome from a gtf file?
1
0
Entering edit mode
3.4 years ago
Biologist ▴ 290

I have gtf file that starts like below:

GL000008.2      StringTie       transcript      19957   20594   .       +       .       transcript_id "MSTRG.3.1"; gene_id "MSTRG.3"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
GL000008.2      StringTie       exon    19957   20594   .       +       .       transcript_id "MSTRG.3.1"; gene_id "MSTRG.3"; exon_number "1";
GL000008.2      StringTie       transcript      29530   29803   .       +       .       transcript_id "MSTRG.1.1"; gene_id "MSTRG.1"; xloc "XLOC_000002"; class_code "u"; tss_id "TSS2";
GL000008.2      StringTie       exon    29530   29803   .       +       .       transcript_id "MSTRG.1.1"; gene_id "MSTRG.1"; exon_number "1";
GL000008.2      StringTie       transcript      83390   131159  .       +       .       transcript_id "MSTRG.39.2"; gene_id "MSTRG.39"; xloc "XLOC_000003"; class_code "u"; tss_id "TSS3";
GL000008.2      StringTie       exon    83390   83545   .       +       .       transcript_id "MSTRG.39.2"; gene_id "MSTRG.39"; exon_number "1";
GL000008.2      StringTie       exon    85567   85625   .       +       .       transcript_id "MSTRG.39.2"; gene_id "MSTRG.39"; exon_number "2";
GL000008.2      StringTie       exon    88636   88695   .       +       .       transcript_id "MSTRG.39.2"; gene_id "MSTRG.39"; exon_number "3";
GL000008.2      StringTie       exon    129985  131159  .       +       .       transcript_id "MSTRG.39.2"; gene_id "MSTRG.39"; exon_number "4";
GL000008.2      StringTie       transcript      83390   133972  .       +       .       transcript_id "MSTRG.39.1"; gene_id "MSTRG.39"; xloc "XLOC_000003"; class_code "u"; tss_id "TSS3";
GL000008.2      StringTie       exon    83390   83545   .       +       .       transcript_id "MSTRG.39.1"; gene_id "MSTRG.39"; exon_number "1";
GL000008.2      StringTie       exon    85567   85625   .       +       .       transcript_id "MSTRG.39.1"; gene_id "MSTRG.39"; exon_number "2";
GL000008.2      StringTie       exon    129985  130583  .       +       .       transcript_id "MSTRG.39.1"; gene_id "MSTRG.39"; exon_number "3";
GL000008.2      StringTie       exon    133918  133972  .       +       .       transcript_id "MSTRG.39.1"; gene_id "MSTRG.39"; exon_number "4";

I want to extract line 1-1000, 1001-2000, 2001-3000 from chromosome 3 (chr3) into separate gtf files. How to do that on command line?

RNA-Seq gtf linux command • 1.8k views
ADD COMMENT
1
Entering edit mode
3.4 years ago
Ram 43k

You can grep for chr3 (or any such identifier) or use awk to match the query identifier to just the first column, then pipe that output to split -l 1000 to split the output every 1000 lines. I'd recommend using the -d flag in split to get numeric suffixes.

ADD COMMENT
0
Entering edit mode

I actually didd grep like this grep -w chr3 sample.annotated.gtf > chr3.gtf. Could you please show me an example command

ADD REPLY
0
Entering edit mode

Try running split -d -l 10 chr3.gtf chr3_chunks.. Then, with the help of the manual (man split), customize it to your liking.

ADD REPLY

Login before adding your answer.

Traffic: 2039 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6