How can I select my Chip-seq genes (targets) in my Rna-seq data?
1
0
Entering edit mode
7.2 years ago
LuisNagano ▴ 90

Hi, I need some help! I want to select my target genes generated by my ChIP-seq (HOMER - genes symbols) within my list of differentially expressed genes from RNA-seq (Cuffdiff output), so I can identify my target genes ids, “XLOC_” gene ids generated by cufflinks, (like select a complete line from my cuffdiff file where I find my target gene), this way I can plot the heatmap of these genes using RNA-seq data. How can I do this in a simple way?

->Annotated genes (HOMER output)

  • SULT1B1
  • LHFPL
  • ZMYND8
  • RAD23B ...

->Cuffdiff table file (RNA-seq)

  • gene_id gene symbol locus
  • XLOC_000001 DDX11L1 chr1:11868-31109
  • XLOC_000002 MIR1302-2 chr1:11868-31109
  • XLOC_000003 OR4G4P chr1:52472-53312 ...
RNA-Seq ChIP-Seq Cufflinks Cuffdiff • 2.5k views
ADD COMMENT
0
Entering edit mode

Hello LuisNagano!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=74116

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

What do you mean with "in a simple way"? Does it mean that you know how to use R for instance?

With R you can do this simple namely, your Homer output seem to be gene symbols, which I recognize in the second column of your cuffdiff table.

So import the files in R and use for example: %in%

ADD REPLY
4
Entering edit mode
7.2 years ago

It looks like your target genes in ChIP-seq (HOMER output) and gene symbols in Cuffdiff table are in HGNC notation http://www.genenames.org/ So they should match exactly.

In order to select all lines from cuffdiff output that have gene names from the list of HOMER output in the second column you can run:

awk 'NR==FNR{HOMER_LIST[$1]=$1}(NR!=FNR&&HOMER_LIST[$2]){print $0}' homer_output_file cuffdiff_output_file

NR==FNR is true when awk reads the first file. That way all gene names from the first file will be stored in memory as an array HOMER_LIST. NR!=FNR is true when awk read the second file. Using && it also tests if second column element $2 from each row of the second file can be found in HOMER_LIST. As a result awk prints to stdout the complete line $0 from cuffdiff file where it can find gene name from that line in the target gene file.

ADD COMMENT
0
Entering edit mode

Thanks Petr, works very well!

ADD REPLY
0
Entering edit mode

If this answer was helpful it is appropriate to upvote it, and if this answer resolved your question completely you can 'accept' the answer, as such marking your question as solved.

ADD REPLY
0
Entering edit mode

how do I mark it as solved?

ADD REPLY
0
Entering edit mode

It seems you (or someone else) already did that. Marking as solved is done by accepting the answer.

ADD REPLY

Login before adding your answer.

Traffic: 1889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6