counting number of genes located on positive or negative strands
2
0
Entering edit mode
5.5 years ago
arsilan324 ▴ 90

Hi all,

Maybe this is a very basic question but I must tell that I am not a bioinformatics person. I just need to count total number of positive and negative genes located on chromosome in .genes or .gff3 file.

Can you please comment how can I do that? Thanks in advance

gff3 htseq • 1.5k views
ADD COMMENT
5
Entering edit mode
5.5 years ago

Via BEDOPS, extract gene features from your GFF file and convert to BED:

$ awk '$3=="gene"' annotations.gff | gff2bed - > genes.bed

Then filter on strand and pass to wc -l to count the number of lines:

$ awk '$6=="+"' genes.bed | wc -l
*number of forward-strand genes*

And:

$ awk '$6=="-"' genes.bed | wc -l
*number of reverse-strand genes*

Converting to BED is not strictly required, but it makes sense if you're doing counting operations, which are essentially set operations. BED is a better format for set operations than GFF.

ADD COMMENT
2
Entering edit mode

Why bother converting the gff3 file to bed when the strand information is already present in the gff3 file?

awk '$3=="gene" && $7=="+"' annotations.gff | wc -l
awk '$3=="gene" && $7=="-"' annotations.gff | wc -l

Also, note, if you are using GFF3 files from NCBI the feature type in column 3 is not always gene. For example, pseudogenes have pseudogene in column 3. You can change the awk command to ($3=="gene"||$3=="pseudogene") will fix that.

ADD REPLY
0
Entering edit mode

Thank you to both of you! It worked.. Thanks!!!

ADD REPLY
3
Entering edit mode
5.5 years ago
n,n ▴ 360

Here is a bash solution as well:

#First lets define a couple of variables to act as counters for each strand ( + & - )

forward=0
reverse=0

#Now we create a loop with your lines of interest from the gff3 file as the changing variable

for line in $(cat file.gff3 | grep '.*gene.*[[:space:]]+[[:space:]]')
do
  forward=$(expr "$forward" + 1) #this will add 1 to the counter for each line found by grep
done

#Depending on your grep version you might need to scape the "+" character, BSD grep doesn't have to

#Same process but for the reverse strand

 for line in $(cat file.gff3 | grep '.*gene.*[[:space:]]-[[:space:]]')
 do
   reverse=$(expr "$reverse" + 1)
 done

#Finally we print the results

echo -e "\nGenes present in the forward (positive) strand: $forward"
echo -e "Genes present in the reverse (negative) strand: $reverse\n"

Hope this helps.

ADD COMMENT

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6