Invalid reference '6L' even though naming of BAM files, fasta files, and gtf files are the same?
0
0
Entering edit mode
6.8 years ago
jjrin ▴ 40

Hello I have been trying to run a program called Slide that identifies novel isoforms de novo. I gave it an input of a gtf file (required but not what I'm trying to look through) and a bam file with an index and a directory of all of the chromosome individual fasta files. However I keep getting this error: ValueError: invalid reference 6L, even though I am completely sure that none of the chromosomes are called 6L, they are all properly named "chr6L" in both the BAM and gtf and fasta files. So I am unsure of what the problem is here. Thanks for your help!

Here is the entire error message:

Traceback (most recent call last):
  File "slide.py", line 1901, in <module>
    gene_linear_models = get_linear_models( genes.values(), samfile, read_type, frag_len, total_num_reads, choose_frag_len_dist, if_use_GC_correction, gene_name_specified, mode=mode )
  File "slide.py", line 1262, in get_linear_models
    bin_built = build_bins( gene, samfile, read_type, new_exons )
  File "slide.py", line 605, in build_bins
    for rl, read  in process_reads( gene, samfile, read_type ):
  File "slide.py", line 369, in process_reads
    for read in samfile.fetch( gene.boundaries.chr, gene.boundaries.start, gene.boundaries.stop ):
  File "pysam/calignmentfile.pyx", line 897, in pysam.calignmentfile.AlignmentFile.fetch (pysam/calignmentfile.c:10997)
  File "pysam/calignmentfile.pyx", line 819, in pysam.calignmentfile.AlignmentFile.parse_region (pysam/calignmentfile.c:10455)
ValueError: invalid reference `6L`
RNA-Seq Assembly • 1.6k views
ADD COMMENT
0
Entering edit mode

Where did you get this program from?

It's possible that it is splitting the identifiers that have 'chr' in the name, and then it can't find the reference.

ADD REPLY
0
Entering edit mode

https://www.ncbi.nlm.nih.gov/pubmed/22135461

This is the paper the program is based from, I found it from the "A survey of best practices for RNA-seq data analysis" paper in isoform discovery. I found some code where it possibly splices out "chr" but I'm unsure of what to adjust.

class GeneBoundaries( dict ): Exon = namedtuple('Exon', ['chr', 'strand', 'start', 'stop'])

@staticmethod
def _parse_gtf_line( line  ):
    data = re.split( "\s+", line.strip() )
    # the type of element - ie exon, transcript, etc.
    type = data[2]

    # parse the meta data, and grab the gene name
    meta_data = dict( zip( data[8::2], ( i[:-1] for i in data[9::2] ) ) )
    try: gene_name = meta_data[ 'gene_name' ]
    except KeyError: gene_name = meta_data[ 'gene_id' ]

    # fix the chromosome if necessary
    chr = data[0]
    if chr.startswith("chr") and not ifchr:
        chr = data[0][3:]
    elif not chr.startswith("chr") and ifchr:
        chr = "chr" + chr
    return gene_name, type, GenomicInterval( chr, data[6], int(data[3]), int(data[4]) )
ADD REPLY
1
Entering edit mode

data is a list of items, split by spaces from each line in the file. data[0] is the chromosome id.

chr = data[0]
    if chr.startswith("chr") and not ifchr:
        chr = data[0][3:]
    elif not chr.startswith("chr") and ifchr:
        chr = "chr" + chr

This part says if the chromosome id starts with 'chr', and 'ifchr' is False, take the 4th (python starts at 0) character in the chromosome id to the end of the chrom id. [3:]. if it does not start with chrom and ifchr is True, add 'chr' to the id.

The problem is I don't know what the variable ifchr is, but in your case it is evaluating to false, and it is trimming 'chr' off your ids.

ADD REPLY
0
Entering edit mode

Here is the code that creates the variable "ifchr"

### handle the required arguments ###

# load the bam file (RNA-Seq reads)
samfile = pysam.Samfile( sys.argv[2], "rb" )

# make sure if the chromosome names in the bam file start with "chr" or not
for read in samfile.fetch():
    samchr = samfile.getrname( read.tid )
    break
global ifchr
ifchr = samchr.startswith("chr")

I think what is causing this problem is that some of the chromosome names in my reference fasta file is "Scaffold100" and there are over 400 thousand variants of this, this is likely what's causing "ifchr" to evaluate to false. But at the end of the fasta files there are indeed chromosomes named chr6L, etc. I believe that the bam files uses chr6L rather than 6L so there's no need to trim them. Would it be best just to remove these lines and keep them as is or is there another problem with them?

ADD REPLY
1
Entering edit mode

Since you don't know everything that's going on in the code, leave that alone, you don't know if variables/functions are called elsewhere or not. You should double check, and change if necessary all the ids to be identical across files, and not to violate the if/else statements.

I'm assuming this program should work out of the box for the type of files you're giving it. So, also double check how your files were created, and if there are and paramters or flags that you can pass to the program to alleviate issues.

ADD REPLY

Login before adding your answer.

Traffic: 1490 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6