Get part of sequence from genome, given a start and stop position with Java.
17 months ago
Netherlands

I've got VCF-like files with start, stop, REF and ALT columns. I need to check that the REF position from the variants are the same as the one in the genome, to check if they're from the same built. I also need the surrounding nucleotides of the given position. Also, some of the REF columns are empty and because of this, it is not an appropriate VCF file.

I've got a fasta file which has the genome for chromosome 1, and I was wondering if there's a library available to get a part of the genome in nucleotides, given a start- and stop position. For example, if you've got the genome AACCGGTT, that given a start position of 1 and a stop position of 4 it returns AACC. I could write such a parser myself, but I'd rather use a library which has the edge-cases covered.

I'd rather have something locally than use the API of NCBI, which also makes this possible.

Hi, You can use bedtools getfasta .

samtools faidx, pyfaidx, bedtools getfasta can all retrieve parts of fasta sequence given a start and stop. While not libraries they may be an option to consider.

@Pierre has his Javarkit which may have something that will work (if you must use Java): http://lindenb.github.io/jvarkit/

If it's anything like BioPython and you absolutely must use Java, there's no doubt something in BioJava which you could use.

I know less than nothing about Java specifically though so can't offer any practical code for this.

16 months ago
France/Nantes/Institut du Thorax - INSE…

use the htsjdk library and the class IndexedFastaSequenceFile https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/reference/IndexedFastaSequenceFile.html

(...)
faidx =new IndexedFastaSequenceFile(fastaFile);
sub = faidx.getSubsequenceAt("chr1",10,20).getBaseString();
(...)

Yes, thank you! I was just looking at this library, but couldn't find the right function.