Question

How to change sequence ID or identifier for a set of sequence

0

Entering edit mode

7.2 years ago

sanjeet00001992 • 0

How to change sequence ID or identifier for a set of sequence

I have a list of sequence which id’s I want to change. I also have their tab separated coordinate file having their old ids and new id’s (my id’s).

For eg. I have a sequence like

>Aquca_005_00546.1
KHMALQFAMNAMDELMKMCQMNEPLWIPNNSGTKEMLNMEEHAKMFPWLTNFKQQHSQVRTEATRDSAVVIMNSITLTDAFLDVNKWMDIFPSIISRAKTVQIISSGIAGHASGSLHLMYAELQVQSPLVPTREAHFLRYCQQNAEEGTWAIVDFPIDSFHDSLQYSFPRYRRRPSGCLIQDMPNGYSRVTWVEHAEVEDKPVHQIFNHFVNSGTAFGAQRWLAVLQQQC
>Aquca_014_00016.1
DGWKVLTFENGVEISKRTSASFHIFRSRWLLKSVSPQQFITVANAIDAAKQWDSDLVEAKYIKDLEDNLSIIRLRFGDGSKPLFKNREFIVYERRETMADGTLVVAVASLPKEIAAGLHPKGNNTIRGLLLQSGWVVEELGDDENSCMVTYVVQLDPAGWLPKFFVNRLNTKLVMIIDNLEKL

I want to change their original id’s with my id’s

Original id            
Aquca_005_00546.1   
Aquca_014_00016.1   

My id
RaAc00546A
RaAc00016E

So my sequence should look like

>RaAc00546A
KHMALQFAMNAMDELMKMCQMNEPLWIPNNSGTKEMLNMEEHAKMFPWLTNFKQQHSQVRTEATRDSAVVIMNSITLTDAFLDVNKWMDIFPSIISRAKTVQIISSGIAGHASGSLHLMYAELQVQSPLVPTREAHFLRYCQQNAEEGTWAIVDFPIDSFHDSLQYSFPRYRRRPSGCLIQDMPNGYSRVTWVEHAEVEDKPVHQIFNHFVNSGTAFGAQRWLAVLQQQC
> RaAc00016E 
DGWKVLTFENGVEISKRTSASFHIFRSRWLLKSVSPQQFITVANAIDAAKQWDSDLVEAKYIKDLEDNLSIIRLRFGDGSKPLFKNREFIVYERRETMADGTLVVAVASLPKEIAAGLHPKGNNTIRGLLLQSGWVVEELGDDENSCMVTYVVQLDPAGWLPKFFVNRLNTKLVMIIDNLEKL

How I should do this

programming perl python command sequence • 5.8k views

ADD COMMENT • link updated 19 months ago by H Nakibapher Jones • 0 • written 7.2 years ago by sanjeet00001992 • 0

0

Entering edit mode

This may also also be useful

sed -e '/Aquca_005_00546.1/RaAc00546A/' filename.fa > filenameedit.fa

ADD REPLY • link 7.2 years ago by mks002 ▴ 220

score 3 · Answer 1 · 2017-01-31

3

Entering edit mode

7.2 years ago

jonasmst ▴ 410

Here's a python script that does what (I think) you want. Let me know if it works or not.

EDIT: Bad formatting.

# Dictionary with strings to replace and what to replace them with
replace_strings = {}
with open("replace_ids.tsv", "r") as id_file:
    # Read file line-by-line
    for line in id_file.readlines():
        # Split line on TAB
        ids = line.strip().split("\t")
        # Fist entry is the original ID
        original_id = ids[0]
        # Second entry is your ID
        my_id = ids[1]
        # Add both to our dictionary of strings to replace
        replace_strings[original_id] = my_id

# Read file with sequences, called "sequence.txt"
with open("sequence.txt", "r+") as infile:
    # Read each line of file into a list
    content = infile.readlines()
    # Keep a list of the lines with the replaced strings
    new_content = []
    # Loop lines in the file content
    for line in content:
        new_line = line
        # Find and replace any original_id with your own ids in the line of content and add it to our list of replaced lines
        for original_id, my_id in replace_strings.items():
            new_line = new_line.replace(original_id, my_id)
        new_content.append(new_line)

    # Write replaced content to a new file called "outfile.txt"
    with open("outfile.txt", "w") as outfile:
        for line in new_content:
            outfile.write(line)

EDIT2:

replace_ids.tsv is your tab-separated file of IDs. I've assumed that the original ID is at the first position and that your own ID is at the second position, e.g.: orignal_id TAB your_id

sequence.txt is the name of the file containing the sequence(s).

outfile.txt is the file with the replaced IDs.

ADD COMMENT • link 7.2 years ago by jonasmst ▴ 410

1

Entering edit mode

Looks like the code gets the job done, and nicely commented too. But I can't resist to offer you some suggestions:

Avoid .readlines(). It loads the entire file in memory which may be problematic for huge files. Just loop over the file:

Example:

for line in id_file:
    ids = line.strip().split('\t')
    replace_strings[ids[0]] = ids[1]

Similar, try to process the sequence file while looping without keeping it in memory. Read a record, write a record with corrected identifier
Use the sys module to get arguments from the command line to the script, executing the script as python myscript.py file1 file2, in the script you will then have sys.argv[1] be file1 and sys.argv[2] file2.

Probably for this particular case those suggestions are pedantic and unnecessary, but for larger files those issues become very real.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Definitely valid concerns, thanks for the suggestions

ADD REPLY • link 7.2 years ago by jonasmst ▴ 410

0

Entering edit mode

Yes sir it works. Thank you very much.

ADD REPLY • link 7.2 years ago by sanjeet00001992 • 0

0

Entering edit mode

I also found this script very useful, but I have a question. When we get to the outfile, how does it know to write the new lines while keeping the rest of the lines from the original file? i.e. How does it know to write new_content (which has our new ids) AND write the sequences following those ids? Even in the second part using infile, I don't understand how it does not lose all the other lines following the original_ids since we are keeping a list of only replaced lines (which are the ids)? If someone could explain this concept, I would appreciate it!

ADD REPLY • link 6.8 years ago by DNAngel ▴ 250

0

Entering edit mode

Hi, what we're doing here is essentially reading the input file, replacing some IDs (if there's an ID to replace, otherwise we just copy the line), then write it to a new file:

The content variable contains all the data in the infile, i.e. both IDs and the sequence strings. Thus when we loop content, we'll first read a line with an ID, then a line with a sequence, then a line with an ID, then a line with a sequence, etc. We store this line in the new_line variable (new_line = line). We then try to replace the original IDs with our own IDs in new_line using a for-loop and the replace() function. Here, if new_line contains and ID, we'll replace the old ID with our own ID. However, if new_line happens to be a line with a sequence (not an ID), nothing will change, the line will remain as it was read, containing only the sequence. After this, we add new_line to our new_content variable, which is written to file.

To answer your question, we're not only looking at lines containing IDs, we're looking at those including sequences as well, but we're not altering those. Early morning write-up, hope this makes sense.

ADD REPLY • link 6.8 years ago by jonasmst ▴ 410