Biostar Beta. Not for public use.
How to change sequence ID or identifier for a set of sequence
0
Entering edit mode
4.0 years ago
@sanjeet0000199219691

How to change sequence ID or identifier for a set of sequence

I have a list of sequence which id’s I want to change. I also have their tab separated coordinate file having their old ids and new id’s (my id’s).

For eg. I have a sequence like

>Aquca_005_00546.1
KHMALQFAMNAMDELMKMCQMNEPLWIPNNSGTKEMLNMEEHAKMFPWLTNFKQQHSQVRTEATRDSAVVIMNSITLTDAFLDVNKWMDIFPSIISRAKTVQIISSGIAGHASGSLHLMYAELQVQSPLVPTREAHFLRYCQQNAEEGTWAIVDFPIDSFHDSLQYSFPRYRRRPSGCLIQDMPNGYSRVTWVEHAEVEDKPVHQIFNHFVNSGTAFGAQRWLAVLQQQC
>Aquca_014_00016.1


I want to change their original id’s with my id’s

Original id
Aquca_005_00546.1
Aquca_014_00016.1

My id
RaAc00546A
RaAc00016E


So my sequence should look like

>RaAc00546A
KHMALQFAMNAMDELMKMCQMNEPLWIPNNSGTKEMLNMEEHAKMFPWLTNFKQQHSQVRTEATRDSAVVIMNSITLTDAFLDVNKWMDIFPSIISRAKTVQIISSGIAGHASGSLHLMYAELQVQSPLVPTREAHFLRYCQQNAEEGTWAIVDFPIDSFHDSLQYSFPRYRRRPSGCLIQDMPNGYSRVTWVEHAEVEDKPVHQIFNHFVNSGTAFGAQRWLAVLQQQC
> RaAc00016E


How I should do this

programming perl python command sequence • 2.3k views
0
Entering edit mode

This may also also be useful

sed -e '/Aquca_005_00546.1/RaAc00546A/' filename.fa > filenameedit.fa

3
Entering edit mode
4.0 years ago
jonasmst • 280
@jonasmst20580

Here's a python script that does what (I think) you want. Let me know if it works or not.

# Dictionary with strings to replace and what to replace them with
replace_strings = {}
with open("replace_ids.tsv", "r") as id_file:
# Split line on TAB
ids = line.strip().split("\t")
# Fist entry is the original ID
original_id = ids[0]
# Second entry is your ID
my_id = ids[1]
# Add both to our dictionary of strings to replace
replace_strings[original_id] = my_id

# Read file with sequences, called "sequence.txt"
with open("sequence.txt", "r+") as infile:
# Read each line of file into a list
# Keep a list of the lines with the replaced strings
new_content = []
# Loop lines in the file content
for line in content:
new_line = line
# Find and replace any original_id with your own ids in the line of content and add it to our list of replaced lines
for original_id, my_id in replace_strings.items():
new_line = new_line.replace(original_id, my_id)
new_content.append(new_line)

# Write replaced content to a new file called "outfile.txt"
with open("outfile.txt", "w") as outfile:
for line in new_content:
outfile.write(line)


EDIT2:

replace_ids.tsv is your tab-separated file of IDs. I've assumed that the original ID is at the first position and that your own ID is at the second position, e.g.: orignal_id TAB your_id

sequence.txt is the name of the file containing the sequence(s).

outfile.txt is the file with the replaced IDs.

1
Entering edit mode

Looks like the code gets the job done, and nicely commented too. But I can't resist to offer you some suggestions:

1. Avoid .readlines(). It loads the entire file in memory which may be problematic for huge files. Just loop over the file:

Example:

for line in id_file:
ids = line.strip().split('\t')
replace_strings[ids[0]] = ids[1]

1. Similar, try to process the sequence file while looping without keeping it in memory. Read a record, write a record with corrected identifier

2. Use the sys module to get arguments from the command line to the script, executing the script as python myscript.py file1 file2, in the script you will then have sys.argv[1] be file1 and sys.argv[2] file2.

Probably for this particular case those suggestions are pedantic and unnecessary, but for larger files those issues become very real.

0
Entering edit mode

Definitely valid concerns, thanks for the suggestions

0
Entering edit mode

Yes sir it works. Thank you very much.

0
Entering edit mode

I also found this script very useful, but I have a question. When we get to the outfile, how does it know to write the new lines while keeping the rest of the lines from the original file? i.e. How does it know to write new_content (which has our new ids) AND write the sequences following those ids? Even in the second part using infile, I don't understand how it does not lose all the other lines following the original_ids since we are keeping a list of only replaced lines (which are the ids)? If someone could explain this concept, I would appreciate it!

0
Entering edit mode

Hi, what we're doing here is essentially reading the input file, replacing some IDs (if there's an ID to replace, otherwise we just copy the line), then write it to a new file:

The content variable contains all the data in the infile, i.e. both IDs and the sequence strings. Thus when we loop content, we'll first read a line with an ID, then a line with a sequence, then a line with an ID, then a line with a sequence, etc. We store this line in the new_line variable (new_line = line). We then try to replace the original IDs with our own IDs in new_line using a for-loop and the replace() function. Here, if new_line contains and ID, we'll replace the old ID with our own ID. However, if new_line happens to be a line with a sequence (not an ID), nothing will change, the line will remain as it was read, containing only the sequence. After this, we add new_line to our new_content variable, which is written to file.

To answer your question, we're not only looking at lines containing IDs, we're looking at those including sequences as well, but we're not altering those. Early morning write-up, hope this makes sense.