Renaming sequencing libraries
3
0
Entering edit mode
18 months ago
bionix ▴ 10

Hi,

I have many fastq files like:

XG-31313_PS33_lib631817_10106_3_2.fastq.gz
XG-31313_PS34_lib631818_10106_3_1.fastq.gz
XG-31313_PS34_lib631818_10106_3_2.fastq.gz
XG-31313_PS34_lib631818_10107_2_1.fastq.gz
XG-31313_PS34_lib631818_10107_2_2.fastq.gz
XG-31313_PS35_lib631819_10106_3_1.fastq.gz
XG-31313_PS35_lib631819_10106_3_2.fastq.gz
XG-31313_PS36_lib631820_10106_3_1.fastq.gz
XG-31313_PS36_lib631820_10106_3_2.fastq.gz
XG-31313_PS36_lib631820_10107_2_1.fastq.gz
XG-31313_PS36_lib631820_10107_2_2.fastq.gz

I want to remove anything before PS and up to the _1/2.fastq.gz. Please note that a simple approach wouldn't work as deletion of the middle parts (after PS##) would create duplicate files so the corresponding forward and reverse reads should be concatenated first.

So, XG-31313_PS33_lib631817_10106_3_2.fastq.gz should become PS33_2.fastq.gz, XG-31313_PS34_lib631818_10107_2_1.fastq.gz should become PS34_1.fastq.gz and so on. I tried

for f in *.fastq.gz; do mv "$f" "${f#*_}"; done #"${f#*_}" 

to remove everything before PS, but couldn't manage the characters in between. Could you please help me?

Regards, PS

fastq • 1.8k views
ADD COMMENT
2
Entering edit mode

Be sure to not touch the original files, make a new directory and symlink the files into it. Then test commands on these links until it works properly. You do not want to test on original data.

ADD REPLY
0
Entering edit mode

Assuming the list of files is in test (remove echo before move when ready to execute :

$ for i in `cat test`; do new=$(echo ${i}| awk -F "_" '{print $2"_"$6}'); echo mv ${i} ${new};done
mv XG-31313_PS33_lib631817_10106_3_2.fastq.gz PS33_2.fastq.gz
mv XG-31313_PS34_lib631818_10106_3_1.fastq.gz PS34_1.fastq.gz
mv XG-31313_PS34_lib631818_10106_3_2.fastq.gz PS34_2.fastq.gz
mv XG-31313_PS34_lib631818_10107_2_1.fastq.gz PS34_1.fastq.gz
mv XG-31313_PS34_lib631818_10107_2_2.fastq.gz PS34_2.fastq.gz
mv XG-31313_PS35_lib631819_10106_3_1.fastq.gz PS35_1.fastq.gz
mv XG-31313_PS35_lib631819_10106_3_2.fastq.gz PS35_2.fastq.gz
mv XG-31313_PS36_lib631820_10106_3_1.fastq.gz PS36_1.fastq.gz
mv XG-31313_PS36_lib631820_10106_3_2.fastq.gz PS36_2.fastq.gz
mv XG-31313_PS36_lib631820_10107_2_1.fastq.gz PS36_1.fastq.gz
mv XG-31313_PS36_lib631820_10107_2_2.fastq.gz PS36_2.fastq.gz

NOTE 1: If you want to be super careful mv can be replaced by a cp so the originals files will remain intact.

NOTE2: It appears that there are identical files with same names if we simply act on the parts OP had asked to remove. So I am moving my example to a comment. It may still help someone else when the file names are not going to overlap.

ADD REPLY
1
Entering edit mode

Sorry GenoMax, but the commands are dangerous here.

ADD REPLY
0
Entering edit mode

That is main reason I have an echo in my example. OP needs to understand what is happening before they execute the commands.

In any case because of the duplicate file name issue that you pointed out below, I am moving my post to a comment.

ADD REPLY
0
Entering edit mode

Please note that a simple approach wouldn't work as deletion of the middle parts (after PS##) would create duplicate files so the corresponding forward and reverse reads should be concatenated first.

bionix - Concatenating paied-end data files end to end may make them unusable in most programs. This is not how programs expect paired end data to be present in files. There is a particular format called "interleaved" fastq files. This would be proper way of handling paired-end data.

ADD REPLY
3
Entering edit mode
18 months ago

Listen to ATpoint 's advice, don't directly work on the original files before ensuring everything is safe.

I'd recommend using brename, again =]

At first glance, I thought it is a simple task. But ... check the report below:

$ brename -p '.+_(PS\d+).+(_[12]).+' -r '$1$2.fastq.gz' -d
[INFO] main options:
[INFO]   ignore case: false
[INFO]   search pattern: .+_(PS\d+).+(_[12]).+
[INFO]   include filters: .
[INFO]   search paths: ./
[INFO] 
[INFO] checking: [ ok ] 'XG-31313_PS33_lib631817_10106_3_2.fastq.gz' -> 'PS33_2.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS34_lib631818_10106_3_1.fastq.gz' -> 'PS34_1.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS34_lib631818_10106_3_2.fastq.gz' -> 'PS34_2.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS34_lib631818_10107_2_1.fastq.gz' -> 'PS34_1.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS34_lib631818_10107_2_2.fastq.gz' -> 'PS34_2.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS35_lib631819_10106_3_1.fastq.gz' -> 'PS35_1.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS35_lib631819_10106_3_2.fastq.gz' -> 'PS35_2.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS36_lib631820_10106_3_1.fastq.gz' -> 'PS36_1.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS36_lib631820_10106_3_2.fastq.gz' -> 'PS36_2.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS36_lib631820_10107_2_1.fastq.gz' -> 'PS36_1.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS36_lib631820_10107_2_2.fastq.gz' -> 'PS36_2.fastq.gz'
[ERRO] 4 potential error(s) detected, please check

See the files again:

file1     XG-31313_PS34_lib631818_10106_3_1.fastq.gz   ->  PS34_1.fastq.gz    It's OK
file2     XG-31313_PS34_lib631818_10106_3_2.fastq.gz    
file3     XG-31313_PS34_lib631818_10107_2_1.fastq.gz   -> PS34_1.fastq.gz     Danger!!!! It overwrites the new PS34_1.fastq.gz (original file1)
file4     XG-31313_PS34_lib631818_10107_2_2.fastq.gz

The consequence is that you'll lose file1 and file2.

So you need to concatenate file1 and file2 first!


A safe answers is (csvtk and rush are needed):

# check files
ls *.fastq.gz \
    | csvtk mutate -Ht \
    | csvtk replace -Ht -p '.+_(PS\d+).+(_[12]).+' -r '$1$2.fastq.gz' \
    | csvtk fold -Ht -f 1 -v 2 -s ' '

PS33_2.fastq.gz XG-31313_PS33_lib631817_10106_3_2.fastq.gz
PS34_1.fastq.gz XG-31313_PS34_lib631818_10106_3_1.fastq.gz XG-31313_PS34_lib631818_10107_2_1.fastq.gz
PS34_2.fastq.gz XG-31313_PS34_lib631818_10106_3_2.fastq.gz XG-31313_PS34_lib631818_10107_2_2.fastq.gz
PS35_1.fastq.gz XG-31313_PS35_lib631819_10106_3_1.fastq.gz
PS35_2.fastq.gz XG-31313_PS35_lib631819_10106_3_2.fastq.gz
PS36_1.fastq.gz XG-31313_PS36_lib631820_10106_3_1.fastq.gz XG-31313_PS36_lib631820_10107_2_1.fastq.gz
PS36_2.fastq.gz XG-31313_PS36_lib631820_10106_3_2.fastq.gz XG-31313_PS36_lib631820_10107_2_2.fastq.gz

# ready to go
ls *.fastq.gz \
    | csvtk mutate -Ht \
    | csvtk replace -Ht -p '.+_(PS\d+).+(_[12]).+' -r '$1$2.fastq.gz' \
    | csvtk fold -Ht -f 1 -v 2 -s ' ' \
    | rush -j 1 -d "\t" 'cat {2} > {1}' --dry-run

cat XG-31313_PS33_lib631817_10106_3_2.fastq.gz > PS33_2.fastq.gz
cat XG-31313_PS34_lib631818_10106_3_1.fastq.gz XG-31313_PS34_lib631818_10107_2_1.fastq.gz > PS34_1.fastq.gz
cat XG-31313_PS34_lib631818_10106_3_2.fastq.gz XG-31313_PS34_lib631818_10107_2_2.fastq.gz > PS34_2.fastq.gz
cat XG-31313_PS35_lib631819_10106_3_1.fastq.gz > PS35_1.fastq.gz
cat XG-31313_PS35_lib631819_10106_3_2.fastq.gz > PS35_2.fastq.gz
cat XG-31313_PS36_lib631820_10106_3_1.fastq.gz XG-31313_PS36_lib631820_10107_2_1.fastq.gz > PS36_1.fastq.gz
cat XG-31313_PS36_lib631820_10106_3_2.fastq.gz XG-31313_PS36_lib631820_10107_2_2.fastq.gz > PS36_2.fastq.gz

# remove --dry-run to apply the renaming.
ADD COMMENT
0
Entering edit mode

So you need to concatenate file1 and file2 first!

One can't assume that XG-31313_PS36_lib631820_10106_3_1.fastq.gz XG-31313_PS36_lib631820_10107_2_1.fastq.gz are from the same sample.

It is only fair to warn that there will be multiple files with identical names if one simply takes out the parts as originally requested by OP.

ADD REPLY
0
Entering edit mode

Yes, GenoMax you are right! I should have mentioned it earlier in my original post. Sorry for the confusion. I have edited the post now.

ADD REPLY
0
Entering edit mode

@shenwei356, Thank you very much for the solution, but when I tried, it threw up an error rush: invalid option -- 'j'

ADD REPLY
0
Entering edit mode
18 months ago
iraun 6.2k

Try:

for f in *.fastq.gz; do 
    newname=$(basename $f | cut -d'_' -f2,6); 
    mv $f $newname;
done
ADD COMMENT
0
Entering edit mode

Sorry iraun but the command is dangerous here. Hope the OP did not try this.

ADD REPLY
0
Entering edit mode

Well, the OP did not ask whether this was a recommended practice or not. He/she asked a specific question about how to rename files in a loop, and my answer fixes the specific question.

ADD REPLY
0
Entering edit mode
18 months ago

I agree with the poster who said don't rename the original files (as seen with the mv command) but rather create symbolic links with short names that point to the original files. Do this in a directory for all your raw reads, and from that point forward you can work with the short, meaningful names at each step in the workflow. To minimize confusion, the output from each step should be in a separate directory.

The BIRCH system has a GUI interface that automates all steps in pre-processing and de-novo assembly of genomes and transcriptomes. A point-and-click example for creating symbolic links with short names is found in the tutorial Pre-processing of RNA sequencing reads. Below is a screenshot from the renaming step that shows the original long filenames and the links (l) with the short names, in this case indicating 18 and 24 hr timepoints. enter image description here

ADD COMMENT

Login before adding your answer.

Traffic: 1670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6