Question

Renaming sequencing libraries

0

Entering edit mode

18 months ago

bionix ▴ 10

Hi,

I have many fastq files like:

XG-31313_PS33_lib631817_10106_3_2.fastq.gz
XG-31313_PS34_lib631818_10106_3_1.fastq.gz
XG-31313_PS34_lib631818_10106_3_2.fastq.gz
XG-31313_PS34_lib631818_10107_2_1.fastq.gz
XG-31313_PS34_lib631818_10107_2_2.fastq.gz
XG-31313_PS35_lib631819_10106_3_1.fastq.gz
XG-31313_PS35_lib631819_10106_3_2.fastq.gz
XG-31313_PS36_lib631820_10106_3_1.fastq.gz
XG-31313_PS36_lib631820_10106_3_2.fastq.gz
XG-31313_PS36_lib631820_10107_2_1.fastq.gz
XG-31313_PS36_lib631820_10107_2_2.fastq.gz

I want to remove anything before PS and up to the _1/2.fastq.gz. Please note that a simple approach wouldn't work as deletion of the middle parts (after PS##) would create duplicate files so the corresponding forward and reverse reads should be concatenated first.

So, XG-31313_PS33_lib631817_10106_3_2.fastq.gz should become PS33_2.fastq.gz, XG-31313_PS34_lib631818_10107_2_1.fastq.gz should become PS34_1.fastq.gz and so on. I tried

for f in *.fastq.gz; do mv "$f" "${f#*_}"; done #"${f#*_}"

to remove everything before PS, but couldn't manage the characters in between. Could you please help me?

Regards, PS

fastq • 1.8k views

ADD COMMENT • link updated 18 months ago by GenoMax 141k • written 18 months ago by bionix ▴ 10

2

Entering edit mode

Be sure to not touch the original files, make a new directory and symlink the files into it. Then test commands on these links until it works properly. You do not want to test on original data.

ADD REPLY • link 18 months ago by ATpoint 82k

0

Entering edit mode

Assuming the list of files is in test (remove echo before move when ready to execute :

$ for i in `cat test`; do new=$(echo ${i}| awk -F "_" '{print $2"_"$6}'); echo mv ${i} ${new};done
mv XG-31313_PS33_lib631817_10106_3_2.fastq.gz PS33_2.fastq.gz
mv XG-31313_PS34_lib631818_10106_3_1.fastq.gz PS34_1.fastq.gz
mv XG-31313_PS34_lib631818_10106_3_2.fastq.gz PS34_2.fastq.gz
mv XG-31313_PS34_lib631818_10107_2_1.fastq.gz PS34_1.fastq.gz
mv XG-31313_PS34_lib631818_10107_2_2.fastq.gz PS34_2.fastq.gz
mv XG-31313_PS35_lib631819_10106_3_1.fastq.gz PS35_1.fastq.gz
mv XG-31313_PS35_lib631819_10106_3_2.fastq.gz PS35_2.fastq.gz
mv XG-31313_PS36_lib631820_10106_3_1.fastq.gz PS36_1.fastq.gz
mv XG-31313_PS36_lib631820_10106_3_2.fastq.gz PS36_2.fastq.gz
mv XG-31313_PS36_lib631820_10107_2_1.fastq.gz PS36_1.fastq.gz
mv XG-31313_PS36_lib631820_10107_2_2.fastq.gz PS36_2.fastq.gz

NOTE 1: If you want to be super careful mv can be replaced by a cp so the originals files will remain intact.

NOTE2: It appears that there are identical files with same names if we simply act on the parts OP had asked to remove. So I am moving my example to a comment. It may still help someone else when the file names are not going to overlap.

ADD REPLY • link 18 months ago by GenoMax 141k

1

Entering edit mode

Sorry GenoMax, but the commands are dangerous here.

ADD REPLY • link 18 months ago by shenwei356 8.4k

0

Entering edit mode

That is main reason I have an echo in my example. OP needs to understand what is happening before they execute the commands.

In any case because of the duplicate file name issue that you pointed out below, I am moving my post to a comment.

ADD REPLY • link 18 months ago by GenoMax 141k

0

Entering edit mode

Please note that a simple approach wouldn't work as deletion of the middle parts (after PS##) would create duplicate files so the corresponding forward and reverse reads should be concatenated first.

bionix - Concatenating paied-end data files end to end may make them unusable in most programs. This is not how programs expect paired end data to be present in files. There is a particular format called "interleaved" fastq files. This would be proper way of handling paired-end data.

ADD REPLY • link 18 months ago by GenoMax 141k

score 3 · Answer 1 · 2022-10-18

Listen to ATpoint 's advice, don't directly work on the original files before ensuring everything is safe.

I'd recommend using brename, again =]

At first glance, I thought it is a simple task. But ... check the report below:

$ brename -p '.+_(PS\d+).+(_[12]).+' -r '$1$2.fastq.gz' -d
[INFO] main options:
[INFO]   ignore case: false
[INFO]   search pattern: .+_(PS\d+).+(_[12]).+
[INFO]   include filters: .
[INFO]   search paths: ./
[INFO] 
[INFO] checking: [ ok ] 'XG-31313_PS33_lib631817_10106_3_2.fastq.gz' -> 'PS33_2.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS34_lib631818_10106_3_1.fastq.gz' -> 'PS34_1.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS34_lib631818_10106_3_2.fastq.gz' -> 'PS34_2.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS34_lib631818_10107_2_1.fastq.gz' -> 'PS34_1.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS34_lib631818_10107_2_2.fastq.gz' -> 'PS34_2.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS35_lib631819_10106_3_1.fastq.gz' -> 'PS35_1.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS35_lib631819_10106_3_2.fastq.gz' -> 'PS35_2.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS36_lib631820_10106_3_1.fastq.gz' -> 'PS36_1.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS36_lib631820_10106_3_2.fastq.gz' -> 'PS36_2.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS36_lib631820_10107_2_1.fastq.gz' -> 'PS36_1.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS36_lib631820_10107_2_2.fastq.gz' -> 'PS36_2.fastq.gz'
[ERRO] 4 potential error(s) detected, please check

See the files again:

file1     XG-31313_PS34_lib631818_10106_3_1.fastq.gz   ->  PS34_1.fastq.gz    It's OK
file2     XG-31313_PS34_lib631818_10106_3_2.fastq.gz    
file3     XG-31313_PS34_lib631818_10107_2_1.fastq.gz   -> PS34_1.fastq.gz     Danger!!!! It overwrites the new PS34_1.fastq.gz (original file1)
file4     XG-31313_PS34_lib631818_10107_2_2.fastq.gz

The consequence is that you'll lose file1 and file2.

So you need to concatenate file1 and file2 first!

A safe answers is (csvtk and rush are needed):

# check files
ls *.fastq.gz \
    | csvtk mutate -Ht \
    | csvtk replace -Ht -p '.+_(PS\d+).+(_[12]).+' -r '$1$2.fastq.gz' \
    | csvtk fold -Ht -f 1 -v 2 -s ' '

PS33_2.fastq.gz XG-31313_PS33_lib631817_10106_3_2.fastq.gz
PS34_1.fastq.gz XG-31313_PS34_lib631818_10106_3_1.fastq.gz XG-31313_PS34_lib631818_10107_2_1.fastq.gz
PS34_2.fastq.gz XG-31313_PS34_lib631818_10106_3_2.fastq.gz XG-31313_PS34_lib631818_10107_2_2.fastq.gz
PS35_1.fastq.gz XG-31313_PS35_lib631819_10106_3_1.fastq.gz
PS35_2.fastq.gz XG-31313_PS35_lib631819_10106_3_2.fastq.gz
PS36_1.fastq.gz XG-31313_PS36_lib631820_10106_3_1.fastq.gz XG-31313_PS36_lib631820_10107_2_1.fastq.gz
PS36_2.fastq.gz XG-31313_PS36_lib631820_10106_3_2.fastq.gz XG-31313_PS36_lib631820_10107_2_2.fastq.gz

# ready to go
ls *.fastq.gz \
    | csvtk mutate -Ht \
    | csvtk replace -Ht -p '.+_(PS\d+).+(_[12]).+' -r '$1$2.fastq.gz' \
    | csvtk fold -Ht -f 1 -v 2 -s ' ' \
    | rush -j 1 -d "\t" 'cat {2} > {1}' --dry-run

cat XG-31313_PS33_lib631817_10106_3_2.fastq.gz > PS33_2.fastq.gz
cat XG-31313_PS34_lib631818_10106_3_1.fastq.gz XG-31313_PS34_lib631818_10107_2_1.fastq.gz > PS34_1.fastq.gz
cat XG-31313_PS34_lib631818_10106_3_2.fastq.gz XG-31313_PS34_lib631818_10107_2_2.fastq.gz > PS34_2.fastq.gz
cat XG-31313_PS35_lib631819_10106_3_1.fastq.gz > PS35_1.fastq.gz
cat XG-31313_PS35_lib631819_10106_3_2.fastq.gz > PS35_2.fastq.gz
cat XG-31313_PS36_lib631820_10106_3_1.fastq.gz XG-31313_PS36_lib631820_10107_2_1.fastq.gz > PS36_1.fastq.gz
cat XG-31313_PS36_lib631820_10106_3_2.fastq.gz XG-31313_PS36_lib631820_10107_2_2.fastq.gz > PS36_2.fastq.gz

# remove --dry-run to apply the renaming.

score 0 · Answer 2 · 2022-10-18

0

Entering edit mode

18 months ago

iraun 6.2k

Try:

for f in *.fastq.gz; do 
    newname=$(basename $f | cut -d'_' -f2,6); 
    mv $f $newname;
done

ADD COMMENT • link 18 months ago by iraun 6.2k

0

Entering edit mode

Sorry iraun but the command is dangerous here. Hope the OP did not try this.

ADD REPLY • link 18 months ago by shenwei356 8.4k

0

Entering edit mode

Well, the OP did not ask whether this was a recommended practice or not. He/she asked a specific question about how to rename files in a loop, and my answer fixes the specific question.

ADD REPLY • link 18 months ago by iraun 6.2k

score 0 · Answer 3 · 2022-10-18

I agree with the poster who said don't rename the original files (as seen with the mv command) but rather create symbolic links with short names that point to the original files. Do this in a directory for all your raw reads, and from that point forward you can work with the short, meaningful names at each step in the workflow. To minimize confusion, the output from each step should be in a separate directory.

The BIRCH system has a GUI interface that automates all steps in pre-processing and de-novo assembly of genomes and transcriptomes. A point-and-click example for creating symbolic links with short names is found in the tutorial Pre-processing of RNA sequencing reads. Below is a screenshot from the renaming step that shows the original long filenames and the links (l) with the short names, in this case indicating 18 and 24 hr timepoints. enter image description here