Home Get the number of characters between two patterns that are in the next line of a line that begins with '>' in a large file

# Get the number of characters between two patterns that are in the next line of a line that begins with '>' in a large file

Fernanda Costa
1#
Fernanda Costa Published in 2018-01-12 19:53:38Z
 First of all, sorry about the extensive size of the title, i could't find a better what to explain where i want to get to with this bash script. I have a very large file (multifasta) that looks like this: >NAME1 GATATATAGATTAGATTTAGAGAGAGGAGCTATTCATCAGAGCTATCATCAGCTACAGCA >NAME2 GCGCTAGAGAGCTAGCTACGACTAGCACTAGAGGATACATCATGGGTCATCAGCAGTCAGCATCAC >NAME3 GCATCAGCATGATAGATCTCATGACTAGATAGAACTATCAT  and goes on.... I also have two patterns: 'GATA' and 'TCAT'  I already know that those 2 patterns exist in every line that doesn't begin with '>', sometimes more than once. So, my objective is to print the '>' line and then get the distance between all the combination of the two patterns in the next line to it, like this: >NAME1 29 #distance between the only 'GATA' and the first 'TCAT' 41 #distance between the only 'GATA' and the second 'TCAT' >NAME2 2 #distance between the only 'GATA' and the first 'TCAT' 9 #distance between the only 'GATA' and the second 'TCAT' >NAME3 4 #distance between the first 'GATA' and first 'TCAT' 23 #distance between the first 'GATA' and second 'TCAT' 6 #distance between the second 'GATA' and the second 'TCAT'  In the third block, there is no distance between the second 'GATA' and the first 'TCAT' because the second pattern appears before the first pattern. I tried the following code: while IFS= read -r line; do echo $line; if [[ "$line" == ">"* ]]; then echo $line; else count=$(sed -n /GATA/,/TCAT/p' | wc -c); echo $count; fi done <$file  That gives me the following output: >NAME1 3029  That output gives me just the first '>' line and a really weird and wrong distance between my two patterns, that suggest that i might be doing at least two things wrong, the loop itself and the sed command. I'm sorry if this was a confusing post and i will be here to clarify things if necessary. I will appreciate any help i can get, or tips or useful links. Thank you all,
RavinderSingh13
2#
 Following awk may help you in same. Also not sure how come 6 has come on output in NAME3 since string TCAT is present only 2 times in line after NAME3. awk -v pattern1="GATA" -v pattern2="TCAT" ' /^>/{ print; index2=prev=""; next } { val=$0; while(index(val,pattern2)){ index2=prev?prev+length(pattern2)+index(val,pattern2):index(val,pattern2); print index2-(index($0,pattern1)+length(pattern1)); val=substr(val,index(val,pattern2)+length(pattern2)+1); prev=index2 }} ' Input_file  Will add explanation too shortly.