Home Get the number of characters between two patterns that are in the next line of a line that begins with '>' in a large file
Reply: 1

Get the number of characters between two patterns that are in the next line of a line that begins with '>' in a large file

Fernanda Costa
1#
Fernanda Costa Published in 2018-01-12 19:53:38Z

First of all, sorry about the extensive size of the title, i could't find a better what to explain where i want to get to with this bash script.

I have a very large file (multifasta) that looks like this:

>NAME1
GATATATAGATTAGATTTAGAGAGAGGAGCTATTCATCAGAGCTATCATCAGCTACAGCA
>NAME2
GCGCTAGAGAGCTAGCTACGACTAGCACTAGAGGATACATCATGGGTCATCAGCAGTCAGCATCAC
>NAME3
GCATCAGCATGATAGATCTCATGACTAGATAGAACTATCAT

and goes on....

I also have two patterns:

'GATA' and 'TCAT'

I already know that those 2 patterns exist in every line that doesn't begin with '>', sometimes more than once. So, my objective is to print the '>' line and then get the distance between all the combination of the two patterns in the next line to it, like this:

>NAME1
29 #distance between the only 'GATA' and the first 'TCAT'
41 #distance between the only 'GATA' and the second 'TCAT'
>NAME2
2 #distance between the only 'GATA' and the first 'TCAT'
9 #distance between the only 'GATA' and the second 'TCAT'
>NAME3
4 #distance between the first 'GATA' and first 'TCAT'
23 #distance between the first 'GATA' and second 'TCAT'
6 #distance between the second 'GATA' and the second 'TCAT'

In the third block, there is no distance between the second 'GATA' and the first 'TCAT' because the second pattern appears before the first pattern.

I tried the following code:

while IFS= read -r line;
do
        echo $line;
        if [[ "$line" == ">"* ]];
        then
                echo $line;
        else
                count=$(sed -n /GATA/,/TCAT/p' | wc -c);
                echo $count;
        fi
done < $file

That gives me the following output:

>NAME1
3029

That output gives me just the first '>' line and a really weird and wrong distance between my two patterns, that suggest that i might be doing at least two things wrong, the loop itself and the sed command.

I'm sorry if this was a confusing post and i will be here to clarify things if necessary. I will appreciate any help i can get, or tips or useful links.

Thank you all,

RavinderSingh13
2#
RavinderSingh13 Reply to 2018-01-12 22:24:58Z

Following awk may help you in same. Also not sure how come 6 has come on output in NAME3 since string TCAT is present only 2 times in line after NAME3.

awk -v pattern1="GATA"  -v pattern2="TCAT"  '
/^>/{
  print;
  index2=prev="";
  next
}
{
  val=$0;
  while(index(val,pattern2)){
    index2=prev?prev+length(pattern2)+index(val,pattern2):index(val,pattern2);
    print index2-(index($0,pattern1)+length(pattern1));
    val=substr(val,index(val,pattern2)+length(pattern2)+1);
    prev=index2
}}
'   Input_file

Will add explanation too shortly.

You need to login account before you can post.

About| Privacy statement| Terms of Service| Advertising| Contact us| Help| Sitemap|
Processed in 0.313135 second(s) , Gzip On .

© 2016 Powered by mzan.com design MATCHINFO