A Blazingly-Fast DNA Search Algorithm
Searching through DNA is a surprisingly important problem in bioinformatics. Some genetic circuits, for example, use RNA strands to carry signals through the cell. It would be really bad if RNA strands already present in the cell could interfere with these circuits, so it’s important to detect overlaps while designing the circuit.
This is where my algorithm comes in. My algorithm is based on this string-searching algorithm, but with a few modifications to make it work with DNA. My algorithm has a worst-case complexity of O(N) and a best-case complexity of O(M) when the DNA sequence you’re searching for is present and a best-case complexity of O(N/M) when the strand isn’t present, with M being the length of the substring.
Steps of the algorithm
Preprocessing and setup
First things first, the inputs have to be processed for the algorithm. The DNA sequences are converted into bytes where A is 0b10
, T is 0b11
, C is 0b00
, and G is 0b01
.
Next, we create a cache that contains which 4-bp-long subsequences are present in the sequence we’re searching for. The algorithm creates an array of 256 “buckets” that we can access using bit-shifting. The algorithm loops through the sequence we’re searching for and adds the subsequence at that location to the array using the following format.
cache[(sequence[i] << 6) + (sequence[i + 1] << 4) + ...] = true
This cache will allow us to detect the presence of 4-bp chunks in constant time. This preprocessing step is by far the slowest part of the algorithm, but we can cache the results in a binary file that can be loaded extremely quickly at runtime.
The main algorithm
The algorithm starts by aligning the subsequence (what we’re looking for) with the longer sequence like so:
Seq1: AAAAATTCGCGAT
SubSeq: ATTCG
The algorithm then enters the main loop. In this loop, the algorithm uses the cache to check if the current 4-bp chunk is present in the sequence.
Seq1: A[AAAA]TTCGCGAT
SubSeq: ATTCG
If the 4-bp chunk isn’t present, the algorithm advances by the length of the subsequence minus 1.
Seq1: AAAAATTCGCGAT
SubSeq: ATTCG
The algorithm then repeats this process until it either reaches the end or finds a match. If it finds a match, it moves backwards until it finds the location of the match in the sequence. Here’s an example:
Found match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCGATShift forwards by 1:
Found match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCGATShift forwards by 1:
Found match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCGATShift forwards by 1, sequences overlap:
Found match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCGAT
Now that the algorithm has found a match, it moves backwards through the subsequence and compares the two sequences until it either finds a mismatch or reaches the end of the substring:
Mismatch example:
Matches↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCATShift backwards by 1:
Matches↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCATShift backwards by 1, mismatch detected:
Doesn't match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCAT
If the algorithm reaches the end of the substring, an exact match was found and the function returns. If it finds a mismatch, it advances by the distance of the mismatch from the current index:
Doesn't match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCATAdvances:
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCAT
Once the algorithm reaches the end of the DNA sequence, it returns and is finished.
There’s the algorithm! I’ve published it on GitHub under the MIT license, so feel free to include it in your own projects. Here’s the link! I’ve done some very informal benchmarking using Visual Studio, and it takes under 1ms for this algorithm to find a unique RNA strand not present in the human transcriptome.
Hey there 👋
Thank you so much for reading through this article. I hope it wasn’t too complicated and I hope this is useful for someone. It’s definitely possible I’m not the first person to do this, I just threw this together over a weekend.
If you want to stay informed about more of my projects like this, subscribe to my newsletter and follow me on Twitter.