A Blazingly-Fast DNA Search Algorithm

Cameron Kroll
3 min readMay 30, 2022

--

Searching through DNA is a surprisingly important problem in bioinformatics. Some genetic circuits, for example, use RNA strands to carry signals through the cell. It would be really bad if RNA strands already present in the cell could interfere with these circuits, so it’s important to detect overlaps while designing the circuit.

This is where my algorithm comes in. My algorithm is based on this string-searching algorithm, but with a few modifications to make it work with DNA. My algorithm has a worst-case complexity of O(N) and a best-case complexity of O(M) when the DNA sequence you’re searching for is present and a best-case complexity of O(N/M) when the strand isn’t present, with M being the length of the substring.

Steps of the algorithm

Preprocessing and setup

First things first, the inputs have to be processed for the algorithm. The DNA sequences are converted into bytes where A is 0b10, T is 0b11, C is 0b00, and G is 0b01.

Next, we create a cache that contains which 4-bp-long subsequences are present in the sequence we’re searching for. The algorithm creates an array of 256 “buckets” that we can access using bit-shifting. The algorithm loops through the sequence we’re searching for and adds the subsequence at that location to the array using the following format.

cache[(sequence[i] << 6) + (sequence[i + 1] << 4) + ...] = true

This cache will allow us to detect the presence of 4-bp chunks in constant time. This preprocessing step is by far the slowest part of the algorithm, but we can cache the results in a binary file that can be loaded extremely quickly at runtime.

The main algorithm

The algorithm starts by aligning the subsequence (what we’re looking for) with the longer sequence like so:

Seq1:   AAAAATTCGCGAT
SubSeq: ATTCG

The algorithm then enters the main loop. In this loop, the algorithm uses the cache to check if the current 4-bp chunk is present in the sequence.

Seq1:   A[AAAA]TTCGCGAT
SubSeq: ATTCG

If the 4-bp chunk isn’t present, the algorithm advances by the length of the subsequence minus 1.

Seq1:   AAAAATTCGCGAT
SubSeq: ATTCG

The algorithm then repeats this process until it either reaches the end or finds a match. If it finds a match, it moves backwards until it finds the location of the match in the sequence. Here’s an example:

      Found match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCGAT
Shift forwards by 1:
Found match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCGAT
Shift forwards by 1:
Found match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCGAT
Shift forwards by 1, sequences overlap:
Found match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCGAT

Now that the algorithm has found a match, it moves backwards through the subsequence and compares the two sequences until it either finds a mismatch or reaches the end of the substring:

Mismatch example:
Matches↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCAT
Shift backwards by 1:
Matches↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCAT
Shift backwards by 1, mismatch detected:
Doesn't match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCAT

If the algorithm reaches the end of the substring, an exact match was found and the function returns. If it finds a mismatch, it advances by the distance of the mismatch from the current index:

Doesn't match↓
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCAT
Advances:
Seq1: AAAAATTCGCGAT
SubSeq: TCGCCAT

Once the algorithm reaches the end of the DNA sequence, it returns and is finished.

There’s the algorithm! I’ve published it on GitHub under the MIT license, so feel free to include it in your own projects. Here’s the link! I’ve done some very informal benchmarking using Visual Studio, and it takes under 1ms for this algorithm to find a unique RNA strand not present in the human transcriptome.

Hey there 👋

Thank you so much for reading through this article. I hope it wasn’t too complicated and I hope this is useful for someone. It’s definitely possible I’m not the first person to do this, I just threw this together over a weekend.

If you want to stay informed about more of my projects like this, subscribe to my newsletter and follow me on Twitter.

--

--