CadBerry Devlog 3: GIL, Parsing, And Lexing

Cameron Kroll
3 min readOct 25, 2021
Photo by Christopher Gower on Unsplash

This week was a bit frustrating. I had a bunch of work, and I didn’t make as much progress as I’d hoped 😒. This week, I worked on the first two-thirds of the CadBerry GIL compiler.

How to Make a Compiler

A compiler has three main parts: the lexer, the parser, and the target code generator. The lexer takes something like this example GIL code:

#Target S.Cerevisiaeoperation TestOp
{
AAAAATTTTTCCCCGGGGG
$InnerCode
@MMCTQQQP
LX@
}
sequence TestSeq
{
TTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCC
}
TestFSeq => TestSeq.Begin
{
.TestOp
{
TestFSeq
}
}

And it converts it into a list of tokens that tell the compiler what each part of the code does. For example, the GIL lexer would convert that example code into the token list:

set target, #Target
unknown token, #Target
ident, S.Cerevisiae
define operation, operation
ident, TestOp
begin,
newline,
dna, AAAAATTTTTCCCCGGGGG
innercode, $InnerCode
aminos, MMCTQQQPLX
newline,
end,
newline,
newline,
define sequence, sequence
ident, TestSeq
begin,
newline,
dna, TTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCC
end,
newline,
newline,
ident, TestFSeq
forward,
ident, TestSeq
newline,call operation, Begin
begin,
newline,
call operation, TestOp
begin,
newline,
ident, TestFSeq
end,
newline,
end,

In GIL, the parser then takes the output tokens and converts them into a Project object in memory. The entry point to the GIL file will be converted into a sequence with an illegal character at the beginning (something like “\nMain”) so that the user can’t accidentally mess with the entry point. With all that technical jargon out of the way, let’s move on to some of the differences between the C# GIL compiler I wrote over quarantine and the C++ version I’m working on now.

How This Version’s Different

My goal for GIL since the beginning has been to simplify synthetic biology. Unfortunately, the C# GIL version (hereafter referred to as GIL#) had a bunch of complicated features all implemented in different ways. In the C++ version (GIL++), I’ve cut out all but the most important features and reworked those that remained. For example, did you know that GIL# has two different ways to specify amino acids, one of which is completely useless? And the one that was useful kinda sucked too. If you wanted to add a sequence of amino acids, GIL# would make you do this:

AminoSequence
{
Some amino acid sequence
}

In GIL++, I’ve reworked this to make it look better, compile faster, and take up less RAM:

@Some amino acid sequence@

I could talk all day about the performance improvements in GIL++ (for example, getting rid of the regex-based lexer), but that’s not what this article is for. At this point, all I have left to do is finish the code generator, add in a couple of settings, and stop procrastinating on debugging a couple of annoying errors. I sense this article has been a bit more rambly than usual, but that’s because this week’s progress has been setting things up for next week, where I’ll (hopefully 🤞) be able to release a super early developer version on GitHub. As I’ve said previously, CadBerry will be released under the same GPL v3 open-source license as GIL. This basically means you can do just about anything you want with it, as long as any changes you make are released under the same license (not legal advice, btw). I’m excited to see where this thing goes, especially with some of my ideas for integrating biological computing, and even maybe CRISPR eventually. With that said, see you next week 😃

--

--