A Five Bit Scripting Language

Last night in bed, as I was thinking about things that still need to be done for my next game before I can really start diving into it, I began thinking about the scripting engine. Another Star just hard-coded all scripting in C#, but, for one, there’s no way I’m scripting scenes in C++, and two, it’s really tedious to hard code script even in C#.

As I thought about the scripting language and how I could store all the script in files, I came up with the idea of packing all the scripts in one file. In the file header would be a table of all the individual scripts with their locations and sizes within the file, so I could load only what I need at a time.

And, since I want the language to be case insensitive anyway, I figured I could probably save some space by reducing each character to just six bits. Six bits gives enough for only a subset of ASCII. 64 characters, to be exact. And you want to convert lowercase letters to uppercase letters (or vice versa) so you don’t waste an extra 26 of those precious characters on just letters. With six bit characters, you can do what’s called packing. This is a form of data compression where you shove four six bit characters into three bytes, where normally each ASCII character would need a full byte each. Reduces the file size by exactly 25% right off the bat. Well, not counting the file header and such, but that’s only a minute portion of the file.

Granted, there’s really no need to save 25% by compressing a text file. I mean, really now. It’s highly unlikely that all the scripts in the entire game combined together will amount to much more than a megabyte or two, if even that. But it’s something I could do if I wanted to, nonetheless.

Then, I began to wonder if I could compress it even further. Five bytes per character! Eight characters would fit into every five bytes! Mwa, ha, ha, ha! I’m a madman!

Thing is, it can be done, and it can be done without a lot of “shifting” like in old-timey standards such as Baudot code.

Let’s take the following code as an example. This is how it might be written in the source code file:

function Example
   let int $foo=15
   if $foo != 15 then

Ignoring for the moment that the example code is redundant and accomplishes absolutely nothing ($foo will always be 15 since we just set it), let’s think about this. Five bits gives us only 32 characters total to work with. We need the 26 Latin letters for English. There’s no way around that unless we do some sort of shifting. That leaves us only six more characters to work with. One of those characters will have to be a space. Another will have to mark line breaks. That leaves only four characters. Four! We don’t even have enough room for the ten digits!

Ah, but that’s where you’re wrong. Let’s see how the above example code would look after it’s been prepared to be packed for storage. I’ll use semicolons here to denote the line break characters.


See what I did there? I encoded the digits using letters. 0 becomes A, 1 becomes B, 2 becomes C, and so on. And by using one of the four leftover characters as a dollar sign to mark variables, I made it impossible for variables to be mixed up with numbers or keywords, so this actually works. Furthermore, operators and mathematical symbols were converted to letter-based keywords, eliminating the need for punctuation. Since the script always follows the alternating pattern keyword->number->keyword->number this also works.

It’s not without its problems though.

Take the following, for example.

$foo = ((2 + 2) * (5 + 3))

See the problem? If you just converted the parentheses to keywords you’d have a problem. It wouldn’t follow the keyword->number->keyword->number pattern. The first opening parenthesis follows directly after the equal sign operator, and is then followed by a second opening parenthesis. That’s three keywords in a row. The parser would need some what to tell the opening parenthesis keyword from a number.

But there are solutions to even this. If the keywords for parentheses were something like POPEN and PCLOSE you could get away with it because the parser couldn’t mistake them for numbers. They both start with the letter P and the highest digit 9 is encoded as J, so there’s no way it could be a valid number.

Similarly, depending on how you want to use functions in the language, you’d probably want to take one of the three remaining character values to mark them with some sort of prefix, the way variables are denoted. This would practically be a requirement if you wanted functions in the scripting language that take parameters and return values. That leaves you with just two characters left.

Another problem would be variable name collisions. $foo1 and $foob would both parse to $FOOB. Granted, if you’re requiring variables always be defined (such as with my verbose “let” keyword in the original example code), then the parser would catch these when packing for storage and notify the coder that their code has an error because $FOOB is already defined.

You’d still have to do some sort of shifting if you wanted any kind of strings, though. Otherwise string $bar = "Hello, I am 9 years old today!" would come out, at best, as STRING $BAR = "HELLO I AM J YEARS OLD TODAY". Notice I didn’t even try to encode the punctuation. For strings to be acceptable in the language, one of the two remaining character values would have to mark the beginning of a string, and then every pair of values after that would be parsed together as a ten bit value, enough to store all 128 ASCII characters and then some. The characters would be decoded like this until you hit some ten bit end-of-string character.

So there you have it. Again, there’s really no reason to do any of this, other than because it can be done. But it was an interesting problem to solve, anyway.