Decoding the secret DNA code in Venter's synthetic bacterium

Craig Venter recently created a synthetic bacterium with a secret message encoded in the DNA. This is described in more detail at singularityhub. (My article will make much more sense if you read the singularityhub article.)

I tried unsuccessfully to decode the secret message yesterday. I realized the problem was that I was thinking like a computer scientist, not a biologist. Once I started thinking like a biologist, the solution was obvious.

I won't spoil your fun by giving away the full answer (at least for now), but I'll mention a few things. [Update: I've written up the details here.] There are four watermarks. The first contains HTML. (That's right! They put actual HTML - complete with DOCTYPE - into the DNA of a living creature. That's just crazy!)
The HTML in the genome
I sent email to the link, and they told me I was the 31st person to break the code. I guess I'll need to be faster next time.

The second, third, and fourth watermarks each contain co-authors and an interesting scientific quote. The first watermark also contains the full character set in order, to help verify the decoding, although I was unable to decode a few of the special characters because they only appear once.

I found the following quotes (which match the quotes given by JCVI and in the singularityhub article):
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." - JAMES JOYCE
"SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."
WHAT I CANNOT BUILD, I CANNOT UNDERSTAND." - RICHARD FEYNMAN
Note that the quotes are one per watermark, not all in the fourth watermark as the singularityhub article claims.

If you want to try decoding the code, the DNA of the watermarks is at Science Magazine. The complete genome is a NCBI, in case you want it.

My loyal readers will be disappointed that I didn't use the Arc language to decode this, unlike my previous cryptanalysis and genome adventures. I started out with Arc, but my existing Arc cryptanalysis tools didn't do the job. I switched to Python since I'm faster in Python and I wanted to do some heavy-duty graphical analysis.

I tried a bunch of things that didn't work for analysis. I tried autocorrelation to try to determine the length of each code element, but didn't get much of a signal. I tried looking for common substrings between the watermarks both visually and symbolically, and I found some substantial strings that appeared multiple times, but that didn't really help. I discovered that the sequence "CGAT" never appears in the watermarks, but that was entirely useless. I tried applying the standard genetic code (mapping DNA triples to amino acids), since apparently Craig Venter used that in the past, but didn't come up with anything. I tried to make various estimates of how many bits of data were in the watermarks and how many bits of data would be required using various encodings. I briefly considered the possibility that the DNA encoded vectors that would draw out the answers (I think a Pickover book suggested analyzing DNA in this way), or that the DNA drew out text as a raster, but decided the bit density would be too low and didn't actually try this.

Then I asked myself: "How would I, myself, encode text in DNA?" I figured Unicode would be overkill, so probably it would be best to use either ASCII or a 6-bit encoding (maybe base-64). Then each pair of bits could be encoded with one of C, G, A, and T. Based on the singularityhub article I assumed the word "TRIUMPH" appeared, so I took "riumph" (to avoid possible capitalization issues with the first letter), encoded it as 6-bits or 8-bits, with any possible starting value for 'a', and the 24 possibilities for mapping C/G/A/T to 2-bits, and then tried big-endian and little-endian, to yield 15360 possible DNA encodings for the word. I was certain that one of them would have to appear. However, a brute force search found that none of them were in the DNA. At that point, I started trying even more implausible encodings, with no success, and started to worry that maybe the data was zipped first, which would would make decoding almost impossible. I started to think about variable-length encodings (such as Huffman encodings), and then gave up for the night. I was wondering if it would be worth a blog post on things that don't work for decrypting the code or if I should quit in silence.

In the middle of the night it occurred to me that the designers of the code were biologists, so I should think like a biologist not a computer scientist to figure out the code. With this perspective, it was obvious how to extend the genetic code to handle a larger character set. The problem was there were 64! possibilities. Or maybe 64^26 or 26 choose 64 or something like that; the exact number doesn't matter, but it's way too many to brute force like I did earlier. However, based on the quotes in the singularityhub article, I assumed that the substring " OUT O" appeared in the string, made a regular expression, and boom, it actually matched the watermark, confirming my theory. Then it was a quick matter of extending the technique to decode the full watermarks. As far as I can tell, the encoding used was arbitrary, and does not have any fundamental meaning. As I mentioned earlier, I'm leaving out the details for now, since I don't want to ruin anyone else's fun in decoding the message.

In conclusion, decoding the secret message was a fun challenge.

2 comments:

UnknownJune 8, 2010 at 4:21 PM
TTAACTAGCTAATGATCATAGTGCGCAGCTATAGGCCGTCTAATAACACGTGCTTGACTGTGCTACATATGATCACTGGCTCGAATACTGTGAATACTATAATAGAACAACCATATATCATAAAACACATAAATTGAGGGCGCGCCTTAACTAGCTAA
UnknownJuly 13, 2010 at 12:28 PM
TTAACTAGCTAATGCCTGTTTTAAATAGTCCGTCTAGCAGAGGGGATTCGTATACATCGTTCCATAGCATGCCGTGTCATACTGGGCATAGTCTAAATATAGAACGTCTAGCATGCTATATCATAGTTGTAAATATGACGTATATAATGCATTATATGATCATAAATAGTCTAGTGATAACTACAATAGCTAGCAATAGTCCTGTGATCACAGGGGTACTACTTTTACTTTTACTTTTTTTGATGATAGTAGTTTTGATAGTACTTTTGATAGTAGGGGCGTCTAATACTGGCTGGGTGATGATAGTAGTTTTGATAGTACTTTTGATAGTAGGGGTAATGCCGTTCCTACTCAAAAATATGATCATAAATATGATCACTGCTAATTATAGTCTAGTGATAACTACAATAGCTAGCAATAGGCCTACGTCAAATATGATCATAAATAGCTTCCACAACACGTCTATGACTGTGCTACATACGTTGCAACCTGTGCTAAATACAATAGTGATAACTACTGTAGAACATACGTCAACTGTGAGCTATACTGTGACGAGGCGCGCCTTAACTAGCTAA