I tried unsuccessfully to decode the secret message yesterday. I realized the problem was that I was thinking like a computer scientist, not a biologist. Once I started thinking like a biologist, the solution was obvious.
I won't spoil your fun by giving away the full answer (at least for now), but I'll mention a few things. [Update: I've written up the details here.] There are four watermarks. The first contains HTML. (That's right! They put actual HTML - complete with DOCTYPE - into the DNA of a living creature. That's just crazy!)
I sent email to the link, and they told me I was the 31st person to break the code. I guess I'll need to be faster next time.
The second, third, and fourth watermarks each contain co-authors and an interesting scientific quote. The first watermark also contains the full character set in order, to help verify the decoding, although I was unable to decode a few of the special characters because they only appear once.
I found the following quotes (which match the quotes given by JCVI and in the singularityhub article):
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." - JAMES JOYCE
"SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."
WHAT I CANNOT BUILD, I CANNOT UNDERSTAND." - RICHARD FEYNMAN
Note that the quotes are one per watermark, not all in the fourth watermark as the singularityhub article claims.
My loyal readers will be disappointed that I didn't use the Arc language to decode this, unlike my previous cryptanalysis and genome adventures. I started out with Arc, but my existing Arc cryptanalysis tools didn't do the job. I switched to Python since I'm faster in Python and I wanted to do some heavy-duty graphical analysis.
I tried a bunch of things that didn't work for analysis. I tried autocorrelation to try to determine the length of each code element, but didn't get much of a signal. I tried looking for common substrings between the watermarks both visually and symbolically, and I found some substantial strings that appeared multiple times, but that didn't really help. I discovered that the sequence "CGAT" never appears in the watermarks, but that was entirely useless. I tried applying the standard genetic code (mapping DNA triples to amino acids), since apparently Craig Venter used that in the past, but didn't come up with anything. I tried to make various estimates of how many bits of data were in the watermarks and how many bits of data would be required using various encodings. I briefly considered the possibility that the DNA encoded vectors that would draw out the answers (I think a Pickover book suggested analyzing DNA in this way), or that the DNA drew out text as a raster, but decided the bit density would be too low and didn't actually try this.
Then I asked myself: "How would I, myself, encode text in DNA?" I figured Unicode would be overkill, so probably it would be best to use either ASCII or a 6-bit encoding (maybe base-64). Then each pair of bits could be encoded with one of C, G, A, and T. Based on the singularityhub article I assumed the word "TRIUMPH" appeared, so I took "riumph" (to avoid possible capitalization issues with the first letter), encoded it as 6-bits or 8-bits, with any possible starting value for 'a', and the 24 possibilities for mapping C/G/A/T to 2-bits, and then tried big-endian and little-endian, to yield 15360 possible DNA encodings for the word. I was certain that one of them would have to appear. However, a brute force search found that none of them were in the DNA. At that point, I started trying even more implausible encodings, with no success, and started to worry that maybe the data was zipped first, which would would make decoding almost impossible. I started to think about variable-length encodings (such as Huffman encodings), and then gave up for the night. I was wondering if it would be worth a blog post on things that don't work for decrypting the code or if I should quit in silence.
In the middle of the night it occurred to me that the designers of the code were biologists, so I should think like a biologist not a computer scientist to figure out the code. With this perspective, it was obvious how to extend the genetic code to handle a larger character set. The problem was there were 64! possibilities. Or maybe 64^26 or 26 choose 64 or something like that; the exact number doesn't matter, but it's way too many to brute force like I did earlier. However, based on the quotes in the singularityhub article, I assumed that the substring " OUT O" appeared in the string, made a regular expression, and boom, it actually matched the watermark, confirming my theory. Then it was a quick matter of extending the technique to decode the full watermarks. As far as I can tell, the encoding used was arbitrary, and does not have any fundamental meaning. As I mentioned earlier, I'm leaving out the details for now, since I don't want to ruin anyone else's fun in decoding the message.
In conclusion, decoding the secret message was a fun challenge.