Showing posts with label arc. Show all posts
Showing posts with label arc. Show all posts

Solving a math problem of Schrödinger with Arc and Python

I recently received an interesting math problem: how many 12-digit numbers can you make from the digits 1 through 5 if each digit must appear at least once. The problem seemed trivial at first, but it turned out to be more interesting than I expected, so I'm writing it up here. I was told the problem is in a book of Schrodinger's, but I don't know any more details of its origin.

(The problem came to me via Aardvark (vark.com), which is a service where you can send questions to random people, and random people send you questions. Aardvark tries to match up questions with people who may know the answer. I find it interesting to see the questions I receive.)

The brute force solution

I didn't see any obvious solution to the problem, so I figured I'd first use brute force. I didn't use the totally brute force solution - enumerating all 12 digit sequences and counting the valid ones - since it would be pretty slow, so I took a slightly less brute force approach.

If 1 appears a times, 2 appears b times, and 3, 4, and 5 appear c, d, and e times, then the total number of possibilities is 12!/(a!b!c!d!e!) - i.e. the multinomial coeffient.

Then we just need to sum across all the different combinations of a through e. a has to be at least 1, and can be at most 8 (in order to leave room for the other 4 digits). Then b has to be at least 1 and at most 9-a, and so on. Finally, e is whatever is left over.

The Python code is straightforward, noting that range(1, 9) counts from 1 to 8:

from math import factorial
total = 0
for a in range(1, 8+1):
  for b in range(1, 9-a+1):
    for c in range(1, 10-a-b+1):
      for d in range(1, 11-a-b-c+1):
        e = 12-a-b-c-d
        total += (factorial(12) / factorial(a) / factorial(b)
          / factorial(c) / factorial(d) / factorial(e))
print total
Or, in Arc:
(def factorial (n)
  (if (<= n 1) 1
     (* n (factorial (- n 1)))))

(let total 0
  (for a 1 8
    (for b 1 (- 9 a)
      (for c 1 (- 10 a b)
        (for d 1 (- 11 a b c)
          (let e (- 12 a b c d)
            (++ total (/ (factorial 12) (factorial a) (factorial b)
              (factorial c) (factorial d) (factorial e))))))))
  total)
Either way, we quickly get the answer 165528000.

The generalized brute force solution

The above solution is somewhat unsatisfying, since if we want to know how many 11 digits numbers can be made from at least one of 1 through 6, for instance, there's no easy way to generalize the code.

We can improve the solution by writing code to generate the vectors of arbitrary digits and length. In Python, we can use yield, which magically gives us the vectors one at a time. The vecs function recursively generates the vectors that sum to <= t. The solve routine adds the last entry in the vector to make the sum exactly equal the length. My multinomial function is perhaps overly fancy, using reduce to compute the factorials of all the denominator terms and multiply them together in one go.

from math import factorial

# Generate vectors of digits of length n with sum <= t, and minimum digit 1.
def vecs(n, t):
  if n <= 0:
    yield []
  else:
    for vec in vecs(n-1, t-1):
      for last in range(1, t - sum(vec) + 1):
        yield vec + [last]

def multinomial(n, ks):
  return factorial(n) / reduce(lambda prod, k: prod * factorial(k), ks, 1)
  
# Find number of sequences of digits 1 to n, of length len,
# with each digit appearing at least once
def solve(n, len):
  total = 0
  for vec0 in vecs(n-1, len-1):
    # Make the vector of length n summing exactly to len
    vec = vec0 + [len - sum(vec0)]
    total += multinomial(len, vec)
  return total
Now, solve(5, 12) solves the original problem, and we can easily solve related problems.

In Arc, doing yield would need some crazy continuation code, so I just accumulate the list of vectors with accum and then process it. The code is otherwise similar to the Python code:

(def vecs (n t)
  (if (<= n 0) '(())
    (accum accfn
      (each vec (vecs (- n 1) (- t 1))
       (for last 1 (- t (reduce + vec))
          (accfn (cons last vec)))))))

(def multinomial (n ks)
  (/ (factorial n) (reduce * (map factorial ks))))

(def solve (n len)
  (let total 0
    (each vec0 (vecs (- n 1) (- len 1))
        (++ total (multinomial len
                    (cons (- len (reduce + vec0)) vec0))))
    total))
This solution will solve our original problem quickly, but in general it's exponential, so solving something like n = 10 and length = 30 gets pretty slow.

The dynamic programming solution

By thinking about the problem a bit more, we can come up with a simpler solution. Let's take all the sequences of 1 through 5 of length 12 with each number appearing at least once, and call this value S(5, 12). How can we recursively find this value? Well, take a sequence and look at the last digit, say 5. Either 5 appears in the first 11 digits, or it doesn't. In the first case, the first 11 digits form a sequence of 1 through 5 of length 11 with each number appearing at least once. In the second case, the first 11 digits form a sequence of 1 through 4 of length 11 with each number appearing at least once. So we have S(5, 11) sequences of the first case, and S(4, 11) of the second case. We assumed the last digit was a 5, but everything works the same if it was a 1, 2, 3, or 4; there are 5 possibilities for the last digit.

The above shows that S(5, 12) = 5 * (S(5, 11) + S(4, 11)). In general, we have the nice recurrence S(n, len) = n * (S(n, len-1) + S(n-1, len-1)). Also, if you only have a single digit, there's only one solution, so S(1, len) = 1. And if you have more digits than length to put them in, there are clearly no solutions, so S(n, len) = 0 if n > len. We can code this up in a nice recursive solution with memoization:

memo = {}
def solve1(digits, length):
  if memo.has_key((digits, length)):
    return memo[(digits, length)]
  if digits > length:
    result = 0
  elif digits == 1:
    result = 1
  else:
    result = digits * (solve1(digits-1, length-1) + solve1(digits, length-1))
  memo[(digits, length)] = result
  return result
An interesting thing about this solution is it breaks down each function call into two simpler function calls. Each of these turns into two simpler function calls, and so forth, until it reaches the base case. This results in an exponential number of function calls. This isn't too bad for solving the original problem of length 12, but the run time rapidly gets way too slow.

Note that the actual number of different function evaluations is pretty small, but the same ones keep getting evaluated over and over exponentially often. The traditional way of fixing this is to use dynamic programming. In this approach, the structure of the algorithm is examined and each value is calculated just one, using previously-calculated values. In our case, we can evaluate S(1, 0), S(2, 0), S(3, 0), and so on. Then evaluate S(2, 1), S(2, 2), S(2, 3) using those values. Then evaluate S(3, 2), S(3, 3), and so on. Note that each evaluation only uses values that have already been calculated. For an introduction to dynamic programming, see Dynamic Programming Zoo Tour.

Rather than carefully looping over the sub-problems in the right order to implement a dynamic programming solution, it is much easier to just use memoization. The trick here is once a subproblem is solved, store the answer. If we need the answer again, just use the stored answer rather than recomputing it. Memoization is a huge performance win. Without memoization solve1(10, 30) takes about 15 seconds, and with it, the solution is almost immediate. (solve1 10 35) takes about 80 seconds.

The Arc implementation is nice, because defmemo creates a function that is automatically memoized.

(defmemo solve1 (digits length)
  (if (> digits length) 0
      (is digits 1) 1
      (* digits (+ (solve1 (- digits 1) (- length 1))
                   (solve1 digits (- length 1))))))

A faster dynamic programming solution

When trying to count the number of things satisfying a condition, it's often easier to figure out how many don't satisfy the condition. We can apply that strategy to this problem. The total number of 12-digit sequences made from 1 through 5 is obviously 5^12. We can just subtract the number of sequences that are missing one or more digits, and we get the answer we want.

How many sequences are missing 1 digit? S(4, 12) is the number of sequences containing at least one of 1 through 4, and by definition missing the digit 5. Of course, we need to consider sequences missing 4, 3, 2, or 1 as well. Likewise, there are S(4, 12) of each of these, so 5 * S(4, 12) sequences missing exactly one digit.

Next, sequences missing exactly 2 of the digits 1 through 5. S(3, 12) is the number of sequences containing at least one of 1 through 3, so they are missing 4 and 5. But we must also consider sequences missing 1 and 2, 1 and 3, and so on. There are 5 choose 2 pairs of digits that can be missing. Thus in total, (5 choose 2) * S(3, 12) sequences are missing exactly 2 of the digits.

Likewise, the number of sequences missing exactly 3 digits is (5 choose 3) * S(2, 12). The number of sequences missing exactly 4 digits is (5 choose 4) * S(1, 12). And obviously there are no sequences missing 5 digits.

Thus, we get the recurrence S(5, 12) = 5^12 - 5*S(4, 12) - 10*S(3, 12) - 10*S(2, 12) - 5*S(1, 12)

In general, we get the recurrence:

The same dynamic programming techniques and memoization can be applied to this. Note that the subproblems all have the same length, so there are many fewer subproblems than in the previous case. That is, instead of a 2-dimensional grid of subproblems, the subproblems are all in a single row.

In Python, the solution is:

memo = {}
def solve2(digits, length):
  if memo.has_key((digits, length)):
    return memo[(digits, length)]
  if digits > length:
    result = 0
  elif digits == 1:
    result = 1
  else:
    result = digits**length
    for i in range(1, digits):
      result -= sol2(i, length)*choose(digits, i)
  memo[(digits, length)] = result
  return result
And in Arc, the solution is similar. Instead of looping, I'm using a map and reduce just for variety. That is, the square brackets define a function to compute the choose and solve term. This is mapped over the range 1 to digits-1. Then the results are summed with reduce. As before, defmemo is used to memoize the results.
(def choose (m n)
  (/ (factorial m) (factorial n) (factorial (- m n))))

(defmemo solve2 (digits length)
  (if (> digits length) 0
      (is digits 1) 1
      (- (expt digits length)
         (reduce +
           (map [* (choose digits _) (solve2 _ length)]
             (range 1 (- digits 1)))))))

The mathematical exponential generating function solution

At this point, I dug out my old combinatorics textbook to figure out how to solve this problem. By using exponential generating functions and a bunch of algebra, we obtain a fairly tidy answer to the original problem:

Since the mathematics required to obtain this answer is somewhat complicated, I've written it up separately in Part II.

Discussion

This problem turned out to be a lot more involved than I thought. It gave me the opportunity to try out some new things in Python and Arc, and make use of my fading knowledge of combinatorics and generating functions.

The Python and Arc code was pretty similar. Python's yield construct was convenient. Arc's automatic memoization was also handy. On the whole, both languages were about equally powerful in solving these problems.

Acknowledgement: equations created through CodeCogs' equation editor

Solving edge-match puzzles with Arc and backtracking

I recently saw a tricky nine piece puzzle and wondered if I could use Arc to solve it. It turned out to be straightforward to solve using standard backtracking methods. This article describes how to solve the puzzle using the Arc language.

A edge-matching puzzle. The puzzle consists of nine square titles. The tiles must be placed so the pictures match up where two tiles meet. As the picture says, "Easy to play, but hard to solve." It turns out that this type of puzzle is called an edge-matching puzzle, and is NP-complete in general. For dozens of examples of these puzzles, see Rob's Puzzle Page.

Backtracking is a standard AI technique to find a solution to a problem step by step. If you reach a point where a solution is impossible, you backtrack a step and try the next possibility. Eventually, you will find all possible solutions. Backtracking is more efficient than brute-force testing of all possible solutions because you abandon unfruitful paths quickly.

To solve the puzzle by backtracking, we put down the first tile in the upper left. We then try a second tile in the upper middle. If it matches, we put it there. Then we try a third tile in the upper right. If it matches, we put it there. The process continues until a tile doesn't match, and then the algorithm backtracks. For instance, if the third tile doesn't match, we try a different third tile and continue. Eventually, after trying all possible third tiles, we backtrack and try a different second tile. And after trying all possible second tiles, we'll backtrack and try a new first tile. Thus, the algorithm will reach all possible solutions, but avoids investigating arrangements that can't possibly work.

The implementation

I represent each tile with four numbers indicating the pictures on each side. I give a praying mantis the number 1, a beetle 2, a dragonfly 3, and an ant 4. For the tail of the insect, I give it a negative value. Each tile is then a list of (top left right bottom). For instance, the upper-left tile is (2 1 -3 3). With this representation, tiles match if the value on one edge is the negative of the value on the other edge. I can then implement the list of tiles:
(with (mantis 1 beetle 2 dragonfly 3 ant 4)
  (= tiles (list
    (list beetle mantis (- dragonfly) dragonfly)
    (list (- beetle) dragonfly mantis (- ant))
    (list ant (- mantis) beetle (- beetle))
    (list (- dragonfly) (- ant) ant mantis)
    (list ant (- beetle) (- dragonfly) mantis)
    (list beetle (- mantis) (- ant) dragonfly)
    (list (- ant) (- dragonfly) beetle (- mantis))
    (list (- beetle) ant mantis (- dragonfly))
    (list mantis beetle (- dragonfly) ant))))
Next, I create some helper functions to access the edges of a tile, convert the integer to a string, and to prettyprint the tiles.
;; Return top/left/right/bottom entries of a tile
(def top (l) (l 0))
(def left (l) (l 1))
(def right (l) (l 2))
(def bottom (l) (l 3))

;; Convert an integer tile value to a displayable value
(def label (val)
  ((list "-ant" "-dgn" "-btl" "-man" "" " man" " btl" " dgn" " ant")
   (+ val 4)))

;; Print the tiles nicely
(def prettyprint (tiles (o w 3) (o h 3))
  (for y 0 (- h 1)
    (for part 0 4
      (for x 0 (- w 1)
        (withs (n (+ x (* y w)) tile (tiles n))
          (if
     (is part 0)
       (pr " ------------- ")
     (is part 1)
       (pr "|    " (label (top tile)) "     |")
     (is part 2)
       (pr "|" (label (left tile)) "    " (label (right tile)) " |")
     (is part 3)
       (pr "|    " (label (bottom tile)) "     |")
     (is part 4)
       (pr " ------------- "))))
      (prn))))
The prettyprint function uses optional arguments for width and height: (o w 3). This sets the width and height to 3 by default but allows it to be modified if desired. The part loop prints each tile row is printed as five lines. Now we can print out the starting tile set and verify that it matches the picture. I'll admit it's not extremely pretty, but it gets the job done:
arc> (prettyprint tiles)
 -------------  -------------  ------------- 
|     btl     ||    -btl     ||     ant     |
| man    -dgn || dgn     man ||-man     btl |
|     dgn     ||    -ant     ||    -btl     |
 -------------  -------------  ------------- 
 -------------  -------------  ------------- 
|    -dgn     ||     ant     ||     btl     |
|-ant     ant ||-btl    -dgn ||-man    -ant |
|     man     ||     man     ||     dgn     |
 -------------  -------------  ------------- 
 -------------  -------------  ------------- 
|    -ant     ||    -btl     ||     man     |
|-dgn     btl || ant     man || btl    -dgn |
|    -man     ||    -dgn     ||     ant     |
 -------------  -------------  ------------- 

Next is the meat of the solver. The first function is matches, which takes a grid of already-positioned tiles and a new tile, and tests if a particular edge of the new tile matches the existing tiles. (The grid is represented simply as a list of the tiles that have been put down so far.) This function is where all the annoying special cases get handled. First, the new tile may be along an edge, so there is nothing to match against. Second, the grid may not be filled in far enough for there to be anything to match against. Finally, if there is a grid tile to match against, and the value there is the negative of the new tile's value, then it matches. One interesting aspect of this function is that functions are passed in to it to select which edges (top/bottom/left/right) to match.

;; Test if one edge of a tile will fit into a grid of placed tiles successfully
;;  grid is the grid of placed tiles as a list of tiles e.g. ((1 3 4 2) nil (-1 2 -1 1) ...)
;;  gridedge is the edge of the grid cell to match (top/bottom/left/right)
;;  gridx is the x coordinate of the grid tile
;;  gridy is the y coordinate of the grid tile
;;  newedge is the edge of the new tile to match (top/bottom/left/right)
;;  newtile is the new tile e.g. (1 2 -1 -3)
;;  w is the width of the grid
;;  h is the height of the grid
(def matches (grid gridedge gridx gridy newedge newtile w h)
  (let n (+ gridx (* gridy w))
    (or
      (< gridx 0) ; nothing to left of tile to match, so matches by default
      (< gridy 0) ; tile is at top of grid, so matches
      (>= gridx w) ; tile is at right of grid, so matches
      (>= gridy h) ; tile is at bottom of grid, so matches
      (>= n (len grid)) ; beyond grid of tiles, so matches
      (no (grid n)) ; no tile placed in the grid at that position
      ; Finally, compare the two edges which should be opposite values
      (is (- (gridedge (grid n))) (newedge newtile)))))
With that method implemented, it's easy to test if a new tile will fit into the grid. We simply test that all four edges match against the existing grid. We don't need to worry about the edges of the puzzle, because the previous method handles them:
;; Test if a tile will fit into a grid of placed tiles successfully
;;   grid is the grid of tiles
;;   newtile is the tile to place into the grid
;;   (x, y) is the position to place the tile
(def is_okay (grid newtile x y w h)
    (and 
      (matches grid right (- x 1) y left newtile w h)
      (matches grid left (+ x 1) y right newtile w h)
      (matches grid top x (+ y 1) bottom newtile w h)
      (matches grid bottom x (- y 1) top newtile w h)))
Now we can implement the actual solver. The functions solve1 and try recursively call each other. solve1 calls try with each candidate tile in each possible orientation. If the tile fits, try updates the grid of placed tiles and calls solve1 to continue solving. Otherwise, the algorithm backtracks and solve1 tries the next possible tile. My main problem was accumulating all the solutions properly; on my first try, the solutions were wrapped in 9 layers of parentheses! One other thing to note is the conversion from a linear (0-8) position to an x/y grid position.

A couple refactorings are left as an exercise to the reader. The code to try all four rotations of the tile is a bit repetitive and could probably be cleaned up. More interesting would be to turn the backtracking solver into a general solver, with the puzzle just one instance of a problem.

;; grid is a list of tiles already placed
;; candidates is a list of tiles yet to be placed
;; nextpos is the next position to place a tile (0 to 8)
;; w and h are the dimensions of the puzzle
(def solve1 (grid candidates nextpos w h)
  (if
    (no candidates)
      (list grid) ; Success!
    (mappend idfn (accum addfn ; Collect results and flatten
      (each candidate candidates
        (addfn (try grid candidate (rem candidate candidates) nextpos w h))
        (addfn (try grid (rotate candidate) (rem candidate candidates) nextpos w h))
        (addfn (try grid (rotate (rotate candidate)) (rem candidate candidates) nextpos w h))
        (addfn (try grid (rotate (rotate (rotate candidate))) (rem candidate candidates) nextpos w h)))))))

; Helper to append elt to list
(def append (lst elt) (join lst (list elt)))

;; Try adding a candidate tile to the grid, and recurse if successful.
;; grid is a list of tiles already placed
;; candidate is the tile we are trying
;; candidates is a list of tiles yet to be placed (excluding candidate)
;; nextpos is the next position to place a tile (0 to 8)
;; w and h are the dimensions of the puzzle
(def try (grid candidate candidates nextpos w h)
  (if (is_okay grid candidate (mod nextpos w) (trunc (/ nextpos w)) w h)
    (solve1 (append grid candidate) candidates (+ nextpos 1) w h)))

The final step is a wrapper function to initialize the grid:

(def solve (tiles (o w 3) (o h 3))
  (solve1 nil tiles 0 w h))

With all these pieces, we can finally solve the problem, and obtain four solutions (just rotations of one solution):

arc> (solve tiles)
(((1 -2 -4 3) (2 4 1 -3) (4 -1 2 -2) (-3 1 4 -2) (3 -4 -1 2) (2 1 -3 3) (2 -4 -1 -3) (-2 1 4 -3) (-3 -4 4 1))
((2 4 -2 -1) (-3 2 3 1) (4 -3 1 -4) (1 2 -3 4) (-1 3 2 -4) (4 -2 -3 1) (-4 1 3 -2) (4 -3 -2 1) (-1 2 -3 -4))
((1 4 -4 -3) (-3 4 1 -2) (-3 -1 -4 2) (3 -3 1 2) (2 -1 -4 3) (-2 4 1 -3) (-2 2 -1 4) (-3 1 4 2) (3 -4 -2 1))
((-4 -3 2 -1) (1 -2 -3 4) (-2 3 1 -4) (1 -3 -2 4) (-4 2 3 -1) (4 -3 2 1) (-4 1 -3 4) (1 3 2 -3) (-1 -2 4 2)))
arc> (prettyprint (that 0))
 -------------  -------------  ------------- 
|     man     ||     btl     ||     ant     |
|-btl    -ant || ant     man ||-man     btl |
|     dgn     ||    -dgn     ||    -btl     |
 -------------  -------------  ------------- 
 -------------  -------------  ------------- 
|    -dgn     ||     dgn     ||     btl     |
| man     ant ||-ant    -man || man    -dgn |
|    -btl     ||     btl     ||     dgn     |
 -------------  -------------  ------------- 
 -------------  -------------  ------------- 
|     btl     ||    -btl     ||    -dgn     |
|-ant    -man || man     ant ||-ant     ant |
|    -dgn     ||    -dgn     ||     man     |
 -------------  -------------  ------------- 
I've used gimp on the original image to display the solution. I've labeled the original tiles A-I so you can see how the solution relates to the original image. Using Arc to display a solution as an image is left as an exercise to the reader :-) But seriously, this is where using a language with extensive libraries would be beneficial, such as Python's PIL imaging library.

solution

Theoretical analysis

I'll take a quick look at the theory of the puzzle. The tiles can be placed in 9! (9 factorial) different locations, and each tile can be oriented 4 ways, for a total of 9! * 4^9 possible arrangements of the tiles, which is about 95 billion combinations. Clearly this puzzle is hard to solve by randomly trying tiles.
arc> (* (apply * (range 1 9)) (expt 4 9))
95126814720

We can do a back of the envelope calculation to see how many solutions we can expect. If you put the tiles down randomly, there are 12 edge constraints that must be satisfied. Since only one of the 8 possibilities matches, the chance of all the edges matching randomly is 1 in 8^12 or 68719476736. Dividing this into the 95 billion possible arrangements yields 1.38 solutions for an arbitrary random puzzle. (If this number were very large, then it would be hard to create a puzzle with only one solution.)

We can test this calculation experimentally by seeing how many solutions there are are for a random puzzle. First we make a function to create a random puzzle, consisting of 9 tiles, each with 4 random values. Then we solve 100 of these:

arc> (def randtiles ()
  (n-of 9 (n-of 4 (rand-elt '(1 2 3 4 -1 -2 -3 -4)))))
arc> (n-of 100 (len (solve (randtiles))))
(0 0 0 8 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 8 0 0 4 0
0 0 4 8 0 8 0 0 0 0 0 0 0 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0
0 0 32 8 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 8 0 0 4 0 0 32)
arc> (apply + that)
196
Out of 100 random puzzles, there are 196 solutions, which is close to the 1.38 solutions per puzzle estimate above. (that is an obscure Arc variable that refers to the previous result.) Note that only 16% of the puzzles have solutions, though. Part of the explanation is that solutions always come in groups of 4, since the entire puzzle can be rotated 90 degrees into four different orientations. Solving 100 puzzles took 146 seconds, by the way.

Another interesting experiment is to add a counter to try to see how many tile combinations the solver actually tries. The result is 66384, which is much smaller than the 95 billion potential possibilities. This suggests that the puzzle is solvable manually by trial-and-error with backtracking; at a second per tile, it would probably take you 4.6 hours to get the first solution.

Conclusions

Solving the puzzle with Arc was easier than I expected, and I ran into only minor problems along the way. The code is available as puzzle.arc.

Using Arc to decode Venter's secret DNA watermark

Recently Craig Venter (who decoded the human genome) created a synthetic bacterium. The J. Craig Venter Institute (JCVI) took a bacterium's DNA sequence as a computer file, modified it, made physical DNA from this sequence, and stuck this DNA into a cell, which then reproduced under control of the new DNA to create a new bacterium. This is a really cool result, since it shows you can create the DNA of an organism entirely from scratch. (I wouldn't exactly call it synthetic life though, since it does require an existing cell to get going.) Although this project took 10 years to complete, I'm sure it's only a matter of time before you will be able to send a data file to some company and get the resulting cells sent back to you.

One interesting feature of this synthetic bacterium is it includes four "watermarks", special sequences of DNA that prove this bacterium was created from the data file, and is not natural. However, they didn't reveal how the watermarks were encoded. The DNA sequences were published (GTTCGAATATTT and so on), but how to get the meaning out of this was left a puzzle. For detailed background on the watermarks, see Singularity Hub. I broke the code (as I described earlier) and found the names of the authors, some quotations, and the following hidden web page. This seems like science fiction, but it's true. There's actually a new bacterium that has a web page encoded in its DNA:
DNA web page
I contacted the JCVI using the link and found out I was the 31st person to break the code, which is a bit disappointing. I haven't seen details elsewhere on how the code works, so I'll provide the details. (Spoilers ahead if you want to break the code yourself.) I originally used Python to break the code, but since this is nominally an Arc blog, I'll show how it can be done using the Arc language.

The oversimplified introduction to DNA

DNA double helix Perhaps I should give the crash course in DNA at this point. The genetic instructions for all organisms are contained in very, very long molecules called DNA. For our purposes, DNA can be described as a sequence of four different letters: C, G, A, and T. The human genome is 3 billion base pairs in 23 chromosome pairs, so your DNA can be described as a pair of sequences of 3 billion C's, G's, A's, and T's.

Watson and Crick won the Nobel Prize for figuring out the structure of DNA and how it encodes genes. Your body is made up of important things such as enzymes, which are made up of carefully-constructed proteins, which are made up of special sequences of 18 different amino acids. Your genes, which are parts of the DNA, specify these amino acid sequences. Each three DNA bases specify a particular amino acid in the sequence. For instance, the DNA sequence CTATGG specifies the amino acids leucine and tryptophan because CTA indicates leucine, and TGG indicates tryptophan. Crick and Watson and Rosalind Franklin discovered the genetic code that associates a particular amino acid with each of the 64 possible DNA triples. I'm omitting lots of interesting details here, but the key point is that each three DNA symbols specifies one amino acid.

Encoding text in DNA

Now, the puzzle is to figure out how JCVI encoded text in the synthetic bacterium's DNA. The obvious extension is instead of assigning an amino acid to each of the 64 DNA triples, assign a character to it. (I should admit that this was only obvious to me in retrospect, as I explained in my earlier blog posting.) Thus if the DNA sequence is GTTCGA, then GTT will be one letter and CGA will be another, and so on. Now we just need to figure out what letter goes with each DNA triple.

If we know some text and the DNA that goes with it, it is straightforward to crack the code that maps between the characters and the DNA triples. For instance, if we happen to know that the text "LIFE" corresponds to the DNA "AAC CTG GGC TAA", then we can conclude that AAA goes to L, CTG goes to I, GGC goes to F, and TAA goes to E. But how do we get started?

Conveniently, the Singularity Hub article gives the quotes that appear in the DNA, as reported by JCVI:

"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE."
"SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."
"WHAT I CANNOT BUILD, I CANNOT UNDERSTAND."
We can try matching the quotes against each position in the DNA and see if there is any position that works. A match can fail in two ways. First, if the same DNA triple corresponds to two different letters, something must be wrong. For instance, if we try to match "LIFE" against "AAC CTG GGC AAC", we would conclude that AAC means both L and E. Second, if the same letter corresponds to two DNA triples, we can reject it. For instance, if we try to match "ERR" against "TAA CTA GTC", then both CTA and GTC would mean R. (Interestingly, the real genetic code has this second type of ambiguity. Because there are only 18 amino acids and 64 DNA triples, many DNA triples indicate the same amino acid.)

Using Arc to break the code

To break the code, first download the DNA sequences of the four watermarks and assign them to variables w1, w2, w3, w4. (Download full code here.)
(= w1 "GTTCGAATATTTCTATAGCTGTACA...")
(= w2 "CAACTGGCAGCATAAAACATATAGA...")
(= w3 "TTTAACCATATTTAAATATCATCCT...")
(= w4 "TTTCATTGCTGATCACTGTAGATAT...")
Next, lets break the four watermarks into triples by first converting to a list and then taking triples of length 3, which we call t1 through t4:
(def strtolist (s) (accum accfn (each x s (accfn (string x)))))
(def tripleize (s) (map string (tuples (strtolist s) 3)))

(= t1 (tripleize w1))
(= t2 (tripleize w2))
(= t3 (tripleize w3))
(= t4 (tripleize w4))
The next function is the most important. It takes triples and text, matches the triples against the text, and generates a table that maps from each triple to the corresponding characters. If it runs into a problem (either two characters assigned to the same triple, or the same character assigned to two triples) then it fails. Otherwise it returns the code table. It uses tail recursion to scan through the triples and input text together.
; Match the characters in text against the triples.
; codetable is a table mapping triples to characters.
; offset is the offset into text
; Return codetable or nil if no match.
(def matchtext (triples text (o codetable (table)) (o offset 0))
  (if (>= offset (len text)) codetable ; success
      (with (trip (car triples) ch (text offset))
  ; if ch is assigned to something else in the table
  ; then not a match
  (if (and (mem ch (vals codetable))
           (isnt (codetable trip) ch))
        nil
             (empty (codetable trip))
        (do (= (codetable trip) ch) ; new char
           (matchtext (cdr triples) text codetable (+ 1 offset)))
             (isnt (codetable trip) ch)
        nil ; mismatch
      (matchtext (cdr triples) text codetable (+ 1 offset))))))
Finally, the function findtext finds a match anywhere in the triple list. In other words, the previous function matchtext assumes the start of the triples corresponds to the start of the text. But findtext tries to find a match at any point in the triples. It calls matchtext, and if there's no match then it drops the first triple (using cdr) and calls itself recursively, until it runs out of triples.
(def findtext (triples text)
  (if (< (len triples) (len text)) nil
    (let result (matchtext triples text)
      (if result result
        (findtext (cdr triples) text)))))
The Singularity Hub article said that the DNA contains the quote:
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." - JAMES JOYCE 
Let's start by just looking for "LIFE OUT OF LIFE", since we're unlikely to get LIFE to match twice just by chance:
arc> (findtext t1 "LIFE OUT OF LIFE")
nil
No match in the first watermark, so let's try the second.
arc> (findtext t2 "LIFE OUT OF LIFE")
#hash(("CGT" . #\O) ("TGA" . #\T) ("CTG" . #\I) ("ATA" . #\space) ("TAA" . #\E) ("AAC" . #\L) ("TCC" . #\U) ("GGC" . #\F))
How about that! Looks like a match in watermark 2, with DNA "AAC" corresponding to L, "CGT" corresponding to "I", "GGC" corresponding to "F", and so on. (The Arc format for hash tables is pretty ugly, but hopefully you can see that in the output.) Let's try matching the full quote and store the table in code2:
arc> (= code2 (findtext t2 "TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE."))
#hash(("TCC" . #\U) ("TCA" . #\H) ("TTG" . #\V) ("CTG" . #\I) ("ACA" . #\P) ("CTA" . #\R) ("CGA" . #\.) ("ATA" . #\space) ("CGT" . #\O) ("TTT" . #\C) ("TGA" . #\T) ("TAA" . #\E) ("AAC" . #\L) ("CAA" . #\M) ("GTG" . #\,) ("TAG" . #\A) ("GGC" . #\F))
Likewise, we can try matching the other quotes:
arc> (= code3 (findtext t3 "SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."))
#hash(("CTA" . #\R) ("TCA" . #\H) ("CTG" . #\I) ("GCT" . #\S) ("CGA" . #\.) ("ATA" . #\space) ("CAT" . #\Y) ("CGT" . #\O) ("TCC" . #\U) ("AGT" . #\B) ("TGA" . #\T) ("TAA" . #\E) ("TGC" . #\N) ("TAC" . #\G) ("CAA" . #\M) ("GTG" . #\,) ("TAG" . #\A))
arc> (= code4 (findtext t4 "WHAT I CANNOT BUILD, I CANNOT UNDERSTAND"))
#hash(("TCC" . #\U) ("TCA" . #\H) ("ATT" . #\D) ("CTG" . #\I) ("GCT" . #\S) ("TTT" . #\C) ("ATA" . #\space) ("TAA" . #\E) ("CGT" . #\O) ("CTA" . #\R) ("TGA" . #\T) ("AGT" . #\B) ("AAC" . #\L) ("TGC" . #\N) ("GTG" . #\,) ("TAG" . #\A) ("GTC" . #\W))
Happily, they all decode the same triples to the same letters, or else we would have a problem. We can merge these three decode tables into one:
arc> (= code (listtab (join (tablist code2) (tablist code3) (tablist code4))))#hash(("TCC" . #\U) ("TCA" . #\H) ("CTA" . #\R) ("CGT" . #\O) ("TGA" . #\T) ("CGA" . #\.) ("TAA" . #\E) ("AAC" . #\L) ("TAC" . #\G) ("TAG" . #\A) ("GGC" . #\F) ("ACA" . #\P) ("ATT" . #\D) ("TTG" . #\V) ("CTG" . #\I) ("GCT" . #\S) ("TTT" . #\C) ("CAT" . #\Y) ("ATA" . #\space) ("AGT" . #\B) ("GTG" . #\,) ("TGC" . #\N) ("CAA" . #\M) ("GTC" . #\W))
Now, let's make a decode function that will apply the string to the unknown DNA and see what we get. (We'll leave unknown triples unconverted.)
(def decode (decodemap triples)
  (string (accum accfn (each triple triples (accfn
      (if (decodemap triple)
        (decodemap triple)
        (string "(" triple ")" )))))))
arc> (decode code t1)
"(GTT). CRAIG VENTER INSTITUTE (ACT)(TCT)(TCT)(GTA)(GGG)ABCDEFGHI(GTT)(GCA)LMNOP(TTA)RSTUVW(GGT)Y(TGG)(GGG) (TCT)(CTT)(ACT)(AAT)(AGA)(GCG)(GCC)(TAT)(CGC)(GTA)(TTC)(TCG)(CCG)(GAC)(CCC)(CCT)(CTC)(CCA)(CAC)(CAG)(CGG)(TGT)(AGC)(ATC)(ACC)(AAG)(AAA)(ATG)(AGG)(GGA)(ACG)(GAT)(GAG)(GAA).,(GGG)SYNTHETIC GENOMICS, INC.(GGG)(CGG)(GAG)DOCTYPE HTML(AGC)(CGG)HTML(AGC)(CGG)HEAD(AGC)(CGG)TITLE(AGC)GENOME TEAM(CGG)(CAC)TITLE(AGC)(CGG)(CAC)HEAD(AGC)(CGG)BODY(AGC)(CGG)A HREF(CCA)(GGA)HTTP(CAG)(CAC)(CAC)WWW.(GTT)CVI.ORG(CAC)(GGA)(AGC)THE (GTT)CVI(CGG)(CAC)A(AGC)(CGG)P(AGC)PROVE YOU(GAA)VE DECODED THIS WATERMAR(GCA) BY EMAILING US (CGG)A HREF(CCA)(GGA)MAILTO(CAG)MRO(TTA)STI(TGG)(TCG)(GTT)CVI.ORG(GGA)(AGC)HERE(GAG)(CGG)(CAC)A(AGC)(CGG)(CAC)P(AGC)(CGG)(CAC)BODY(AGC)(CGG)(CAC)HTML(AGC)"
It looks like we're doing pretty well with the decoding, as there's a lot of recognizable text and some HTML in there. There's also conveniently the entire alphabet as a decoding aid. From this, we can fill in a lot of the gaps, e.g. GTT is J and GCA is K. From the HTML tags we can figure out angle brackets, quotes, and slash. We can guess that there are numbers in there and figure out that ACT TCT TCT GTA is 2009. These deductions can be added manually to the decode table:
arc> (= (code "GTT") #\J)
arc> (= (code "GCA") #\K)
arc> (= (code "ACT") #\2)
...
After a couple cycles of adding missing characters and looking at the decodings, we get the almost-complete decodings of the four watermarks:

The contents of the watermarks

J. CRAIG VENTER INSTITUTE 2009
ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789(TTC)@(CCG)(GAC)-(CCT)(CTC)=/:<(TGT)>(ATC)(ACC)(AAG)(AAA)(ATG)(AGG)"(ACG)(GAT)!'.,
SYNTHETIC GENOMICS, INC.
<!DOCTYPE HTML><HTML><HEAD><TITLE>GENOME TEAM</TITLE></HEAD><BODY><A HREF="HTTP://WWW.JCVI.ORG/">THE JCVI</A><P>PROVE YOU'VE DECODED THIS WATERMARK BY EMAILING US <A HREF="MAILTO:[email protected]">HERE!</A></P></BODY></HTML>

MIKKEL ALGIRE, MICHAEL MONTAGUE, SANJAY VASHEE, CAROLE LARTIGUE, CHUCK MERRYMAN, NINA ALPEROVICH, NACYRA ASSAD-GARCIA, GWYN BENDERS, RAY-YUAN CHUANG, EVGENIA DENISOVA, DANIEL GIBSON, JOHN GLASS, ZHI-QING QI.
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." - JAMES JOYCE

CLYDE HUTCHISON, ADRIANA JIGA, RADHA KRISHNAKUMAR, JAN MOY, MONZIA MOODIE, MARVIN FRAZIER, HOLLY BADEN-TILSON, JASON MITCHELL, DANA BUSAM, JUSTIN JOHNSON, LAKSHMI DEVI VISWANATHAN, JESSICA HOSTETLER, ROBERT FRIEDMAN, VLADIMIR NOSKOV, JAYSHREE ZAVERI.
"SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."

CYNTHIA ANDREWS-PFANNKOCH, QUANG PHAN, LI MA, HAMILTON SMITH, ADI RAMON, CHRISTIAN TAGWERKER, J CRAIG VENTER, EULA WILTURNER, LEI YOUNG, SHIBU YOOSEPH, PRABHA IYER, TIM STOCKWELL, DIANA RADUNE, BRIDGET SZCZYPINSKI, SCOTT DURKIN, NADIA FEDOROVA, JAVIER QUINONES, HANNA TEKLEAB.
"WHAT I CANNOT BUILD, I CANNOT UNDERSTAND." - RICHARD FEYNMAN

The first watermark consists of a copyright-like statement, a list of all the characters, and the hidden HTML page (which I showed above.) The second, third, and fourth watermarks consist of a list of the authors and three quotations.

Note that there are 14 undecoded triples that only appear once in the list of characters. They aren't in ASCII order, keyboard order, or any other order I can figure out, so I can't determine what they are, but I assume they are missing punctuation and special characters.

The DNA decoding table

The following table summarizes the 'secret' DNA to character code:
AAA ? AAC L AAG ? AAT 3
ACA P ACC ? ACG ? ACT 2
AGA 4 AGC > AGG ? AGT B
ATA space ATC ? ATG ? ATT D
CAA M CAC / CAG : CAT Y
CCA = CCC - CCG ? CCT ?
CGA . CGC 8 CGG < CGT O
CTA R CTC ? CTG I CTT 1
GAA ' GAC ? GAG ! GAT ?
GCA K GCC 6 GCG 5 GCT S
GGA " GGC F GGG newline GGT X
GTA 9 GTC W GTG , GTT J
TAA E TAC G TAG A TAT 7
TCA H TCC U TCG @ TCT 0
TGA T TGC N TGG Z TGT ?
TTA Q TTC ? TTG V TTT C
As far as I can tell, this table is in random order. I analyzed it a bunch of ways, from character frequency and DNA frequency to ASCII order and keyboard order, and couldn't figure out any rhyme or reason to it. I was hoping to find either some structure or another coded message, but I couldn't find anything.

More on breaking the code

It was convenient that the JCVI said in advance what quotations were in the DNA, making it easier to break the code. But could the code still be broken if they hadn't?

One way to break the code is to look at statistics of how often different triples appear. We count the triples, convert the table to a list, and then sort the list. In the sort we use a special comparison function that compares the counts, to sort on the counts, not on the triples.

arc> (sort
  (fn ((k1 v1) (k2 v2)) (> v1 v2))
  (tablist (counts t2)))
(("ATA" 41) ("TAG" 27) ("TAA" 25) ("CTG" 18) ("TGC" 16) ("GTG" 16) ("CTA" 15) ("CGT" 14) ("AAC" 13) ("TGA" 10) ("TTT" 10) ("GCT" 10) ("TAC" 10) ("TCA" 8) ("CAA" 7) ("CAT" 7) ("TCC" 7) ("TTG" 5) ("GGC" 4) ("GTT" 4) ("ATT" 4) ("CCC" 4) ("GCA" 3) ("AGT" 2) ("TTA" 2) ("ACA" 2) ("CGA" 2) ("GGA" 2) ("GTC" 1) ("TGG" 1) ("GGG" 1))
This tells us that ATA appears 41 times in the second watermark, TAG 27 times, and so on. If we assume that this encodes English text, then the most common characters will be space, E, T, A, O, and so on. Then it's a matter of trial-and-error trying the high-frequency letters for the high-frequency triples until we find a combination that yields real words. (It's just like solving a newspaper cryptogram.) You'll notice that the high-frequency triples do turn out to match high-frequency letters, but not in the exact order. (I've written before on simple cryptanalysis with Arc.)

Another possible method is to guess that a phrase such as "CRAIG VENTER" appears in the solution, and try to match that against the triples. This turns out to be the case. It doesn't give a lot of letters to work on, but it's a start.

Arc: still bad for exploratory programming

A couple years ago I wrote that Arc is bad for exploratory programming, which turned out to be hugely controversial. I did this DNA exploration using both Python and Arc, and found Python to be much better for most of the reasons I described in my earlier article.
  • Libraries: Matching DNA triples was trivial in Python because I could use regular expressions. Arc doesn't have regular expressions (unless you use a nonstandard variant such as Anarki). Another issue was when I wanted to do graphical display of watermark contents; Arc has no graphics support.
  • Performance: for the sorts of exploratory programming I do, performance is important. For instance, one thing I did when trying to figure out the code was match all substrings of one watermark against another, to see if there were commonalities. This is O(N^3) and was tolerably fast in Python, but Arc would be too painful.
  • Ease of and speed of programming. A lot of the programming speed benefit is that I have more familiarity with Python, but while programming in Arc I would constantly get derailed by cryptic error messages, or trying to figure out how to do simple things like breaking a string into triples. It doesn't help that most of the Arc documentation I wrote myself.
To summarize, I started off in Arc, switched to Python when I realized it would take me way too long to figure out the DNA code using Arc, and then went back to Arc for this writeup after I figured out what I wanted to do. In other words, Python was much better for the exploratory part.

Thoughts on the DNA coding technique

Part of the gene map.  Watermark 2b is shown in the DNA, with an arc gene below it I see some potential ways that the DNA coding technique could be improved and extended. Since the the full details of the DNA coding technique haven't been published by the JCVI yet, they may have already considered these ideas.

Being able to embed text in the DNA of a living organism is extremely cool. However, I'm not sure how practical it is. The data density in DNA is very high, maybe 500 times that of a hard drive (ref), but most of the DNA is devoted to keeping the bacterium alive and only about 1K of text is stored in the watermarks.

I received email from JCVI that said the coding mechanism also supports Java, math (LaTeX?), and Perl as well as HTML. Embedding Perl code in a living organism seems even more crazy than text or HTML.

If encoding text in DNA catches on, I'd expect that error correction would be necessary to handle mutations. An error-correction code like Reed-Solomon won't really work because DNA can suffer deletions and insertions as well as mutations, so you'd need some sort of generalized deletion/insertion correcting code.

I just know that a 6-bit code isn't going to be enough. Sooner or later people will want lower-case, accented character, Japanese, etc. So you might as well come up with a way to put Unicode into DNA, probably as UTF-8. And people will want arbitrary byte data. My approach would be to use four DNA base pairs instead of three, and encode bytes. Simply assign A=00, C=01, G=10, and T=11 (for instance), and encode your bytes.

If I were writing an RFC for data storage in DNA, I'd suggest that most of the time you'd want to store the actual data on the web, and just store a link in the DNA. (Similar to how a QR code or tinyurl can link to a website.) The encoded link could be surrounded by a special DNA sequence that indicates a tiny DNA link code. This sequence could also be used to find the link code by using PCR, so you don't need to do a genome sequence on the entire organism to find the watermark. (The existing watermarks are in fact surrounded by fixed sequences, which may be for that purpose.)

I think there should also be some way to tag and structure data that is stored DNA, to indicate what is identification, description, copyright, HTML, owners, and so forth. XML seems like an obvious choice, but embedding XML in living organisms makes me queasy.

Conclusions

Being able to fabricate a living organism from a DNA computer file is an amazing accomplishment of Craig Venter and the JCVI. Embedding the text of a web page in a living organism seems like science fiction, but it's actually happened. Of course the bacterium doesn't actually serve the web page... yet.

Sources

Decoding the secret DNA code in Venter's synthetic bacterium

Craig Venter recently created a synthetic bacterium with a secret message encoded in the DNA. This is described in more detail at singularityhub. (My article will make much more sense if you read the singularityhub article.)

I tried unsuccessfully to decode the secret message yesterday. I realized the problem was that I was thinking like a computer scientist, not a biologist. Once I started thinking like a biologist, the solution was obvious.

I won't spoil your fun by giving away the full answer (at least for now), but I'll mention a few things. [Update: I've written up the details here.] There are four watermarks. The first contains HTML. (That's right! They put actual HTML - complete with DOCTYPE - into the DNA of a living creature. That's just crazy!)
The HTML in the genome
I sent email to the link, and they told me I was the 31st person to break the code. I guess I'll need to be faster next time.

The second, third, and fourth watermarks each contain co-authors and an interesting scientific quote. The first watermark also contains the full character set in order, to help verify the decoding, although I was unable to decode a few of the special characters because they only appear once.

I found the following quotes (which match the quotes given by JCVI and in the singularityhub article):
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." - JAMES JOYCE
"SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE."
WHAT I CANNOT BUILD, I CANNOT UNDERSTAND." - RICHARD FEYNMAN
Note that the quotes are one per watermark, not all in the fourth watermark as the singularityhub article claims.

If you want to try decoding the code, the DNA of the watermarks is at Science Magazine. The complete genome is a NCBI, in case you want it.

My loyal readers will be disappointed that I didn't use the Arc language to decode this, unlike my previous cryptanalysis and genome adventures. I started out with Arc, but my existing Arc cryptanalysis tools didn't do the job. I switched to Python since I'm faster in Python and I wanted to do some heavy-duty graphical analysis.

I tried a bunch of things that didn't work for analysis. I tried autocorrelation to try to determine the length of each code element, but didn't get much of a signal. I tried looking for common substrings between the watermarks both visually and symbolically, and I found some substantial strings that appeared multiple times, but that didn't really help. I discovered that the sequence "CGAT" never appears in the watermarks, but that was entirely useless. I tried applying the standard genetic code (mapping DNA triples to amino acids), since apparently Craig Venter used that in the past, but didn't come up with anything. I tried to make various estimates of how many bits of data were in the watermarks and how many bits of data would be required using various encodings. I briefly considered the possibility that the DNA encoded vectors that would draw out the answers (I think a Pickover book suggested analyzing DNA in this way), or that the DNA drew out text as a raster, but decided the bit density would be too low and didn't actually try this.

Then I asked myself: "How would I, myself, encode text in DNA?" I figured Unicode would be overkill, so probably it would be best to use either ASCII or a 6-bit encoding (maybe base-64). Then each pair of bits could be encoded with one of C, G, A, and T. Based on the singularityhub article I assumed the word "TRIUMPH" appeared, so I took "riumph" (to avoid possible capitalization issues with the first letter), encoded it as 6-bits or 8-bits, with any possible starting value for 'a', and the 24 possibilities for mapping C/G/A/T to 2-bits, and then tried big-endian and little-endian, to yield 15360 possible DNA encodings for the word. I was certain that one of them would have to appear. However, a brute force search found that none of them were in the DNA. At that point, I started trying even more implausible encodings, with no success, and started to worry that maybe the data was zipped first, which would would make decoding almost impossible. I started to think about variable-length encodings (such as Huffman encodings), and then gave up for the night. I was wondering if it would be worth a blog post on things that don't work for decrypting the code or if I should quit in silence.

In the middle of the night it occurred to me that the designers of the code were biologists, so I should think like a biologist not a computer scientist to figure out the code. With this perspective, it was obvious how to extend the genetic code to handle a larger character set. The problem was there were 64! possibilities. Or maybe 64^26 or 26 choose 64 or something like that; the exact number doesn't matter, but it's way too many to brute force like I did earlier. However, based on the quotes in the singularityhub article, I assumed that the substring " OUT O" appeared in the string, made a regular expression, and boom, it actually matched the watermark, confirming my theory. Then it was a quick matter of extending the technique to decode the full watermarks. As far as I can tell, the encoding used was arbitrary, and does not have any fundamental meaning. As I mentioned earlier, I'm leaving out the details for now, since I don't want to ruin anyone else's fun in decoding the message.

In conclusion, decoding the secret message was a fun challenge.

Arc beats Python for a web-controlled stereo

I recently set up a web interface to my stereo, implementing it in both Python and Arc. I found that, much to my surprise, writing the Arc implementation was much easier than the Python implementation in this case.

In my stereo controller system, I go to a web page that has an image of the appropriate remote control, and click the button. The Javascript in the web page sends a hex code back to the server, which sends the code to an Arduino microcontroller board, which converts the code into an infrared signal. The stereo interprets this signal as a command from the remote control and performs the desired action. It sounds complicated, but it responds quickly, even if I use the browser on my cell phone. I wrote up more details on the system earlier.
Remote control on G1 phone
The web server for this project is fairly straightforward: it needs to serve some static HTML pages and images, and needs to receive POST messages and forward the data over the serial port to the Arduino. I originally wrote the code in Python:

import cgi
import serial

from BaseHTTPServer import HTTPServer
from SimpleHTTPServer import SimpleHTTPRequestHandler

class MyHandler(SimpleHTTPRequestHandler):
  def do_POST(self):
    if self.path == '/arduino':
      form = cgi.FieldStorage(fp=self.rfile, headers=self.headers,
        environ={'REQUEST_METHOD':'POST'})
      code = form['code'].value
      print 'Sent:', code
      arduino.write(code)
      self.send_response(200)
      self.send_header('Content-type', 'text/html')
      return
    return self.do_GET()

arduino = serial.Serial('/dev/ttyUSB0', 9600, timeout=2)
server = HTTPServer(('', 8080), MyHandler).serve_forever()
The Python web server is based on SimpleHTTPServer. To handle POST requests to the /arduino URL, I made a simple handler subclass that pulls the data out of the response, sends it to the serial port, and returns a blank response. The last two lines open the serial port and start up the server. The static web page serving comes "for free."

After writing this, I decided to try writing a version in Arc:

(= srvops* (table))

(= arduino-serial (outfile "/dev/ttyUSB0" 'binary))

(defopr || req "/index.html")

(defop arduino req (w/stdout (stderr)
  (let code (arg req "code")
    (write code arduino-serial)
    (prn "Sent code:" code))))

(thread (serve 8080))
Since you may not be familiar with the Arc web server, I'll give a brief explanation. The first line clears out the server's default pages (login, whoami, prompt, etc) that make sense for news.yc, but not for me. The next line opens the serial port. Next, the mysterious || redirects the home page. The next four lines are the handler for "/arduino": the "code" parameter is pulled out of the request (req). The code is written to the serial port, and printed to the log. The last line starts a serving thread on port 8080.

I'm not going to count tokens since that's a silly metric, but the Arc code is a clear winner here. The first Python disadvantage is the hideous cgi.FieldStorage call to parse the POST data; apparently this is the way to do it, but it took me considerable time to get this right. The second Python disadvantage is the necessity to set up the return code and content-type for the response. I ended up with a browser-specific bug when I didn't have this right, since some Javascript interpreters expect a particular response. Arc automatically sends a valid response. The main Arc disadvantage is there's no real serial port support; the Arc code just opens the port and hopes for the best. On the other hand, Python's serial library lets you set the baud rate, timeouts, and other parameters.

Arc and Python have a few more minor differences. Python requires imports, which Arc doesn't. Arc has the silly clearing of srvops* to avoid unwanted pages. Starting the server thread is simpler in Arc but not as flexible.

To summarize, I found the Python code difficult and error-prone to write, and the Arc code was shorter and just worked. This was a surprise to me; usually I find Python straightforward and Arc gives me pain when I have to write a bunch of code for some trivial thing it doesn't support. (For example, generating a time string in my temperature monitoring.) In this case, however, Arc came out ahead. Perhaps this is because Arc's flagship application is a web server, so it handles web serving well.

Obviously this is a fairly trivial program - the complexity is actually in the Javascript code and the C++ code on the Arduino - so I'm not claiming this as an overall victory for Arc over Python. But it's still interesting to find Arc came out ahead in this case.

Simple genome analysis with Arc: In which I examine the XMRV genome and discover little of interest

I decided to examine the genome of the retrovirus XMRV to see what I could learn. This virus, with the absurd name Xenotropic Murine leukemia virus-Related Virus, was found a few years ago in many cases of prostate cancer and very recently found in many people suffering from chronic fatigue syndrome (more, more). If these studies turn out to be true, it would be quite remarkable that an obscure new retrovirus causes such differing diseases.

My goal was to take the XMRV genome (the sequence of c's, g's, a's, and t's that make up its RNA), and see if the combination "cg" is a lot more rare than you'd expect. This probably seems like a pretty random thing to do, but there's actually a biological reason.

One way the human immune system detects invaders is by looking for DNA containing a cytosine followed by a guanine (which is called a CpG dinucleotide). In mammals, most CpG dinucleotides are methylated (specially tagged), so any CpG that shows up is a sign of something invading. My hypothesis was that XMRV might have a very low number of CpG dinucleotides, helping it avoid the immune system and contributing to its ability to cause disease.

Conveniently, the genome of XMRV has been sequenced and can be downloaded, consisting of 8185 bases:

        1 gcgccagtca tccgatagac tgagtcgccc gggtacccgt gttcccaata aagccttttg
       61 ctgtttgcat ccgaagcgtg gcctcgctgt tccttgggag ggtctcctca gagtgattga
...
It would be a simple matter to use Python to count the number of CG pairs in the genome, but I figured using the Arc language would be more interesting.

The first step is to write a function to parse out the DNA sequence from the genome file; this consists of the lines between ORIGIN and //, with the spaces and numbers removed.

(def skip-to-origin (infile)
  (let line (readline infile)
    (if (is 'eof line) line ;quit on eof
        (litmatch "ORIGIN" line) line ;quit on ORIGIN
        (skip-to-origin infile)))) ; recurse

; Reads the data from one line
; Returns nil on EOF or // line
(def read-data-line (infile)
  (let line (readline infile)
    (if (is 'eof line) nil
        (litmatch "//" line) nil
 (keep [in _ #\c #\g #\a #\t] line))))

; Read the sequence data from ORIGIN to //
; Returns a single string
(def readseq (infile)
   (skip-to-origin infile)
   (string (drain (read-data-line infile))))
This code simply uses skip-to-origin to read up to the ORIGIN line, and then read-data-line to read the cgat data from each following line.

The next step is a histogram function to count the number of times each nucleotide occurs into a table. (Edit: I've shortened my original code with some advice from arclanguage.org.)

((def hist (seq)
  (counts (coerce seq 'cons)))
After downloading the genome as "xmrv", I can load the sequence into seq and generate the histogram:
arc> (= seq (w/infile inf "xmrv" (readseq inf)))
"gcgccagtcatccgatagactgagtcgcccgggtacccgtgt ...etc"
arc> (len seq)
8185
arc>(hist seq)
#hash((#\t . 1732) (#\g . 2057) (#\a . 2078) (#\c . 2318))
The sequence is 8185 nucleotides long as expected, with 1732 T's, 2057 G's, 2078 A's, and 2318 C's. To count the number of times each pair of nucleotides occurs, I made a more general function that will handle pairs or any other sequence of n nucleotides:
(def histn (seq n)
  (w/table h
    (for i 0 (- (len seq) n)
      (++ (h (cut seq i (+ i n)) 0)))
    h))
Since I got tired of looking at raw hash tables, I made a short function to format the output as well as generating percentages.
(def prettyhist (h)
  (let count (reduce + (vals h))
    (let sorted (sort (fn ((k1 v1) (k2 v2)) (> v1 v2))
                      (accum addit (each elt h (addit elt))))
      (each (k v) sorted
        (prn k ": " v " (" (num (* (/ v count) 100.) 2) "%)")))))
Running these functions:
arc> (prettyhist (histn seq 2))
cc: 862 (10.53%)
gg: 639 (7.81%)
ag: 616 (7.53%)
ct: 612 (7.48%)
aa: 588 (7.18%)
ga: 585 (7.15%)
ca: 537 (6.56%)
ac: 529 (6.46%)
tg: 494 (6.04%)
tc: 469 (5.73%)
gc: 458 (5.6%)
tt: 401 (4.9%)
gt: 375 (4.58%)
ta: 368 (4.5%)
at: 344 (4.2%)
cg: 307 (3.75%)
This shows that CG is the most uncommon combination, but not super-low. How does this compare to the frequence if the C/G/A/T frequency was the same, but they were ordered randomly? We can compute this by computing all the possibilities of taking two letters from the original sequence. Instead of actually summing all the combinations, we can just take the cross-product of the original histogram:
(def cross (h1 h2)
  (let h (table)
    (each (k1 v1) h1
      (each (k2 v2) h2
        (let combo (string k1 k2)
   (if (no (h combo))
     (= (h combo) 0))
   (= (h combo) (+ (h combo) (* v1 v2))))))
    h))
This gives us the result:
arc> (prettyhist (cross (hist seq) (hist seq)))
cc: 5373124 (8.02%)
ac: 4816804 (7.19%)
ca: 4816804 (7.19%)
cg: 4768126 (7.12%)
gc: 4768126 (7.12%)
aa: 4318084 (6.45%)
ag: 4274446 (6.38%)
ga: 4274446 (6.38%)
gg: 4231249 (6.32%)
tc: 4014776 (5.99%)
ct: 4014776 (5.99%)
at: 3599096 (5.37%)
ta: 3599096 (5.37%)
gt: 3562724 (5.32%)
tg: 3562724 (5.32%)
tt: 2999824 (4.48%)
This tells us that CG would appear 7.12% of the time if the sequence were randomly shuffled, but instead it appears only 3.75% of the time, so CG shows up about half as often as expected. This is in line with known results for the related MuLV virus, so it seems that there's nothing special about XMRV.

On the other hand, it's known that the HIV genome has a severely low amount of CpG:

arc> (prettyhist (histn (w/infile inf "hivbru" (readseq inf)) 2))
aa: 1096 (11.88%)
ag: 971 (10.52%)
ca: 766 (8.3%)
ga: 762 (8.26%)
at: 691 (7.49%)
ta: 665 (7.21%)
gg: 631 (6.84%)
tg: 548 (5.94%)
ac: 530 (5.74%)
tt: 524 (5.68%)
gc: 430 (4.66%)
ct: 428 (4.64%)
gt: 409 (4.43%)
cc: 381 (4.13%)
tc: 315 (3.41%)
cg: 81 (.88%)
I also took a look at the H1N1 flu sequences. The influenza genome is on 8 separate strands of RNA, so there are 8 separate sequences to process. (The HA segment is the "H" and the NA segment is the "N" in H1N1.) Many H1N1 sequences can be downloaded. I somewhat arbitrarily picked A/Beijing/02/2009(H1N1) since all 8 strands were sequenced. After saving the strands as flu1, ..., flu8, I ran the histogram on all the strands:
arc> (for i 1 8 (prn "---" i "---")
  (prettyhist (histn (w/infile inf (string "flu" i) (readseq inf)) 2)))
On all strands, "cg" was at the bottom of the frequency chart. Especially low-frequency strands are 1 (PB2) at 1.93% CpG, 2 (PB1) at 1.52%, 4 (HA) at 1.57%, and 6 (NA) at 1.77%. Strand 8 (NEP) was highest at 3.09%. So it looks like influenza is pretty low in CpG frequency, but not as low as HIV. (Edit: I've since found a paper that examines CpG in influenza in more detail.)

To conclude, Arc can be used for simple genome analysis. My original hypothesis that XMRV would have low levels of CpG dinucleotides holds, but not as dramatically as for HIV or H1N1 influenza. Apologies to biologists for my oversimplifications, and apologies to statisticians for my lack of p-values :-)

Arc + Arduino + ARM: Temperature monitoring

I'm using my Arc web server running on an ARM-based SheevaPlug to get analog data from my Arduino. The frame below and the graph of temperature and illumination are coming live are a screenshot from the SheevaPlug; they are dynamically generated by some simple Arc code which also collects the data from the Arduino. (Note: this is currently a screenshot, as I'm currently using the SheevaPlug for another project.)
screenshot
The graph shows the daily cycle of illumination (green line), as well as temperature climbing during the day (red line). The temperature goes way, way up when the sun hits the temperature sensor directly. Perhaps I should find some shade for it. For one day, the data went crazy; this is where the wires from the sensor got knocked out of the Arduino by the vacuum cleaner.

The SheevaPlug and Arduino

The Arc web server is running on the ARM-based SheevaPlug, a small Linux server that plugs into the wall. It is connected by USB to an Arduino microcontoller, which does the analog to digital conversion. In the photo, the SheevaPlug also has an Ethernet cable attached. A 4-conductor wire connects the Arduino to the temperature and light sensors outside. This wire is old phone cable, but as described below, I probably should have used something more shielded.

The Arduino code

The Arduino sketch (download) simply reads the voltages and writes them to the serial port:
#define SUPPLY 5.14 // Supply voltage (measured)

void setup() {
  Serial.begin(9600);
}

void loop() {
  float v = analogRead(TEMP_PIN) * SUPPLY / 1024;
  float c = (v - .5) * 100;
  Serial.print(c); 
  Serial.print(" ");
  Serial.println(analogRead(SUN_PIN), DEC);
  delay(5000);
}
Note that I've hardwired the supply voltage, as it's needed for the conversion. The Celsius temperature is obtained directly from the measured voltage (10mV per degree, and offset by .5V to allow negative temperatures).

The Arc code

The Arc server code (download) is straightforward. It leverages my earlier Arc example server code (download).

A background thread fetches the time / sun data lines from the serial port and writes the data to a file. It converts Celsius to Fahrenheit and limits the updates to one per minute:

(def logdata ()
  (w/appendfile outf "/tmp/data"
    (w/stdout outf
      (w/infile serial "/dev/ttyUSB0"
        (let oldtimestamp nil
          (while 1
            (with ((degc sun) (tokens (readline serial))
                  timestamp (gettimestamp))
              (when (isnt timestamp oldtimestamp)
                (let degf (+ 32 (* 9. (/ (coerce degc 'num) 5.)))
               (prn timestamp " " (num degf 2) " " sun)
                  (= oldtimestamp timestamp))))))))))

(new-bgthread 'logdata (fn () (logdata)) 0)
The simple web page is generated in Arc, but the graph itself is generated by gnuplot. Since this is currently a static page, it's a bit of overkill to use the Arc functions to generate the page.
(defop temperature req
  (system "gnuplot < gnuplotcmd")
  (page
    (tag h1 (prn "Temperature and light: Arc + Arduino + ARM"))
    (gentag img src "/graph.png")
    (tag br)
    (prn "This web page is being served by the") (link "Arc language" "http://www.righto.com/doc/index.html") (pr " web server running on a ")
    (link "SheevaPlug" "http://www.righto.com/2009/06/arduino-sheevaplug-cool-hardware.html")
    (pr " plug computer.") (pr "  For more details see ")
    (link "arcfn.com" "http://www.righto.com")
    (pr ".")
    (para)
    (link "View source code for this page" "/source-t")))

Arc vs. Python

I implemented a similar Arduino graph server in Python a few weeks ago. Overall, both Python and Arc make it easy to set up a simple web server. Comparing the Python code with the Arc code reveals Arc's lack of libraries.

The first problem with Arc is it doesn't have a serial library, so I can't configure the baud rate and parameters on my serial port from Arc. I'm just assuming it's set up right, and that's working but not very robust.

The second problem I encountered was creating the timestamps on the data. The latest version of Arc has a timedate function to generate a timestamp but unfortunately that's only in GMT. I tried writing a routine to convert the time to the local timezone, but that got rather annoying. In addition, Arc doesn't have any printf-like formatting, so I had to make my own formatting routine to generate zero-padded strings of the form "12:05". I rapidly decided that writing timezone functions wasn't what I wanted to do, and ended up just using the Unix date command. (This confirm's "kens' law": Any sufficiently complicated Arc application requries the use of 'system to get things done.)

; return a timestamp of the form 2009-08-20 19:22
(= tz "America/Los_Angeles")
(def gettimestamp ()
  (trim (tostring (system (string "TZ=" tz " date +'%m-%d-%Y %H:%M'")))))
I'd have to say that Python is clearly the easier solution overall.

Hardware details

sensors The circuit is pretty trivial. It measures light with a photocell and temperature with a TMP36 temperature sensor (tutorial). (The TMP36 is the three-wire black part that looks like a transistor.) These parts generate voltages that are read by the Arduino's analog-to-digital converters. The temperature sensor's voltage directly gives the Celsius temperature, while the photocell's light measurement is not calibrated to anything.

I had several problems with temperature measurement. First, the Arduino's A/D converter converts relative to its power supply voltage, so the temperature is only as stable as the power supply, and must be manually calibrated. Second, the 10 feet of unshielded phone wire between the Arduino and the temperature sensor introduced a lot of noise. Notice the 10 degree fluctuations on the first day of measurement. I added bypass capacitors after the first day, and the measurements are much smoother. Finally, the temperature sensor heats up a lot when the direct sun hits it in the late afternoon, apparently reaching 140° F. I guess there's a reason why real meteorologists put their temperature sensors in sheltered boxes instead of directly in the sun.

If I were doing this again, I'd probably use a digital temperature sensor such as the One-wire DS18B20; this would avoid the analog calibration and noise issues.

It would be cool to replace the wire between the Arduino and the sensors with Xbee wireless networking. Since I don't have any Xbees, the wire will have to suffice for now.

The photocell voltage is generated from a simple resistor divider. After the second day I changed the resistor from 10K to 1K so the curve wouldn't saturate in the bright sun. (10K worked fine indoors, but outdoors is much brighter.) You can see the break in the green line where I changed resistors.
Schematic

Conclusion

Arc and the SheevaPlug have been more reliable than I expected; my code has been running for a couple of weeks without problems. I think the SheevaPlug makes a good platform for this sort of project. Using Arc, however, is more of an "experimental curiosity"; I'd recommend a different language unless you really want to use Arc.

I have a few other related postings:

World's smallest Arc server

The frame below is being served was being served by the Arc language web server running on a SheevaPlug plug computer. (I've replaced the live connection with a screenshot, as I needed to use the SheevaPlug for something else.) Screenshot

The SheevaPlug

SheevaPlug in the wall The SheevaPlug is a compact, inexpensive ($99), low-power (5W), ARM-based Linux server built into a wall-plug. It's fairly powerful despite its size, with an Ethernet port, Flash storage, and a 1.2GHz processor. See plugcomputer.org for more information.

Marvell gave me a SheevaPlug and I thought it would be an interesting platform for experimenting with Arc as a web server. The SheevaPlug seems like a good platform for an always-running web server.

The picture shows the SheevaPlug plugged into the wall, with the Ethernet cable at the bottom. That's not a wall-wart power supply; that's the whole computer.

Since the plug runs Linux, I can just ssh into it and use it like any other Linux server. I installed mzscheme on the SheevaPlug (apt-get install mzscheme), downloaded the Arc language with wget, used iptables to map port 80 to port 8080, and so on. Dynamic DNS provides a DNS name for my SheevaPlug.

Arc

Arc is a new dialect of Lisp, designed by Paul Graham and Robert Morris. It's designed to be compact, flexible, and useful for exploratory programming.

The Arc language now (finally) runs on the current version of mzscheme (link). The ARM version of mzscheme 4.1.3 is easily installable on the SheevaPlug, making it straightforward to run Arc on the SheevaPlug.

The demo is a simple program using the Arc web server. It lets you vote on a set of choices and then graphs the result using a simple HTML/CSS graph. The Arc demo also serves its own source code here.

I'm using Arc on the SheevaPlug mostly to see if it works, since I've been using Arc for a while. For normal SheevaPlug use, you'd probably want to use Apache, PHP, Python, or another normal web server configuration; they all have ARM packages that you can simply install.

I'm just assuming this is the world's smallest Arc server; we'll see if anyone runs Arc on something smaller.

If the server is not working, either I shut it down to do something else or it crashed. Send me email and let me know. You can access the server without the iframe here.

I've also used a Python web server on the SheevaPlug, which is a more mainstream way to go than Arc. In that case, I connected an Arduino to the SheevaPlug, so my Arduino could be accessed over the web.

Admittedly, the demo is just a proof-of concept. The next step is to do something non-trivial with the SheevaPlug and Arc.

Inside the news.yc ranking formula

The recent arc3 release of Arc includes news.arc, the Arc source code for the Hacker News forum. Examining this code can give some insight into the ranking algorithm that selects the top articles on the Hacker News forum.

The ranking formula

In outline, each item is given a ranking, and the articles are sorted according to the ranking. The simplistic way to think about ranking is the number of votes is divided by time, so more votes results in a higher ranking, but the ranking also drops over time. The votes are raised to a power less than one, while the time is raised to a power greater than one, so time has more effect than votes. Some additional penalties also may be applied to the ranking.

To be specific, the following is the ranking code from news.arc:

(= gravity* 1.8 timebase* 120 front-threshold* 1
   nourl-factor* .4 lightweight-factor* .3 )

(def frontpage-rank (s (o scorefn realscore) (o gravity gravity*))
  (* (/ (let base (- (scorefn s) 1)
          (if (> base 0) (expt base .8) base))
        (expt (/ (+ (item-age s) timebase*) 60) gravity))
     (if (no (in s!type 'story 'poll))  1
         (blank s!url)                  nourl-factor*
         (lightweight s)                (min lightweight-factor*
                                             (contro-factor s))
                                        (contro-factor s))))

This algorithm can be expressed as a (slightly simplified) equation:

rank=\frac{ \left( score-1 \right) ^{.8}}{ \left( age _{hours} + 2 \right) ^ {1.8}} * penalties

This ranking algorithm is used for items on the front page as well as for ordering the comments on an item.

The key factor in the ranking formula is the exponent on time (i.e. the optional parameter gravity) is higher than the exponent on the votes. As a result, even if an item keeps getting a large number of votes, eventually the denominator will overwhelm it and its ranking will drop. For example, if a popular item gets 100 votes an hour, eventually it won't be able to keep pace with a new item getting 10 votes an hour. This ensures that even the most popular items won't stay on the list forever. In other words, gravity controls how fast articles get pulled down in ranking.

By default, the scoring function scorefn is the realscore, which is the net votes for an item ignoring any upvotes from pontential sockpuppet accounts. Sockpuppet accounts are accounts created to manipulate voting, and an account is considered potentially a sockpuppet based on several simple factors.

Because the score is given an exponent of .8 (for positive scores), the benefit of large numbers of votes is reduced. Mathematically, though, this exponent could be eliminated. Since only relative rankings matter, raise everything to the 1/.8 power, and the numerator's exponent disappears, along with the need for the negative check. This would be a win from the "code golf" perspective.

A few other constants are of interest. Since 1 is subtracted from the score, an item starts off with a rank of 0. In the denominator, 120 minutes are added to the time. Thus, an article can be considered two hours old at time of posting, limiting the ranking boost for very recent articles. The front-threshold* specifies a minimum ranking score for articles to appear.

Penalties

A story (i.e. a normal posting) or poll can receive additional ranking penalties. The penalties are not applied to comments or poll options. The penalties are as follows:
  • An item with a blank url is penalized by nourl-factor*. This lowers the ranking of an "Ask HN" item, for instance.
  • A "lightweight" item is penalized.
  • A "controversial" item is penalized.
An item is considered lightweight if the title is a "rallying cry", the post is an image, or the URL's site is in a list of lightweight sites. A lightweight item is penalized by lightweight-factor*. (The code doesn't show how a "rallying cry" is determined.)

The contro-factor penalty is applied to controversial articles. The contro-factor is computed as:

(def contro-factor (s)
  (aif (check (visible-family nil s) [> _ 20])
       (min 1 (expt (/ (realscore s) it) 2))
       1))
An item with too many comments (at least 20 and more comments than upvotes) gets the contro-factor penalty. An item's visible-family is the number of visible comments (plus one for the item itself). If the visible-family is more than 20, the penalty is:

min\left[ 1, \left( \frac{realscore}{visible\-family}\right) ^{2} \right]

A lightweight item gets the worse of the "lightweight" and "contro-factor" penalties. The contro-factor penalty is only applied if less than 1; i.e. an item can't get a boost from it. Because the ratio is squared, the effect of the penalty is more substantial. I would have expected contro-factor to penalize controversial comment subthreads too, but apparently it doesn't.

How the ranking is performed

When the server starts up, it loads the top stories from a topstories file. If the file is not present, it uses the ranking algorithm to select the top 180 stories out of the most recent 1000 stories. Stories normally are re-ranked by adjust-rank only when they receive a vote. Interestingly, only the updated item is re-ranked, rather than doing a global re-rank, probably for efficiency reasons. The adjust-rank function also saves the top 180 items to thetopstories file to disk.

There's also a background reranking thread to rerank a random story from the top 50 every 30 seconds (again using adjust-rank). This ensures that stories that are no longer receiving any votes at all get reranked occasionally.

As far as I can tell, the list of ranked stories never gets shrunk (except on restarts), but the number of displayed stories is capped at 210; you will hit this limit if you keep paging through the top stories.

Comparison with Arc2

The old code in arc2 is much simpler:
(= gravity* 1.4 timebase* 120 front-threshold* 1)

(def frontpage-rank (s (o gravity gravity*))
  (/ (- (realscore s) 1)
     (expt (/ (+ (item-age s) timebase*) 60) gravity)))
The main differences are that arc3 has higher gravity, so articles will drop off faster; arc3 has an exponent on the numerator, and arc3 adds the various penalties. This illustrates that the ranking formula is changing and gaining complexity over time.

Conclusions

There may be differences there are between the live Hacker News server and news.arc, so my analysis may not exactly describe what's really happening. I'd assume the code is constantly being modified; comparison with arc2 shows that the ranking formula has become considerably more complex. In addition, news.arc has some YC-specific stuff removed.. Finally, since this is Arc, the ranking constants and even the algorithms can be changed at the REPL while the server is running. Thus, the live ranking algorithm might never actually exist in a source file.

So, what's the secret to getting a high-ranked article? I don't see any magic bullets in the code, just the obvious: get lots of upvotes in a short period of time, avoid penalties (don't post fluff), and don't use sockpuppets.

What's new in arc3?

The long-awaited new update of Arc (arc3) has been made available. I've examined the release to determine what's new in arc3 beyond the mysterious new logo: arc logo.

arc3 is a major change from arc2, with about 3000 lines and most files changed. The following gratuitous pie chart illustrates that the largest number of changes are in the news server (news.arc), with many changes to the fundamental language (ac.scm and arc.arc) as well as the web and application servers (app.arc and srv.arc).
Summary of changes
The news server has been heavily modified with new ranking, support for polls, and improved spam rejection, among other features. A new "how-to-run-news" file explains how to set up your own Hacker News-like server. I will discuss the news server, web server, app server, and html library in more detail in a future article. I will focus this article on the language and main libraries.

If you're looking to try arc3, it can be downloaded here. Arc3 has significant incompatibilities with previous versions that will cause problems if you're not prepared.

Incompatibilities

  • Arc now requires mzscheme 372. Arc will not work with the version 352 required by arc2, or with the newer versions of mzscheme.
  • flush-output is disabled by default on write, disp, and functions that use these. See declare below.
  • The result of accum is no longer in reverse order.
  • news.arc is no longer loaded as part of arc.
  • a.b.c is now ((a b) c), not (a b c)
  • a!b!c is now ((a 'b) 'c), not (a 'b 'c)
  • The set function has been renamed assign, and the assert function has been renamed set.
  • writefile1 removed.
  • flat no longer supports stringstoo option.
  • to-nearest renamed to nearest.
  • random-elt renamed to rand-elt.
  • firstn-that renamed to retrieve.
  • plural renamed to pluralize.

New functions

  • int function coerces to an integer.
  • force-close discards buffered output and closes the descriptor.
  • mvfile moves a file.
  • memory returns the current memory use.
  • sin, cos, tan, and log are finally available.
  • declare is used to set system variables: explicit-flush to control automatic output flushing, and direct-calls for the function call optimization.
  • timedate converts seconds to date.
  • copylist copies a list.
  • defs applies def to name, args, body triples to define multiple functions at once.
  • get provides simpler table / array lookup. get!foo creates a function that takes a table and looks up 'foo. (Actually, it's a generic mechanism to create a function to apply an argument to something.)
  • writefile writes to a temporary file first.
  • letter tests if a character is a letter.
  • sum sums up function results.
  • med returns median of list.
  • minutes-since returns minutes since a time.
  • defcache wrapper around cache to reduce boilerplate.
  • datestring creates "y - m - d" string.
  • or= modifies a variable if the variable is false.
  • out evaluates and prints an expression.
  • fromdisk, diskvar, disktable, and todisk provide a simplified mechanism for loading and storing data.
  • halve splits a string in two on the first separator.
  • positions returns a list of positions where a test is true.
  • lines splits a string into lines.
  • urlencode urlencodes a string.
  • nonblank returns a string if it's nonblank.
  • plural returns a count and pluralized word.

Bug fixes

  • atomic-invoke fixed.
  • socket-accept uses custodians so web servers can forcibly close connections if the clients aren't reading data.
  • coerce of string into int rounds the value as does number to char.
  • list modified to copy the list.
  • safeset now writes warnings to stderr.
  • each fixed to avoid conflict with afn.
  • cut fixed.
  • trues fixed.
  • whiler fixed.
  • rand-string now more random to help fix huge security hole.
  • split fixed for negative positions..
  • memo now memoizes nil results too.
  • readline will terminate on nil.
  • copy fixed.
  • hours-since fixed.
  • date fixed to avoid system operation.
  • intersperse now works with nil.
  • load fixed.
  • posmatch fixed.
  • num fixed to handle negative numbers.

Optimizations

  • complement in function position
  • The <, >, and + operations when applied to two arguments.
  • (foo bar) is optimized to call foo directly if foo is a global bound to a function. This is controlled by declare.

Other changes

  • Debugging has been improved by giving the Arc names of things to Scheme, and turning on line counting.
  • A default value can be provided when accessing a hash table.
  • Currying was largely implemented but not enabled.
In conclusion, arc3 has some nice improvements, bug fixes, and optimizations, but nothing too dramatic in the language itself. If you're using the Arc news server, you'll definitely want arc3 for the security fixes. On the other hand, the unofficial Anarki release of Arc has a lot of interesting features too.

Disclaimers: I have provided my interpretations of the changes in arc3, and I'm sure there are some errors; please provide feedback if you find mistakes. Note that I am not connected with the arc team in any way, and this is not an "official" list of changes. In addition, arc3 may change without warning, so any of the above can be obsoleted. (There are some good reasons for real version numbers, bug tracking, and source code control, but I won't belabor that point.)