Simple genome analysis with Arc: In which I examine the XMRV genome and discover little of interest

I decided to examine the genome of the retrovirus XMRV to see what I could learn. This virus, with the absurd name Xenotropic Murine leukemia virus-Related Virus, was found a few years ago in many cases of prostate cancer and very recently found in many people suffering from chronic fatigue syndrome (more, more). If these studies turn out to be true, it would be quite remarkable that an obscure new retrovirus causes such differing diseases.

My goal was to take the XMRV genome (the sequence of c's, g's, a's, and t's that make up its RNA), and see if the combination "cg" is a lot more rare than you'd expect. This probably seems like a pretty random thing to do, but there's actually a biological reason.

One way the human immune system detects invaders is by looking for DNA containing a cytosine followed by a guanine (which is called a CpG dinucleotide). In mammals, most CpG dinucleotides are methylated (specially tagged), so any CpG that shows up is a sign of something invading. My hypothesis was that XMRV might have a very low number of CpG dinucleotides, helping it avoid the immune system and contributing to its ability to cause disease.

Conveniently, the genome of XMRV has been sequenced and can be downloaded, consisting of 8185 bases:

        1 gcgccagtca tccgatagac tgagtcgccc gggtacccgt gttcccaata aagccttttg
       61 ctgtttgcat ccgaagcgtg gcctcgctgt tccttgggag ggtctcctca gagtgattga
It would be a simple matter to use Python to count the number of CG pairs in the genome, but I figured using the Arc language would be more interesting.

The first step is to write a function to parse out the DNA sequence from the genome file; this consists of the lines between ORIGIN and //, with the spaces and numbers removed.

(def skip-to-origin (infile)
  (let line (readline infile)
    (if (is 'eof line) line ;quit on eof
        (litmatch "ORIGIN" line) line ;quit on ORIGIN
        (skip-to-origin infile)))) ; recurse

; Reads the data from one line
; Returns nil on EOF or // line
(def read-data-line (infile)
  (let line (readline infile)
    (if (is 'eof line) nil
        (litmatch "//" line) nil
 (keep [in _ #\c #\g #\a #\t] line))))

; Read the sequence data from ORIGIN to //
; Returns a single string
(def readseq (infile)
   (skip-to-origin infile)
   (string (drain (read-data-line infile))))
This code simply uses skip-to-origin to read up to the ORIGIN line, and then read-data-line to read the cgat data from each following line.

The next step is a histogram function to count the number of times each nucleotide occurs into a table. (Edit: I've shortened my original code with some advice from

((def hist (seq)
  (counts (coerce seq 'cons)))
After downloading the genome as "xmrv", I can load the sequence into seq and generate the histogram:
arc> (= seq (w/infile inf "xmrv" (readseq inf)))
"gcgccagtcatccgatagactgagtcgcccgggtacccgtgt ...etc"
arc> (len seq)
arc>(hist seq)
#hash((#\t . 1732) (#\g . 2057) (#\a . 2078) (#\c . 2318))
The sequence is 8185 nucleotides long as expected, with 1732 T's, 2057 G's, 2078 A's, and 2318 C's. To count the number of times each pair of nucleotides occurs, I made a more general function that will handle pairs or any other sequence of n nucleotides:
(def histn (seq n)
  (w/table h
    (for i 0 (- (len seq) n)
      (++ (h (cut seq i (+ i n)) 0)))
Since I got tired of looking at raw hash tables, I made a short function to format the output as well as generating percentages.
(def prettyhist (h)
  (let count (reduce + (vals h))
    (let sorted (sort (fn ((k1 v1) (k2 v2)) (> v1 v2))
                      (accum addit (each elt h (addit elt))))
      (each (k v) sorted
        (prn k ": " v " (" (num (* (/ v count) 100.) 2) "%)")))))
Running these functions:
arc> (prettyhist (histn seq 2))
cc: 862 (10.53%)
gg: 639 (7.81%)
ag: 616 (7.53%)
ct: 612 (7.48%)
aa: 588 (7.18%)
ga: 585 (7.15%)
ca: 537 (6.56%)
ac: 529 (6.46%)
tg: 494 (6.04%)
tc: 469 (5.73%)
gc: 458 (5.6%)
tt: 401 (4.9%)
gt: 375 (4.58%)
ta: 368 (4.5%)
at: 344 (4.2%)
cg: 307 (3.75%)
This shows that CG is the most uncommon combination, but not super-low. How does this compare to the frequence if the C/G/A/T frequency was the same, but they were ordered randomly? We can compute this by computing all the possibilities of taking two letters from the original sequence. Instead of actually summing all the combinations, we can just take the cross-product of the original histogram:
(def cross (h1 h2)
  (let h (table)
    (each (k1 v1) h1
      (each (k2 v2) h2
        (let combo (string k1 k2)
   (if (no (h combo))
     (= (h combo) 0))
   (= (h combo) (+ (h combo) (* v1 v2))))))
This gives us the result:
arc> (prettyhist (cross (hist seq) (hist seq)))
cc: 5373124 (8.02%)
ac: 4816804 (7.19%)
ca: 4816804 (7.19%)
cg: 4768126 (7.12%)
gc: 4768126 (7.12%)
aa: 4318084 (6.45%)
ag: 4274446 (6.38%)
ga: 4274446 (6.38%)
gg: 4231249 (6.32%)
tc: 4014776 (5.99%)
ct: 4014776 (5.99%)
at: 3599096 (5.37%)
ta: 3599096 (5.37%)
gt: 3562724 (5.32%)
tg: 3562724 (5.32%)
tt: 2999824 (4.48%)
This tells us that CG would appear 7.12% of the time if the sequence were randomly shuffled, but instead it appears only 3.75% of the time, so CG shows up about half as often as expected. This is in line with known results for the related MuLV virus, so it seems that there's nothing special about XMRV.

On the other hand, it's known that the HIV genome has a severely low amount of CpG:

arc> (prettyhist (histn (w/infile inf "hivbru" (readseq inf)) 2))
aa: 1096 (11.88%)
ag: 971 (10.52%)
ca: 766 (8.3%)
ga: 762 (8.26%)
at: 691 (7.49%)
ta: 665 (7.21%)
gg: 631 (6.84%)
tg: 548 (5.94%)
ac: 530 (5.74%)
tt: 524 (5.68%)
gc: 430 (4.66%)
ct: 428 (4.64%)
gt: 409 (4.43%)
cc: 381 (4.13%)
tc: 315 (3.41%)
cg: 81 (.88%)
I also took a look at the H1N1 flu sequences. The influenza genome is on 8 separate strands of RNA, so there are 8 separate sequences to process. (The HA segment is the "H" and the NA segment is the "N" in H1N1.) Many H1N1 sequences can be downloaded. I somewhat arbitrarily picked A/Beijing/02/2009(H1N1) since all 8 strands were sequenced. After saving the strands as flu1, ..., flu8, I ran the histogram on all the strands:
arc> (for i 1 8 (prn "---" i "---")
  (prettyhist (histn (w/infile inf (string "flu" i) (readseq inf)) 2)))
On all strands, "cg" was at the bottom of the frequency chart. Especially low-frequency strands are 1 (PB2) at 1.93% CpG, 2 (PB1) at 1.52%, 4 (HA) at 1.57%, and 6 (NA) at 1.77%. Strand 8 (NEP) was highest at 3.09%. So it looks like influenza is pretty low in CpG frequency, but not as low as HIV. (Edit: I've since found a paper that examines CpG in influenza in more detail.)

To conclude, Arc can be used for simple genome analysis. My original hypothesis that XMRV would have low levels of CpG dinucleotides holds, but not as dramatically as for HIV or H1N1 influenza. Apologies to biologists for my oversimplifications, and apologies to statisticians for my lack of p-values :-)


  1. I hope you don't mind that there are some comments on your XMRV genome analysis here. I'm guessing that people would be honored if you would reply:

    posts #23 and #24

  2. The virus' RNA is hidden inside a capsule and should not be available to the antibodies to attach to and therefore for the immune system cannot recognize the virus as an invading body based on the contents of its RNA. The composition of the RNA should not have any bearing on the immune response. Why do you think it is relevant?

  3. Anonymous: See Microbe DNA Seen as Alien By Immune Cells in the New York Times, which discusses how the immune system can recognize CpG dinucleotides in foreign DNA, stimulating B cell growth.