My goal was to take the XMRV genome (the sequence of c's, g's, a's, and t's that make up its RNA), and see if the combination "cg" is a lot more rare than you'd expect. This probably seems like a pretty random thing to do, but there's actually a biological reason.
One way the human immune system detects invaders is by looking for DNA containing a cytosine followed by a guanine (which is called a CpG dinucleotide). In mammals, most CpG dinucleotides are methylated (specially tagged), so any CpG that shows up is a sign of something invading. My hypothesis was that XMRV might have a very low number of CpG dinucleotides, helping it avoid the immune system and contributing to its ability to cause disease.
Conveniently, the genome of XMRV has been sequenced and can be downloaded, consisting of 8185 bases:
1 gcgccagtca tccgatagac tgagtcgccc gggtacccgt gttcccaata aagccttttg 61 ctgtttgcat ccgaagcgtg gcctcgctgt tccttgggag ggtctcctca gagtgattga ...It would be a simple matter to use Python to count the number of CG pairs in the genome, but I figured using the Arc language would be more interesting.
The first step is to write a function to parse out the DNA sequence from the genome file; this consists of the lines between
//, with the spaces and numbers removed.
(def skip-to-origin (infile) (let line (readline infile) (if (is 'eof line) line ;quit on eof (litmatch "ORIGIN" line) line ;quit on ORIGIN (skip-to-origin infile)))) ; recurse ; Reads the data from one line ; Returns nil on EOF or // line (def read-data-line (infile) (let line (readline infile) (if (is 'eof line) nil (litmatch "//" line) nil (keep [in _ #\c #\g #\a #\t] line)))) ; Read the sequence data from ORIGIN to // ; Returns a single string (def readseq (infile) (skip-to-origin infile) (string (drain (read-data-line infile))))This code simply uses
skip-to-originto read up to the
ORIGINline, and then
read-data-lineto read the cgat data from each following line.
The next step is a histogram function to count the number of times each nucleotide occurs into a table. (Edit: I've shortened my original code with some advice from arclanguage.org.)
((def hist (seq) (counts (coerce seq 'cons)))After downloading the genome as "xmrv", I can load the sequence into
seqand generate the histogram:
arc> (= seq (w/infile inf "xmrv" (readseq inf))) "gcgccagtcatccgatagactgagtcgcccgggtacccgtgt ...etc" arc> (len seq) 8185 arc>(hist seq) #hash((#\t . 1732) (#\g . 2057) (#\a . 2078) (#\c . 2318))The sequence is 8185 nucleotides long as expected, with 1732 T's, 2057 G's, 2078 A's, and 2318 C's. To count the number of times each pair of nucleotides occurs, I made a more general function that will handle pairs or any other sequence of n nucleotides:
(def histn (seq n) (w/table h (for i 0 (- (len seq) n) (++ (h (cut seq i (+ i n)) 0))) h))Since I got tired of looking at raw hash tables, I made a short function to format the output as well as generating percentages.
(def prettyhist (h) (let count (reduce + (vals h)) (let sorted (sort (fn ((k1 v1) (k2 v2)) (> v1 v2)) (accum addit (each elt h (addit elt)))) (each (k v) sorted (prn k ": " v " (" (num (* (/ v count) 100.) 2) "%)")))))Running these functions:
arc> (prettyhist (histn seq 2)) cc: 862 (10.53%) gg: 639 (7.81%) ag: 616 (7.53%) ct: 612 (7.48%) aa: 588 (7.18%) ga: 585 (7.15%) ca: 537 (6.56%) ac: 529 (6.46%) tg: 494 (6.04%) tc: 469 (5.73%) gc: 458 (5.6%) tt: 401 (4.9%) gt: 375 (4.58%) ta: 368 (4.5%) at: 344 (4.2%) cg: 307 (3.75%)This shows that CG is the most uncommon combination, but not super-low. How does this compare to the frequence if the C/G/A/T frequency was the same, but they were ordered randomly? We can compute this by computing all the possibilities of taking two letters from the original sequence. Instead of actually summing all the combinations, we can just take the cross-product of the original histogram:
(def cross (h1 h2) (let h (table) (each (k1 v1) h1 (each (k2 v2) h2 (let combo (string k1 k2) (if (no (h combo)) (= (h combo) 0)) (= (h combo) (+ (h combo) (* v1 v2)))))) h))This gives us the result:
arc> (prettyhist (cross (hist seq) (hist seq))) cc: 5373124 (8.02%) ac: 4816804 (7.19%) ca: 4816804 (7.19%) cg: 4768126 (7.12%) gc: 4768126 (7.12%) aa: 4318084 (6.45%) ag: 4274446 (6.38%) ga: 4274446 (6.38%) gg: 4231249 (6.32%) tc: 4014776 (5.99%) ct: 4014776 (5.99%) at: 3599096 (5.37%) ta: 3599096 (5.37%) gt: 3562724 (5.32%) tg: 3562724 (5.32%) tt: 2999824 (4.48%)This tells us that CG would appear 7.12% of the time if the sequence were randomly shuffled, but instead it appears only 3.75% of the time, so CG shows up about half as often as expected. This is in line with known results for the related MuLV virus, so it seems that there's nothing special about XMRV.
arc> (prettyhist (histn (w/infile inf "hivbru" (readseq inf)) 2)) aa: 1096 (11.88%) ag: 971 (10.52%) ca: 766 (8.3%) ga: 762 (8.26%) at: 691 (7.49%) ta: 665 (7.21%) gg: 631 (6.84%) tg: 548 (5.94%) ac: 530 (5.74%) tt: 524 (5.68%) gc: 430 (4.66%) ct: 428 (4.64%) gt: 409 (4.43%) cc: 381 (4.13%) tc: 315 (3.41%) cg: 81 (.88%)I also took a look at the H1N1 flu sequences. The influenza genome is on 8 separate strands of RNA, so there are 8 separate sequences to process. (The HA segment is the "H" and the NA segment is the "N" in H1N1.) Many H1N1 sequences can be downloaded. I somewhat arbitrarily picked A/Beijing/02/2009(H1N1) since all 8 strands were sequenced. After saving the strands as flu1, ..., flu8, I ran the histogram on all the strands:
arc> (for i 1 8 (prn "---" i "---") (prettyhist (histn (w/infile inf (string "flu" i) (readseq inf)) 2)))On all strands, "cg" was at the bottom of the frequency chart. Especially low-frequency strands are 1 (PB2) at 1.93% CpG, 2 (PB1) at 1.52%, 4 (HA) at 1.57%, and 6 (NA) at 1.77%. Strand 8 (NEP) was highest at 3.09%. So it looks like influenza is pretty low in CpG frequency, but not as low as HIV. (Edit: I've since found a paper that examines CpG in influenza in more detail.)
To conclude, Arc can be used for simple genome analysis. My original hypothesis that XMRV would have low levels of CpG dinucleotides holds, but not as dramatically as for HIV or H1N1 influenza. Apologies to biologists for my oversimplifications, and apologies to statisticians for my lack of p-values :-)