In the previous Xerox Alto restoration session, we got the disk working, but the system didn't boot. After much investigation, I discovered the explanation for the boot failure: the disk has been overwritten with random data! This article describes my journey through the Alto microcode to determine what happened.
For background, the Alto was a revolutionary computer designed at Xerox PARC in 1973 to investigate personal computing. It introduced the GUI, Ethernet and laser printers to the world, among other things. Y Combinator received an Alto from computer visionary Alan Kay and I'm helping restore it, along with Marc Verdiell, Luca Severini, Ron Crane, Carl Claunch and Ed Thelen (from the IBM 1401 restoration team). For posts on previous restoration days see 1, 2, 3, 4, 5 and 6.
Debugging the boot failureLast session, after fixing a broken 7414 TTL chip on the disk interface board, we could fetch a block from disk but the Alto failed to boot. We used a logic analyzer to trace the microcode instructions and the ALU bus contents. Josh Dersch from the Living Computer Museum studied the traces and found that the boot program was executing a few instructions (jump, add, load), and then seemed to go off the rails. But it turns out things were more messed up than that.
I made a microcode trace browser to help figure out what was going on. With this program, I can step through an execution trace one micro-instruction at a time and see the corresponding source code line. (Click the image below for the live trace browser.) First, I examined the KWD (disk word task), which executes for each word from disk, and copies that word to memory. I verified that the disk read was working as expected. The second task of interest is the NOVEM (Nova emulator task), which runs a program. In our case, it runs the boot program as soon as it is loaded from disk. By examining this task, we can figure out what is going wrong with the boot process.
By studying the disk read microcode (KWD) closely, I was able to extract each word in the disk sector from the logic analyzer trace. This was very difficult for many reasons. For example, we logged the ALU bus which doesn't have the words from disk. I had to figure out the disk contents by reversing the checksum computation, which was on the ALU bus. Another problem was the Alto stores sectors on disk backwards. But eventually I extracted the contents of the boot sector, as read into the Alto:
16a5 2d4a 5a94 b528 14db 29b6 536c a6d8 333b 6676 ccec e753 b02d 1ed1 3da2 7b44 ...
I hand-disassembled these words into Data General Nova assembler code and discovered a few things. First, the first few instructions matched Josh's interpretation, so the CPU and the emulator task seemed to be working correctly. Second, the instructions didn't make any sense as code, and some words weren't even instructions, which explained why the boot rapidly fell apart. Third, and most puzzling, the instructions were nothing like what the Alto boot code was supposed to be.
The boot block seemed to contain random junk. The problem wasn't flaky hardware generating bad data, because the block checksum validated correctly. This wasn't the drive returning the wrong sector, because the sector header was correct. The sector didn't contain instructions, it wasn't ASCII, and it didn't look like a sensible file format. As I studied the sector contents more, I wondered it the data was literally random. I made a histogram of how many times each byte value occurred, and it was pretty much uniform so (In comparison, archived Alto disk sectors showed very non-uniform distributions.) But why would the boot block have been overwritten with (pseudo-) random data?
Josh mentioned DiEx (Diablo disk exerciser), a utility program to diagnose problems with the Alto's Diablo disk drive, and suggested that it could have wiped the disk. I found the DiEx source code in the Computer History Museum's Alto archive, and sure enough, it has a feature to write random data to the disk (and then verify it).
I could believe someone had inconveniently wiped our disk with the DiEx utility, but I still had nagging doubts that maybe we were seeing a hardware issue. Could I prove that DiEx was responsible? All I had to do was show that the disk data wasn't arbitrary, but came from DiEx.
Generating random numbers on the AltoI found the source code for RANDOM.ASM, the Alto's random number code, in the Computer History Museum's Alto archive. This algorithm generates 16-bit random numbers with the recurrence formula: "x[n] = (x[n-33] + x[n-13]) mod 2^16". (Note that are very bad random numbers cryptographically since once you have 33 numbers in the sequence you can generate them all.) I wanted to see if the data we read from disk was generated from this function, so I coded up the algorithm. This was somewhat difficult as the original was written in Nova assembler code. The results didn't match the disk data, no matter what I tried. Finally, I realized that I could just use a brute force solution and ignore the details of the algorithm. I picked random pairs of values in the data and checked if their sum appeared in the data. If the data came from any sort of recurrence, I would get a bunch of matches, but I didn't. I concluded that the disk data wasn't generated from this random number algorithm.
However, on closer examination I noticed that the RANDOM.ASM function signature didn't match the DiEx code, so it probably wasn't the right function. After more searching I found TriexML.asm, another Alto random number function. To generate a random 16-bit word, this algorithm simply shifts the previous value one bit to the left. If there is an overflow, the result is xor'd with the number 077213. (It would be hard to come up with a cryptographically worse random number generator—from one number you can generate the whole sequence—but the algorithm is very fast.)
To check the disk contents against this algorithm, I skipped the careful implementation and went straight to brute force. To see if any shift-and-xor algorithm would explain our data, I shifted each word from the disk sector and xor'd it with the next one. In each case, I got either 0 or octal 077213, matching the algorithm. Starting the algorithm with 012345 (the seed value in the code) eventually generates the exact sector of data we read, proving this algorithm generated the random data we saw on the disk.
Thus, someone had clobbered our disk (probably decades ago) while testing the drive with DiEx. Since we couldn't boot off this disk, we'd need a new boot disk. Xerox PARC has dozens of old Alto disks lying around and they offered some of them to us. But the Living Computer Museum offered to send us a working Alto disk, rather than risk damage to the potentially-interesting contents of an old PARC disk, so we'll use the LCM disk instead.
ConclusionLast repair session, we fixed a failed 7414 inverter chip on the disk interface board. With that fixed, we could read the disk but boot still failed. After careful investigation of the microcode and traces, I discovered that our disk had been overwritten with random data making it impossible to boot from it. In one way this is a good result, since it means our boot wasn't failing because of a hardware problem.
When we get a new Alto disk, we'll try booting again. I'm moderately optimistic that the system will come up successfully, but there could be more hardware problems waiting for us. For updates on the restoration, follow kenshirriff on Twitter.
Thanks to Josh Dersch and the Living Computer Museum for their debugging help. Thanks to Tim Curley and Xerox PARC for supplying additional disks.