Showing posts with label alto. Show all posts
Showing posts with label alto. Show all posts

Restoring YC's Xerox Alto day 9: tracing a crash through software and hardware

Last week, after months of restoration, we finally got the vintage Xerox Alto computer to boot (details) and run programs. However, some programs (such as the mouse-based Draw program) crashed so we knew there must still be a hardware problem somewhere in the system. In today's session we traced through the software, microcode and hardware to track down the cause of these crashes.

For background, the Alto was a revolutionary computer designed at Xerox PARC in 1973 to investigate personal computing. It introduced the GUI, Ethernet and laser printers to the world, among other things. Y Combinator received an Alto from computer visionary Alan Kay and I'm helping restore the system, along with Marc Verdiell, Luca Severini, Ron Crane, Carl Claunch and Ed Thelen. The full collection of Alto restoration posts is here.

When the Xerox Alto encounters a problem, it drops into the Swat debugger.

When the Xerox Alto encounters a problem, it drops into the Swat debugger.

To assist with debugging, the Alto includes a debugger called Swat. If a program malfunctions, it drops into the Swat debugger, as seen above. The debugger lets you examine memory and set breakpoints. It is more advanced than I'd expect for 1973, including a feature to disassemble machine instructions from memory and view them with names from the symbol table.

In our case, the debugger showed that when we ran the MADTEST test program, the Alto had jumped to address 2, which triggered the debugger. The first 8 memory locations in the Xerox Alto contain TRAP instructions to catch erroneous jumps to a zero address (or near-zero address) which can happen if a subroutine return address is clobbered. By examining the stack frames, I determined which subroutine had been called when the system crashed. The problem occurred when the program was jumping to microcode that had been loaded into microcode RAM, Since this is an unusual operation, it would explain why most programs ran successfully and only a few crashed.

Microcode

Microcode is a low-level feature of most processors, but I should explain what it means for a program to jump to microcode, since this is a strange feature of the Alto. Computers execute machine code, the simple, low-level instructions that the CPU can understand; on modern computers this may be the x86 instruction set, while the Alto used the Data General Nova instruction set. Most processors, however, don't run machine instructions directly, but have a microcode layer that is invisible to the programmer. While the processor appears to be running machine instructions, it internally executes microcode, a simpler, low-level instruction set that is a better match for the hardware. Each machine instruction may turn into many micro-instructions.

The Xerox Alto uses microcode much more extensively than most computers, with microcode performing tasks such as device control that most computers do in hardware, resulting in a cheaper and more flexible system. (As Alan Kay wrote, "Hardware is just software crystallized early.") On the Alto, programmers have access to the microcode—a user program can load new microcode into special control RAM. This microcode can implement new machine instructions, optimize particular operations (analogous to GPU programming), or obtain low-level control over the system.

The Xerox Alto's CRAM board (Control RAM) stores 1024 microcode instructions. The 32 memory chips in the lower left provide the 1024x32 storage.

The Xerox Alto's CRAM board (Control RAM) stores 1024 microcode instructions. The 32 memory chips in the lower left provide the 1024x32 storage. Foreshadowing: the connector at the lower left connects the CRAM board to the microcode control board.

Our Alto has 1024 words of microcode in ROM (for the standard microcode) and 1024 words in RAM (for software-controlled microcode). The photo above shows the CRAM (control RAM) board that holds the user-modifiable microcode. This board illustrates the incredible improvements in memory density since 1973—this board required 32 memory chips to hold the 1024 32-bit words (4 Kbytes) of microcode.

The Alto's microcode uses a 1K (10-bit) address space. Since Altos can support up to 2K of microcode in ROM and 3K in RAM, bank switching is used to switch between different 1K memory banks. Bank switching is triggered by a special micro-instruction called SWMODE (switch mode).

Getting back to our crash, the MADTEST test program loads special test microcode into the control RAM. Then it executes the JMPRAM machine instruction to switch execution from machine instructions to the microcode in RAM. The microcode that implements JMPRAM performs a SWMODE to switch execution to the RAM microcode bank and the microcode in RAM will execute. When the microcode is done, it is supposed to return execution to the machine code emulator, and execution of the user-level program (MADTEST) will continue. But somehow execution ended up at address 2, causing the program to crash.

To track down a problem with the Xerox Alto's bank switching circuit, we attached many probes to the CPU control board.

To track down a problem with the Xerox Alto's bank switching circuit, we attached many probes to the CPU control board.

We used a logic analyzer to record every micro-instruction and memory access, so we could determine exactly what went wrong. After a few tries, we captured a trace showing what the Alto was executing until it crashed. Over the past week, I've been using the Living Computer Museum's ContrAlto simulator of the Xerox Alto to understand how the Alto's software and microcode work. With this background, I could interpret the logic analyzer output and map it to the MADTEST code and the microcode. Everything proceeded fine until the JMPRAM instruction was executed. Instead of switching to the microcode in RAM, it was still running microcode from the ROM. Since the micro-address was intended for the RAM code, the processor was running essentially random microcode. Through pure luck, this microcode routine completed and returned control to the regular machine code emulator rather than hanging the system. Unfortunately this code didn't load the return address register, resulting in a jump to address 2 and the Swat crash we saw.

To summarize, everything was working fine except instead of switching to the microcode RAM bank, execution stayed in the microcode ROM bank. This was pretty clearly a hardware problem, so we started looking at the bank switch circuit, which consists of multiple integrated circuits.

The bank switch hardware

The Alto was built at the dawn of the microprocessor age, so instead of using a microprocessor chip, it used three boards of TTL chips for the CPU. The control board interprets the microcode, including performing bank switching, so that's where we started our investigation.

Bank switching in the Alto happens when the SWMODE micro-instruction is executed. The destination bank is selected following complex rules that depend on the hardware configuration and the current bank. Rather than implement these rules with a complex hardware circuit, the Alto designers used the short cut of encoding all the logic into a 256x4 PROM chip. (This also has the advantage that a different hardware configuration can be supported simply by replacing the PROM.) The schematic below shows the PROM (left) generating the bank select signals (yellow), which pass through various chips to create the current bank select signals (right), which are fed back into the PROM for the next cycle.

This schematic shows the Xerox Alto's bank switching circuit, allowing microcode to run from ROM or RAM banks.

This schematic shows the Xerox Alto's bank switching circuit, allowing microcode to run from ROM or RAM banks. (Click for larger image.)

We connected logic analyzer probes so we could trace each chip in the bank select circuit. The PROM correctly generated the RAM bank signals when the SWMODE micro-instruction executed, but in the next step its inputs had reverted to the ROM bank for some reason. This showed the PROM worked, so we continued probing through the circuit. Each chip had the proper output until we got to the multiplexer chip that feeds back to the PROM. (This chip, on the right, handles microcode task switching by selecting either the current task's bank, or the new tasks's bank, which is recorded in a RAM chip.) The input signal to the multiplexer pulsed high for the new bank, but the output stayed low, blocking the bank switch signal. The oscilloscope trace below shows the problem: the input signal (bottom trace) is not passed to the output (middle trace).

A multiplexer IC in the Xerox Alto was failing to pass the bank switch signal from its input (bottom trace) to its output (middle trace).

A multiplexer IC in the Xerox Alto was failing to pass the bank switch signal from its input (bottom trace) to its output (middle trace).

We found a bad chip on the disk interface board a few weeks ago, so had we located a second bad chip? We pulled out the suspicious chip (a 74S157 multiplexer) and tested it in a breadboard to prove that it was faulty. Surprisingly, it worked just fine. Perhaps the problem only showed up at high frequency? We swapped it with an identical chip on the board and the crash still happened. Clearly there was nothing wrong with the chip. But its output stayed low when it should go high. Why was this?

We thought this 74S157 multiplexer IC from the Xerox Alto was faulty. However, the chip worked fine when tested in a breadboard.

We thought this 74S157 multiplexer IC from the Xerox Alto was faulty. However, the chip worked fine when tested in a breadboard.

Our next theory was that something was grounding the chip's output signal, forcing the output to remain low. To test this, we disconnected the chip's output pin from the rest of the circuit by bending the pin so it didn't go into the socket, With the output not connected to the circuit, the output went high as expected. (See oscilloscope trace below.) This proved that the chip worked and something else was pulling the signal low. Since the chip's output was connected to the PROM chip, the obvious suspect was the PROM, which might have an input shorted low. We hoped the PROM chip wasn't at fault, since locating a 1970s-era D3601 PROM chip and reprogramming it would be inconvenient. We pulled the PROM chip out of the board and the short to ground remained, demonstrating the PROM chip was not the culprit.

With the multiplexer's output disconnected from the circuit, the input signal (bottom) appears on the output (top) as expected.

With the multiplexer's output disconnected from the circuit, the input signal (bottom) appears on the output (top) as expected.

We removed the control board from the Alto to examine it for short circuits. On the back of the circuit board, we noticed that two white wires were connected to the multiplexer chip that was causing us problems. (Wires are added to printed circuit boards to fix manufacturing problems, support new features, or support new hardware.) These wires went to the connector that was cabled to the CRAM (control RAM) board shown earlier. With the CRAM board disconnected, the short to ground went away. Thus, the cause of our crashes was these two wires that someone had added to the board! Could we simply cut these wires and have the system work correctly? We figured we should understand why the wires were there, rather than randomly ripping them out. Maybe our control board and CRAM board were incompatible? Maybe these wires were to support the Trident disk drive we aren't using? It was the end of the day by this point, so further investigation will wait until next time.

This is the Xerox Alto control board, one of three boards that make up the CPU. The board has been modified with several white wires.

This is the Xerox Alto control board, one of three boards that make up the CPU. The board has been modified with several white wires which trigger our crashes.

Conclusion

After a bunch of software, microcode and hardware debugging we found that the crashes are due to some wires added to one of the circuit boards. These wires messed up microcode bank switching, causing programs that used custom microcode to crash. Fixing this should be straightforward, but we want to understand the motivation behind these wires. On the whole, the processor is working reliably other than this one issue. Once it is fixed, we can run MADTEST (the microcode test program) to stress-test the processor. If there are no more processor issues, we'll move on to getting the mouse working.

For updates on the restoration, follow me on Twitter at kenshirriff. Thanks to the Living Computer Museum for the extender board and the ContrAlto simulator.

Restoring a vintage Xerox Alto day 8: it boots!

We've been restoring a Xerox Alto from the 1970s for several months, and we finally got it to boot and run some programs! There's still some hardware debugging ahead of us, since the Alto drops into the debugger for many programs, but we're quite happy to see the system running. In this post, I describe our latest debugging session and show some programs running on the Alto.

The Xerox Alto, listing the files on the disk.

The Xerox Alto, successfully booted and listing the files on the disk. The diagonal strips are an artifact of photographing the CRT and do not appear on the display.

For background, the Alto was a revolutionary computer designed at Xerox PARC in 1973 to investigate personal computing. It introduced the GUI, Ethernet and laser printers to the world, among other things. Y Combinator received an Alto from computer visionary Alan Kay and I'm helping restore the system, along with Marc Verdiell, Luca Severini, Ron Crane, Carl Claunch and Ed Thelen. For posts on previous restoration days see parts 1, 2, 3, 4, 5, 6, 6 update and 7.

The new boot disk

In an earlier session, we discovered that our boot disk had been used for drive testing decades earlier and was filled with random garbage, making it impossible to boot from the disk. Fortunately, the Living Computer Museum in Seattle sent us a new boot disk, loaded with diagnostic software. I received a vintage Digital RK05K-11 disk cartridge box:

Box for a vintage Digital RK05K-11 disk cartridge

Box for a vintage Digital RK05K-11 disk cartridge
Inside the box was the 14" disk. Despite its size, the disk cartridge only hold 2.5 megabytes, a tangible indication of the exponential improvements in disk density since the 1970s. We loaded the disk into the Alto's Diablo drive, waited a minute for the disk to spin up to speed and the heads to load, and Ed eagerly pressed the reset button. Would we be lucky and successfully boot the Alto? After all the anticipation, nothing happened.

An Alto diagnostic boot disk, sent to us by the Living Computer Museum in Seattle.

An Alto diagnostic boot disk, sent to us by the Living Computer Museum in Seattle.

Why won't the system boot?

Since we had successfully loaded a disk sector (of random data) earlier, we knew that the system was working end-to-end, from the drive through the disk interface card and into the processor boards and memory. One possibility was that the alignment was different between our drive and the Living Computer Museum's drive, corrupting the data. Needing to hand-align our drive would be very difficult, so we hoped that wasn't the problem.

To see the words as they came off the disk, we added more logic analyzer probes to the Alto's backplane to trace the processor bus. At this point, the backplane is liberally decorated with probes, allowing us to monitor the buses and microcode execution in detail.

We added more probes to the Alto's backplane to monitor the processor bus. The probes are connected to a vintage Agilent logic analyzer.

We added more probes to the Alto's backplane to monitor the processor bus. The probes are connected to a vintage Agilent logic analyzer.

Using the logic analyzer, we could step through the microcode to see each disk word getting loaded into memory, but the data didn't match the boot sector we expected. The Alto stores each sector on disk as a 2-word header (holding the disk address), an 8-word label (holding a next block pointer), and the 256-word data block. Although the data seemed wrong, more interesting was the octal value 000100 in the header coming from disk. (The Alto uses octal, causing us no end of confusion.) This header value corresponds to a disk address of cylinder 8, not the boot sector 0. Could we be reading the wrong sector?

By removing the cover from the Diablo drive, you can watch it seek. Unlike modern hard drives, the Alto's disk isn't sealed so you can see the disk surface and head when the disk is loaded in the drive.

Looking inside the Diablo disk drive, you can see the head moving over the disk's surface as disk seeks take place.

Looking inside the Diablo disk drive, you can see the head moving over the disk's surface as disk seeks take place. The green dial on the right rotates to indicate the current track. These seeks are from an earlier test, not from boot.

Watching the drive as the Alto attempted to boot, we saw the disk arm seek, which it shouldn't have done to read from boot sector 0. The seek dial rotated to cylinder 8—as the logic analyzer suggested, the Alto was trying to boot from the wrong disk cylinder, which clearly wouldn't work.

Inside the Diablo disk drive, the turquoise sector indicator shows the drive has seeked to sector 8.

Inside the Diablo disk drive, the turquoise sector indicator shows the drive has seeked to sector 8.

Since the drive seeked correctly last week, why was it trying to read from the wrong cylinder today? Were we suffering another chip failure on the disk interface card? Had something malfunctioned in the drive? We pored over the disk interface schematics and suspected a problem with the nine cylinder select lines between the Alto and the drive. In particular, a malfunction in the CYL(5) line could set the cylinder to 8, causing the seek we saw. (Bits on the Alto are inconveniently numbered backwards, so cylinder bit 5 corresponds to the value 8.)

We noticed a scratch in the 40-conductor ribbon cable between the Alto and the disk drive, exposing a wire. Could this be the cause of our problems? We carefully checked continuity and found no problems with the cable despite the scratch, so we hooked the cable back up along with an oscilloscope to monitor the offending signal, so we could debug the problem.

Running the Alto

We tried booting the Alto again, watching for the seek problem. This time the disk unexpectedly performed multiple seeks. And then the boot screen appeared on the Alto. We had a running system!

The Xerox Alto screen after booting, waiting for a command.

The Xerox Alto screen after booting, waiting for a command.

A few months ago, I had used the Salto simulator to see how the Alto worked. But now, facing a working system, I couldn't remember the commands. To see the files, was it LIST, or DIR? No. How about HELP? No good. After a minute or two, I remembered that a simple question mark was the command to list the disk, and I got a list of files. The system was working well enough to read a directory.

I tried running the WYSIWYG text editor Bravo and the mouse-based drawing program Draw, but they crashed, dropping the system into the debugger, Swat. Clearly some hardware problems remain and our debugging adventure is not over yet.

The Alto's debugger is called Swat, and runs if there is an error.

The Alto's debugger is called Swat, and runs if there is an error.

Some programs ran successfully. The CRT test program drew grids on the bitmapped screen. The CRT is a bit fuzzy in the upper left, but the quality is surprisingly good considering that this tube was almost too dim to see a few months ago. Apparently running the tube a while restored it by burning contaminants off the cathode (or something mysterious tube-era phenomenon like that).

The Xerox Alto running a CRT test program. Antique mechanical calculators are in the background.

The Xerox Alto running a CRT test program. Antique mechanical calculators are in the background.

The Ethernet diagnostic program ran and showed off the mouse-based GUI. I'm developing a BeagleBone-based Ethernet simulator for the Alto, so this program will be very helpful. We don't have a gridded optical mouse pad, so the mouse didn't work and we couldn't click anything.

The Alto's Ethernet Diagnostic Program uses a mouse-based GUI.

The Alto's Ethernet Diagnostic Program uses a mouse-based GUI.

The keyboard test program graphically displays the keyboard and shows each key as it is pressed. We used this to verify the keys all work.

The Alto running the keyboard test program. Antique calculators are in the background.

The Alto running the keyboard test program. Antique calculators are in the background.

A closeup of the Alto's keyboard test programming. It highlights keys when they are pressed.

A closeup of the Alto's keyboard test programming. It highlights keys when they are pressed.

Conclusion

It was an exciting day, with the Alto finally booting successfully. A disk seek problem blocked us for a while, but then the problem mysteriously disappeared. We ran a bunch of test programs from the disk. About half of them ran successfully, and half crashed into the debugger. There may be a malfunction in the processor that we need to track down. Or perhaps we're getting memory errors; the parity errors we saw earlier could have returned. In any case, we have some more debugging ahead of us, but it's exciting to see the system finally running. Hopefully we will soon be playing Alto Trek and Maze War.

For updates on the restoration, follow me on Twitter at kenshirriff.

Thanks to Josh Dersch and the Living Computer Museum for their debugging help and sending out the boot disk.

Restoring YC's Xerox Alto: how our boot disk was trashed with random data

In the previous Xerox Alto restoration session, we got the disk working, but the system didn't boot. After much investigation, I discovered the explanation for the boot failure: the disk has been overwritten with random data! This article describes my journey through the Alto microcode to determine what happened.

Inserting a disk into the Xerox Alto's disk drive. The Alto's video display is visible at the back.

Inserting a disk into the Xerox Alto's disk drive. The Alto's video display is visible at the back.

For background, the Alto was a revolutionary computer designed at Xerox PARC in 1973 to investigate personal computing. It introduced the GUI, Ethernet and laser printers to the world, among other things. Y Combinator received an Alto from computer visionary Alan Kay and I'm helping restore it, along with Marc Verdiell, Luca Severini, Ron Crane, Carl Claunch and Ed Thelen (from the IBM 1401 restoration team). For posts on previous restoration days see 1, 2, 3, 4, 5 and 6.

Debugging the boot failure

Last session, after fixing a broken 7414 TTL chip on the disk interface board, we could fetch a block from disk but the Alto failed to boot. We used a logic analyzer to trace the microcode instructions and the ALU bus contents. Josh Dersch from the Living Computer Museum studied the traces and found that the boot program was executing a few instructions (jump, add, load), and then seemed to go off the rails. But it turns out things were more messed up than that.

I made a microcode trace browser to help figure out what was going on. With this program, I can step through an execution trace one micro-instruction at a time and see the corresponding source code line. (Click the image below for the live trace browser.) First, I examined the KWD (disk word task), which executes for each word from disk, and copies that word to memory. I verified that the disk read was working as expected. The second task of interest is the NOVEM (Nova emulator task), which runs a program. In our case, it runs the boot program as soon as it is loaded from disk. By examining this task, we can figure out what is going wrong with the boot process.

Xerox Alto microcode trace viewer.

Xerox Alto microcode trace viewer. With the viewer, you can step through the execution trace collected by the logic analyzer and see each source code line as it is executed. The buttons on the right indicate which microcode task is running at each step.

By studying the disk read microcode (KWD) closely, I was able to extract each word in the disk sector from the logic analyzer trace. This was very difficult for many reasons. For example, we logged the ALU bus which doesn't have the words from disk. I had to figure out the disk contents by reversing the checksum computation, which was on the ALU bus. Another problem was the Alto stores sectors on disk backwards. But eventually I extracted the contents of the boot sector, as read into the Alto:

16a5 2d4a 5a94 b528 14db 29b6 536c a6d8
333b 6676 ccec e753 b02d 1ed1 3da2 7b44
...

I hand-disassembled these words into Data General Nova assembler code and discovered a few things. First, the first few instructions matched Josh's interpretation, so the CPU and the emulator task seemed to be working correctly. Second, the instructions didn't make any sense as code, and some words weren't even instructions, which explained why the boot rapidly fell apart. Third, and most puzzling, the instructions were nothing like what the Alto boot code was supposed to be.

Backplane of the Xerox Alto wired with logic analyzer probes. These probes monitor the executing micro-instructions and the contents of the ALU bus.

Backplane of the Xerox Alto wired with logic analyzer probes. These probes monitor the executing micro-instructions and the contents of the ALU bus.

The boot block seemed to contain random junk. The problem wasn't flaky hardware generating bad data, because the block checksum validated correctly. This wasn't the drive returning the wrong sector, because the sector header was correct. The sector didn't contain instructions, it wasn't ASCII, and it didn't look like a sensible file format. As I studied the sector contents more, I wondered it the data was literally random. I made a histogram of how many times each byte value occurred, and it was pretty much uniform so (In comparison, archived Alto disk sectors showed very non-uniform distributions.) But why would the boot block have been overwritten with (pseudo-) random data?

Josh mentioned DiEx (Diablo disk exerciser), a utility program to diagnose problems with the Alto's Diablo disk drive, and suggested that it could have wiped the disk. I found the DiEx source code in the Computer History Museum's Alto archive, and sure enough, it has a feature to write random data to the disk (and then verify it).

Screenshot of the Diablo Disk Exerciser (DiEx) running on a Xerox Alto simulator. Courtesy of Nathan Lineback, toastytech.com.

Screenshot of the Diablo Disk Exerciser (DiEx) running on a Xerox Alto simulator. Note the early mouse-based GUI; clicking on an entry changes the value. Image courtesy of Nathan Lineback.

I could believe someone had inconveniently wiped our disk with the DiEx utility, but I still had nagging doubts that maybe we were seeing a hardware issue. Could I prove that DiEx was responsible? All I had to do was show that the disk data wasn't arbitrary, but came from DiEx.

Generating random numbers on the Alto

I found the source code for RANDOM.ASM, the Alto's random number code, in the Computer History Museum's Alto archive. This algorithm generates 16-bit random numbers with the recurrence formula: "x[n] = (x[n-33] + x[n-13]) mod 2^16". (Note that are very bad random numbers cryptographically since once you have 33 numbers in the sequence you can generate them all.) I wanted to see if the data we read from disk was generated from this function, so I coded up the algorithm. This was somewhat difficult as the original was written in Nova assembler code. The results didn't match the disk data, no matter what I tried. Finally, I realized that I could just use a brute force solution and ignore the details of the algorithm. I picked random pairs of values in the data and checked if their sum appeared in the data. If the data came from any sort of recurrence, I would get a bunch of matches, but I didn't. I concluded that the disk data wasn't generated from this random number algorithm.

However, on closer examination I noticed that the RANDOM.ASM function signature didn't match the DiEx code, so it probably wasn't the right function. After more searching I found TriexML.asm, another Alto random number function. To generate a random 16-bit word, this algorithm simply shifts the previous value one bit to the left. If there is an overflow, the result is xor'd with the number 077213. (It would be hard to come up with a cryptographically worse random number generator—from one number you can generate the whole sequence—but the algorithm is very fast.)

To check the disk contents against this algorithm, I skipped the careful implementation and went straight to brute force. To see if any shift-and-xor algorithm would explain our data, I shifted each word from the disk sector and xor'd it with the next one. In each case, I got either 0 or octal 077213, matching the algorithm. Starting the algorithm with 012345 (the seed value in the code) eventually generates the exact sector of data we read, proving this algorithm generated the random data we saw on the disk.

A few of the old Xerox Alto disks in Xerox PARC's collection.

A few of the old Xerox Alto disks in Xerox PARC's collection. Hopefully they haven't been overwritten with junk.

Thus, someone had clobbered our disk (probably decades ago) while testing the drive with DiEx. Since we couldn't boot off this disk, we'd need a new boot disk. Xerox PARC has dozens of old Alto disks lying around and they offered some of them to us. But the Living Computer Museum offered to send us a working Alto disk, rather than risk damage to the potentially-interesting contents of an old PARC disk, so we'll use the LCM disk instead.

Conclusion

Last repair session, we fixed a failed 7414 inverter chip on the disk interface board. With that fixed, we could read the disk but boot still failed. After careful investigation of the microcode and traces, I discovered that our disk had been overwritten with random data making it impossible to boot from it. In one way this is a good result, since it means our boot wasn't failing because of a hardware problem.

When we get a new Alto disk, we'll try booting again. I'm moderately optimistic that the system will come up successfully, but there could be more hardware problems waiting for us. For updates on the restoration, follow kenshirriff on Twitter.

Thanks to Josh Dersch and the Living Computer Museum for their debugging help. Thanks to Tim Curley and Xerox PARC for supplying additional disks.

Restoring YCombinator's Xerox Alto day 6: Fixed a chip, data read from disk

In today's Xerox Alto restoration session we investigated why the disk drive isn't working and found a failed chip. With this chip repaired, we were able to read a block from disk, although the system still doesn't boot. (In previous episodes, we fixed the power supply, got the CRT display working, cleaned up the disk drive and hooked up a logic analyzer: days 1, 2, 3, 4 and 5.)

Our test setup for the Xerox Alto. The Alto computer itself is the metal cabinet in the center with the visible circuit boards. On the left is a vintage HP line printer, with the logic analyzer behind it. The video display for the Alto is visible on the right, behind the oscilloscope.

Our test setup for the Xerox Alto. The Alto computer itself is the metal cabinet in the center with the visible circuit boards. On the left is a vintage HP line printer, with the logic analyzer behind it. The video display for the Alto is visible on the right, behind the oscilloscope.

The Alto was a revolutionary computer, designed at Xerox PARC in 1973 to investigate personal computing. It introduced the GUI, Ethernet and laser printers to the world, among other things. Y Combinator received an Alto from computer visionary Alan Kay and I'm helping restore the system, along with Marc Verdiell, Luca Severini, Ron Crane, Carl Claunch and Ed Thelen (from the IBM 1401 restoration team). Marc's video of this restoration session is below.

The missing disk sector task

In the Alto, like most modern computers, each machine instruction is implemented in an even more primitive form of code called microcode. But unlike most computers, the Alto also implements some of its low-level software in microcode. Part of the Alto's design philosophy was to use software (i.e. microcode) instead of hardware where possible. For instance, a microcode sector task processes each disk sector and a word task stores each word of data as it arrives from the disk drive; most computers do this with DMA hardware.

Last week we hooked a logic analyzer to the Alto to trace the executing microcode and found the disk sector task was failing to run. Each track on the Alto's hard disk is divided into 12 sectors, with 12 slots in the hub to indicate the sectors. We verified that the disk drive was detecting these slots and sending the sector pulses every 3.33 milliseconds. The disk sector task is supposed to run for each sector and perform any disk command, but the logic analyzer showed that this task was not running.

The hard disk pack for the Xerox Alto has 12 sectors. Slots cut into the disk hub trigger a signal for each sector.

The hard disk pack for the Xerox Alto has 12 sectors. Slots cut into the disk hub trigger a signal for each sector. Four of the sector slots are labeled in the photo.

Why was the sector task not running? The disk interface board provides a signal to indicate when the sector task should run (WAKEST), but we found it was not being activated even though the disk drive was providing sector pulses to the disk interface board. Looking at the disk interface board schematic, the sector pulse circuit is fairly simple: just a few flip flops. (You don't need to understand the schematic below. The key point is the sector pulse comes in on the left, goes through a few chips, and the wakeup signal comes out on the right.) I've heard that old TTL flip flops fail regularly, so I figured one of the flip flop chips had failed. We decided to hook up an oscilloscope and see where things were going wrong, but one problem stood in our way.

Schematic from the Xerox Alto's disk controller card. This circuit processes sector pulses from the disk drive and generates signals to wake up the microcode sector task.

Schematic from the Xerox Alto's disk controller card. This circuit processes sector pulses from the disk drive and generates signals to wake up the microcode sector task.

The extender card

The Alto consists of 13 circuit cards plugged into a wire-wrapped backplane, making them inaccessible to probing. Fortunately, the Living Computer Museum gave us an extender card, a board that goes between an Alto board and the backplane, physically extending the Alto board out of the cabinet where it can be diagnosed. Last week, we used the extender card to probe signals on the CPU control board. But no matter how hard I tried, I couldn't get the extender board to plug into the disk interface board's slot. Marc noticed out that the board was hitting something, and we realized that the disk interface board had a notch on the right, allowing the board to clear a bar that was in the way. The extender board, like most of the Alto boards, lacked this notch. A bit more investigation revealed that memory boards had a notch, but on the left.

Why did some boards have notches? Most of the boards are powered with 5 volts. The memory boards also require -5 volts and +12 volts for the 4116 DRAM chips. The I/O boards (Ethernet and disk) have +/- 15 volts as well as 5 volts. The Alto backplane was apparently designed so you couldn't plug a board into a slot with the wrong voltages (which would have been catastrophic). Boards with unusual power requirements had a notch that allowed them to fit into slots wired with unusual voltages. The consequence was that we couldn't use the extender board with the disk interface without cutting a notch in it, which we did (see photo below).

Milling a notch into the extender board.

Milling a notch into the extender board.

We were worried that by cutting a notch in the extender board and using it in a slot where it wasn't intended we might destroy the computer in a spectacular show of sparks and smoke. The concern was that the extender board doesn't simply pass the 162 lines through, but wires all the ground lines to a ground plane and wires the +5 lines together. If the disk interface card had +15 volts where the extender board expected, say, +5 volts, the extender card would run +15 volts to all the chips and destroy them. We verified the wiring five times to make sure nothing would get shorted, plugged in the extender board, and turned the Alto on with some trepidation. Fortunately our calculations were correct and nothing blew up.

Debugging the disk interface

The photo below shows the disk interface card extended out of the Alto cabinet, with some oscilloscope probes attached to the flip flop chips. (The ribbon cable attached to the board connects to the disk drive, while the ribbon cable hanging above the board allows us to probe microcode signals with the logic analyzer.) Strangely, we didn't see any signals either going into the flip flops or coming out. We checked that the sector pulses were showing up in the logic analyzer, and on the connector from the disk drive, but the flip flops were getting nothing. Eventually we turned our attention to the inverter chip (see earlier schematic). We saw the sector signal going into the inverter, but not coming out. Could this simple chip be causing the problems?

Debugging the disk interface card in the Xerox Alto.

Debugging the disk interface card in the Xerox Alto.

The 7414 TTL chip contains 6 inverters, which turn a 1 input into a 0 output and vice versa. We pulled the chip out of the disk interface board and tested it with a simple LED circuit (see photo below). Five of the six inverters worked fine, but one of the inverters had entirely failed. The chip is a bit unusual since it uses a Schmitt trigger—a circuit that cleans up noisy signals (such as the sector pulses that traveled over a long cable from the disk drive)—so we couldn't get a replacement at Fry's or Radio Shack. Were we stuck for the day?

Testing the 7414 inverter chip from the Xerox Alto's disk interface card. One inverter was burnt out, preventing the disk from working.

Testing the 7414 inverter chip from the Xerox Alto's disk interface card. One inverter was burnt out, preventing the disk from working.

Fortunately we could work around the faulty chip. Carl studied the schematics and discovered that one of the good inverters on the chip was unused. We rewired the chip to replace the bad inverter with the unused good inverter by using an ugly but effective "dead bug" hack. We bent out the pins from the good inverter and attached wires. We cut off the pins from the bad inverter. Finally, we stuck the wires into the socket along with the IC, so the good inverter was wired in place of the bad inverter.

We re-wired a 7414 inverter chip. An unused inverter replaced the failed inverter.

We re-wired a 7414 inverter chip. An unused inverter replaced the failed inverter.

We booted the Alto and found that our chip hack actually worked and the system worked much better than before: the sector pulses got through the inverter, were processed by the flip flops, and triggered the sector task as we hoped. The sector task read the disk command from memory and sent it to the disk drive. The disk drive read the desired sector and started sending bits back. For each word, the disk word task read the word from the disk interface and stored it in memory. In summary, we were now reading data from disk!

Reading data from disk was a big milestone, since most of the system needed to be working properly for this to happen. Unfortunately the Alto didn't boot up, and we'll need to figure out where things went wrong. Is the boot block not running correctly? Is the read data corrupted? Is the disk returning an error at some point? Is our disk not a boot disk? Strangely, there was no sign of the parity errors we kept seeing last week.

The timeline diagram below shows task switching in the Alto over an interval of 700 microseconds.. You can see that the microcode is constantly switching between tasks. Today's accomplishment can be seen in the periodic execution of disk word task (KWT) at the bottom of the image; this task runs about every 9 microseconds when each word comes from the disk drive. The disk sector task (KSEC) runs at the start of the next sector (at which time the word task stops). Other tasks are the memory refresh task (MRT) and cursor task (CURT) that run periodically. (You can see where the higher-priority MRT task interrupted the KSEC task.) The lowest priority task is the Nova emulator (NOVEM), which runs program code when nothing else is happening. The numbers at the bottom show the micro-instruction count since boot; at this point we are 14.8 milliseconds into the boot process. I generated the diagram below by processing the logic analyzer output to show each running task. An interactive version is here, allowing zoom and pan with the mouse.

Timeline showing task switching on the Xerox Alto. These are microcode tasks switched by hardware, not operating system level processes or threads.

Timeline showing task switching on the Xerox Alto. These are microcode tasks switched by hardware, not operating system level processes or threads.

Conclusion

In today's repair session, we found a failed 7414 inverter chip that was preventing disk operation. By working around that issue, we could finally read from disk, but boot is still failing for unknown reasons. Nonetheless, today's session got us much closer to a working system. We'll need to dig through the logic analyzer output to figure out where the boot process is breaking down.

Restoring YCombinator's Xerox Alto day 5: Microcode tracing with a logic analyzer

In today's Xerox Alto restoration session we investigated why the system doesn't boot. We find a broken wire, hook up a logic analyzer, generate a cloud of smoke, and discover that memory problems are preventing the computer from booting. (In previous episodes, we fixed the power supply, got the CRT display working and cleaned up the disk drive: days 1, 2, 3. and 4.)

The Alto was a revolutionary computer, designed at Xerox PARC in 1973 to investigate personal computing. It introduced the GUI, Ethernet and laser printers to the world, among other things. Y Combinator received an Alto from computer visionary Alan Kay and I'm helping restore the system, along with Marc Verdiell, Luca Severini, Ron Crane, Carl Claunch and Ed Thelen (from the IBM 1401 restoration team).

The broken wire

The Xerox Alto is built from 13 circuit boards, crammed with TTL chips. In 1973, minicomputers such as the Alto were built from a whole bunch of simple ICs instead of a primitive microprocessor chip. (People still do this as a retro project.) The Alto's CPU is split across 3 boards: an ALU board, a control board, and a control RAM board. The control board is the focus of today's adventures.

If a circuit board has a design defect or needs changes, it can be modified by attaching new wires to create the necessary connections. The photo below shows the control board with several white modification wires. While examining the control board, we noticed one of the wires had come loose. Could the boot failures be simply due to a broken wire?

Control board from the Xerox Alto, showing a broken wire. The white wires were for a modification, but one wire came loose.

Control board from the Xerox Alto, showing a broken wire. The white wires were for a modification, but one wire came loose.

We carefully resoldered the wire and powered up the system. The disk drive slowly came up to speed and the heads lowered onto the disk surface. We pressed the reset button (under the keyboard) to boot. As before, nothing happened and the display remained blank. Fixing the wire had no effect.

After investigation, it appears the rework wires were to support the Trident/Tricon hard disk. In the photo above, note the small edge connector in the upper right, with the white wires connected. The Trident disk controller used this connector, but our (Diablo) disk controller does not. In other words, the broken wire might have caused problems with a different disk drive, but it was irrelevant to us.

Microcode on the Xerox Alto

Some background on the Xerox Alto's architecture will help motivate our day's investigation. The Alto, like most modern computers, is implemented using microcode. Computers are programmed in machine instructions, where each instruction may involve several steps. For instance, a "load" instruction may first compute a memory address by adding an offset to an index register. Then the address is sent to memory. Finally the contents of memory are stored into a register. Instead of hardcoding these steps (as done in the 6502 or Z-80 for instance), modern computers run a sequence of "micro-instructions", where each micro-instruction performs one step of the larger machine instructions. This technique, called microcode, is used by the Xerox Alto.

The Alto uses microcode much more heavily than most computers. The Alto not only uses microcode to implement the instruction set, but implements part of the software in microcode directly. Part of the Alto's design philosophy was to use software (i.e. microcode) instead of hardware where possible. For instance, most video displays pull pixels out of memory and display them on the screen. In the Alto, the processor itself fetches pixels out of memory and passes them to the video hardware. Similarly, most disk interfaces transfer data between memory and the disk drive. But in the Alto, the processor moves each data word to/from memory itself. The code to perform these tasks is written in microcode.

To perform all these low-level activities, the Alto hardware manages 16 different tasks, listed below. High-priority tasks (such as handling high-speed data from the disk) can take over from low-priority tasks, such as handling the display cursor. The lowest-level task is the "emulator", the task that executes program instructions. (In a normal computer, the emulator task is the only thing microcode is doing.) Remember, these tasks are not threads or processes handled by the operating system. These are microcode tasks, below the operating system and scheduled directly by the hardware.

TaskNameDescription
0EmulatorLowest priority.
1-unused
2-unused
3-unused
4KSECDisk sector task
5-unused
6-unused
7ETHEREthernet task
8MRTMemory refresh task. Wakeup every 38.08 microseconds.
9DWTDisplay word task
10CURTCursor task
11DHTDisplay horizontal task
12DVTDisplay vertical task. Wakeup every 16.666 milliseconds.
13PARTParity task. Wakeup generated by parity error.
14KWDDisk word task
15-unused

Last episode, we found that processor was running the various tasks, but never tried to access the disk. System boot is started by the emulator task, which stores a disk command in memory. The disk sector task (KSEC) periodically checks if there are any disk commands to perform. Thus, it seemed like something was going wrong in either the emulator task (setting up the disk request), or the disk sector task (performing the disk request). To figure out exactly what was happening, we needed to hook up a logic analyzer.

The logic analyzer

A logic analyzer is a piece of test equipment a bit like an oscilloscope, except instead of measuring voltages, it just measures 0's or 1's. A logic analyzer also has dozens of inputs, allowing many signals to be analyzed at once. By using a logic analyzer, we can log every micro-instruction the processor runs, track each task, and even record every memory access.

Most of the signals of interest are available on the Alto's backplane, which connects all the circuit cards. Since the backplane is wire-wrapped, it consists of pins that conveniently fit the logic analyzer probes. For each signal, you need to find the right card, and then count the pins until you find the right pin to attach the probe. This setup is very tedious, but Marc patiently connected all the probes, while Carl entered the configuration into the logic analyzer.

The backplane of the Xerox Alto, with probes from the logic analyzer attached to trace microcode execution. Note the thick power wires on the left.

The backplane of the Xerox Alto, with probes from the logic analyzer attached to trace microcode execution. Note the thick power wires on the left.

Unfortunately, a few important signals (the addresses of the micro-instructions) were not available on the backplane, and we needed to attach probes to one of the PROM chips that hold the microcode. Fortunately, the Living Computer Museum in Seattle gave us an extender card; by plugging the extender card into the backplane and the circuit board into the extender card, the board was accessible and we could connect the probes.

Logic analyzer hooked up to the Xerox Alto. By plugging the control board into an extension board, probes can be attached to it.

Probes from the logic analyzer hooked up to the Xerox Alto. By plugging the control board into an extension board, probes can be attached to it.

Hours later, with all the probes connected and the configuration programmed into the logic analyzer, we were ready to power up the system and collect data.

Running the logic analyzer

"Smoke! Stop! Shut it off!"

As soon as we flipped the power switch, smoke poured out of the backplane. Had we destroyed this rare computing artifact? What had gone wrong? When something starts smoking, it's usually pretty obvious where the problem is. In our case, one of the ground wires from the logic analyzer pod had melted, turning its insulation into smoke. A bit of discussion followed: "Pin 3 is ground, right?" "No, pin 9 is ground, pin 3 is 5 volts." "Oops." It turns out that when you short +5 and ground, a probe wire is no match for a 60 amp power supply. Fortunately, this wire was the only casualty of the mishap.

This logic probe wire melted when we accidentally connected +5 volts and ground with it.

This logic probe wire melted when we accidentally connected +5 volts and ground with it.

With this problem fixed, we were able to get a useful trace from the logic analyzer. The trace showed that the Alto started off with the emulator/boot task. After just four instructions, execution switched to the disk word task, which was rapidly interrupted by the parity error task. When that task finished, execution went back to the disk word task, which was interrupted a few instructions later by the display vertical task. The disk word task was able to run a few more instructions before the display horizontal task ran, followed by the cursor task.

The vintage Agilent 1670G logic analyzer that we connected to the Xerox Alto. The screen shows the start of the boot sequence.

The vintage Agilent 1670G logic analyzer that we connected to the Xerox Alto. The screen shows the start of the Alto's boot sequence.

It's rather amazing how much task switching is going on in the Alto, with low-priority tasks only getting a few instructions executed before being interrupted by a higher-priority task. Looking at the trace made me realize how much overhead these tasks have. In our case, the emulator task is running the boot code, so progress towards boot requires looking at hundreds of instructions in the logic analyzer.

The key thing we noticed in the traces is the parity error task ran right near the start, indicating an error in memory. This would explain why the system doesn't boot up. We ran a few more boot cycles through the logic analyzer. The specific order of tasks varied each time, as you'd expect since they are triggered asynchronously from hardware events. But we kept seeing the parity errors.

The Alto's memory system

The Alto was built in the early days of semiconductor memory, when RAM chips were expensive and unreliable. The original Alto module used Intel's 1103 memory chips, which were the first commercially available DRAM chip, holding just 1 kilobit. To provide 128 kilobytes of memory, the Alto I used 16 boards crammed full of chips. (If you're reading this on a computer with 4 gigabytes of memory, think about how much memory capacity has improved since the 1970s.)

We have the later Alto II XM (extended memory) system, which used more advanced 16 kilobit chips to fit 512 kilobytes of storage onto 4 boards. Each memory board stored a 10 bit chunk—why 10 bits? Because memory chips were unreliable, the Alto used error correction. To store a 32-bit word pair, 6 bits of Hamming error correction were added, along with a parity bit, and one unused bit. The extra bits allow single-bit errors to be corrected and double-bit errors to be detected. The four memory boards in parallel stored 40 bits at a time—the 32 bit word pair and the extra bits for error correction.

A 128KB memory card from the Xerox Alto.

A 128KB memory card from the Xerox Alto. The board has eighty 4116 DRAM chips, each with 16 kilobits of storage.

In addition to the 4 memory boards, the Alto has three circuit boards to control memory. The "MEAT" (Memory Extension And Terminator) is a trivial board to support four memory banks (the extended memory in the Alto XM). The "AIM" board (Address Interface Module) is a complex board that maps addresses to memory control signals, as well as handling memory-mapped peripherals such as the keyboard, mouse, and printer. Finally, the "DIM" board (Data Interface Module) generates the Hamming error correcting code signals, and performs error detection and correction.

More probing showed that the DIM board was always expressing a parity error. At this point, we're not sure if some of the memory chips are bad or if the complex circuitry on the DIM board is malfunctioning and reporting errors. As you can tell from the above description, the memory system on the Alto is complex. It may be a challenge to debug the memory and find out why we're getting errors.

A look at the microcode

In this section, I'll give a brief view of what the microcode looks like and how it appears in the logic analyzer. Microcode is generally hard to understand because it is at a very low level in the system, below the instruction set and running on the bare hardware. The Alto's microcode seems especially confusing.

Each Alto micro-instruction specifies an ALU operation and two "functions". A function can be something like "read a register" or "send an address to memory". But a function can also change meaning depending on what task is running. For instance, when the Ethernet task is running, a function might mean "do a four-way branch depending on the Ethernet state". But during the display task, the same function could mean "display these pixels on the screen". As a result, you can't figure out what an instruction does unless you know which task it is a part of.

The image below shows a small part of the logic analyzer output (as printed on Marc's vintage HP line printer). Each line corresponds to one executed micro-instruction. The "address" column shows the address of the micro-instruction in the 1K PROM storage. The task field shows which task is running. You can see the task switch midway through execution; 0 is the emulator and 13 is the parity task. Finally, the 32-bit micro-instruction is broken into fields such as RSEL (register select), ALUF (ALU function) and F1 (function 1).

The start of the logic analyzer trace from booting the Xerox Alto. The trace shows us each micro-instruction that was executed.

The start of the logic analyzer trace from booting the Xerox Alto. The trace shows us each micro-instruction that was executed.

Note that the addresses jump around a lot; this is because the microcode isn't stored linearly in the PROM. Every micro-instruction has a "next instruction address" field in the instruction, so you can think of it as a GOTO inside every instruction. To make it worse, this field can be modified by the interface hardware, turning a GOTO into a computed GOTO. To make this work, the assembler shuffles instructions around in memory, so it's hard to figure out what code goes with a particular address. The point of this is that the logic analyzer output shows us every micro-instruction as it executes, but the output is somewhat difficult and tedious to interpret.

Fortunately we have the source code for the microcode, but understanding it is a challenge. The image below shows a small section of the boot code. I won't attempt to explain the microcode in detail, but want to give you a feel for what it is like. Labels (along the left) and jumps to labels are highlighted in blue. Things such as IR, L, and T are registers, and they get assigned values as indicated by the arrows. MAR is the memory address register (giving an address to memory) and MD is memory data, reading or writing the memory value.

A short section of the Xerox Alto's microcode. Labels and jumps are colored blue. Comments are gray.

A short section of the Xerox Alto's microcode. Labels and jumps are colored blue. Comments are gray.

Figuring out the control flow of the microcode requires detailed understanding of what is happening in the hardware. For example, in the last line above, ":Q0" indicates a jump to label "Q0". However the previous line says "BUS", which means the contents of the data bus are ORed into the address, turning the jump into a conditional jump to Q0, Q1, Q2, etc. depending on the bus value. And "TASK" indicates that a task switch can happen after the next instruction. So matching up the instructions in the logic analyzer output with instructions in the source code is non-trivial.

I should mention that the authors of the Alto's microcode were really amazing programers. An important feature for graphics displays is BITBLT, bit block transfer. The idea is to take an arbitrary rectangle of pixels in memory (such as a character, image, or window) and copy it onto the screen. The tricky part is that the regions may not be byte-aligned, so you may need to extract part of a byte, shift it over, and combine it with part of the destination byte. In addition, BITBLT supports multiple writing modes (copy, XOR, merge) and other features. So BITBLT is a difficult function to implement, even in a high-level language. The incredible part is that the Xerox Alto has BITBLT implemented in hundreds of lines of complex microcode! Using microcode for BITBLT made the operation considerably faster than implementing it in assembly code. (It also meant that BITBLT was used as a single machine language instruction.)

Conclusion

Hooking up the logic analyzer was time consuming, but succeeded in showing us exactly what was happening inside the Alto processor. Although interpreting the logic analyzer output and mapping it to the microcode source is difficult, we were able to follow the execution and determined that the parity task was running. It appears that memory parity errors are preventing the system from booting. Next step will be to understand the memory system in detail to determine where these errors are coming from and how to fix them.

Restoring Y Combinator's Xerox Alto, day 4: What's running on the system

This post describes our continuing efforts to restore a Xerox Alto. We checked that the low-level microcode tasks are running correctly and the processor is functioning. (The Alto uses an unusual architecture that runs multiple tasks in microcode.) Unfortunately the system still doesn't boot from disk, so the next step will be to get out the logic analyzer and see exactly what's happening. Here's Marc's video of the days's session:

The Alto was a revolutionary computer, designed at Xerox PARC to investigate personal computing, introducing the GUI, Ethernet and laser printers to the world. Y Combinator received an Alto from computer visionary Alan Kay. I'm helping restore the system, along with Marc Verdiell, Luca Severini, Ron Crane, Carl Claunch and Ed Thelen (from the IBM 1401 restoration team). For background, see my previous restoration articles: day 1, day 2, day 3.

Checking the clocks

We started by checked that all the clock signals were working properly by connecting an oscilloscope to the wirewrap pins on the computer's backplane. This took a lot of careful counting to make sure we connected to the right pins! The system clock signals are generated by an oscillator on the video display card, which isn't where I'd expect to find them. Since the clock signals control the timing of the entire system, nothing will happen if the clock is bad. Thus, checking the clock was an important first step.

At first, the clock signals all looked awful, but after finding a decent ground for the oscilloscope probes, the clock signals looked much better. We verified that the multiple clock outputs were all running nicely. We also tested the reset line to make sure it was being triggered properly - the Alto is reset by pushing a button at the back of the keyboard.

Connecting an oscilloscope to the Xerox Alto backplane

Connecting oscilloscope probes to the Xerox Alto backplane.

Microcode tasks

Next we looked at the running tasks. The Alto has 16 separate tasks running in microcode, doing everything from pushing pixels to the display to refreshing memory to moving disk words. Keep in mind that these are microcode tasks, not operating-system level tasks. The Alto was designed to reduce hardware by performing as many tasks in software as possible to reduce price and increase flexibility. The downside is the CPU can spend the majority of its time doing these tasks rather than "useful" work.

Alto task scheduling is fairly complex. Each task has a priority. When a task is ready to run, its request line is activated (by the associated hardware). The current task can offer to yield by executing the TASK function at convenient points. If there is a higher-priority task ready to run, it preempts the running task. If there's nothing better to run, task 0 runs - this task is what actually runs user code, emulating the Data General Nova instruction set.

The point of this explanation is that microcode instructions need to be running properly for task switching to happen. If the TASK function doesn't get called, the current task will run forever. And if all the task scheduling hardware isn't working right, task switching also won't happen.

Below is a picture of the microcode control board from the Alto. When you're using 1973-era chips, it takes a lot of chips to do anything. This board manages which task is running and the memory address of each task. It uses two special priority encoder chips to determine which waiting task has the highest priority. The board holds the microcode, 1024 micro-instructions of 32 bits each, using eight 1K x 4 bit PROM chips. (PROM, programmable read-only memory, is sort of like non-erasable flash memory.) The board has 8 open sockets allowing an upgrade of 1K of additional microcode to be installed. Note the tiny memory capacity of the time, just 512 bytes of storage per chip.

The microcode control board from a Xerox Alto

The microcode control board from a Xerox Alto

Since tasks can be interrupted, the board needs to store the current address of each task. It uses two i3101A RAM chips for this storage. The 3101 is historically interesting: it was the first solid state memory chip, introduced by Intel in 1969. This chip holds 64 bits as 16 words of 4 bits each. Just imagine a time when a memory chip held not gigabits but just 64 total bits.

Looking at the running tasks

The control board has a 4-bit task number available on the backplane, indicating which task is running. We hooked up the oscilloscope so we could see the running tasks. The good news is we saw the appropriate tasks running at the right intervals, with preemption working properly. The following traces show the four task number bits. Most of the time the low-priority task 0 runs (all active-low signals high). Task 12 is running in the middle. Task 8 (memory refresh) runs three times, 38.08 microseconds apart as expected. From the traces, everything seems to be functioning correctly with the task execution.

The 4-bit microcode task select lines on the Xerox Alto

Trace of the 4-bit microcode task select lines on the Xerox Alto. Top (red) is 8, then 4, 2 and bottom (yellow) is 1 bit. Signals are active-low. Each time interval is 10 microseconds, so this shows a 100 microsecond time interval.

Seeing the running tasks is a big thing, since it shows a whole lot of the system is working properly. As explained earlier, since tasks are running and switching, the microcode processor must be fetching and executing micro-instructions correctly.

Display working better now

You may remember from the previous article that the Alto display was very, very dim and we suspected the CRT was failing. The good news is the display has steadily increased in brightness from its original very dim state, so we probably won't need to replace the CRT. We also managed to see some garbage on the screen along with a cursor, showing that RAM is storing something and the display interface is working.

The display of the Xerox Alto displaying random junk.

The display of the Xerox Alto displaying random junk.

Boot still doesn't work

Lots of things are working at this point. The minor :-) remaining problem is the system doesn't boot. Last time, we got the disk drive working: we can put a 14-inch disk cartridge (below) in the drive, the drive spins up, and the heads load. But looking at the backplane signals, we found nothing is getting read from the disk (which explains the boot failure). The oscilloscope showed that the Alto isn't sending any commands to the disk - the Alto isn't even trying to read the disk. We checked for various hardware issues and couldn't find any problems. My suspicion is the boot code in microcode isn't running properly.

Inserting a hard disk into the Diablo drive.

Inserting a hard disk into the Diablo drive.

A bit of explanation on the boot process: On reset, microcode task 0 handles the boot. If backspace is pressed on the keyboard, the Alto does a Ethernet boot. Otherwise it does a disk boot by setting up a disk command block in RAM. The microcode disk sector task gets triggered on each sector pulse (which we saw coming from the disk). It checks if there is a command block in RAM, and if so sends the command to the disk. When the read data comes from the disk, the disk word task copies the data into memory. At the end, the block read from disk will be executed, performing the disk boot. So three microcode tasks need to cooperate to boot from disk.

Since we're seeing no command sent to the disk, something must be going wrong between task 0 setting up the command block in RAM and the sector task executing the command block. There's a lot that needs to go right here. If anything is wrong in the ALU or RAM has problems, the command block will get corrupted. And then no disk operation will happen.

Conclusion

The next step is to use a logic analyzer to see exactly what is running, instruction by instruction. By looking at the microcode address lines, we will be able to see what code is executing and where things go wrong. Then we can probe the memory bus to see if RAM is the problem, and look at the ALU to see if it is causing the problem. This is where debugging will get more complex.

I've studied the microcode and it is very bizarre. (You can see the source here.) Instructions are in random order in the PROM, what an instruction does depends on what task is running, branches happen when a device board flips address bits on the bus, and some bits in the PROM are inverted for no good reason (probably to save an inverter chip somewhere). So looking at the microcode can be mind-bending. But hopefully with the logic analyzer we can narrow the problem down. We can also use the Living Computer Museum's simulator to cross-check against what microcode should be running.

For updates on the restoration, follow kenshirriff on Twitter.

Restoring Y Combinator's Xerox Alto, day 3: Inside the disk drive

I'm helping restore a Xerox Alto — a legendary minicomputer from 1973 that helped set the direction for personal computing. This post describes how we cleaned and restored the disk drive and then powered up the system. Spoiler: the drive runs but the system doesn't boot yet.

While creating the Alto, Xerox PARC invented much of the modern personal computer: everything from Ethernet and the laser printer to WYSIWYG editors with high-quality fonts. Getting this revolutionary system running again is a big effort but fortunately I'm working with a strong team: Marc Verdiell, Luca Severini, Ron Crane, Carl Claunch and Ed Thelen, along with special guest Tim Curley from PARC.

If this article gives you deja vu, you probably saw Marc's restoration video (above) on Hacker News last week or read the earlier restoration updates: introduction, day 1, day 2.

Hard disk technology of the 1970s

For mass storage, the Alto uses a Diablo disk drive, which stores 2.5 megabytes on a removable 14 inch disk cartridge. With 1970s technology, you don't get much storage even on an inconveniently large disk, so Alto users were constantly short of disk space. The photo below shows the Xerox Alto, with the computer chassis (bottom) partially removed. Above the chassis and below the keyboard is the Diablo disk drive, which is the focus of this article.

The Xerox Alto II XM 'personal computer'. The card cage below the disk drive has been partially removed. Four cooling fans are visible at the front of it.

The Xerox Alto II XM 'personal computer'. The card cage below the disk drive has been partially removed. Four cooling fans are visible at the front of it.

To insert the disk cartridge into the drive, the front of the drive swings down and the cartridge slides into place. The cartridge is an IBM 2315 disk pack, which was used by many manufacturers of the era such as DEC and HP, and contains a single platter inside the hard white protective case. The disk drive has been partially pulled out of the cabinet and the top removed, revealing the internals of the drive. During normal use, the disk drive is inside the cabinet, of course.

Inserting a hard disk into the Diablo drive.

Inserting a hard disk into the Diablo drive.

Unlike modern hard disks, the Alto's disk is not sealed; the disk pack opens during use to provide access to the heads. To protect against contamination and provide cooling, filtered air is blown through the disk pack during use. Air enters the disk through a metal panel on the bottom of the disk (as seen below) and exits through the head opening, blowing any dust away from the disk surface.

Hard disk for the Xerox Alto, showing the air intake vent.

Hard disk for the Xerox Alto, showing the air intake vent.

Although the heads are widely separated during disk pack insertion, they move very close to the disk surface during operation, floating on a cushion of air about one thousandth of a millimeter above the surface. The diagram below from the manual illustrates the danger of particles on the disk's surface. Any contamination can cause the head to crash into the disk surface, gouging out the oxide layer and destroying the disk and the head.

The Diablo disk and why contaminants are bad, from the Alto disk manual.

The Diablo disk and why contaminants are bad, from the Alto disk manual.

The magnified photo below shows the read/write head. The two air bleed holes ensure that the head is flying at the correct height above the disk surface. The long part of the cross contains the read/write coil, while the short part of the cross contains the erase coils (which erase a band between tracks).

Read/write head for the Diablo drive.

Read/write head for the Diablo drive.

The following diagram shows how data is stored on the disk in 203 tracks (actually 203 cylinders, since there are tracks on the top and bottom surfaces). The drive moves the tiny read/write heads to the desired track. Each track is divided into 12 sectors, with 256 words of data in each sector.

Diagram of how the Diablo disk drive's read/write head stores data in tracks on the disk surface. From the Maintenance Manual.

Diagram of how the Diablo disk drive's read/write head stores data in tracks on the disk surface. From the Maintenance Manual.

In the photo below, we have removed the top of the disk pack revealing the hard disk inside. Note the vertical metal ring along the inside of the disk; it has twelve narrow slots that physically indicate the twelve sectors of the disk. A double slot is the index mark, indicating the first sector. To make sure the disk surface was clean, we wiped the disk surfaces clean with isopropyl alcohol. This seemed a bit crazy to me, but apparently it's a normal thing to do with disks of that era.

Inside the disk pack used by the Xerox Alto.

Inside the disk pack used by the Xerox Alto.

The photo below shows the motor spindle that rotates the hard disk at 1500 RPM. In front of the spindle, you can see the sensor that detects the slots that indicate sectors. To the left is the air duct that provides filtered airflow into the disk pack. (The air intake on the bottom of the disk pack was shown in an earlier photo.) Around the edge of the air duct is foam to provide a seal between the duct and the disk cartridge, ensuring airflow through the cartridge.

The motor spindle (center) rotates the hard disk. In front of the spindle is the sensor to detect sectors. To the left is the ventilation air duct for the disk.

The motor spindle (center) rotates the hard disk. In front of the spindle is the sensor to detect sectors. To the left is the ventilation air duct for the disk.

After 40 years, the foam had deteriorated into mush and needed to be replaced. The foam no longer provided an airtight seal. Even worse, particles could break off the foam. If a piece of foam got blown onto the disk surface, it would probably trigger a catastrophic disk crash. To replace the foam, we used weatherstripping, which isn't standard but seemed to get the job done.

As well as replacing the foam, we vacuumed any dust out of the drive and carefully cleaned the heads and other drive components.

How the Diablo drive works

The drive itself has fairly limited logic, with most of the functionality inside the Alto. The drive can seek to a particular track, indicate the current sector, and read or write a stream of raw bits. Since there's no buffering in the disk drive, the Alto must supply every bit at the precise time based on the disk's rotation. In the Alto, microcode performs many interfacing tasks that are usually done in hardware. Instead of using DMA, the Alto's microcode moves data words one at a time to the disk interface card in the Alto, which does the serial/parallel conversion.

The Diablo drive opened for servicing.

The Diablo drive opened for servicing.

Modern disk drives use a dense disk controller integrated circuit. The Diablo drive, in contrast, implements its limited functionality with transistors and individual chips (mostly gates and flip flops), so it requires boards of components. The photo above shows the 6 main circuit boards of the Alto, plugged into the "mother board": three on the left side and three on the right side. For ease of maintenance, the electronics assembly pops up as seen above, allowing access to the boards. The leftmost board is the analog circuitry, generating the write signals for the heads and amplifying the signals read back from the disk. You can see a wire running from the board to the read/write heads. The next board detects sector and index marks and controls the motor speed. The third board has a counter to keep track of the current sector number.

The three boards on the right perform seeks, moving the disk head to the desired track. The first board computes the difference between the previous track number and the requested track number. The next board counts tracks as the head moves to determine the distance remaining. The rightmost board controls the servo that moves the head to the right track. The seek servo has a four-speed drive, so the head moves rapidly at first and slows down as it approaches the right track, more sophistication than I expected. The Diablo drive manual has detailed schematics.

The photo below shows some of the colorful resistors and diodes on the analog read/write board, along with some transistors. Modern circuit boards would be much denser, with tightly packed surface mounted components.

Circuitry inside the Diablo 31 drive.

Circuitry inside the Diablo 31 drive.

The head positioning mechanism is shown below. The turquoise circles rotate as the drive moves to a new track and the yellow pointer indicates the track number on the dial. The heads themselves are on the arm below (lower center). In front of the heads (bottom of the picture) is the metal bar that opens the disk pack when it is inserted.

Inside the Diablo disk drive. The heads are visible in the center. In front of them is the metal bar that opens the disk pack.

Inside the Diablo disk drive. The heads are visible in the center. In front of them is the metal bar that opens the disk pack.

As the disk pack enters the drive, it opens up to provide access to the disk surface. The photo below shows the same mechanism as the previous photo, but from the side and with a disk inserted. You can see the exposed surface of the disk, brownish from the magnetizable iron oxide layer over the aluminum platter. As described earlier, the airflow exits the cartridge here, preventing dust from entering through this opening. The read/write head is visible above the disk's surface, with another head below the disk.

Closeup of the hard disk inside the Diablo drive. The read/write head (metal/yellow) is visible above the disk surface (brown).

Closeup of the hard disk inside the Diablo drive. The read/write head (metal/yellow) is visible above the disk surface (brown).

The drive largely uses primitive DTL chips—diode transistor logic, an early form of digital logic, as well as some slightly more modern TTL chips. The photo below shows some of the chips on the sector counting board. The chips labeled MC858P provide four NAND gates, so there's not much logic per chip. (7651 is the date code, indicating the chip was manufactured in week 51 of 1976.)

Chips on a control board for the Diablo drive.

Chips on a control board for the Diablo drive.

Conclusion

After putting the disk drive back together, we carefully powered up the system. The disk drive spun up to high speed, the heads dropped to the surface, and the disk slowed to 1500 RPM as expected. (One surprising complexity of the drive is it runs at a faster speed for a while so the airflow will blow contaminants out of the disk pack before loading the heads; it has counters and logic to implement this.) We verified that the disk surface remained undamaged, so the drive works properly, at least mechanically.

This was the first time we had powered up the Alto circuitry. Happily, nothing emitted smoke. But not surprisingly, the Alto failed to boot from the disk. Unless the Alto can read boot code from the disk (or Ethernet), nothing happens, not even a prompt on the screen. The photo below shows the disk with the ready light illuminated, and the empty screen.

The Xerox Alto's drive powered up, along with monitor (showing a white screen).

The Xerox Alto's drive powered up, along with monitor (showing a white screen).

We have a long debugging task ahead of us, to trace through the Alto's logic circuits and find out what's going wrong. We're also building a disk emulator using a FPGA, so we will be able to run the Alto with an emulated disk, rather than depending on the Diablo drive to keep running. The restoration is likely to keep us busy for a while, so expect more updates. One item we are missing is the Alignment Cartridge (or C.E. Pack), a disk cartridge with specially-recorded tracks used to align the drive; let us know if you happen to have one lying around!

For updates on the restoration, follow kenshirriff on Twitter. Thanks to Al Kossow and Keith Hayes for assistance with restoration.