Showing posts with label reverse-engineering. Show all posts
Showing posts with label reverse-engineering. Show all posts

Reverse engineering the mysterious Up-Data Link Test Set from Apollo

Back in 2021, a collector friend of ours was visiting a dusty warehouse in search of Apollo-era communications equipment. A box with NASA-style lights caught his eye—the "AGC Confirm" light suggested a connection with the Apollo Guidance Computer. Disappointingly, the box was just an empty chassis and the circuit boards were all missing. He continued to poke around the warehouse when, to his surprise, he found a bag on the other side of the warehouse that contained the missing boards! After reuniting the box with its wayward circuit cards, he brought it to us: could we make this undocumented unit work?

The Up-Data Link Confidence Test Set, powered up.

The Up-Data Link Confidence Test Set, powered up.

A label on the back indicated that it is an "Up-Data Link Confidence Test Set", built by Motorola. As the name suggests, the box was designed to test Apollo's Up-Data Link (UDL), a system that allowed digital commands to be sent up to the spacecraft. As I'll explain in detail below, these commands allowed ground stations to switch spacecraft circuits on or off, interact with the Apollo Guidance Computer, or set the spacecraft's clock. The Up-Data Link needed to be tested on the ground to ensure that its functions operated correctly. Generating the test signals for the Up-Data Link and verifying its outputs was the responsibility of the Up-Data Link Confidence Test Set (which I'll call the Test Set for short)

The Test Set illustrates how, before integrated circuits, complicated devices could be constructed from thumb-sized encapsulated modules. Since I couldn't uncover any documentation on these modules, I had to reverse-engineer them, discovering that different modules implemented everything from flip-flops and logic gates to opto-isolators and analog circuits. With the help of a Lumafield 3-dimensional X-ray scanner, we looked inside the modules and examined the discrete transistors, resistors, diodes, and other components mounted inside.

Four of the 13-pin Motorola modules. These implement logic gates (2/2G & 2/1G), lamp drivers (LD), more logic gates (2P/3G), and a flip-flop (LP FF). The modules have 13 staggered pins, ensuring that they can't be plugged in backward.

Four of the 13-pin Motorola modules. These implement logic gates (2/2G & 2/1G), lamp drivers (LD), more logic gates (2P/3G), and a flip-flop (LP FF). The modules have 13 staggered pins, ensuring that they can't be plugged in backward.

Reverse-engineering this system—from the undocumented modules to the mess of wiring—was a challenge. Mike found one NASA document that mentioned the Test Set, but the document was remarkably uninformative.1 Moreover, key components of the box were missing, probably removed for salvage years ago. In this article, I'll describe how we learned the system's functionality, uncovered the secrets of the encapsulated modules, built a system to automatically trace the wiring, and used the UDL Test Set in a large-scale re-creation of the Apollo communications system.

The Apollo Up-Data Link

Before describing the Up-Data Link Test Set, I'll explain the Up-Data Link (UDL) itself. The Up-Data Link provided a mechanism for the Apollo spacecraft to receive digital commands from ground stations. These commands allowed ground stations to control the Apollo Guidance Computer, turn equipment on or off, or update the spacecraft's clock. Physically, the Up-Data Link is a light blue metal box with an irregular L shape, weighing almost 20 pounds.

The Up-Data Link box.

The Up-Data Link box.

The Apollo Command Module was crammed with boxes of electronics, from communication and navigation to power and sequencing. The Up-Data Link was mounted above the AC power inverters, below the Apollo Guidance Computer, and to the left of the waste management system and urine bags.

The lower equipment bay of the Apollo Command Module. The Up-Data Link is highlighted in yellow. Click this image (or any other) for a larger version. From Command/Service Module Systems Handbook p212.

The lower equipment bay of the Apollo Command Module. The Up-Data Link is highlighted in yellow. Click this image (or any other) for a larger version. From Command/Service Module Systems Handbook p212.

Up-Data Link Messages

The Up-Data Link supported four types of messages:

  • Mission Control had direct access to the Apollo Guidance Computer (AGC) through the UDL, controlling the computer, keypress by keypress. That is, each message caused the UDL to simulate a keypress on the Display/Keyboard (DSKY), the astronaut's interface to the computer.

  • The spacecraft had a clock, called the Central Timing Equipment or CTE, that tracked the elapsed time of the mission, from days to seconds. A CTE message could set the clock to a specified time.

  • A system called Real Time Control (RTC) allowed the UDL to turn relays on or off, so some spacecraft systems to be controlled from the ground.2 These 32 relays, mounted inside the Up-Data Link box, could do everything from illuminating an Abort light—indicating that Mission Control says to abort—to controlling the data tape recorder or the S-band radio.

  • Finally, the UDL supported two test messages to "exercise all process, transfer and program control logic" in the UDL.

The diagram below shows the format of messages to the Up-Data Link. Each message consisted of 12 to 30 bits, depending on the message type. The first three bits, the Vehicle Address, selected which spacecraft should receive the message. (This allowed messages to be directed to the Saturn V booster, the Command Module, or the Lunar Module.3) Next, three System Address bits specified the spacecraft system to receive the message, corresponding to the four message types above. The remaining bits supplied the message text.

Format of the messages to the Up-Data Link. From Telecommunication Systems Study Guide.
Note that the vehicle access code uses a different sub-bit pattern from the rest of the message.
This diagram shows an earlier sub-bit encoding, not the encoding used by the Test Set.

Format of the messages to the Up-Data Link. From Telecommunication Systems Study Guide. Note that the vehicle access code uses a different sub-bit pattern from the rest of the message. This diagram shows an earlier sub-bit encoding, not the encoding used by the Test Set.

The contents of the message text depended on the message type. A Real Time Control (RTC) message had a six-bit value specifying the relay number as well as whether it should be turned off or on. An Apollo Guidance Computer (AGC) message had a five-bit value specifying a key on the Display/Keyboard (DSKY). For reliability, the message was encoded in 16 bits: the message, the message inverted, the message again, and a padding bit; any mismatching bits would trigger an error. A CTE message set the clock using four 6-bit values indicating seconds, minutes, hours, and days. The UDL processed the message by resetting the clock and then advancing the time by issuing the specified number of pulses to the CTE to advance the seconds, minutes, hours, and days. (This is similar to setting a digital alarm clock by advancing the digits one at a time.) Finally, the two self test messages consisted of 24-bit patterns that would exercise the UDL's internal circuitry. The results of the test were sent back to Earth via Apollo's telemetry system.

For reliability, each bit transmitted to the UDL was replaced by five "sub-bits": each "1" bit was replaced with the sub-bit sequence "01011", and each "0" bit was replaced with the complement, "10100".4 The purpose of the sub-bits was that any corrupted data would result in an invalid sub-bit code so corrupted messages could be rejected. The Up-Data Link performed this validation by matching the input data stream against "01011" or "10100". (The vehicle address at the start of a message used a different sub-bit code, ensuring that the start of the message was properly identified.) By modern standards, sub-bits are an inefficient way of providing redundancy, since the message becomes five times larger. As a consequence, the effective transmission rate was low: 200 bits per second.

There was no security in the Up-Data Link messages, apart from the need for a large transmitter. Of the systems on Apollo, only the rocket destruct system—euphemistically called the Propellant Dispersion System—was cryptographically secure.5

Since the Apollo radio system was analog, the digital sub-bits couldn't be transmitted from ground to space directly. Instead, a technique called phase-shift keying (PSK) converted the data into an audio signal. This audio signal consists of a sine wave that is inverted to indicate a 0 bit versus a 1 bit; in other words, its phase is shifted by 180 degrees for a 0 bit. The Up-Data Link box takes this audio signal as input and demodulates it to extract the digital message data. (Transmitting this audio signal from ground to the Up-Data Link required more steps that aren't relevant to the Test Set, so I'll describe them in a footnote.6)

The Up-Data Link Test Set

Now that I've explained the Up-Data Link, I can describe the Test Set in more detail. The purpose of the UDL Test Set is to test the Up-Data Link system. It sends a message—as an audio signal—to the Up-Data Link box, implementing the message formatting, sub-bit encoding, and phase shift keying described above. Then it verifies the outputs from the UDL to ensure that the UDL performed the correct action.

Perhaps the most visible feature of the Test Set is the paper tape reader on the front panel: this reader is how the Test Set obtains messages to transmit. Messages are punched onto strips of paper tape, encoded as a sequence of 13 octal digits.7 After a message is read from paper tape, it is shown on the 13-digit display. The first three digits are an arbitrary message number, while the remaining 10 octal digits denote the 30-bit message to send to the UDL. Based on the type of message, specified by the System Address digit, the Test Set validates the UDL's response and indicates success or errors on the panel lights.

I created the block diagram below to explain the architecture and construction of the Test Set (click for a larger view). The system has 25 circuit boards, labeled A1 through A25;8 for the most part, they correspond to functional blocks in the diagram.

My block diagram of the Up-Data Link Test Set. (Click for a larger image.)

My block diagram of the Up-Data Link Test Set. (Click for a larger image.)

The Test Set's front panel is dominated by its display of 13 large digits. It turns out that the storage of these digits is the heart of the Test Set. This storage (A3-A9) assembles the digits as they are read from the paper tape, circulates the bits for transmission, and provides digits to the other circuits to select the message type and validate the results. To accomplish this, the 13 digit circuits are configured as a 39-bit shift register. As the message is read from the paper tape, its bits are shifted into the digit storage, right to left, and the message is shown on the display. To send the message, the shift register is reconfigured so the 10 digits form a loop, excluding the message number. As the bits cycle through the loop, the leftmost bit is encoded and transmitted. At the end of the transmission, the digits have cycled back to their original positions, so the message can be transmitted again if desired. Thus, the shift-register mechanism both deserializes the message when it is read and serializes the message for transmission.

The Test Set uses three boards (A15, A2, and A1) to expand the message with sub-bits and to encode the message into audio. The first board converts each bit into five sub-bits. The second board applies phase-shift keying (PSK) modulation, and the third board has filters to produce clean sine waves from the digital signals.

On the input side, the Test Set receives signals from the Up-Data Link (UDL) box through round military-style connectors. These input signals are buffered by boards A25, A22, A23, A10, and A24. Board 15 verifies the input sub-bits by comparing them with the transmitted sub-bits. For an AGC message, the computer signals are verified by board A14. The timing (CTE) signals are verified by boards A20 and A21. The UDL status (validity) signals are processed by board A12. Board A11 implements a switching power supply to power the interface boards.

You can see from the block diagram that the Test Set is complex and implements multiple functions. On the other hand, the block diagram also shows that it takes a lot of 1960s circuitry to implement anything. For instance, one board can only handle two digits, so the digit display alone requires seven boards. Another example is the inputs, requiring a full board for two or three input bits.

Encapsulated modules

The box is built from modules that are somewhat like integrated circuits but contain discrete components. Modules like these were used in the early 1960s before ICs caught on. Each module implements a simple function such as a flip-flop or buffer. They were more convenient than individual components, since a module provided a ready-made function. They were also compact, since the components were tightly packaged inside the module.

Physically, each module has 13 pins: a row of 7 on one side and a row of 6 offset on the other side. This arrangement ensures that a module cannot be plugged in backward.

A Motorola "LP FF" module. This module implements a J-K flip-flop. "LP" could indicate low performance, low power, or low propagation; the system also uses "HP FF" modules, which could be high performance.

A Motorola "LP FF" module. This module implements a J-K flip-flop. "LP" could indicate low performance, low power, or low propagation; the system also uses "HP FF" modules, which could be high performance.

Reverse engineering these modules was difficult since they were encapsulated in plastic and the components were inaccessible. The text printed on each module hinted at its function. For example, the J-K flip-flop module above is labeled "LP FF". The "2/2G & 2/1G" module turned out to contain two NAND gates and two inverters (the 2G and 1G gates). A "2P/3G" module contains two pull-up resistors and two three-input NAND gates. Other modules provided special-purpose analog functions for the PSK modulation.

I reverse-engineered the functions of the modules by applying signals and observing the results. Conveniently, the pins are on 0.200" spacing so I could plug modules into a standard breadboard. The functions of the logic modules were generally straightforward to determine. The analog modules were more difficult; for instance, the "-3.9V" module contains a -3.9-volt Zener diode, six resistors, and three capacitors in complicated arrangements.

To determine how the modules are constructed internally, we had a module X-rayed by John McMaster and another module X-rayed in three dimensions by Lumafield. The X-rays revealed that modules were built with "cordwood construction", a common technique in the 1960s. That is, cylindrical components were mounted between two boards, stacked parallel similar to a pile of wood logs. Instead of using printed-circuit boards, the leads of the components were welded to metal strips to provide the interconnections.

A 3-D scan of the module showing the circuitry inside the compact package, courtesy of Lumafield. Two transistors are visible near the center.

A 3-D scan of the module showing the circuitry inside the compact package, courtesy of Lumafield. Two transistors are visible near the center.

For more information on these modules, see my articles Reverse-engineering a 1960s cordwood flip-flop module with X-ray CT scans and X-ray reverse-engineering a hybrid module. You can interact with the scan here.

The boards

In this section, I'll describe some of the circuit boards and point out their interesting features. A typical board has up to 15 modules, arranged as five rows of three. The modules are carefully spaced so that two boards can be meshed with the components on one board fitting into the gaps on the other board. Thus, a pair of boards forms a dense block.

This photo shows how the modules of the two circuit boards are arranged so the boards can be packed together tightly.

This photo shows how the modules of the two circuit boards are arranged so the boards can be packed together tightly.

Each pair of boards is attached to side rails and a mounting bracket, forming a unit.8 The bracket has ejectors to remove the board unit, since the backplane connectors grip the boards tightly. Finally, each bracket is labeled with the board numbers, the test point numbers, and the Motorola logo. The complexity of this mechanical assembly suggests that Motorola had developed an integrated prototyping system around the circuit modules, prior to the Test Set.

Digit driver boards

The photo below shows a typical board, the digit driver board. At the left, a 47-pin plug provides the connection between the board and the Test Set's backplane. At the right, 15 test connections allow the board to be probed and tested while it is installed. The board itself is a two-sided printed circuit board with gold plating. Boards are powered with +6V, -6V, and ground; the two red capacitors in the lower left filter the two voltages.

Boards A4 through A9 are identical digit driver boards.

Boards A4 through A9 are identical digit driver boards.

The digit driver is the most common board in the system, appearing six times.9 Each board stores two octal digits in a shift register and drives two digit displays on the front panel. Since the digits are octal, each digit requires three bits of storage, implemented with three flip-flop modules connected as a shift register. If you look closely, you can spot the six flip-flop modules, labeled "LP FF".

The digits are displayed through an unusual technology: an edge-lit lightguide display.10 From a distance, it resembles a Nixie tube, but it uses 10 lightbulbs, one for each number value, with a plastic layer for each digit. Each plastic sheet has numerous dots etched in the shape of the corresponding number. One sheet is illuminated from the edge, causing the dots in the sheet to light up and display that number. In the photo below, you can see both the illuminated and the unilluminated dots. The displays take 14 volts, but the box runs at 28 volts, so a board full of resistors on the front panel drops the voltage from 28 to 14, giving off noticeable heat in the process.

A close-up of a digit in the Test Set, showing the structure of the edge-lit lightguide display.

A close-up of a digit in the Test Set, showing the structure of the edge-lit lightguide display.

For each digit position, the driver board provides eight drive signals, one for each bulb. The drivers are implemented in "LD" modules. Since each LD module contains two drive transistors controlled by 4-input AND gates, a module supports two bulbs. Thus, a driver board holds eight LD modules in total. The LD modules are also used on other boards to drive the lights on the front panel.

Ring counters

The Test Set contains multiple counters to count bits, sub-bits, digits, states, and so forth. While a modern design would use binary counters, the Test Set is implemented with a circuit called a ring counter that optimizes the hardware.

For instance, to count to ten, five flip-flops are arranged as a shift register so each flip-flop sends its output to the next one. However, the last flip-flop sends its inverted output to the first. The result is that the counter will proceed: 10000, 11000, 11100, 11110, 11111 as 1 bits are shifted in at the left. But after a 1 reaches the last bit, 0 bits will be shifted in at the left: 01111, 00111, 00011, 00001, and finally 0000. Thus, the counter moves through ten states.

Why not use a 4-bit binary counter and save a flip-flop? First, the binary counter requires additional logic to go from 9 back to 0. Moreover, acting on a particular binary value requires a 4-input gate to check the four bits. But a particular value of a ring counter can be detected with a smaller 2-input gate by checking the bits on either side of the 0/1 boundary. For instance, to detect a count of 3 (11100), only the two highlighted bits need to be tested. Thus, the decoding logic is much simpler for a ring counter, which is important when each gate comes in an expensive module.

Another use of the ring counter is in the sub-state generator, counting out the five states. Since this ring counter uses three flip-flops, you might expect it to count to six. However, the first flip-flop gets one of its inputs from the second flip-flop, resulting in five states: 000, 100, 110, 011, and 001, with the 111 state skipped.11 This illustrates the flexibility of ring counters to generate arbitrary numbers of states.

The PSK boards

Digital data could not be broadcast directly to the spacecraft, so the data was turned into an audio signal using phase-shift keying (PSK). The Test Set uses two boards (A1 and A2) to produce this signal. These boards are interesting and unusual because they are analog, unlike the other boards in the Test Set.

The idea behind phase-shift keying is to change the phase of a sine wave depending on the bit (i.e., sub-bit) value. Specifically, a 2 kHz sine wave indicated a one bit, while the sine wave was inverted for a zero bit. That is, a phase shift of 180º indicated a 0 bit. But how do you tell which sine wave is original and which is flipped? The solution was to combine the information signal with a 1 kHz reference signal that indicates the start and phase of each bit. The diagram below shows how the bits 1-0-1 are encoded into the composite audio signal that is decoded by the Up-Data Link box.

The phase-shift keying modulation process. This encoded digital data into an audio signal for transmission to the Up-DataLink. Note that "1 kc" is 1 kilocycle, or 1 kilohertz in modern usage. From Apollo Digital Up-Data Link Description.

The phase-shift keying modulation process. This encoded digital data into an audio signal for transmission to the Up-DataLink. Note that "1 kc" is 1 kilocycle, or 1 kilohertz in modern usage. From Apollo Digital Up-Data Link Description.

The core of the PSK modulation circuit is a transformer with a split input winding. The 2 kHz sine wave is applied to the winding's center tap. One side of the winding is grounded (by the "ø DET" module) for a 0 bit, but the other side of the winding is grounded for a 1 bit. This causes the signal to go through the winding in one direction for a 1 bit and the opposite direction for a 0 bit. The transformer's output winding thus receives an inverted signal for a 0 bit, giving the 180º phase shift seen in the second waveform above. Finally, the board produces the composite audio signal by mixing in the reference signal through a potentiometer and the "SUM" module.12

Board A2 is the heart of the PSK encoding. The black transformer selects the phase shift, controlled by the "ø DET" and "ø DET D" modules in front of it. The two central potentiometers  balance the components of the output signal.

Board A2 is the heart of the PSK encoding. The black transformer selects the phase shift, controlled by the "ø DET" and "ø DET D" modules in front of it. The two central potentiometers balance the components of the output signal.

Inconveniently, some key components of the Test Set were missing; probably the most valuable components were salvaged when the box was scrapped. The missing components included the power supplies and amplifiers on the back of the box, as well as parts from PSK board A1. This board had ten white wires that had been cut, going to missing components labeled MP1, R2, L1, and L2. By studying the circuitry, I determined that MP1 had been a 4-kHz oscillator that provided the master clock for the Test Set. R2 was simply a potentiometer to adjust signal levels.

Marc added circuitry to board A1 to replace the two missing filters and the missing oscillator. (The oscillator was used earlier to drive a clock from Soyuz.)

Marc added circuitry to board A1 to replace the two missing filters and the missing oscillator. (The oscillator was used earlier to drive a clock from Soyuz.)

But L1 and L2 were more difficult. It took a lot of reverse-engineering before we determined that L1 and L2 were resonant filters to convert the digital waveforms to the sine waves needed for the PSK output. Marc used a combination of theory and trial-and-error to determine the inductor and capacitor values that produced a clean signal. The photo above shows our substitute filters, along with a replacement oscillator.

Input boards

The Test Set receives signals from the Up-Data Link box under test and verifies that these signals are correct. The Test Set has five input boards (A22 through A25) to buffer the input signals and convert them to digital levels. The input boards also provide electrical isolation between the input signals and the Test Set, avoiding problems caused by ground loops or different voltage levels.

A typical input board is A22, which receives two input signals, supplied through coaxial cables. The board buffers the signals with op-amps, and then produces a digital signal for use by the box. The op-amp outputs go into "1 SS" isolation modules that pass the signal through to the box while ensuring isolation. These modules are optocouplers, using an LED and a phototransistor to provide isolation.13 The op-amps are powered by an isolated power supply.

Board A22 handles two input signals. It has two op-amps and associated circuitry. Note the empty module positions; board A23 has these positions populated so it supports three inputs.

Board A22 handles two input signals. It has two op-amps and associated circuitry. Note the empty module positions; board A23 has these positions populated so it supports three inputs.

Each op-amp module is a Burr-Brown Model 1506 module,14 encapsulating a transistorized op-amp into a convenient 8-pin module. The module is similar to an integrated-circuit op-amp, except it has discrete components inside and is considerably larger than an integrated circuit. Burr-Brown is said to have created the first solid-state op-amp in 1957, and started making op-amp modules around 1962.

Board A24 is also an isolated input board, but uses different circuitry. It has two modules that each contain four Schmitt triggers, circuits to sharpen up a noisy input. These modules have the puzzling label "-12+6LC". Each output goes through a "1 SS" isolation module, as with the previous input boards. This board receives the 8-bit "validity" signal from the Up-Data Link.

The switching power supply board

Board A11 is interesting: instead of sealed modules, it has a large green cube with numerous wires attached. This board turned out to be a switching power supply that implements six dual-voltage power supplies. The green cube is a transformer with 14 center-tapped windings connected to 42 pins. The transformer ensures that the power supply's outputs are isolated. This allows the op-amps on the input boards to remain electrically isolated from the rest of the Test Set.

The switching power supply board is dominated by a large green transformer with many windings. The two black power transistors are at the front.

The switching power supply board is dominated by a large green transformer with many windings. The two black power transistors are at the front.

The power supply uses a design known as a Royer Converter; the two transistors drive the transformer in a push-pull configuration. The transistors are turned on alternately at high frequency, driven by a feedback winding. The transformer has multiple windings, one for each output. Each center-tapped winding uses two diodes to produce a DC output, filtered by the large capacitors. In total, the power supply has four ±7V outputs and two ±14V outputs to supply the input boards.

This switching power supply is independent from the power supplies for the rest of the Test Set. On the back of the box, we could see where power supplies and amplifiers had been removed. Determining the voltages of the missing power supplies would have been a challenge. Fortunately, the front of the box had test points with labels for the various voltages: -6, +6, and +28, so we knew what voltages were required.

The front panel

The front panel reveals many of the features of the Test Set. At the top, lights indicate the success or failure of various tests. "Sub-bit agree/error" indicates if the sub-bits read back into the Test Set match the values sent. "AGC confirm/error" shows the results of an Apollo Guidance Computer message, while "CTE confirm/error" shows the results of a Central Timing Equipment message. "Verif confirm/error" indicates if the verification message from the UDL matches the expected value for a test message. At the right, lights indicate the status of the UDL: standby, active, or powered off.

A close-up of the Test Set's front panel.

A close-up of the Test Set's front panel.

In the middle, toggle switches control the UDL operation. The "Sub-bit spoil" switch causes sub-bits to be occasionally corrupted for testing purposes. "Sub-bit compare/override" enables or disables sub-bit verification. The four switches on the right control the paper tape reader. The "Program start" switch is the important one: it causes the UDL to send one message (in "Single" mode) or multiple messages (in "Serial" mode). The Test Set can stop or continue when an error occurs ("Stop on error" / "Bypass error"). Finally, "Tape advance" causes messages to be read from paper tape, while "Tape stop" causes the UDL to re-use the current message rather than loading a new one.

The UDL provides a verification code that indicates its status. The "Verification Return" knob selects the source of this verification code: the "Direct" position uses a 4-bit verification code, while "Remote" uses an 8-bit verification code.15

At the bottom, "PSK high/low" selects the output level for the PSK signal from the Test Set. (Since the amplifier was removed from our Test Set, this switch has no effect. Likewise, the "Power On / Off" switch has no effect since the power supplies were removed. We power the Test Set with an external lab supply.) In the middle, 15 test points allow access to various signals inside the Test Set. The round elapsed time indicator shows how many hours the Test Set has been running (apparently over 12 months of continuous operation).

Reverse-engineering the backplane

Once I figured out the circuitry on each board, the next problem was determining how the boards were connected. The backplane consists of rows of 47-pin sockets, one for each board. Dense white wiring runs between the sockets as well as to switches, displays, and connectors. I started beeping out the connections with a multimeter, picking a wire and then trying to find the other end. Some wires were easy since I could see both ends, but many wires disappeared into a bundle. I soon realized that manually tracing the wiring was impractically slow: with 25 boards and 47 connections per board, brute-force testing of every pair of connections would require hundreds of thousands of checks.

The backplane wiring of the Test Set consisted of bundles of white wires, as shown in this view of the underside of the Test Set.

The backplane wiring of the Test Set consisted of bundles of white wires, as shown in this view of the underside of the Test Set.

To automate the beeping-out of connections, I built a system that I call Beep-o-matic. The idea behind Beep-o-matic is to automatically find all the connections between two motherboard slots by plugging two special boards into the slots. By energizing all the pins on the first board in sequence, a microcontroller can detect connected pins on the second board, revealing the wiring between the two slots.

This system worked better than I expected, rapidly generating a list of connections. I still had to plug the Beep-o-matic boards into each pair of slots (about 300 combinations in total), but each scan took just a few seconds, so a full scan was practical. To find the wiring to the switches and connectors, I used a variant of the process. I plugged a board into a slot and used a program to continuously monitor the pins for changes. I went through the various switch positions and applied signals to the connectors to find the associated connections.

Conclusions

I started reverse-engineering the Test Set out of curiosity: given an undocumented box made from mystery modules and missing key components, could we understand it? Could we at least get the paper tape reader to run and the lights to flash? It was a tricky puzzle to figure out the modules and the circuitry, but eventually we could read a paper tape and see the results on the display.

But the box turned out to be useful. Marc has amassed a large and operational collection of Apollo communications hardware. We use the UDL Test Set to generate realistic signals that we feed into Apollo's S-band communication system. We haven't transmitted these signals to the Moon, but we have transmitted signals between antennas a few feet apart, receiving them with a box called the S-band Transponder. Moreover, we have used the Test Set to control an Up-Data Link box, a CTE clock, and a simulated Apollo Guidance Computer, reading commands from the paper tape and sending them through the complete communication path. Ironically, the one thing we haven't done with the Test Set is use it to test the Up-Data Link in the way it is intended: connecting the UDL's outputs to the Test Set and checking the panel lights.

From a wider perspective, the Test Set provides a glimpse of the vast scope of the Apollo program. This complicated box was just one part of the test apparatus for one small part of Apollo's electronics. Think of the many different electronic systems in the Apollo spacecraft, and consider the enormous effort to test them all. And electronics was just a small part of Apollo alongside the engines, mechanical structures, fuel cells, and life support systems. With all this complexity, it's not surprising that the Apollo program employed 400,000 people.

For more information, the footnotes include a list of UDL documentation16 and CuriousMarc's videos17. Follow me on Bluesky (@righto.com), Mastodon (@[email protected]), or RSS. (I've given up on Twitter.) I worked on this project with CuriousMarc, Mike Stewart, and Eric Schlapfer. Thanks to John McMaster for X-rays, thanks to Lumafield for the CT scans, and thanks to Marcel for providing the box.

Notes and references

  1. Mike found a NASA document Functional Integrated System Schematics that includes "Up Data Link GSE/SC Integrated Schematic Diagram". Unfortunately, this was not very helpful since the diagram merely shows the Test Set as a rectangle with one wire in and one wire out. The remainder of the diagram (omitted) shows that the output line passes through a dozen boxes (modulators, switches, amplifiers, and so forth) and then enters the UDL onboard the Spacecraft Command Module. At least we could confirm that the Test Set was part of the functional integrated testing of the UDL.

    Detail from "Up Data Link GSE/SC Integrated Schematic Diagram", page GT3.

    Detail from "Up Data Link GSE/SC Integrated Schematic Diagram", page GT3.

    Notably, this diagram has the Up-Data Link Confidence Test Set denoted with "2A17". If you examine the photo of the Test Set at the top of the article, you can see that the physical box has a Dymo label "2A17", confirming that this is the same box. 

  2. The table below lists the functions that could be performed by sending a "realtime command" to the Up-Data Link to activate a relay. The crew could reset any of the relays except for K1-K5 (Abort Light A and Crew Alarm).

    The functions controlled by the relays. Adapted from Command/Service Module Systems Handbook.

    The functions controlled by the relays. Adapted from Command/Service Module Systems Handbook.

    A message selected one of 32 relays and specified if the relay should be turned on or off. The relays were magnetic latching relays, so they stayed in the selected position even when de-energized. The relay control also supported "salvo reset": four commands to reset a bank of relays at once. 

  3. The Saturn V booster had a system for receiving commands from the ground, closely related to the Up-Data Link, but with some differences. The Saturn V system used the same Phase-Shift Keying (PSK) and 70 kHz subcarrier as the Up-Data Link, but the frequency of the S-band signal was different for Saturn V (2101.8 MHz). (Since the Command Module and the booster use separate frequencies, the use of different addresses in the up-data messages was somewhat redundant.) Both systems used sub-bit encoding. Both systems used three bits for the vehicle address, but the remainder of the Saturn message was different, consisting of 14 bits for the decoder address, and 18 bits for message data. A typical message for the Launch Vehicle Digital Computer (LVDC) includes a 7-bit command followed by the 7 bits inverted for error detection. The command system for the Saturn V was located in the Instrument Unit, the ring containing most of the electronic systems that was mounted at the top of the rocket, below the Lunar Module. The command system is described in Astrionics System Handbook section 6.2.

    The Saturn Command Decoder. From Saturn IB/V Instrument Unit System Description and Component Data.

    The Saturn Command Decoder. From Saturn IB/V Instrument Unit System Description and Component Data.

    The Lunar Module also had an Up-Data system, called the Digital Up-link Assembly (DUA) and built with integrated circuits. The Digital Up-link Assembly was similar to the Command Module's Up-Data Link and allowed ground stations to control the Lunar Guidance Computer. The DUA also controlled relays to arm the ascent engine. The DUA messages consisted of three vehicle address bits, three system address bits, and 16 information bits. Unlike the Command Module's UDL, the DUA includes the 70-kHz discriminator to demodulate the sub-band. The DUA also provided a redundant up-link voice path, using the data subcarrier to transmit audio. (The Command Module had a similar redundant voice path, but the demodulation was performed in the Premodulation Processor.) The DUA was based on the Digital-Command Assembly (DCA) that received up-link commands on the development vehicles. See Lunar Module Communication System and LM10 Handbook 2.7.4.2.2. 

  4. Unexpectedly, we found three different sets of sub-bit codes in different documents. The Telecommunications Study Guide says that the first digit (the Vehicle Address) encodes a one bit with the sub-bits 11011; for the remaining digits, a one bit is encoded by 10101. Apollo Digital Command System says that the first digit uses 11001 and the remainder use 10001. The schematic in Apollo Digital Up-Data Link Description shows that the first digit uses 11000 and the remainder use 01011. This encoding matches our Up-Data Link and the Test Set, although the Test Set flipped the phase in the PSK signal. (In all cases, a zero bit is encoded by inverting all five sub-bits.) 

  5. To provide range safety if the rocket went off course, the Saturn V booster had a destruct system. This system used detonating fuses along the RP-1 and LOX tanks to split the tanks open. As this happened, the escape tower at the top of the rocket would pull the astronauts to safety, away from the booster. The destruct system was controlled by the Digital Range Safety Command System (DRSCS), which used a cryptographic plug to prevent a malevolent actor from blowing up the rocket.

    The DRSCS—used on both the Saturn and Skylab programs—received a message consisting of a 9-character "Address" word and a 2-character "Command" word. Each character was composed of two audio-frequency tones from an "alphabet" of seven tones, reminiscent of the Dual-Tone Multi-Frequency (DTMF) signals used by Touch-Tone phones. The commands could arm the destruct circuitry, shut off propellants, disperse propellants, or switch the DRSCS off.

    To make this system secure, a "code plug" was carefully installed in the rocket shortly before launch. This code plug provided the "key-of-the-day" by shuffling the mapping between tone pairs and characters. With 21 characters, there were 21! (factorial) possible keys, so the chances of spoofing a message were astronomically small. Moreover, as the System Handbook writes with understatement: "Much attention has been given to preventing execution of a catastrophic command should one component fail during flight."

    For details of the range safety system, see Saturn Launch Vehicle Systems Handbook, Astrionics System Handbook (schematic in section 6.3), Apollo Spacecraft & Saturn V Launch Vehicle Pyrotechnics / Explosive Devices, The Evolution of Electronic Tracking, Optical, Telemetry, and Command Systems at the Kennedy Space Center, and Saturn V Stage I (S-IC) Overview

  6. I explained above how the Up-Data Link message was encoded into an audio signal using phase-shift keying. However, more steps were required before this signal could be transmitted over Apollo's complicated S-band radio system. Rather than using a separate communication link for each subsystem, Apollo unified most communication over a high-frequency S-band link, calling this the "Unified S-Band". Apollo had many communication streams—voice, control data, scientific data, ranging, telemetry, television—so cramming them onto a single radio link required multiple layers of modulation, like nested Russian Matryoshka dolls with a message inside.

    For the Up-Data Link, the analog PSK signal was modulated onto a subcarrier using frequency modulation. It was combined with the voice signal from ground and the pseudo-random ranging signal, and the combined signal was phase-modulated at 2106.40625 MHz and transmitted to the spacecraft through an enormous dish antenna at a ground station.

    The spectrum of the S-band signal to the Command Module. The Up-Data is transmitted on the 70 kHz subcarrier. Note the very wide spectrum of the pseudo-random ranging signal.

    The spectrum of the S-band signal to the Command Module. The Up-Data is transmitted on the 70 kHz subcarrier. Note the very wide spectrum of the pseudo-random ranging signal.

    Thus, the initial message was wrapped in several layers of modulation before transmission: the binary message was expanded to five times its length by the sub-bits, modulated with Phase-Shift Keying, modulated with frequency modulation, and modulated with phase modulation.

    On the spacecraft, the signal went through corresponding layers of demodulation to extract the message. A box called the Unified S-band Transceiver demodulated the phase-modulated signal and sent the data and voice signals to the pre-modulation processor (PMP). The PMP split out the voice and data subcarriers and demodulated the signals with FM discriminators. It sent the data signal (now a 2-kHz audio signal) to the Up-Data Link, where a phase-shift keying demodulator produced a binary output. Finally, each group of five sub-bits was converted to a single bit, revealing the message. 

  7. The Test Set uses eight-bit paper tape, but the encoding is unusual. Each character of the paper tape consists of a three-bit octal digit, the same digit inverted, and two control bits. Because of this redundancy, the Test Set could detect errors while reading the tape.

    One puzzling aspect of the paper tape reader was that we got it working, but when we tilted the Test Set on its side, the reader completely stopped working. It turned out that the reader's motor was controlled by a mercury-wetted relay, a high-current relay that uses mercury for the switch. Since mercury is a liquid, the relay would only work in the proper orientation; when we tilted the box, the mercury rolled away from the contacts. 

  8. This view of the Test Set from the top shows the positions of the 25 circuit boards, A1 through A25. Most of the boards are mounted in pairs, although A1, A2, and A15 are mounted singly. Because boards A1 and A11 have larger components, they have empty slots next to them; these are not missing boards. Each board unit has two ejector levers to remove it, along with two metal tabs to lock the unit into position. The 15 numbered holes allow access to the test points for each board. (I don't know the meaning of the text "CTS" on each board unit.) The thirteen digit display modules are at the bottom, with their dropping resistors at the bottom right.

    Top view of the Test Set.

    Top view of the Test Set.

     

  9. There are seven driver boards: A3 through A9. Board A3 is different from the others because it implements one digit instead of two. Instead, board A3 includes validation logic for the paper tape data. 

  10. Here is the datasheet for the digit displays in the Test Set: "Numerik Indicator IND-0300". In current dollars, they cost over $200 each! The cutaway diagram shows how the bent plastic sheets are stacked and illuminated.

    Datasheet from General Radio Catalog, 1963.

    Datasheet from General Radio Catalog, 1963.

    For amazing photos that show the internal structure of the displays, see this article. Fran Blanche's video discusses a similar display. Wikipedia has a page on lightguide displays.

    While restoring the Test Set, we discovered that a few of the light bulbs were burnt out. Since displaying an octal digit only uses eight of the ten bulbs, we figured that we could swap the failed bulbs with unused bulbs from "8" or "9". It turned out that we weren't the first people to think of this—many of the "unused" bulbs were burnt out. 

  11. I'll give more details on the count-to-five ring counter. The first flip-flop gets its J input from the Q' output of the last flip-flop as expected, but it gets its K input from the Q output of the second flip-flop, not the last flip-flop. If you examine the states, this causes the transition from 110 to 011 (a toggle instead of a set to 111), resulting in five states instead of six. 

  12. To explain the phase-shift keying circuitry in a bit more detail, board A1 produces a 4 kHz clock signal. Board A2 divides the clock, producing a 2 kHz signal and a 1 kHz signal. The 2 kHz signal is fed into the transformer to be phase-shifted. Then the 1 kHz reference signal is mixed in to form the PSK output. Resonant filters on board A1 convert the square-wave clock signals to smooth sine waves. 

  13. I was surprised to find LED opto-isolators in a device from the mid-1960s. I expected that the Test Set isolator used a light bulb, but testing showed that it switches on at 550 mV (like a diode) and operated successfully at over 100 kHz, impossible with a light bulb or photoresistor. It turns out that Texas Instruments filed a patent for an LED-based opto-isolator in 1963 and turned this into a product in 1964. The "PEX 3002" used a gallium-arsenide LED and a silicon phototransistor. Strangely, TI called this product a "molecular multiplex switch/chopper". Nowadays, an opto-isolator costs pennies, but at the time, these devices were absurdly expensive: TI's device sold for $275 (almost $3000 in current dollars). For more, see The Optical Link: A New Circuit Tool, 1965. 

  14. For more information on the Burr-Brown 1506 op amp module, see Burr-Brown Handbook of Operational Amplifier RC Networks. Other documents are Burr-Brown Handbook of Operational Amplifier Applications, Op-Amp History, Operational Amplifier Milestones, and an ad for the Burr-Brown 130 op amp. 

  15. I'm not sure of the meaning of the Direct versus Remote verification codes. The Block I (earlier) UDL had an 8-bit code, while the Block II (flight) UDL had a 4-bit code. The Direct code presumably comes from the UDL itself, while the Remote code is perhaps supplied through telemetry? 

  16. The block diagram below shows the structure of the Up-Data Link (UDL). It uses the sub-bit decoder and a 24-stage register to deserialize the message. Based on the message, the UDL triggers relays (RTC), outputs data to the Apollo Guidance Computer (called the CMC, Command Module Computer here), sends pulses to the CTE clock, or sends validity signals back to Earth.

    UDL block diagram, from Apollo Operations Handbook, page 31

    UDL block diagram, from Apollo Operations Handbook, page 31

    For details of the Apollo Up-Data system, see the diagram below (click it for a very large image). This diagram is from the Command/Service Module Systems Handbook (PDF page 64); see page 80 for written specifications of the UDL.

    This diagram of the Apollo Updata system specifies the message formats, relay usages, and internal structure of the UDL.

    This diagram of the Apollo Updata system specifies the message formats, relay usages, and internal structure of the UDL.

    Other important sources of information: Apollo Digital Up-Data Link Description contains schematics and a detailed description of the UDL. Telecommunication Systems Study Guide describes the earlier UDL that included a 450 MHz FM receiver. 

  17. The following CuriousMarc videos describe the Up-Data Link and the Test Set, so smash that Like button and subscribe :-)

     

Reverse engineering the 386 processor's prefetch queue circuitry

In 1985, Intel introduced the groundbreaking 386 processor, the first 32-bit processor in the x86 architecture. To improve performance, the 386 has a 16-byte instruction prefetch queue. The purpose of the prefetch queue is to fetch instructions from memory before they are needed, so the processor usually doesn't need to wait on memory while executing instructions. Instruction prefetching takes advantage of times when the processor is "thinking" and the memory bus would otherwise be unused.

In this article, I look at the 386's prefetch queue circuitry in detail. One interesting circuit is the incrementer, which adds 1 to a pointer to step through memory. This sounds easy enough, but the incrementer uses complicated circuitry for high performance. The prefetch queue uses a large network to shift bytes around so they are properly aligned. It also has a compact circuit to extend signed 8-bit and 16-bit numbers to 32 bits. There aren't any major discoveries in this post, but if you're interested in low-level circuits and dynamic logic, keep reading.

The photo below shows the 386's shiny fingernail-sized silicon die under a microscope. Although it may look like an aerial view of a strangely-zoned city, the die photo reveals the functional blocks of the chip. The Prefetch Unit in the upper left is the relevant block. In this post, I'll discuss the prefetch queue circuitry (highlighted in red), skipping over the prefetch control circuitry to the right. The Prefetch Unit receives data from the Bus Interface Unit (upper right) that communicates with memory. The Instruction Decode Unit receives prefetched instructions from the Prefetch Unit, byte by byte, and decodes the opcodes for execution.

This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version.

This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version.

The left quarter of the chip consists of stripes of circuitry that appears much more orderly than the rest of the chip. This grid-like appearance arises because each functional block is constructed (for the most part) by repeating the same circuit 32 times, once for each bit, side by side. Vertical data lines run up and down, in groups of 32 bits, connecting the functional blocks. To make this work, each circuit must fit into the same width on the die; this layout constraint forces the circuit designers to develop a circuit that uses this width efficiently without exceeding the allowed width. The circuitry for the prefetch queue uses the same approach: each circuit is 66 µm wide1 and repeated 32 times. As will be seen, fitting the prefetch circuitry into this fixed width requires some layout tricks.

What the prefetcher does

The purpose of the prefetch unit is to speed up performance by reading instructions from memory before they are needed, so the processor won't need to wait to get instructions from memory. Prefetching takes advantage of times when the memory bus is otherwise idle, minimizing conflict with other instructions that are reading or writing data. In the 386, prefetched instructions are stored in a 16-byte queue, consisting of four 32-bit blocks.2

The diagram below zooms in on the prefetcher and shows its main components. You can see how the same circuit (in most cases) is repeated 32 times, forming vertical bands. At the top are 32 bus lines from the Bus Interface Unit. These lines provide the connection between the datapath and external memory, via the Bus Interface Unit. These lines form a triangular pattern as the 32 horizontal lines on the right branch off and form 32 vertical lines, one for each bit. Next are the fetch pointer and the limit register, with a circuit to check if the fetch pointer has reached the limit. Note that the two low-order bits (on the right) of the incrementer and limit check circuit are missing. At the bottom of the incrementer, you can see that some bit positions have a blob of circuitry missing from others, breaking the pattern of repeated blocks. The 16-byte prefetch queue is below the incrementer. Although this memory is the heart of the prefetcher, its circuitry takes up a relatively small area.

A close-up of the prefetcher with the main blocks labeled. At the right, the prefetcher receives control signals.

A close-up of the prefetcher with the main blocks labeled. At the right, the prefetcher receives control signals.

The bottom part of the prefetcher shifts data to align it as needed. A 32-bit value can be split across two 32-bit rows of the prefetch buffer. To handle this, the prefetcher includes a data shift network to shift and align its data. This network occupies a lot of space, but there is no active circuitry here: just a grid of horizontal and vertical wires.

Finally, the sign extend circuitry converts a signed 8-bit or 16-bit value into a signed 16-bit or 32-bit value as needed. You can see that the sign extend circuitry is highly irregular, especially in the middle. A latch stores the output of the prefetch queue for use by the rest of the datapath.

Limit check

If you've written x86 programs, you probably know about the processor's Instruction Pointer (EIP) that holds the address of the next instruction to execute. As a program executes, the Instruction Pointer moves from instruction to instruction. However, it turns out that the Instruction Pointer doesn't actually exist! Instead, the 386 has an "Advance Instruction Fetch Pointer", which holds the address of the next instruction to fetch into the prefetch queue. But sometimes the processor needs to know the Instruction Pointer value, for instance, to determine the return address when calling a subroutine or to compute the destination address of a relative jump. So what happens? The processor gets the Advance Instruction Fetch Pointer address from the prefetch queue circuitry and subtracts the current length of the prefetch queue. The result is the address of the next instruction to execute, the desired Instruction Pointer value.

The Advance Instruction Fetch Pointer—the address of the next instruction to prefetch—is stored in a register at the top of the prefetch queue circuitry. As instructions are prefetched, this pointer is incremented by the prefetch circuitry. (Since instructions are fetched 32 bits at a time, this pointer is incremented in steps of four and the bottom two bits are always 0.)

But what keeps the prefetcher from prefetching too far and going outside the valid memory range? The x86 architecture infamously uses segments to define valid regions of memory. A segment has a start and end address (known as the base and limit) and memory is protected by blocking accesses outside the segment. The 386 has six active segments; the relevant one is the Code Segment that holds program instructions. Thus, the limit address of the Code Segment controls when the prefetcher must stop prefetching.3 The prefetch queue contains a circuit to stop prefetching when the fetch pointer reaches the limit of the Code Segment. In this section, I'll describe that circuit.

Comparing two values may seem trivial, but the 386 uses a few tricks to make this fast. The basic idea is to use 30 XOR gates to compare the bits of the two registers. (Why 30 bits and not 32? Since 32 bits are fetched at a time, the bottom bits of the address are 00 and can be ignored.) If the two registers match, all the XOR values will be 0, but if they don't match, an XOR value will be 1. Conceptually, connecting the XORs to a 32-input OR gate will yield the desired result: 0 if all bits match and 1 if there is a mismatch. Unfortunately, building a 32-input OR gate using standard CMOS logic is impractical for electrical reasons, as well as inconveniently large to fit into the circuit. Instead, the 386 uses dynamic logic to implement a spread-out NOR gate with one transistor in each column of the prefetcher.

The schematic below shows the implementation of one bit of the equality comparison. The mechanism is that if the two registers differ, the transistor on the right is turned on, pulling the equality bus low. This circuit is replicated 30 times, comparing all the bits: if there is any mismatch, the equality bus will be pulled low, but if all bits match, the bus remains high. The three gates on the left implement XNOR; this circuit may seem overly complicated, but it is a standard way of implementing XNOR. The NOR gate at the right blocks the comparison except during clock phase 2. (The importance of this will be explained below.)

This circuit is repeated 30 times to compare the registers.

This circuit is repeated 30 times to compare the registers.

The equality bus travels horizontally through the prefetcher, pulled low if any bits don't match. But what pulls the bus high? That's the job of the dynamic circuit below. Unlike regular static gates, dynamic logic is controlled by the processor's clock signals and depends on capacitance in the circuit to hold data. The 386 is controlled by a two-phase clock signal.4 In the first clock phase, the precharge transistor below turns on, pulling the equality bus high. In the second clock phase, the XOR circuits above are enabled, pulling the equality bus low if the two registers don't match. Meanwhile, the CMOS switch turns on in clock phase 2, passing the equality bus's value to the latch. The "keeper" circuit keeps the equality bus held high unless it is explicitly pulled low, to avoid the risk of the voltage on the equality bus slowly dissipating. The keeper uses a weak transistor to keep the bus high while inactive. But if the bus is pulled low, the keeper transistor is overpowered and turns off.

This is the output circuit for the equality comparison.  This circuit is located to the right of the prefetcher.

This is the output circuit for the equality comparison. This circuit is located to the right of the prefetcher.

This dynamic logic reduces power consumption and circuit size. Since the bus is charged and discharged during opposite clock phases, you avoid steady current through the transistors. (In contrast, an NMOS processor like the 8086 might use a pull-up on the bus. When the bus is pulled low, would you end up with current flowing through the pull-up and the pull-down transistors. This would increase power consumption, make the chip run hotter, and limit your clock speed.)

The incrementer

After each prefetch, the Advance Instruction Fetch Pointer must be incremented to hold the address of the next instruction to prefetch. Incrementing this pointer is the job of the incrementer. (Because each fetch is 32 bits, the pointer is incremented by 4 each time. But in the die photo, you can see a notch in the incrementer and limit check circuit where the circuitry for the bottom two bits has been omitted. Thus, the incrementer's circuitry increments its value by 1, so the pointer (with two zero bits appended) increases in steps of 4.)

Building an incrementer circuit is straightforward, for example, you can use a chain of 30 half-adders. The problem is that incrementing a 30-bit value at high speed is difficult because of the carries from one position to the next. It's similar to calculating 99999999 + 1 in decimal; you need to tediously carry the 1, carry the 1, carry the 1, and so forth, through all the digits, resulting in a slow, sequential process.

The incrementer uses a faster approach. First, it computes all the carries at high speed, almost in parallel. Then it computes each output bit in parallel from the carries—if there is a carry into a position, it toggles that bit.

Computing the carries is straightforward in concept: if there is a block of 1 bits at the end of the value, all those bits will produce carries, but carrying is stopped by the rightmost 0 bit. For instance, incrementing binary 11011 results in 11100; there are carries from the last two bits, but the zero stops the carries. A circuit to implement this was developed at the University of Manchester in England way back in 1959, and is known as the Manchester carry chain.

In the Manchester carry chain, you build a chain of switches, one for each data bit, as shown below. For a 1 bit, you close the switch, but for a 0 bit you open the switch. (The switches are implemented by transistors.) To compute the carries, you start by feeding in a carry signal at the right The signal will go through the closed switches until it hits an open switch, and then it will be blocked.5 The outputs along the chain give us the desired carry value at each position.

Concept of the Manchester carry chain, 4 bits.

Concept of the Manchester carry chain, 4 bits.

Since the switches in the Manchester carry chain can all be set in parallel and the carry signal blasts through the switches at high speed, this circuit rapidly computes the carries we need. The carries then flip the associated bits (in parallel), giving us the result much faster than a straightforward adder.

There are complications, of course, in the actual implementation. The carry signal in the carry chain is inverted, so a low signal propagates through the carry chain to indicate a carry. (It is faster to pull a signal low than high.) But something needs to make the line go high when necessary. As with the equality circuitry, the solution is dynamic logic. That is, the carry line is precharged high during one clock phase and then processing happens in the second clock phase, potentially pulling the line low.

The next problem is that the carry signal weakens as it passes through multiple transistors and long lengths of wire. The solution is that each segment has a circuit to amplify the signal, using a clocked inverter and an asymmetrical inverter. Importantly, this amplifier is not in the carry chain path, so it doesn't slow down the signal through the chain.

The Manchester carry chain circuit for a typical bit in the incrementer.

The Manchester carry chain circuit for a typical bit in the incrementer.

The schematic above shows the implementation of the Manchester carry chain for a typical bit. The chain itself is at the bottom, with the transistor switch as before. During clock phase 1, the precharge transistor pulls this segment of the carry chain high. During clock phase 2, the signal on the chain goes through the "clocked inverter" at the right to produce the local carry signal. If there is a carry, the next bit is flipped by the XOR gate, producing the incremented output.6 The "keeper/amplifier" is an asymmetrical inverter that produces a strong low output but a weak high output. When there is no carry, its weak output keeps the carry chain pulled high. But as soon as a carry is detected, it strongly pulls the carry chain low to boost the carry signal.

But this circuit still isn't enough for the desired performance. The incrementer uses a second carry technique in parallel: carry skip. The concept is to look at blocks of bits and allow the carry to jump over the entire block. The diagram below shows a simplified implementation of the carry skip circuit. Each block consists of 3 to 6 bits. If all the bits in a block are 1's, then the AND gate turns on the associated transistor in the carry skip line. This allows the carry skip signal to propagate (from left to right), a block at a time. When it reaches a block with a 0 bit, the corresponding transistor will be off, stopping the carry as in the Manchester carry chain. The AND gates all operate in parallel, so the transistors are rapidly turned on or off in parallel. Then, the carry skip signal passes through a small number of transistors, without going through any logic. (The carry skip signal is like an express train that skips most stations, while the Manchester carry chain is the local train to all the stations.) Like the Manchester carry chain, the implementation of carry skip needs precharge circuits on the lines, a keeper/amplifier, and clocked logic, but I'll skip the details.

An abstracted and simplified carry-skip circuit. The block sizes don't match the 386's circuit.

An abstracted and simplified carry-skip circuit. The block sizes don't match the 386's circuit.

One interesting feature is the layout of the large AND gates. A 6-input AND gate is a large device, difficult to fit into one cell of the incrementer. The solution is that the gate is spread out across multiple cells. Specifically, the gate uses a standard CMOS NAND gate circuit with NMOS transistors in series and PMOS transistors in parallel. Each cell has an NMOS transistor and a PMOS transistor, and the chains are connected at the end to form the desired NAND gate. (Inverting the output produces the desired AND function.) This spread-out layout technique is unusual, but keeps each bit's circuitry approximately the same size.

The incrementer circuitry was tricky to reverse engineer because of these techniques. In particular, most of the prefetcher consists of a single block of circuitry repeated 32 times, once for each bit. The incrementer, on the other hand, consists of four different blocks of circuitry, repeating in an irregular pattern. Specifically, one block starts a carry chain, a second block continues the carry chain, and a third block ends a carry chain. The block before the ending block is different (one large transistor to drive the last block), making four variants in total. This irregular pattern is visible in the earlier photo of the prefetcher.

The alignment network

The bottom part of the prefetcher rotates data to align it as needed. Unlike some processors, the x86 does not enforce aligned memory accesses. That is, a 32-bit value does not need to start on a 4-byte boundary in memory. As a result, a 32-bit value may be split across two 32-bit rows of the prefetch queue. Moreover, when the instruction decoder fetches one byte of an instruction, that byte may be at any position in the prefetch queue.

To deal with these problems, the prefetcher includes an alignment network that can rotate bytes to output a byte, word, or four bytes with the alignment required by the rest of the processor.

The diagram below shows part of this alignment network. Each bit exiting the prefetch queue (top) has four wires, for rotates of 24, 16, 8, or 0 bits. Each rotate wire is connected to one of the 32 horizontal bit lines. Finally, each horizontal bit line has an output tap, going to the datapath below. (The vertical lines are in the chip's lower M1 metal layer, while the horizontal lines are in the upper M2 metal layer. For this photo, I removed the M2 layer to show the underlying layer. Shadows of the original horizontal lines are still visible.)

Part of the alignment network.

Part of the alignment network.

The idea is that by selecting one set of vertical rotate lines, the 32-bit output from the prefetch queue will be rotated left by that amount. For instance, to rotate by 8, bits are sent down the "rotate 8" lines. Bit 0 from the prefetch queue will energize horizontal line 8, bit 1 will energize horizontal line 9, and so forth, with bit 31 wrapping around to horizontal line 7. Since horizontal bit line 8 is connected to output 8, the result is that bit 0 is output as bit 8, bit 1 is output as bit 9, and so forth.

The four possibilities for aligning a 32-bit value. The four bytes above are shifted as specified to produce the desired output below.

The four possibilities for aligning a 32-bit value. The four bytes above are shifted as specified to produce the desired output below.

For the alignment process, one 32-bit output may be split across two 32-bit entries in the prefetch queue in four different ways, as shown above. These combinations are implemented by multiplexers and drivers. Two 32-bit multiplexers select the two relevant rows in the prefetch queue (blue and green above). Four 32-bit drivers are connected to the four sets of vertical lines, with one set of drivers activated to produce the desired shift. Each byte of each driver is wired to achieve the alignment shown above. For instance, the rotate-8 driver gets its top byte from the "green" multiplexer and the other three bytes from the "blue" multiplexer. The result is that the four bytes, split across two queue rows, are rotated to form an aligned 32-bit value.

Sign extension

The final circuit is sign extension. Suppose you want to add an 8-bit value to a 32-bit value. An unsigned 8-bit value can be extended to 32 bits by simply filling the upper bits with zeroes. But for a signed value, it's trickier. For instance, -1 is the eight-bit value 0xFF, but the 32-bit value is 0xFFFFFFFF. To convert an 8-bit signed value to 32 bits, the top 24 bits must be filled in with the top bit of the original value (which indicates the sign). In other words, for a positive value, the extra bits are filled with 0, but for a negative value, the extra bits are filled with 1. This process is called sign extension.9

In the 386, a circuit at the bottom of the prefetcher performs sign extension for values in instructions. This circuit supports extending an 8-bit value to 16 bits or 32 bits, as well as extending a 16-bit value to 32 bits. This circuit will extend a value with zeros or with the sign, depending on the instruction.

The schematic below shows one bit of this sign extension circuit. It consists of a latch on the left and right, with a multiplexer in the middle. The latches are constructed with a standard 386 circuit using a CMOS switch (see footnote).7 The multiplexer selects one of three values: the bit value from the swap network, 0 for sign extension, or 1 for sign extension. The multiplexer is constructed from a CMOS switch if the bit value is selected and two transistors for the 0 or 1 values. This circuit is replicated 32 times, although the bottom byte only has the latches, not the multiplexer, as sign extension does not modify the bottom byte.

The sign extend circuit associated with bits 31-8 from the prefetcher.

The sign extend circuit associated with bits 31-8 from the prefetcher.

The second part of the sign extension circuitry determines if the bits should be filled with 0 or 1 and sends the control signals to the circuit above. The gates on the left determine if the sign extension bit should be a 0 or a 1. For a 16-bit sign extension, this bit comes from bit 15 of the data, while for an 8-bit sign extension, the bit comes from bit 7. The four gates on the right generate the signals to sign extend each bit, producing separate signals for the bit range 31-16 and the range 15-8.

This circuit determines which bits should be filled with 0 or 1.

This circuit determines which bits should be filled with 0 or 1.

The layout of this circuit on the die is somewhat unusual. Most of the prefetcher circuitry consists of 32 identical columns, one for each bit.8 The circuitry above is implemented once, using about 16 gates (buffers and inverters are not shown above). Despite this, the circuitry above is crammed into bit positions 17 through 7, creating irregularities in the layout. Moreover, the implementation of the circuitry in silicon is unusual compared to the rest of the 386. Most of the 386's circuitry uses the two metal layers for interconnection, minimizing the use of polysilicon wiring. However, the circuit above also uses long stretches of polysilicon to connect the gates.

Layout of the sign extension circuitry. This circuitry is at the bottom of the prefetch queue.

Layout of the sign extension circuitry. This circuitry is at the bottom of the prefetch queue.

The diagram above shows the irregular layout of the sign extension circuitry amid the regular datapath circuitry that is 32 bits wide. The sign extension circuitry is shown in green; this is the circuitry described at the top of this section, repeated for each bit 31-8. The circuitry for bits 15-8 has been shifted upward, perhaps to make room for the sign extension control circuitry, indicated in red. Note that the layout of the control circuitry is completely irregular, since there is one copy of the circuitry and it has no internal structure. One consequence of this layout is the wasted space to the left and right of this circuitry block, the tan regions with no circuitry except vertical metal lines passing through. At the far right, a block of circuitry to control the latches has been wedged under bit 0. Intel's designers go to great effort to minimize the size of the processor die since a smaller die saves substantial money. This layout must have been the most efficient they could manage, but I find it aesthetically displeasing compared to the regularity of the rest of the datapath.

How instructions flow through the chip

Instructions follow a tortuous path through the 386 chip. First, the Bus Interface Unit in the upper right corner reads instructions from memory and sends them over a 32-bit bus (blue) to the prefetch unit. The prefetch unit stores the instructions in the 16-byte prefetch queue.

Instructions follow a twisting path to and from the prefetch queue.

Instructions follow a twisting path to and from the prefetch queue.

How is an instruction executed from the prefetch queue? It turns out that there are two distinct paths. Suppose you're executing an instruction to add 12345678 to the EAX register. The prefetch queue will hold the five bytes 05 (the opcode), 78, 56, 34, and 12. The prefetch queue provides opcodes to the decoder one byte at a time over the 8-bit bus shown in red. The bus takes the lowest 8 bits from the prefetch queue's alignment network and sends this byte to a buffer (the small square at the head of the red arrow). From there, the opcode travels to the instruction decoder.10 The instruction decoder, in turn, uses large tables (PLAs) to convert the x86 instruction into a 111-bit internal format with 19 different fields.11

The data bytes of an instruction, on the other hand, go from the prefetch queue to the ALU (Arithmetic Logic Unit) through a 32-bit data bus (orange). Unlike the previous buses, this data bus is spread out, with one wire through each column of the datapath. This bus extends through the entire datapath so values can also be stored into registers. For instance, the MOV (move) instruction can store a value from an instruction (an "immediate" value) into a register.

Conclusions

The 386's prefetch queue contains about 7400 transistors, more than an Intel 8080 processor. (And this is just the queue itself; I'm ignoring the prefetch control logic.) This illustrates the rapid advance of processor technology: part of one functional unit in the 386 contains more transistors than an entire 8080 processor from 11 years earlier. And this unit is less than 3% of the entire 386 processor.

Every time I look at an x86 circuit, I see the complexity required to support backward compatibility, and I gain more understanding of why RISC became popular. The prefetcher is no exception. Much of the complexity is due to the 386's support for unaligned memory accesses, requiring a byte shift network to move bytes into 32-bit alignment. Moreover, at the other end of the instruction bus is the complicated instruction decoder that decodes intricate x86 instructions. Decoding RISC instructions is much easier.

In any case, I hope you've found this look at the prefetch circuitry interesting. I plan to write more about the 386, so follow me on Bluesky (@righto.com) or RSS for updates. I've written multiple articles on the 386 previously; a good place to start might be my survey of the 368 dies.

Footnotes and references

  1. The width of the circuitry for one bit changes a few times: while the prefetch queue and segment descriptor cache use a circuit that is 66 µm wide, the datapath circuitry is a bit tighter at 60 µm. The barrel shifter is even narrower at 54.5 µm per bit. Connecting circuits with different widths wastes space, since the wiring to connect the bits requires horizontal segments to adjust the spacing. But it also wastes space to use widths that are wider than needed. Thus, changes in the spacing are rare, where the tradeoffs make it worthwhile. 

  2. The Intel 8086 processor had a six-byte prefetch queue, while the Intel 8088 (used in the original IBM PC) had a prefetch queue of just four bytes. In comparison, the 16-byte queue of the 386 seems luxurious. (Some 386 processors, however, are said to only use 12 bytes due to a bug.)

    The prefetch queue assumes instructions are executed in linear order, so it doesn't help with branches or loops. If the processor encounters a branch, the prefetch queue is discarded. (In contrast, a modern cache will work even if execution jumps around.) Moreover, the prefetch queue doesn't handle self-modifying code. (It used to be common for code to change itself while executing to squeeze out extra performance.) By loading code into the prefetch queue and then modifying instructions, you could determine the size of the prefetch queue: if the old instruction was executed, it must be in the prefetch queue, but if the modified instruction was executed, it must be outside the prefetch queue. Starting with the Pentium Pro, x86 processors flush the prefetch queue if a write modifies a prefetched instruction. 

  3. The prefetch unit generates "linear" addresses that must be translated to physical addresses by the paging unit (ref). 

  4. I don't know which phase of the clock is phase 1 and which is phase 2, so I've assigned the numbers arbitrarily. The 386 creates four clock signals internally from a clock input CLK2 that runs at twice the processor's clock speed. The 386 generates a two-phase clock with non-overlapping phases. That is, there is a small gap between when the first phase is high and when the second phase is high. The 386's circuitry is controlled by the clock, with alternate blocks controlled by alternate phases. Since the clock phases don't overlap, this ensures that logic blocks are activated in sequence, allowing the orderly flow of data. But because the 386 uses CMOS, it also needs active-low clocks for the PMOS transistors. You might think that you could simply use the phase 1 clock as the active-low phase 2 clock and vice versa. The problem is that these clock phases overlap when used as active-low; there are times when both clock signals are low. Thus, the two clock phases must be explicitly inverted to produce the two active-low clock phases. I described the 386's clock generation circuitry in detail in this article

  5. The Manchester carry chain is typically used in an adder, which makes it more complicated than shown here. In particular, a new carry can be generated when two 1 bits are added. Since we're looking at an incrementer, this case can be ignored.

    The Manchester carry chain was first described in Parallel addition in digital computers: a new fast ‘carry’ circuit. It was developed at the University of Manchester in 1959 and used in the Atlas supercomputer. 

  6. For some reason, the incrementer uses a completely different XOR circuit from the comparator, built from a multiplexer instead of logic. In the circuit below, the two CMOS switches form a multiplexer: if the first input is 1, the top switch turns on, while if the first input is a 0, the bottom switch turns on. Thus, if the first input is a 1, the second input passes through and then is inverted to form the output. But if the first input is a 0, the second input is inverted before the switch and then is inverted again to form the output. Thus, the second input is inverted if the first input is 1, which is a description of XOR.

    The implementation of an XOR gate in the incrementer.

    The implementation of an XOR gate in the incrementer.

    I don't see any clear reason why two different XOR circuits were used in different parts of the prefetcher. Perhaps the available space for the layout made a difference. Or maybe the different circuits have different timing or output current characteristics. Or it could just be the personal preference of the designers. 

  7. The latch circuit is based on a CMOS switch (or transmission gate) and a weak inverter. Normally, the inverter loop holds the bit. However, if the CMOS switch is enabled, its output overpowers the signal from the weak inverter, forcing the inverter loop into the desired state.

    The CMOS switch consists of an NMOS transistor and a PMOS transistor in parallel. By setting the top control input high and the bottom control input low, both transistors turn on, allowing the signal to pass through the switch. Conversely, by setting the top input low and the bottom input high, both transistors turn off, blocking the signal. CMOS switches are used extensively in the 386, to form multiplexers, create latches, and implement XOR. 

  8. Most of the 386's control circuitry is to the right of the datapath, rather than awkwardly wedged into the datapath. So why is this circuit different? My hypothesis is that since the circuit needs the values of bit 15 and bit 7, it made sense to put the circuitry next to bits 15 and 7; if this control circuitry were off to the right, long wires would need to run from bits 15 and 7 to the circuitry. 

  9. In case this post is getting tedious, I'll provide a lighter footnote on sign extension. The obvious mnemonic for a sign extension instruction is SEX, but that mnemonic was too risque for Intel. The Motorola 6809 processor (1978) used this mnemonic, as did the related 68HC12 microcontroller (1996). However, Steve Morse, architect of the 8086, stated that the sign extension instructions on the 8086 were initially named SEX but were renamed before release to the more conservative CBW and CWD (Convert Byte to Word and Convert Word to Double word).

    The DEC PDP-11 was a bit contradictory. It has a sign extend instruction with the mnemonic SXT; the Jargon File claims that DEC engineers almost got SEX as the assembler mnemonic, but marketing forced the change. On the other hand, SEX was the official abbreviation for Sign Extend (see PDP-11 Conventions Manual, PDP-11 Paper Tape Software Handbook) and SEX was used in the microcode for sign extend.

    RCA's CDP1802 processor (1976) may have been the first with a SEX instruction, using the mnemonic SEX for the unrelated Set X instruction. See also this Retrocomputing Stack Exchange page

  10. It seems inconvenient to send instructions all the way across the chip from the Bus Interface Unit to the prefetch queue and then back across to the chip to the instruction decoder, which is next to the Bus Interface Unit. But this was probably the best alternative for the layout, since you can't put everything close to everything. The 32-bit datapath circuitry is on the left, organized into 32 columns. It would be nice to put the Bus Interface Unit other there too, but there isn't room, so you end up with the wide 32-bit data bus going across the chip. Sending instruction bytes across the chip is less of an impact, since the instruction bus is just 8 bits wide. 

  11. See "Performance Optimizations of the 80386", Slager, Oct 1986, in Proceedings of ICCD, pages 165-168.