Showing posts with label chips. Show all posts
Showing posts with label chips. Show all posts

The transparent chip inside a vintage Hewlett-Packard floppy drive

While repairing an eight-inch HP floppy drive, we found that the problem was a broken interface chip. Since the chip was bad, I decapped it and took photos. This chip is very unusual: instead of a silicon substrate, the chip is formed on a base of sapphire, with silicon and metal wiring on top. As a result, the chip is transparent as you can see from the gold "X" visible through the die in the photo below.

The PHI die as seen through an inspection microscope. Click this image (or any other) for a larger version.

The PHI die as seen through an inspection microscope. Click this image (or any other) for a larger version.

The chip is a custom HP chip from 1977 that provides an interface between HP's interface bus (HP-IB) and the Z80 processor in the floppy drive controller. HP designed this interface bus as a low-cost bus to connect test equipment, computers, and peripherals. The chip, named PHI (Processor-to-HP-IB Interface), was used in multiple HP products. It handles the bus protocol and buffered data between the interface bus and a device's microprocessor.1 In this article, I'll take a look inside this "silicon-on-sapphire" chip, examine its metal-gate CMOS circuitry, and explain how it works.

Silicon-on-sapphire

Most integrated circuits are formed on a silicon wafer. Silicon-on-sapphire, on the other hand, starts with a sapphire substrate. A thin layer of silicon is built up on the sapphire substrate to form the circuitry. The silicon is N-type, and is converted to P-type where needed by ion implantation. A metal wiring layer is created on top, forming the wiring as well as the metal-gate transistors. The diagram below shows a cross-section of the circuitry.

Cross-section from HP Journal, April 1977.

Cross-section from HP Journal, April 1977.

The important thing about silicon-on-sapphire is that silicon regions are separated from each other. Since the sapphire substrate is an insulator, transistors are completely isolated, unlike a regular integrated circuit. This reduces the capacitance between transistors, improving performance. The insulation also prevents stray currents, protecting against latch-up and radiation.

An HP MC2 die, illuminated from behind with fiber optics. From Hewlett-Packard Journal, April 1977.

An HP MC2 die, illuminated from behind with fiber optics. From Hewlett-Packard Journal, April 1977.

Silicon-on-sapphire integrated circuits date back to research in 1963 at Autonetics, an innovative but now-forgotten avionics company that produced guidance computers for the Minuteman ICBMs, among other things. RCA developed silicon-on-sapphire integrated circuits in the 1960s and 1970s such as the CDP1821 silicon-on-sapphire 1K RAM. HP used silicon-on-sapphire for multiple chips starting in 1977, such as the MC2 Micro-CPU Chip. HP also used SOS for the three-chip CPU in the HP 3000 Amigo (1978), but the system was a commercial failure. The popularity of silicon-on-sapphire peaked in the early 1980s and HP moved to bulk silicon integrated circuits for calculators such as the HP-41C. Silicon-on-sapphire is still used in various products, such as LEDs and RF applications, but is now mostly a niche technology.

Inside the PHI chip

HP used an unusual package for the PHI chip. The chip is mounted on a ceramic substrate, protected by a ceramic cap. The package has 48 gold fingers that press into a socket. The chip is held into the socket by two metal spring clips.

Package of the PHI chip, showing the underside. The package is flipped over when mounted in a socket.

Package of the PHI chip, showing the underside. The package is flipped over when mounted in a socket.

Decapping the chip was straightforward, but more dramatic than I expected. The chip's cap is attached with adhesive, which can be softened by heating. Hot air wasn't sufficient, so we used a hot plate. Eric tested the adhesive by poking it with an X-Acto knife, causing the cap to suddenly fly off with a loud "pop", sending the blade flying through the air. I was happy to be wearing safety glasses.

Decapping the chip with a hotplate and hot air.

Decapping the chip with a hotplate and hot air.

After decapping the chip, I created the high-resolution die photo below. The metal layer is clearly visible as white lines, while the silicon is grayish and the sapphire appears purple. Around the edge of the die, bond wires connect the chip's 48 external connections to the die. Slightly upper left of center, a large regular rectangular block of circuitry provides 160 bits of storage: this is two 8-word FIFO buffers, passing 10-bit words between the interface bus and a connected microprocessor. The thick metal traces around the edges provide +12 volts, +5 volts, and ground to the chip.

Die photo of the PHI chip, created by stitching together microscope photos. Click for a much larger image.

Die photo of the PHI chip, created by stitching together microscope photos. Click for a much larger image.

Logic gates

Circuitry on this chip has an unusual appearance due to the silicon-on-sapphire implementation as well as the use of metal-gate transistors, but fundamentally the circuits are standard CMOS. The photo below shows a block that implements an inverter and a NAND gate. The sapphire substrate appears as dark purple. On top of this, the thick gray lines are the silicon. The white metal on top connects the transistors. The metal can also form the gates of transistors when it crosses silicon (indicated by letters). Inconveniently, metal that contacts silicon, metal that crosses over silicon, and metal that forms a transistor all appear very similar in this chip. This makes it more difficult to determine the wiring.

This diagram shows an inverter and a NAND gate on the die.

This diagram shows an inverter and a NAND gate on the die.

The schematic below shows how the gates are implemented, matching the photo above. The metal lines at the top and bottom provide the power and ground rails respectively. The inverter is formed from NMOS transistor A and PMOS transistor B; the output goes to transistors D and F. The NAND gate is formed by NMOS transistors E and F in conjunction with PMOS transistors C and D. The components of the NAND gate are joined at the square of metal, and then the output leaves through silicon on the right. Note that signals can only cross when one signal is in the silicon layer and one is in the metal layer. With only two wiring layers, signals in the PHI chip must often meander to avoid crossings, wasting a lot of space. (This wiring is much more constrained than typical chips of the 1970s that also had a polysilicon layer, providing three wiring layers in total.)

This schematic shows how the inverter and a NAND gate are implemented.

This schematic shows how the inverter and a NAND gate are implemented.

The FIFOs

The PHI chip has two first-in-first-out buffers (FIFOs) that occupy a substantial part of the die. Each FIFO holds 8 words of 10 bits, with one FIFO for data being read from the bus and the other for data written to the bus. These buffers help match the bus speed to the microprocessor speed, ensuring that data transmission is as fast as possible.

Each bit of the FIFO is essentially a static RAM cell, as shown below. Inverters A and B form a loop to hold a bit. Pass transistor C provides feedback so the inverter loop remains stable. To write a word, 10 bits are fed through vertical bit-in lines. A horizontal word write signal is activated to select the word to update. This disables transistor C and turns on transistor D, allowing the new bit to flow into the inverter loop. To read a word, a horizontal word read line is activated, turning on pass transistor F. This allows the bit in the cell to flow onto the vertical bit-out line, buffered by inverter E. The two FIFOs have separate lines so they can be read and written independently.

One cell of the FIFO.

One cell of the FIFO.

The diagram below shows nine FIFO cells as they appear on the die. The red box indicates one cell, with components labeled to match the schematic. Cells are mirrored vertically and horizontally to increase the layout density.

Nine FIFO cells as they appear on the die.

Nine FIFO cells as they appear on the die.

Control logic (not shown) to the left and right of the FIFOs manages the FIFOs. This logic generates the appropriate read and write signals so data is written to one end of the FIFO and read from the other end.

The address decoder

Another interesting circuit is the decoder that selects a particular register based on the address lines. The PHI chip has eight registers, selected by three address lines. The decoder takes the address lines and generates 16 control lines (more or less), one to read from each register, and one to write to each register.

A die photo of the address decoder.

A die photo of the address decoder.

The decoder has a regular matrix structure for efficient implementation. Row lines are in pairs, with a line for each address bit input and its complement. Each column corresponds to one output, with the transistors arranged so the column will be activated when given the appropriate inputs. At the top and bottom are inverters. These latch the incoming address bits, generate the complements, and buffer the outputs.

Schematic of the decoder.

Schematic of the decoder.

The schematic above shows how the decoder operates. (I've simplified it to two inputs and two outputs.) At the top, the address line goes through a latch formed from two inverters and a pass transistor. The address line and its complement form two row lines; the other row lines are similar. Each column has a transistor on one row line and a diode on the other, selecting the address for that column. For instance, supposed a0 is 1 and an is 0. This matches the first column since the transistor lines are low and the diode lines are high. The PMOS transistors in the column will all turn on, pulling the input to the inverter high. However, if any of the inputs are "wrong", the corresponding transistor will turn off, blocking the +12 volts. Moreover, the output will be pulled low through the corresponding diode. Thus, each column will be pulled high only if all the inputs match, and otherwise it will be pulled low. Each column output controls one of the chip's registers, allowing that register to be accessed.

The HP-IB bus and the PHI chip

The Hewlett-Packard Interface Bus (HP-IB) was designed in the early 1970s as a low-cost bus for connecting diverse devices including instrument systems (such as a digital voltmeter or frequency counter), storage, and computers. This bus became an IEEE standard in 1975, known as the IEEE-488 bus.2 The bus is 8-bits parallel, with handshaking between devices so slow devices can control the speed.

In 1977, HP Developed a chip, known as PHI (Processor to HP-IB Interface) to implement the bus protocol and provide a microprocessor interface. This chip not only simplified construction of a bus controller but also ensured that devices implemented the protocol consistently. The block diagram below shows the components of the PHI chip. It's not an especially complex chip, but it isn't trivial either. I estimate that it has several thousand transistors.

Block diagram from HP Journal, July 1989.

Block diagram from HP Journal, July 1989.

The die photo below shows some of the functional blocks of the PHI chip. The microprocessor connected to the top pins, while the interface bus connected to the lower pins.

The PHI die with some functional blocks labeled,

The PHI die with some functional blocks labeled,

Conclusions

Top of the PHI chip, with the 1AA6-6004 part number. I'm not sure what the oval stamp at the top is, maybe a turtle?

Top of the PHI chip, with the 1AA6-6004 part number. I'm not sure what the oval stamp at the top is, maybe a turtle?

The PHI chip is interesting as an example of a "technology of the future" that didn't quite pan out. HP put a lot of effort into silicon-on-sapphire chips, expecting that this would become an important technology: dense, fast, and low power. However, regular silicon chips turned out to be the winning technology and silicon-on-sapphire was relegated to niche markets.

Comparing HP's silicon-on-sapphire chips to regular silicon chips at the time shows some advantages and disadvantages. HP's MC2 16-bit processor (1977) used silicon-on-sapphire technology and had 10,000 transistors and ran at 8 megahertz, using 350 mW. In comparison, the Intel 8086 (1978) was also a 16-bit processor, but implemented on regular silicon and using NMOS instead of CMOS. The 8086 had 29,000 transistors, ran at 5 megahertz (at first) and used up to 2.5 watts. The sizes of the chips were almost identical: 34 mm2 for the HP processor and 33 mm2 for the Intel processor. This illustrates that CMOS uses much less power than NMOS, one of the reasons that CMOS is now the dominant technology. For the other factors, silicon-on-sapphire had a bit of a speed advantage but wasn't as dense. Silicon-on-sapphire's main problem was its low yield and high cost. Crystal incompatibilities between silicon and sapphire made manufacturing difficult; HP achieved a yield of 9%, meaning 91% of the dies failed.

The time period of the PHI chip is also interesting since interface buses were transitioning from straightforward buses to high-performance buses with complex protocols. Early buses could be implemented with simple integrated circuits, but as protocols became more complex, custom interface chips became necessary. (The MOS 6522 Versatile Interface Adapter chip (1977) is another example, used in many home computers of the 1980s.) But these interfaces were still simple enough that the interface chips didn't require microcontrollers, using simple state machines instead.

The HP logo on the die of the PHI chip.

The HP logo on the die of the PHI chip.

For more, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @[email protected]. Thanks to CuriousMarc for providing the chip and to TubeTimeUS for help with decapping.

Notes and references

  1. More information: The article What is Silicon-on-Sapphire discusses the history and construction. Details on the HP-IB bus are here. The HP 12009A HP-IB Interface Reference Manual has information on the PHI chip and the protocol. See also the PHI article from HP Journal, July 1989. EvilMonkeyDesignz also shows a decapped PHI chip. 

  2. People with Commodore PET computers may recognize the IEEE-488 bus since peripherals such as floppy disk drives were connected using the IEEE-488 bus. The cables were generally expensive and harder to obtain than interface cables used by other computers. The devices were also slow compared to other computers, although I think this was due to the hardware, not the bus. 

Reverse engineering the Intel 386 processor's register cell

The groundbreaking Intel 386 processor (1985) was the first 32-bit processor in the x86 line. It has numerous internal registers: general-purpose registers, index registers, segment selectors, and more specialized registers. In this blog post, I look at the silicon die of the 386 and explain how some of these registers are implemented at the transistor level. The registers that I examined are implemented as static RAM, with each bit stored in a common 8-transistor circuit, known as "8T". Studying this circuit shows the interesting layout techniques that Intel used to squeeze two storage cells together to minimize the space they require.

The diagram below shows the internal structure of the 386. I have marked the relevant registers with three red boxes. Two sets of registers are in the segment descriptor cache, presumably holding cache entries, and one set is at the bottom of the data path. Some of the registers at the bottom are 32 bits wide, while others are half as wide and hold 16 bits. (More registers with different circuits, but I won't discuss them in this post.)

The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici.

The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici.

The 6T and 8T static RAM cells

First, I'll explain how a 6T or 8T static cell holds a bit. The basic idea behind a static RAM cell is to connect two inverters into a loop. This circuit will be stable, with one inverter on and one inverter off, and each inverter supporting the other. Depending on which inverter is on, the circuit stores a 0 or a 1.

Two inverters in a loop can store a 0 or a 1.

Two inverters in a loop can store a 0 or a 1.

To write a new value into the circuit, two signals are fed in, forcing the inverters to the desired new values. One inverter receives the new bit value, while the other inverter receives the complemented bit value. This may seem like a brute-force way to update the bit, but it works. The trick is that the inverters in the cell are small and weak, while the input signals are higher current, able to overpower the inverters.1 The write data lines (called bitlines) are connected to the inverters by pass transistors.2 When the pass transistors are on, the signals on the write lines can pass through to the inverters. But when the pass transistors are off, the inverters are isolated from the write lines. Thus, the write control signal enables writing a new value to the inverters. (This signal is called a wordline since it controls access to a word of storage.) Since each inverter consists of two transistors7, the circuit below consists of six transistors, forming the 6T storage cell.

Adding pass transistor so the cell can be written.

Adding pass transistor so the cell can be written.

The 6T cell uses the same bitlines for reading and writing. Adding two transistors creates the 8T circuit, which has the advantage that you can read one register and write to another register at the same time. (I.e. the register file is two-ported.) In the 8T cell below, two additional transistors (G and H) are used for reading. Transistor G buffers the cell's value; it turns on if the inverter output is high, pulling the read output bitline low.3 Transistor H is a pass transistor that blocks this signal until a read is performed on this register; it is controlled by a read wordline.

Schematic of a storage cell. Each transistor is labeled with a letter.

Schematic of a storage cell. Each transistor is labeled with a letter.

To form registers (or memory), a grid is constructed from these cells. Each row corresponds to a register, while each column corresponds to a bit position. The horizontal lines are the wordlines, selecting which word to access, while the vertical lines are the bitlines, passing bits in or out of the registers. For a write, the vertical bitlines provide the 32 bits (along with their complements). For a read, the vertical bitlines receive the 32 bits from the register. A wordline is activated to read or write the selected register.

Static memory cells (8T) organized into a grid.

Static memory cells (8T) organized into a grid.

Silicon circuits in the 386

Before showing the layout of the circuit on the die, I should give a bit of background on the technology used to construct the 386. The 386 was built with CMOS technology, with NMOS and PMOS transistors working together, an advance over the earlier x86 chips that were built with NMOS transistors. Intel called this CMOS technology CHMOS-III (complementary high-performance metal-oxide-silicon), with 1.5 µm features. While Intel's earlier chips had a single metal layer, CHMOS-III provided two metal layers, making signal routing much easier.

Because CMOS uses both NMOS and PMOS transistors, fabrication is more complicated. In an MOS integrated circuit, a transistor is formed where a polysilicon wire crosses active silicon, creating the transistor's gate. A PMOS transistor is constructed directly on the silicon substrate (which is N-doped). However, an NMOS transistor is the opposite, requiring a P-doped substrate. This is created by forming a P well, a region of P-doped silicon that holds NMOS transistors. Each P well must be connected to ground; this is accomplished by connecting ground to specially-doped regions of the P well, called "well taps"`.

The diagram below shows a cross-section through two transistors, showing the layers of the chip. There are four important layers: silicon (which has some regions doped to form active silicon), polysilicon for wiring and transistors, and the two metal layers. At the bottom is the silicon, with P or N doping; note the P-well for the NMOS transistor on the left. Next is the polysilicon layer. At the top are the two layers of metal, named M1 and M2. Conceptually, the chip is constructed from flat layers, but the layers have a three-dimensional structure influenced by the layers below. The layers are separated by silicon dioxide ("ox") or silicon oxynitride4; the oxynitride under M2 caused me considerable difficulty.

A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology.

A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology.

The image below shows how circuitry appears on the die;5 I removed the metal layers to show the silicon and polysilicon that form transistors. (As will be described below, this image shows two static cells, holding two bits.) The pinkish and dark regions are active silicon, doped to take part in the circuits, while the "background" silicon can be ignored. The green lines are polysilicon lines on top of the silicon. Transistors are the most important feature here: a transistor gate is formed when polysilicon crosses active silicon, with the source and drain on either side. The upper part of the image has PMOS transistors, while the lower part of the image has the P well that holds NMOS transistors. (The well itself is not visible.) In total, the image shows four PMOS transistors and 12 NMOS transistors. At the bottom, the well taps connect the P well to ground. Although the metal has been removed, the contacts between the lower metal layer (M1) and the silicon or polysilicon are visible as faint circles.

A (heavily edited) closeup of the die.

A (heavily edited) closeup of the die.

Register layout in the 386

Next, I'll explain the layout of these cells in the 386. To increase the circuit density, two cells are put side-by-side, with a mirrored layout. In this way, each row holds two interleaved registers.6 The schematic below shows the arrangement of the paired cells, matching the die image above. Transistors A and B form the first inverter,7 while transistors C and D form the second inverter. Pass transistors E and F allow the bitlines to write the cell. For reading, transistor G amplifies the signal while pass transistor H connects the selected bit to the output.

Schematic of two static cells in the 386. The schematic approximately matches the physical layout.

Schematic of two static cells in the 386. The schematic approximately matches the physical layout.

The left and right sides are approximately mirror images, with separate read and write control lines for each half. Because the control lines for the left and right sides are in different positions, the two sides have some layout differences, in particular, the bulging loop on the right. Mirroring the cells increases the density since the bitlines can be shared by the cells.

The diagram below shows the various components on the die, labeled to match the schematic above. I've drawn the lower M1 metal wiring in blue, but omitted the M2 wiring (horizontal control lines, power, and ground). "Read crossover" indicates the connection from the read output on the left to the bitline on the right. Black circles indicate vias between M1 and M2, green circles indicate contacts between silicon and M1, and reddish circles indicate contacts between polysilicon and M1.

The layout of two static cells. The M1 metal layer is drawn in blue; the horizontal M2 lines are not shown.

The layout of two static cells. The M1 metal layer is drawn in blue; the horizontal M2 lines are not shown.

One more complication is that alternating registers (i.e. rows) are reflected vertically, as shown below. This allows one horizontal power line to feed two rows, and similarly for a horizontal ground line. This cuts the number of power/ground lines in half, making the layout more efficient.

Multiple storage cells.

Multiple storage cells.

Having two layers of metal makes the circuitry considerably more difficult to reverse engineer. The photo below (left) shows one of the static RAM cells as it appears under the microscope. Although the structure of the metal layers is visible in the photograph, there is a lot of ambiguity. It is difficult to distinguish the two layers of metal. Moreover, the metal completely hides the polysilicon layer, not to mention the underlying silicon. The large black circles are vias between the two metal layers. The smaller faint circles are contacts between a metal layer and the underlying silicon or polysilicon.

One cell as it appears on the die, with a diagram of the upper (M2) and lower (M1) metal layers.

One cell as it appears on the die, with a diagram of the upper (M2) and lower (M1) metal layers.

With some effort, I determined the metal layers, which I show on the right: M2 (upper) and M1 (lower). By comparing the left and right images, you can see how the structure of the metal layers is somewhat visible. I use black circles to indicate vias between the layers, green circles indicate contacts between M1 and silicon, and pink circles indicate contacts between M1 and polysilicon. Note that both metal layers are packed as tightly as possible. The layout of this circuit was highly optimized to minimize the area. It is interesting to note that decreasing the size of the transistors wouldn't help with this circuit, since the size is limited by the metal density. This illustrates that a fabrication process must balance the size of the metal features, polysilicon features, and silicon features since over-optimizing one won't help the overall chip density.

The photo below shows the bottom of the register file. The "notch" makes the registers at the very bottom half-width: 4 half-width rows corresponding to eight 16-bit registers. Since there are six 16-bit segment registers in the 386, I suspect these are the segment registers and two mystery registers.

The bottom of the register file.

The bottom of the register file.

I haven't been able to determine which registers in the 386 correspond to the other registers on the die. In the segment descriptor circuitry, there are two rows of register cells with ten more rows below, corresponding to 24 32-bit registers. These are presumably segment descriptors. At the bottom of the datapath, there are 10 32-bit registers with the T8 circuit. The 386's programmer-visible registers consist of eight general-purpose 32-bit registers (EAX, etc.). The 386 has various control registers, test registers, and segmentation registers8 that are not well known. The 8086 has a few registers for internal use that aren't visible to the programmer, so the 386 presumably has even more invisible registers. At this point, I can't narrow down the functionality.

Conclusions

It's interesting to examine how registers are implemented in a real processor. There are plenty of descriptions of the 8T static cell circuit, but it turns out that the physical implementation is more complicated than the theoretical description. Intel put a lot of effort into optimizing this circuit, resulting in a dense block of circuitry. By mirroring cells horizontally and vertically, the density could be increased further.

Reverse engineering one small circuit of the 386 turned out to be pretty tricky, so I don't plan to do a complete reverse engineering. The main difficulty is the two layers of metal are hard to untangle. Moreover, I lost most of the polysilicon when removing the metal. Finally, it is hard to draw diagrams with four layers without the diagram turning into a mess, but hopefully the diagrams made sense.

I plan to write more about the 386, so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @[email protected].

Notes and references

  1. Typically the write driver circuit generates a strong low on one of the bitlines, flipping the corresponding inverter to a high output. As soon as one inverter flips, it will force the other inverter into the right state. To support this, the pullup transistors in the inverters are weaker than normal. 

  2. The pass transistor passes its signal through or blocks it. In CMOS, this is usually implemented with a transmission gate with an NMOS and a PMOS transistor in parallel. The cell uses only the NMOS transistor, which makes it worse at passing a high signal, but substantially reduces the size, a reasonable tradeoff for a storage cell. 

  3. The bitline is typically precharged to a high level for a read, and then the cell pulls the line low for a 0. This is more compact than including circuitry in each cell to pull the line high. 

  4. One problem is that the 386 uses a layer of insulating silicon oxynitride as well as the usual silicon dioxide. I was able to remove the oxynitride with boiling phosphoric acid, but this removed most of the polysilicon as well. I'm still experimenting with the timing; 20 minutes of boiling was too long. 

  5. The image is an edited composite of multiple cells since the polysilicon was highly damaged when removing the metal layers. Unfortunately, I haven't found a process for the 386 to remove one layer of metal at a time. As a result, reverse-engineering the 386 is much more difficult than earlier processors such as the 8086; I have to look for faint traces of polysilicon and puzzle over what connections the circuit requires. 

  6. You might wonder why they put two cells side-by-side instead of simply cramming the cells together more tightly. The reason for putting two cells in each row is presumably to match the size of each bit with the rest of the circuitry in the datapath. If the register circuitry is half the width of the ALU circuitry, a bunch of space will be wasted by the wiring to line up each register bit with the corresponding ALU bit. 

  7. A CMOS inverter is constructed from an NMOS transistor (which pulls the output low on a 1 input) and a PMOS transistor (which pulls the output high on a 0 input), as shown below.

    A CMOS inverter.

    A CMOS inverter.

     

  8. The 386 has multiple registers that are documented but not well known. Chapter 4 of the 386 Programmers Reference Manual discusses various registers that are only relevant to operating systems programmers. These include the Global Descriptor Table Register (GDTR), Local Descriptor Table Register (LDTR), Interrupt Descriptor Table Register (IDTR), and Task Register (TR). There are four Control Registers CR0-CR3; CR0 controls coprocessor usage, paging, and a few other things. The six Debug Registers for hardware breakpoints are named DR0-DR3, DR6, and DR7 (which suggests undocumented DR4 and DR5 registers). The two Test Registers for TLB testing are named TR6 and TR7 (which suggests undocumented TR0-TR5 registers). I expect that these registers are located near the relevant functional units, rather than part of the processing datapath. 

Reverse-engineering Ethernet backoff on the Intel 82586 network chip's die

Introduced in 1973, Ethernet is the predominant way of wiring computers together. Chips were soon introduced to handle the low-level aspects of Ethernet: converting data packets into bits, implementing checksums, and handling network collisions. In 1982, Intel announced the i82586 Ethernet LAN coprocessor chip, which went much further by offloading most of the data movement from the main processor to an on-chip coprocessor. Modern Ethernet networks handle a gigabit of data per second or more, but at the time, the Intel chip's support for 10 Mb/s Ethernet put it on the cutting edge. (Ethernet was surprisingly expensive, about $2000 at the time, but expected to drop under $1000 with the Intel chip.) In this blog post, I focus on a specific part of the coprocessor chip: how it handles network collisions and implements exponential backoff.

The die photo below shows the i82586 chip. This photo shows the metal layer on top of the chip, which hides the underlying polysilicon wiring and silicon transistors. Around the edge of the chip, square bond pads provide the link to the chip's 48 external pins. I have labeled the function blocks based on my reverse engineering and published descriptions. The left side of the chip is called the "receive unit" and handles the low-level networking, with circuitry for the network transmitter and receiver. The left side also contains low-level control and status registers. The right side is called the "command unit" and interfaces to memory and the main processor. The right side contains a simple processor controlled by a microinstruction ROM.1 Data is transmitted between the two halves of the chip by 16-byte FIFOs (first in, first out queues).

The die of the Intel 82586 with the main functional blocks labeled. Click this image (or any other) for a larger version.

The die of the Intel 82586 with the main functional blocks labeled. Click this image (or any other) for a larger version.

The 82586 chip is more complex than the typical Ethernet chip at the time. It was designed to improve system performance by moving most of the Ethernet processing from the main processor to the coprocessor, allowing the main processor and the coprocessor to operate in parallel. The coprocessor provides four DMA channels to move data between memory and the network without the main processor's involvement. The main processor and the coprocessor communicate through complex data structures2 in shared memory: the main processor puts control blocks in memory to tell the I/O coprocessor what to do, specifying the locations of transmit and receive buffers in memory. In response, the I/O coprocessor puts status blocks in memory. The processor onboard the 82586 chip allows the chip to handle these complex data structures in software. Meanwhile, the transmission/receive circuitry on the left side of the chip uses dedicated circuitry to handle the low-level, high-speed aspects of Ethernet.

Ethernet and collisions

A key problem with a shared network is how to prevent multiple computers from trying to send data on the network at the same time. Instead of a centralized control mechanism, Ethernet allows computers to transmit whenever they want.3 If two computers transmit at the same time, the "collision" is detected and the computers try again, hoping to avoid a collision the next time. Although this may sound inefficient, it turns out to work out remarkably well.4 To avoid a second collision, each computer waits a random amount of time before retransmitting the packet. If a collision happens again (which is likely on a busy network), an exponential backoff algorithm is used, with each computer waiting longer and longer after each collision. This automatically balances the retransmission delay to minimize collisions and maximize throughput.

I traced out a bunch of circuitry to determine how the exponential backoff logic is implemented. To summarize, exponential backoff is implemented with a 10-bit counter to provide a pseudorandom number, a 10-bit mask register to get an exponentially sized delay, and a delay counter to count down the delay. I'll discuss how these are implemented, starting with the 10-bit counter.

The 10-bit counter

A 10-bit counter may seem trivial, but it still takes up a substantial area of the chip. The straightforward way of implementing a counter is to hook up 10 latches as a "ripple counter". The counter is controlled by a clock signal that indicates that the counter should increment. The clock toggles the lowest bit of the counter. If this bit flips from 1 to 0, the next higher bit is toggled. The process is repeated from bit to bit, toggling a bit if there is a carry. The problem with this approach is that the carry "ripples" through the counter. Each bit is delayed by the lower bit, so the bits don't all flip at the same time. This limits the speed of the counter as the top bit isn't settled until the carry has propagated through the nine lower bits.

The counter in the chip uses a different approach with additional circuitry to improve performance. Each bit has logic to check if all the lower bits are ones. If so, the clock signal toggles the bit. All the bits toggle at the same time, rapidly incrementing the counter in response to the clock signals. The drawback of this approach is that it requires much more logic.

The diagram below shows how the carry logic is implemented. The circuitry is optimized to balance speed and complexity. In particular, bits are examined in groups of three, allowing some of the logic to be shared across multiple bits. For instance, instead of using a 9-input gate to examine the nine lower bits, separate gates test bits 0-2 and 3-5.

The circuitry to generate the toggle signals for each bit of the counter.

The circuitry to generate the toggle signals for each bit of the counter.

The implementation of the latches is also interesting. Each latch is implemented with dynamic logic, using the circuit's capacitance to store each bit. The input is connected to the output with two inverters. When the clock is high, the transistor turns on, connecting the inverters in a loop that holds the value. When the clock is low, the transistor turns off. However, the 0 or 1 value will still remain on the input to the first inverter, held by the charge on the transistor's gate. At this time, an input can be fed into the latch, overriding the old value.

The basic dynamic latch circuit.

The basic dynamic latch circuit.

The latch has some additional circuitry to make it useful. To toggle the latch, the output is inverted before feeding it back to the input. The toggle control signal selects the inverted output through another pass transistor. The toggle signal is only activated when the clock is low, ensuring that the circuit doesn't repeatedly toggle, oscillating out of control.

One bit of the counter.

One bit of the counter.

The image below shows how the counter circuit is implemented on the die. I have removed the metal layer to show the underlying transistors; the circles are contacts where the metal was connected to the underlying silicon. The pinkish regions are doped silicon. The pink-gray lines are polysilicon wiring. When polysilicon crosses doped silicon, it creates a transistor. The blue color swirls are not significant; they are bits of oxide remaining on the die.

The counter circuitry on the die.

The counter circuitry on the die.

The 10-bit mask register

The mask register has a particular number of low bits set, providing a mask of length 0 to 10. For instance, with 4 bits set, the mask register is 0000001111. The mask register can be updated in two ways. First, it can be set to length 1-8 with a three-bit length input.5 Second, the mask can be lengthened by one bit, for example going from 0000001111 to 0000011111 (length 4 to 5).

The mask register is implemented with dynamic latches similar to the counter, but the inputs to the latches are different. To load the mask to a particular length, each bit has logic to determine if the bit should be set based on the three-bit input. For example, bit 3 is cleared if the specified length is 0 to 3, and set otherwise. The lengthening feature is implemented by shifting the mask value to the left by one bit and inserting a 1 into the lowest bit.

The schematic below shows one bit of the mask register. At the center is a two-inverter latch as seen before. When the clock is high, it holds its value. When the clock is low, the latch can be loaded with a new value. The "shift" line causes the bit from the previous stage to be shifted in. The "load" line loads the mask bit generated from the input length. The "reset" line clears the mask. At the right is the NAND gate that applies the mask to the count and inverts the result. As will be seen below, these NAND gates are unusually large.

One stage of the mask register.

One stage of the mask register.

The logic to set a mask bit based on the length input is shown below.6 The three-bit "sel" input selects the mask length from 1 to 8 bits; note that the mask0 bit is always set while bits 8 and 9 are cleared.7 Each set of gates energizes the corresponding mask line for the appropriate inputs.

The control logic to enable mask bits based on length.

The control logic to enable mask bits based on length.

The diagram below shows the mask register on the die. I removed the metal layer to show the underlying silicon and polysilicon, so the transistors are visible. On the left are the NAND gates that combine each bit of the counter with the mask. Note that large snake-like transistors; these larger transistors provide enough current to drive the signal over the long bus to the delay counter register at the bottom of the chip. Bit 0 of the mask is always set, so it doesn't have a latch. Bits 8 and 9 of the mask are only set by shifting, not by selecting a mask length, so they don't have mask logic.8

The mask register on the die.

The mask register on the die.

The delay counter register

To generate the pseudorandom exponential backoff, the counter register and the mask register are NANDed together. This generates a number of the desired binary length, which is stored in the delay counter. Note that the NAND operation inverts the result, making it negative. Thus, as the delay counter counts up, it counts toward zero, reaching zero after the desired number of clock ticks.

The implementation of the delay counter is similar to the first counter, so I won't include a schematic. However, the delay counter is attached to the register bus, allowing its value to be read by the chip's CPU. Control lines allow the delay counter's value to pass onto the register bus.

The diagram below shows the locations of the counter, mask, and delay register on the die. In this era, something as simple as a 10-bit register occupied a significant part of the die. Also note the distance between the counter and mask and the delay register at the bottom of the chip. The NAND gates for the counter and mask required large transistors to drive the signal across this large distance.

The die, with counter, mask, and delay register.

The die, with counter, mask, and delay register.

Conclusions

The Intel Ethernet chip provides an interesting example of how a real-world circuit is implemented on a chip. Exponential backoff is a key part of the Ethernet standard. This chip implements backoff with a simple but optimized circuit.9

A high-resolution image of the die with the metal removed. (Click for a larger version.) Some of the oxide layer remains, causing colored regions due to thin-film interference.

A high-resolution image of the die with the metal removed. (Click for a larger version.) Some of the oxide layer remains, causing colored regions due to thin-film interference.

For more chip reverse engineering, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @[email protected]. Acknowledgments: Thanks to Robert Garner for providing the chip and questions.

Notes and references

  1. I think the on-chip processor is a very simple processor that doesn't match other Intel architectures. It is described as executing microcode. I don't think this is microcode in the usual sense of machine instructions being broken down into microcode. Instead, I think the processor's instructions are primitive, single-clock instructions that are more like micro-instructions than machine instructions. 

  2. The diagram below shows the data structures in shared memory for communication between the main processor and the coprocessor. The Command List specifies the commands that the coprocessor should perform. The Receive Frame area provides memory blocks for incoming network packets.

    A diagram of the 82586 shared memory structures, from the 82586 datasheet.

    A diagram of the 82586 shared memory structures, from the 82586 datasheet.

    I think Intel was inspired by mainframe-style I/O channels, which moved I/O processing to separate processors communicating through memory. Another sign of Intel's attempts to move mainframe technology to microprocessors was the ill-fated iAPX 432 processor, which Intel called a "micro-mainframe." (I discuss the iAPX 432 as part of this blog post.)

     

  3. An alternative approach to networking is token-ring, where the computers in the network pass a token from machine to machine. Only the machine with the token can send a packet on the network, ensuring collision-free transmission. I looked inside an IBM token-ring chip in this post

  4. Ethernet's technique is called CSMA/CD (Carrier Sense Multiple Access with Collision Detection). The idea of Carrier Sense is that the "carrier" signal on the network indicates that the network is in use. Each computer on the network listens for the lack of carrier before transmitting, which avoids most collisions. However, there is still a small chance of collision. (In particular, the speed of light means that there is a delay on a long network between when one computer starts transmitting and when a second computer can detect this transmission. Thus, both computers can think the network is free while the other computer is transmitting. This factor also imposes a maximum length on an Ethernet network segment: if the network is too long, a computer can finish transmitting a packet before the collision occurs, and it won't detect the collision.) Modern Ethernet has moved from the shared network to a star topology that avoids collisions. 

  5. The length of the mask is one more than the three-bit length input. E.g. An input of 7 sets eight mask bits. 

  6. The mask generation logic is a bit tricky to understand. You can try various bit combinations to see how it works. The logic is easier to understand if you apply De Morgan's law to change the NOR gates to AND gates, which also removes the negation on the inputs. 

  7. The control line appears to enable or disable mask selection but its behavior is inexplicably negated on bit 1. 

  8. The circuitry below the counter appears to be a state machine that is unrelated to the exponential backoff. From reverse engineering, my hypothesis is that the counter is reused by the state machine: it both generates pseudorandom numbers for exponential backoff and times events when a packet is being received. In particular, it has circuitry to detect when the counter reaches 9, 20, and 48, and takes actions at these values.

    The state itself is held in numerous latches. The new state is computed by a PLA (Programmable Logic Array) below and to the right of the counter along with numerous individual gates. 

  9. One drawback of this exponential backoff circuit is that the pseudorandom numbers are completely synchronous. If two network nodes happen to be in the exact same counter state when they collide, they will go through the same exponential backoff delays, causing a collision every time. While this may seem unlikely, it apparently happened occasionally during use. The LANCE Ethernet chip from AMD used a different approach. Instead of running the pseudorandom counter from the highly accurate quartz clock signal, the counter used an on-chip ring oscillator that was deliberately designed to be inaccurate. This prevented two nodes from locking into inadvertent synchronization. 

A close look at the 8086 processor's bus hold circuitry

The Intel 8086 microprocessor (1978) revolutionized computing by founding the x86 architecture that continues to this day. One of the lesser-known features of the 8086 is the "hold" functionality, which allows an external device to temporarily take control of the system's bus. This feature was most important for supporting the 8087 math coprocessor chip, which was an option on the IBM PC; the 8087 used the bus hold so it could interact with the system without conflicting with the 8086 processor.

This blog post explains in detail how the bus hold feature is implemented in the processor's logic. (Be warned that this post is a detailed look at a somewhat obscure feature.) I've also found some apparently undocumented characteristics of the 8086's hold acknowledge circuitry, designed to make signal transition faster on the shared control lines.

The die photo below shows the main functional blocks of the 8086 processor. In this image, the metal layer on top of the chip is visible, while the silicon and polysilicon underneath are obscured. The 8086 is partitioned into a Bus Interface Unit (upper) that handles bus traffic, and an Execution Unit (lower) that executes instructions. The two units operate mostly independently, which will turn out to be important. The Bus Interface Unit handles read and write operations as requested by the Execution Unit. The Bus Interface Unit also prefetches instructions that the Execution Unit uses when it needs them. The hold control circuitry is highlighted in the upper right; it takes a nontrivial amount of space on the chip. The square pads around the edge of the die are connected by tiny bond wires to the chip's 40 external pins. I've labeled the MN/MX, HOLD, and HLDA pads; these are the relevant signals for this post.

The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version.

The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version.

How bus hold works

In an 8086 system, the processor communicates with memory and I/O devices over a bus consisting of address and data lines along with various control signals. For high-speed data transfer, it is useful for an I/O device to send data directly to memory, bypassing the processor; this is called DMA (Direct Memory Access). Moreover, a co-processor such as the 8087 floating point unit may need to read data from memory. The bus hold feature supports these operations: it is a mechanism for the 8086 to give up control of the bus, letting another device use the bus to communicate with memory. Specifically, an external device requests a bus hold and the 8086 stops putting electrical signals on the bus and acknowledges the bus hold. The other device can now use the bus. When the other device is done, it signals the 8086, which then resumes its regular bus activity.

Most things in the 8086 are more complicated than you might expect, and the bus hold feature is no exception, largely due to the 8086's minimum and maximum modes. The 8086 can be designed into a system in one of two ways—minimum mode and maximum mode—that redefine the meanings of the 8086's external pins. Minimum mode is designed for simple systems and gives the control pins straightforward meanings such as indicating a read versus a write. Minimum mode provides bus signals that were similar to the earlier 8080 microprocessor, making migration to the 8086 easier. On the other hand, maximum mode is designed for sophisticated, multiprocessor systems and encodes the control signals to provide richer system information.

In more detail, minimum mode is selected if the MN/MX pin is wired high, while maximum mode is selected if the MN/MX pin is wired low. Nine of the chip's pins have different meanings depending on the mode, but only two pins are relevant to this discussion. In minimum mode, pin 31 has the function HOLD, while pin 30 has the function HLDA (Hold Acknowlege). In maximum mode, pin 31 has the function RQ/GT0', while pin 30 has the function RQ/GT1'.

I'll start by explaining how a hold operation works in minimum mode. When an external device wants to use the bus, it pulls the HOLD pin high. At the end of the current bus cycle, the 8086 acknowledges the hold request by pulling HLDA high. The 8086 also puts its bus output pins into "tri-state" mode, in effect disconnecting them electrically from the bus. When the external device is done, it pulls HOLD low and the 8086 regains control of the bus. Don't worry about the details of the timing below; the key point is that a device pulls HOLD high and the 8086 responds by pulling HLDA high.

This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14.

This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14.

The 8086's maximum mode is more complex, allowing two other devices to share the bus by using a priority-based scheme. Maximum mode uses two bidirectional signals, RQ/GT0 and RQ/GT1.2 When a device wants to use the bus, it issues a pulse on one of the signal lines, pulling it low. The 8086 responds by pulsing the same line. When the device is done with the bus, it issues a third pulse to inform the 8086. The RQ/GT0 line has higher priority than RQ/GT1, so if two devices request the bus at the same time, the RQ/GT0 device wins and the RQ/GT1 device needs to wait.1 Keep in mind that the RQ/GT lines are bidirectional: the 8086 and the external device both use the same line for signaling.

This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16.

This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16.

The bus hold does not completely stop the 8086. The hold operation stops the Bus Interface Unit, but the Execution Unit will continue executing instructions until it needs to perform a read or write, or it empties the prefetch queue. Specifically, the hold signal blocks the Bus Interface Unit from starting a memory cycle and blocks an instruction prefetch from starting.

Bus sharing and the 8087 coprocessor

Probably the most common use of the bus hold feature was to support the Intel 8087 math coprocessor. The 8087 coprocessor greatly improved the performance of floating-point operations, making them up to 100 times faster. As well as floating-point arithmetic, the 8087 supported trigonometric operations, logarithms and powers. The 8087's architecture became part of later Intel processors, and the 8087's instructions are still a part of today's x86 computers.3

The 8087 had its own registers and didn't have access to the 8086's registers. Instead, the 8087 could transfer values to and from the system's main memory. Specifically, the 8087 used the RQ/GT mechanism (maximum mode) to take control of the bus if the 8087 needed to transfer operands to or from memory.4 The 8087 could be installed as an option on the original IBM PC, which is why the IBM PC used maximum mode.

The enable flip-flop

The circuit is built from six flip-flops. The flip-flops are a bit different from typical D flip-flops, so I'll discuss the flip-flop behavior before explaining the circuit.

A flip-flop can store a single bit, 0 or 1. Flip flops are very important in the 8086 because they hold information (state) in a stable way, and they synchronize the circuitry with the processor's clock. A common type of flip-flop is the D flip-flop, which takes a data input (D) and stores that value. In an edge-triggered flip-flop, this storage happens on the edge when the clock changes state from low to high.5 (Except at this transition, the input can change without affecting the output.) The output is called Q, while the inverted output is called Q-bar.

The symbol for the D flip-flop with enable.

The symbol for the D flip-flop with enable.

Many of the 8086's flip-flops, including the ones in the hold circuit, have an "enable" input. When the enable input is high, the flip-flop records the D input, but when the enable input is low, the flip-flop keeps its previous value. Thus, the enable input allows the flip-flop to hold its value for more than one clock cycle. The enable input is very important to the functioning of the hold circuit, as it is used to control when the circuit moves to the next state.

How bus hold is implemented (minimum mode)

I'll start by explaining how the hold circuitry works in minimum mode. To review, in minimum mode the external device requests a hold through the HOLD input, keeping the input high for the duration of the request. The 8086 responds by pulling the hold acknowledge HLDA output high for the duration of the hold.

In minimum mode, only three of the six flip-flops are relevant. The diagram below is highly simplified to show the essential behavior. (The full schematic is in the footnotes.6) At the left is the HOLD signal, the request from the external device.

Simplified diagram of the circuitry for minimum mode.

Simplified diagram of the circuitry for minimum mode.

When a HOLD request comes in, the first flip-flop is activated, and remains activated for the duration of the request. The second flip-flop waits if any condition is blocking the hold request: a LOCK instruction, an unaligned memory access, or so forth. When the HOLD can proceed, the second flip-flop is enabled and it latches the request. The second flip-flop controls the internal hold signal, causing the 8086 to stop further bus activity. The third flip-flop is then activated when the current bus cycle (if any) completes; when it latches the request, the hold is "official". The third flip-flop drives the external HLDA (Hold Acknowledge) pin, indicating that the bus is free. This signal also clears the bus-enabled latch (elsewhere in the 8086), putting the appropriate pins into floating tri-state mode. The key point is that the flip-flops control the timing of the internal hold and the external HLDA, moving to the next step as appropriate.

When the external device signals an end to the hold by pulling the HOLD pin low, the process reverses. The three flip-flops return to their idle state in sequence. The second flip-flop clears the internal hold signal, restarting bus activity. The third flip-flop clears the HLDA pin.7

How bus hold is implemented (maximum mode)

The implementation of maximum mode is tricky because it uses the same circuitry as minimum mode, but the behavior is different in several ways. First, minimum mode and maximum mode operate on opposite polarity: a hold is requested by pulling HOLD high in minimum mode versus pulling a request line low in maximum mode. Moreover, in minimum mode, a request on the HOLD pin triggers a response on the opposite pin (HLDA), while in maximum mode, a request and response are on the same pin. Finally, using the same pin for the request and grant signals requires the pin to act as both an input and an output, with tricky electrical properties.

In maximum mode, the top three flip-flops handle the request and grant on line 0, while the bottom three flip-flops handle line 1. At a high level, these flip-flops behave roughly the same as in the minimum mode case, with the first flip-flop tracking the hold request, the second flip-flop activated when the hold is "approved", and the third flip-flop activated when the bus cycle completes. An RQ 0 input will generate a GT 0 output, while a RQ 1 input will generate a GT 1 output. The diagram below is highly simplified, but illustrates the overall behavior. Keep in mind that RQ 0, GT 0, and HOLD use the same physical pin, as do RQ 1, GT 1, and HLDA.

Simplified diagram of the circuitry for maximum mode.

Simplified diagram of the circuitry for maximum mode.

In more detail, the first flip-flop converts the pulse request input into a steady signal. This is accomplished by configuring the first flip-flop is configured with to toggle on when the request pulse is received and toggle off when the end-of-request pulse is received.10 The toggle action is implemented by feeding the output pulse back to the input, inverted (A); since the flip-flop is enabled by the RQ input, the flip-flop holds its value until an input pulse. One tricky part is that the acknowledge pulse must not toggle the flip-flop. This is accomplished by using the output signal to block the toggle. (To keep the diagram simple, I've just noted the "block" action rather than showing the logic.)

As before, the second flip-flop is blocked until the hold is "authorized" to proceed. However, the circuitry is more complicated since it must prioritize the two request lines and ensure that only one hold is granted at a time. If RQ0's first flip-flop is active, it blocks the enable of RQ1's second flip-flop (B). Conversely, if RQ1's second flip-flop is active, it blocks the enable of RQ0's second flip-flop (C). Note the asymmetry, blocking on RQ0's first flip-flop and RQ1's second flip-flop. This enforces the priority of RQ0 over RQ1, since an RQ0 request blocks RQ1 but only an RQ1 "approval" blocks RQ0.

When the second flip-flop is activated in either path, it triggers the internal hold signal (D).8 As before, the hold request is latched into the third flip-flop when any existing memory cycle completes. When the hold request is granted, a pulse is generated (E) on the corresponding GT pin.9

The same circuitry is used for minimum mode and maximum mode, although the above diagrams show differences between the two modes. How does this work? Essentially, logic gates are used to change the behavior between minimum mode and maximum mode as required. For the most part, the circuitry works the same, so only a moderate amount of logic is required to make the same circuitry work for both. On the schematic, the signal MN is active during minimum mode, while MX is active during maximum mode, and these signals control the behavior.

The "hold ok" circuit

As usually happens with the 8086, there are a bunch of special cases when different features interact. One special case is if a bus hold request comes in while the 8086 is acknowledging an interrupt. In this case, the interrupt takes priority and the bus hold is not processed until the interrupt acknowledgment is completed. A second special case is if the bus hold occurs while the 8086 is halted. In this case, the 8086 issues a second HALT indicator at the end of the bus hold. Yet another special case is the 8086's LOCK prefix, which locks the use of the bus for the following instruction, so a bus hold request is not honored until the locked instruction has completed. Finally, the 8086 performs an unaligned word access to memory by breaking it into two 8-bit bus cycles; these two cycles can't be interrupted by a bus hold.

In more detail, the "hold ok" circuit determines at each cycle if a hold could proceed. There are several conditions under which the hold can proceed:

  • The bus cycle is `T2`, except if an unaligned bus operation is taking place (i.e. a word split into two byte operations), or
  • A memory cycle is not active and a microcode memory operation is not taking place, or
  • A memory cycle is not active and a hold is currently active.

The first case occurs during bus (memory) activity, where a hold request before cycle T2 will be handled at the end of that cycle. The second case allows a hold if the bus is inactive. But if microcode is performing a memory operation, the hold will be delayed, even if the request is just starting. The third case is opposite from the other two: it enables the flip flop so a hold request can be dropped. (This ensures that the hold request can still be dropped in the corner case where a hold starts and then the microcode makes a memory request, which will be blocked by the hold.)

The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear.

The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear.

An instruction with the LOCK prefix causes the bus to be locked against other devices for the duration of the instruction. Thus, a hold cannot be granted while the instruction is running. This is implemented through a separate path. This logic is between the output of the first (request) flip-flop and the second (accepted) flip-flop, tied into the LOCK signal. Conceptually, it seems that the LOCK signal should block hold-ok and thus block the second (accepted) flip-flop from being enabled. But instead, the LOCK signal blocks the data path, unless the request has already been granted. I think the motivation is to allow dropping of a hold request to proceed uninterrupted. In other words, LOCK prevents a hold from being accepted, but it doesn't prevent a hold from being dropped, and it was easier to implement this in the data path.

The pin drive circuitry

The circuitry for the HOLD/RQ0/GT0 and HLDA/RQ1/GT1 pins is somewhat complicated, since they are used for both input and output. In minimum mode, the HOLD pin is an input, while the HLDA pin is an output. In maximum mode, both pins act as an input, with a low-going pulse from an external device to start or stop the hold. But the 8086 also issues pulses to grant the hold. Pull-up resistors inside the 8086 to ensure that the pins remain high (idle) when unused. Finally, an undocumented active pull-up system restores a pin to a high state after it is pulled low, providing faster response than the resistor.

The schematic below shows the heart of the tri-state output circuit. Each pin is connected to two large output MOSFETs, one to drive the pin high and one to drive the pin low. The transistors have separate control lines; if both control lines are low, both transistors are off and the pin floats in the "tri-state" condition. This permits the pin to be used as an input, driven by an external device. The pull-up resistor keeps the pin in a high state.

The tri-state output circuit for each hold pin.

The tri-state output circuit for each hold pin.

The diagram below shows how this circuitry looks on the die. In this image, the metal and polysilicon layers have been removed with acid to show the underlying doped silicon regions. The thin white stripes are transistor gates where polysilicon wiring crossed the silicon. The black circles are vias that connected the silicon to the metal on top. The empty regions at the right are where the metal pads for HOLD and HLDA were. Next to the pads are the large transistors to pull the outputs high or low. Because the outputs require much higher current than internal signals, these transistors are much larger than logic transistors. They are composed of several transistors placed in parallel, resulting in the parallel stripes. The small pullup resistors are also visible. For efficiency, these resistors are actually depletion-mode transistors, specially doped to act as constant-current sources.

The HOLD/HLDA pin circuitry on the die.

The HOLD/HLDA pin circuitry on the die.

At the left, some of the circuitry is visible. The large output transistors are driven by "superbuffers" that provide more current than regular NMOS buffers. (A superbuffer uses separate transistors to pull the signal high and low, rather than using a pull-up to pull the signal high as in a standard NMOS gate.) The small transistors are the pass transistors that gate output signals according to the clock. The thick rectangles are crossovers, allowing the vertical metal wiring (no longer visible) to cross over a horizontal signal in the signal layer. The 8086 has only a single metal layer, so the layout requires a crossover if signals will otherwise intersect. Because silicon's resistance is much higher than metal's resistance, the crossover is relatively wide to reduce the resistance.

The problem with a pull-up resistor is that it is relatively slow when connected to a line with high capacitance. You essentially end up with a resistor-capacitor delay circuit, as the resistor slowly charges the line and brings it up to a high level. To get around this, the 8086 has an active drive circuit to pulse the RQ/GT lines high to pull them back from a low level. This circuit pulls the line high one cycle after the 8086 drops it low for a grant acknowledge. This circuit also pulls the line high after the external device pulls it low.11 (The schematic for this circuit is in the footnotes.) The curious thing is that I couldn't find this behavior documented in the datasheets. The datasheets describe the internal pull-up resistor, but don't mention that the 8086 actively pulls the lines high.12

Conclusions

The hold circuitry was a key feature of the 8086, largely because it was necessary for the 8087 math coprocessor chip. The hold circuitry seems like it should be straightforward, but there are many corner cases in this circuitry: it interacts with unaligned memory accesses, the LOCK prefix, and minimum vs. maximum modes. As a result, it is fairly complicated.

Personally, I find the hold circuitry somewhat unsatisfying to study, with few fundamental concepts but a lot of special-case logic. The circuitry seems overly complicated for what it does. Much of the complexity is probably due to the wildly different behavior of the pins between minimum and maximum mode. Intel should have simply used a larger package (like the Motorola 68000) rather than re-using pins to support different modes, as well as using the same pin for a request and a grant. It's impressive, however, the same circuitry was made to work for both minimum and maximum modes, despite using completely different signals to request and grant holds. This circuitry must have been a nightmare for Intel's test engineers, trying to ensure that the circuitry performed properly when there were so many corner cases and potential race conditions.

I plan to write more on the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected] and Bluesky as @righto.com so you can follow me there too.

Notes and references

  1. The timing of priority between RQ0 and RQ1 is left vague in the documentation. In practice, even if RQ1 is requested first, a later RQ0 can still preempt it until the point that RQ1 is internally granted (i.e. the second flip-flop is activated). This happens before the hold is externally acknowledged, so it's not obvious to the user at what point priority no longer applies. 

  2. The RQ/GT0 and RQ/GT1 signals are active-low. These signals should have an overbar to indicate this, but it makes the page formatting ugly :-) 

  3. Modern x86 processors still support the 8087 (x87) instruction set. Starting with the 80486DX, the floating point unit was included on the CPU die, rather than as an external coprocessor. The x87 instruction set used a stack-based model, which made it harder to parallelize. To mitigate this, Intel introduced SSE in 1999, a different set of floating point instructions that worked on an independent register set. The x87 instructions are now considered mostly obsolete and are deprecated in 64-bit Windows

  4. The 8087 provides another RQ/GT input line for an external device. Thus, two external devices can still be used in a system with an 8087. That is, although the 8087 uses up one of the 8086's two RQ/GT lines, the 8087 provides another one, so there are still two lines available. The 8087 has logic to combine its bus requests and external bus requests into a single RQ/GT line to the 8086. 

  5. Confusingly, some of the flip-flops in the hold circuit transistion when the clock goes high, while others use the inverted clock signal and transition when the clock goes low. Moreover, the flip-flops are inconsistent about how they treat the data. In each group of three flip-flops, the first flip-flop is active-high, while the remaining flip-flops are active-low. For the most part, I'll ignore this in the discussion. You can look at the schematic if you want to know the details. 

  6. The schematics below shows my reverse-engineered schematic for the hold circuitry. I have partitioned the schematic into the hold logic and the output driver circuitry. This split matches the physical partitioning of the circuitry on the die.

    In the first schematic, the upper part handles HOLD and request0, while the lower part handles request1. There is some circuitry in the middle to handle the common enabling and to generate the internal hold signal. I won't explain the circuitry in much detail, but there are a few things I want to point out. First, even though the hold circuit seems like it should be simple, there are a lot of gates connected in complicated ways. Second, although there are many inverters, NAND, and NOR gates, there are also complex gates such as AND-NOR, OR-NAND, AND-OR-NAND, and so forth. These are implemented as single gates. Due to how gates are constructed from NMOS transistors, it is just as easy to build a hierarchical gate as a single gate. (The last step must be inversion, though.) The XOR gates are more complex; they are constructed from a NOR gate and an AND-NOR gate.

    Schematic of the hold circuitry. Click this image (or any other) for a larger version.

    Schematic of the hold circuitry. Click this image (or any other) for a larger version.

    The schematic below shows the output circuits for the two pins. These circuits are similar, but have a few differences because only the bottom one is used as an output (HLDA) in minimum mode. Each circuit has two inputs: what the current value of the pin is, and what the desired value of the pin is.

    Schematic of the pin output circuits.

    Schematic of the pin output circuits.

     

  7. Interestingly, the external pins aren't taken out of tri-state mode immediately when the HLDA signal is dropped. Instead, the 8086's bus drivers are re-enabled when a bus cycle starts, which is slightly later. The bus circuitry has a separate flip-flop to manage the enable/disable state, and the start of a bus cycle is what re-enables the bus. This is another example of behavior that the documentation leaves ambiguous. 

  8. There's one more complication for the hold-out signal. If a hold is granted on one line, a request comes in on the other line, and then the hold is released on the first line, the desired behavior is for the bus to remain in the hold state as the hold switches to the second line. However, because of the way a hold on line 1 blocks a hold on line 0, the GT1 second flip-flop will drop a cycle before the GT0 second flip-flop is activated. This would cause hold-out to drop for a cycle and the 8086 could start unwanted bus activity. To prevent this case, the hold-out line is also activated if there is an RQ0 request and RQ1 is granted. This condition seems a bit random but it covers the "gap". I have to wonder if Intel planned the circuit this way, or they added the extra test as a bug fix. (The asymmetry of the priority circuit causes this problem to occur only going from a hold on line 1 to line 0, not the other direction.) 

  9. The pulse-generating circuit is a bit tricky. A pulse signal is generated if the request has been accepted, has not been granted, and will be granted on the next clock (i.e. no memory request is active so the flip-flop is enabled). (Note that the pulse will be one cycle wide, since granting the request on the next cycle will cause the conditions to be no longer satisfied.) This provides the pulse one clock cycle before the flip-flop makes it "official". Moreover, the signals come from the inverted Q outputs from the flip-flops, which are updated half a clock cycle earlier. The result is that the pulse is generated 1.5 clock cycles before the flip-flop output. Presumably the point of this is to respond to hold requests faster, but it seems overly complicated. 

  10. The request pulse is required to be one clock cycle wide. The feedback loop shows why: if the request is longer than one clock cycle, the first flip-flop will repeatedly toggle on and off, resulting in unexpected behavior. 

  11. The details of the active pull-up circuitry don't make sense to me. First it XORs the state of the pin with the desired state of the pin and uses this signal to control a multiplexer, which generates the pull-up action based on other gates. The result of all this ends up being simply NAND, implemented with excessive gates. Another issue is that minimum mode blocks the active pull-up, which makes sense. But there are additional logic gates so minimum mode can affect the value going into the multiplexer, which gets blocked in minimum mode, so that logic seems wasted. There are also two separate circuits to block pull-up during reset. My suspicion is that the original logic accumulated bug fixes and redundant logic wasn't removed. But it's possible that the implementation is doing something clever that I'm just missing. 

  12. My analysis of the RQ/GT lines being pulled high is based on simulation. It would be interesting to verify this behavior on a physical 8086 chip. By measuring the current out of the pin, the pull-up pulses should be visible.