Showing posts with label electronics. Show all posts
Showing posts with label electronics. Show all posts

Counting the transistors in the 8086 processor: it's harder than you might think

How many transistors are in Intel's 8086 processor? This seems like a straightforward question, but it doesn't have a straightforward answer. Most sources say that this processor has 29,000 transistors.1 However, I have traced out every transistor from die photos and my count is 19,618. What accounts for the 9382 missing transistors?

The explanation is that when manufacturers report the transistor count of chips, typically often report "potential" transistors. Chips that include a ROM will have different numbers of transistors depending on the values stored in the ROM. Since marketing doesn't want to publish varying numbers depending on the number of 1 bits and 0 bits, they often count ROM sites: places that could have transistors, but might not. A PLA (Programmable Logic Array) has similar issues; the transistor count depends on the desired logic functions.

What are these potential transistor sites? ROMs are typically constructed as a grid of cells, with a transistor at a cell for a 1 bit, and no transistor for a 0 bit.2 In the 8086, transistors are created or not through the pattern of silicon doping. The photo below shows a closeup of the silicon layer for part of the 8086's microcode ROM. The empty regions are undoped silicon, while the other regions are doped silicon. Transistor gates are formed where vertical polysilicon lines (removed for the photo) passed over the doped silicon. Thus, depending on the data encoded into the ROM during manufacturing, the number of transistors varies.

A closeup of part of the microcode ROM. The dark circles indicate vias between the silicon and the metal on top.

A closeup of part of the microcode ROM. The dark circles indicate vias between the silicon and the metal on top.

The diagram below provides more detail, showing the microcode ROM up close. Green T's indicate transistors, while red X's indicate positions with no transistor. As you can see, the potential transistor positions form a grid, but only some of the positions are occupied by transistors. The common method for counting transistors counts all the potential positions (18 below) rather than the actual transistors that are implemented (12 below).

An extreme closeup of the microcode ROM. Green T's indicate transistors, while red X's indicate positions with no transistor.

An extreme closeup of the microcode ROM. Green T's indicate transistors, while red X's indicate positions with no transistor.

I found an Intel history that confirmed that the 8086 transistor count includes potential sites, saying "This is 29,000 transistors if all ROM and PLA available placement sites are counted." That paper gives the approximate number of (physical) transistors in the 8086 as 20,000. This number is close to my count of 19,618.

To get a transistor count that includes empty sites, I counted the number of transistor sites in the various ROMs and PLAs in the 8086 chip. This is harder than you might expect because the smaller ROMs, such as the constant ROM, have some layout optimization. The photo below shows a closeup of the constant ROM. It is essentially a grid, but has been "squeezed" slightly to optimize its layout, making it slightly irregular. I'm counting its "potential" transistors, but one could argue that it shouldn't be counted because filling in these transistors might run into problems.

Closeup of the constant ROM showing the silicon and polysilicon.

Closeup of the constant ROM showing the silicon and polysilicon.

The following table breaks down the ROM and PLA counts by subcomponent. I found a total of approximately 9659 unfilled transistor vacancies. If you add those to my transistor count, it works out to 29,277 transistors.

ComponentTransistor sitesTransistorsVacancies
Microcode1390462107694
Group Decode ROM1254603651
Translation ROM1050431619
Register PLAs465182283
ALU PLA354170184
Constant ROM20310994
Condition PLA1607486
Segment PLA904248

The image below shows these ROMs and PLAs on the die and how much the vacancies increase the transistor count. Not surprisingly, the large microcode ROM and its decoding PLA are responsible for most of the vacancies.

The 8086 die with transistor vacancy counts and how much they contribute to the final transistor count. (Click this image or any other for a larger version.)

The 8086 die with transistor vacancy counts and how much they contribute to the final transistor count. (Click this image or any other for a larger version.)

Potential exclusions

So are my counts of 19,618 transistors and 29,277 transistor sites correct? There are some subtleties that could lower this count slightly. First, the output pins use large, high-current transistors. Each output transistor is constructed from more than a dozen transistors wired in parallel. Should this be counted as a dozen transistors or a single transistor? I'm counting the component transistors.

An output pad with a bond wire attached. Driver transistors next to the pad are constructed from multiple transistors in parallel.

An output pad with a bond wire attached. Driver transistors next to the pad are constructed from multiple transistors in parallel.

The 8086 has about 43 transistors wired as diodes for various purposes. Some are input protection diodes, while others are used in the charge pump for the substrate bias generator. Should these be excluded from the transistor count? Physically they are transistors but functionally they aren't.

The 8086 is built with NMOS logic which builds gates out of active "enhancement" transistors as well as "depletion" transistors which basically act as pull-up resistors. I count 2689 depletion-mode transistors, but you could exclude them from the count as not "real" transistors.

Conclusions

The number of transistors in a chip is harder to define than you might expect. The 8086 is commonly described as having 29,000 transistors when including empty sites in ROMs and PLAs that potentially could have a transistor. The published number of physical transistors in the 8086 is "approximately 20,000". From my counts, the 8086 has 19,618 physical transistors and 29,277 transistors when including empty sites. Given the potential uncertainties in counting, it's not surprising that Intel rounded the numbers to the nearest thousand.

The practice of counting empty transistor sites may seem like an exaggeration of the real transistor count, but there are some good reasons to count this way. Including empty sites gives a better measure of the size and complexity of the chip, since these sites take up area whether or not they are used. This number also lets one count the number of transistors before the microcode is written, and it is also stable as the microcode changes. But when looking at transistor counts, it's good to know exactly what is getting counted.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected]. I discussed the transistor count in the 6502 processor here.

Notes and references

  1. For example, The 8086 Family Users Manual says on page A-210: "The central processor for the iSBC 86/12 board is Intel's 8086, a powerful 16-bit H-MOS device. The 225 sq. mil chip contains 29,000 transistors and has a clock rate of 5MHz." 

  2. ROMs can also be constructed the other way around, with a transistor indicating a 0. It's essentially an arbitrary decision, depending on whether the output buffer inverts the bit or not. Other ROM technologies may have transistors at all the sites but only connect the desired ones. 

Reverse-engineering an airspeed/Mach indicator from 1977

How does a vintage airspeed indicator work? CuriousMarc picked one up for a project, but it didn't have any documentation, so I reverse-engineered it. This indicator was used in the cockpit panel for business jets such as the Gulfstream G-III, Cessna Citation, and Bombardier Challenger CL600. It was probably manufactured in 1977 based on the dates on its transistors.

You might expect that the indicators on an aircraft control panel are simple dials. But behind this dial is a large, 2.8-pound box with a complex system of motors, gears, and feedback potentiometers, controlled by two boards of electronics. But for all this complexity, the indicator doesn't have any smarts: the pointers just indicate voltages fed into it from an air data computer. This is a quick blog post to summarize what I found.

Front view of the indicator.

Front view of the indicator.

The dial has two rotating pointers: the white pointer indicates airspeed in knots while the striped pointer indicates the maximum airspeed (which varies depending on altitude). The "digital" indicator at the top shows Mach number from 0.10 to 0.99, implemented with rotating digit wheels. When the unit is operating, the OFF indicator flag switches to black. The flag switches to a bright VMO warning if the pilot exceeds the maximum airspeed.1 On the rim of the dial, two small markers called "bugs" can be manually moved to indicate critical speeds such as takeoff speed.

In use, the indicator is connected to a Sperry air data computer and receives voltage signals to control the dial positions.3 The air data computer measures the static and dynamic air pressure from pitot tubes and determines the airspeed, Mach number, altitude, and other parameters. (These calculations become nontrival near Mach 1 as air compresses and the fluid dynamics change.) Since we didn't have the air data computer or its specifications, I needed to figure out the connections from the computer to the display.

With the unit's cover removed, you can see the internal mechanisms and circuitry. Each of the three indicators is controlled by a small DC motor with a potentiometer providing feedback. To the right, two circuit boards provide the electronics to drive the indicators.4 At the upper right, the black blob is a 26-volt 400-Hertz transformer to power the unit. Some power supply components are in front of it. Below the transformer is an orangish flexible printed-circuit board, which seems advanced for the timeframe. This flexible ribbon connects the transformer, the external connector, and the printed-circuit board sockets, providing the backplane for the system.

A side view of the unit shows the gears to control the indicators.

A side view of the unit shows the gears to control the indicators.

The diagram below shows the principle behind the servo mechanism that controls each indicator. The goal is to rotate the indicator to a position corresponding to the input voltage. A feedback loop is used to achieve this. The potentiometer provides a voltage proportional to its rotation. The input voltage and the feedback voltage are inputs to an op amp, which generates an error signal based on the difference between the inputs. The error signal rotates the DC motor in the appropriate direction until the potentiometer voltage matches the input voltage. Because the indicator and the potentiometer are geared together, the indicator will be in the correct position. As the input voltage changes, the system will continuously track the changes and keep the indicator updated.

A diagram illustrating the servo feedback loop.

A diagram illustrating the servo feedback loop.

Because the DC motor spins much faster than the dial moves, reduction gears slow the rotation. The photo below shows the gear train in the unit. A potentiometer is at the upper-right with three wires attached.

A closeup of the gear train. A potentiometer is on the right.

A closeup of the gear train. A potentiometer is on the right.

The Mach number has additional gearing to rotate the numbered wheels. When the low-digit wheel cycles around, it advances the high-digit wheel, similar to an odometer.

The mechanism to rotate the digit wheels for the Mach number.

The mechanism to rotate the digit wheels for the Mach number.

Fault checking

One interesting feature of the indicator unit is that it implements fault checking to alert the pilot if something goes wrong. The front panel has a three-position flag. By default it's in the OFF position. Powering the coil in one direction rotates the flag to the blank side. Powering the coil in the other direction rotates the flag to the "VMO" position which indicates that the pilot has exceeded the maximum operating speed.

I figured that powering up the unit would move the flag out of the OFF position, but it's more complicated than that. First, the unit checks that the air data computer is providing a suitable reference voltage. Second, the unit verifies that the motor voltages for the two needles are within limits; this ensures that the servo loop is operating successfully. Third, the unit checks that signals are received on status pins K and L. The unit only moves out of the OFF state if all these conditions are satisfied.5 Thus, if the unit receives bad signals or is malfunctioning, the pilot will be alerted by the OFF indicator, rather than trusting the faulty display.

The circuitry

The unit is powered by 26 volts, 400 Hz, a standard voltage for aviation. A small transformer provides multiple outputs for the various internal voltages. The unit has four power supplies: three on the first board and one on the back wall of the unit. One power supply is for the status indicator, one is for the op amps, one powers the 41.7V motors, and the fourth provides other power.

One subtlety is how the feedback potentiometers are powered. The servo loop compares the potentiometer voltage with the input voltage. But this only works if the potentiometer and the input voltage are using the same reference. One solution would be for the indicator unit and the air data computer to contain matching precision voltage regulators. Instead, the system uses a simpler, more reliable approach: the air data computer provides a reference voltage that the indicator unit uses to power the potentiometers.6 With this approach, the air data computer's voltage reference can fluctuate and the indicator will still reach the right position. (In other words, a 5V input with a 10V reference and a 6V input with a 12V reference are both 50%.)

The diagram below shows the board with the servo circuitry. The board uses dual op-amp integrated circuits, packaged in 10-pin metal cans that protected against interference.7 The ICs and some of the other components have obscure military part numbers; I don't know if this unit was built for military use or if military-grade parts were used for reliability.

The servo board is full of transistors, resistors, capacitors, diodes, and op-amp integrated circuits.

The servo board is full of transistors, resistors, capacitors, diodes, and op-amp integrated circuits.

The circuitry in the lower-left corner handles the reference voltage from the air data computer. The board buffers this voltage with an op amp to power the three feedback potentiometers. The op amp also ensures that the reference voltage is at least 10 volts. If not, the indicator unit shows the "OFF" flag to alert the pilot.

The schematic below shows one of the servo circuits; the three circuits are roughly the same. The heart of the circuit is the error op amp in the center. It compares the voltage from the potentiometer with the input voltage and generates an error output that moves the motor appropriately. A positive error output will turn on the upper transistor, driving the motor with a positive voltage. Conversely, a negative error output will turn on the lower transistor, driving the motor with a negative voltage. The motor drive circuit has clamp diodes to limit the transistor base voltages.

Schematic of one of the servo circuits.

Schematic of one of the servo circuits.

The op amp also receives a feedback signal from the motor output. I don't entirely understand this signal, which goes through a filter circuit with resistors, diodes, and a capacitor. I think it dampens the motor signal so the motor doesn't overshoot the desired position. I think it also keeps the transistor drive signal biased relative to the emitter voltage (i.e. the motor output).

On the input side, the potentiometer voltage goes through an op amp follower buffer, which simply outputs its input voltage. This may seem pointless, but the op amp provides a high-impedance input so the potentiometer's voltage doesn't get distorted.

The external input voltage goes through a resistor/capacitor circuit to scale it and filter out noise. Curiously, the circuit board was modified by cutting a trace and adding a resistor and capacitor to change the input circuit for one of the inputs. In the photo below, you can see the added resistor and capacitor; the cut trace is just to the right of the capacitor. I don't know if this modification changed the scale factor or if it filtered out noise. A label on the box says that Honeywell performed a modification on November 8, 1991, which presumably was this circuit.

A closeup of the circuit board showing the modification.

A closeup of the circuit board showing the modification.

The second board implements three power supplies as well as the circuitry for the OFF/VMO flag. The power supplies are simple and unregulated, just diode bridges to convert AC to DC, along with filter capacitors. Most of the circuitry on the board controls the status flag. Two dual op amps check the motor voltages against upper and lower limits to ensure that the motors are tracking the inputs. These outputs, along with other logic status signals, are combined with diode-transistor logic to determine the flag status. Driver transistors provide +18 or -18 volts to the flag's coil to drive it to the desired position.

This board has power supply circuitry and the control circuitry for the indicator flag.

This board has power supply circuitry and the control circuitry for the indicator flag.

Conclusions

After reverse-engineering the pinout, I connected the airspeed indicator to a stack of power supplies and succeeded in getting the indicators to operate (video). This unit is much more complex than I expected for a simple display, with servoed motors controlled by two boards of electronics. Air safety regulations probably account for much of the complexity, ensuring that the display provides the pilot with accurate information. For all that complexity, the unit is essentially a voltmeter, indicating three voltages on its display. This airspeed indicator is a bit different from most of the hardware I examine, but hopefully you found this look at its internal circuitry interesting.

With the case removed, the internal circuitry is visible.

With the case removed, the internal circuitry is visible.

You can follow me on Twitter @kenshirriff or rss. I've also started experimenting with mastodon recently as @[email protected].

Notes and references

  1. Since the unit has airspeed and maximum airspeed indicators, you might expect it to display the maximum airspeed warning flag based on the two speed inputs. Instead, the flag is controlled by input pin "L". In other words, the air data computer, not the indicator unit, determines when the maximum airspeed is exceeded. 

  2. This unit is a "Mach Airspeed Indicator", 4018366, apparently also called the SI-225,2

    Product label with part number 4018366-901.

    Product label with part number 4018366-901.

    Note that the label says Sperry. In 1986, Sperry attempted to buy Honeywell but instead Burroughs made a hostile takeover bid. The merger of Sperry and Burroughs formed Unisys. A couple of months after the merger, the Sperry Aerospace Group was sold to Honeywell for $1.025 billion. Thus, the indicator became a Honeywell product. This corporate history explains why the unit has a Honeywell product support sticker.

    Labels on top of the unit indicate that it worked with the Sperry 4013242 and 4013244 air data computers. These became the Honeywell AZ-242 and AZ-244.

    Labels on top of the unit indicate that it worked with the Sperry 4013242 and 4013244 air data computers. These became the Honeywell AZ-242 and AZ-244.

     

  3. The connector is a 32-pin MIL Spec round connector. Most of the 32 pins are unused. The connector has complex keying with 5 slots. I assume the keying is specific to this indicator, so the wrong indicator doesn't get connected.

    A closeup of the 32-pin connector, probably a MIL Spec 18-32.

    A closeup of the 32-pin connector, probably a MIL Spec 18-32.

    For reference, here is the pinout of the unit. Since this is based on reverse engineering, I don't guarantee it 100%. Don't use this for flight!

    PinUse
    A5V illumination
    BChassis ground
    CAC ground
    E26V 400 Hz
    F26V 400 Hz
    KEnable
    LSpeed ok
    MSignal ground
    NRef. voltage
    PVmax control voltage
    RAirspeed control voltage
    SMach control voltage
    VChassis ground

    Pins D, G, H, J, T, U, W, X, Y, Z, a, b, c, d, e, f, g, h, and j are unused. 

  4. The chassis has an empty slot for a third circuit board. My guess is that this chassis was used for multiple types of indicators and others required a third board. 

  5. If the L pin goes low, the indicator will move to the VMO position. 

  6. My hypothesis is that the correct reference voltage is 11.7 volts. This yields a scale factor of 1 volt equals 50 knots. It also matches up the display's change in scale at 250 knots with the measured scale change. 

  7. The meter uses three different integrated circuits in 10-pin metal cans with mysterious military markings: "FHL 24988", "JM38510/10102BIC 27014", and "SL14040". These appear to all be equivalent to uA747 dual op amps. (Note that JM38510 is not a part number; it is a general military specification for integrated circuits. The number after it is the relevant part number.) 

How the 8086 processor's microcode engine works

The 8086 microprocessor was a groundbreaking processor introduced by Intel in 1978. It led to the x86 architecture that still dominates desktop and server computing. The 8086 chip uses microcode internally to implement its instruction set. I've been reverse-engineering the 8086 from die photos and this blog post discusses how the chip's microcode engine operated. I'm not going to discuss the contents of the microcode1 or how the microcode controls the rest of the processor here. Instead, I'll look at how the 8086 decides what microcode to run, steps through the microcode, handles jumps and calls inside the microcode, and physically stores the microcode. It was a challenge to fit the microcode onto the chip with 1978 technology, so Intel used many optimization techniques to reduce the size of the microcode.

In brief, the microcode in the 8086 consists of 512 micro-instructions, each 21 bits wide. The microcode engine has a 13-bit register that steps through the microcode, along with a 13-bit subroutine register to store the return address for microcode subroutine calls. The microcode engine is assisted by two smaller ROMs: the "Group Decode ROM" to categorize machine instructions, and the "Translation ROM" to branch to microcode subroutines for address calculation and other roles. Physically, the microcode is stored in a 128×84 array. It has a special address decoder that optimizes the storage. The microcode circuitry is visible in the die photo below.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

What is microcode?

Machine instructions are generally considered the basic steps that a computer performs. However, each instruction usually requires multiple operations inside the processor. For instance, an ADD instruction may involve computing the memory address, accessing the value, moving the value to the Arithmetic-Logic Unit (ALU), computing the sum, and storing the result in a register. One of the hardest parts of computer design is creating the control logic that signals the appropriate parts of the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated and error-prone.

In 1951, Maurice Wilkes came up with the idea of microcode: instead of building the control circuitry from complex logic gates, the control logic could be replaced with another layer of code (i. e. microcode) stored in a special memory called a control store. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task. Microcode also permits complex instructions and a large instruction set to be implemented without making the processor more complex (apart from the size of the microcode). Finally, it is generally easier to fix a bug in microcode than in circuit logic.

Early computers didn't use microcode, largely due to the lack of good storage technologies to hold the microcode. This changed in the 1960s; for example IBM made extensive use of microcode in the System/360 (1964). (I've written about that here.) But early microprocessors didn't use microcode, returning to hard-coded control logic with logic gates.3 This logic was generally more compact and ran faster than microcode, since the circuitry could be optimized. Since space was at a premium in early microprocessors and the instruction sets were relatively simple, this tradeoff made sense. But as microprocessor instruction sets became complex and transistors became cheaper, microcode became appealing. This led to the use of microcode in the Intel 8086 (1978) and 8088 (1979) and Motorola 68000 (1979), for instance.2

The 8086's microcode

The 8086's microcode is much simpler than in most processors, but it's still fairly complex. The code below is the microcode routine from the 8086 for a routine called "CORD", part of integer division, consisting of 16 micro-instructions. I'm not going to explain how this microcode works in detail, but I want to give a flavor of it. Each line has an address on the left (blue) and the micro-instruction on the right (yellow), specifying the low-level actions during one time step (i.e. clock cycle). Each micro-instruction performs a move, transferring data from a source register (S) to a destination register (D). (The source Σ indicates the ALU output.) For parallelism, the micro-instruction performs an operation or two at the same time as the move. This operation is specified by the "a" and "b" fields; their meanings depend on the type field. For instance, type 1 indicates an ALU instruction such as subtract (SUBT) or left-rotate through carry (LRCY). Type 4 selects two general operations such as "RTN" which returns from a microcode subroutine. Type 0 indicates a jump operation; "UNC 10" is an unconditional jump to line 10 while "CY 13" jumps to line 13 if the carry flag is set. Finally, the "F" field indicates if the condition code flags should be updated. The key points are that the micro-instructions are simple and execute in one clock cycle, they can perform multiple operations in parallel to maximize performance, and they include control-flow operations such as conditional jumps and subroutines.

An example of a microcode routine. The CORD routine implements integer division with subtracts and left rotates. This is from patent 4,449,184.

An example of a microcode routine. The CORD routine implements integer division with subtracts and left rotates. This is from patent 4,449,184.

Each instruction is stored at a 13-bit address (blue) which consists of 9 bits shown explicitly and a 4-bit sequence counter "CR". The eight numbered address bits usually correspond to the machine instruction's opcode. The "X" bit is an extra bit to provide more address space for code that is not directly tied to a machine instruction, such as reset and interrupt code, address computation, and the multiply/divide algorithms.

A micro-instruction is encoded into 21 bits as shown below. Every micro-instruction contains a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits is a bit tricky since it depends on the type field, which is two or three bits long. The "short jump" (type 0) is a conditional jump within the current block of 16 micro-instructions. The ALU operation (type 1) sets up the arithmetic-logic unit to perform an operation. Bookkeeping operations (type 4) are anything from flushing the prefetch queue to ending the current instruction. A memory read or write is type 6. A "long jump" (type 5) is a conditional jump to any of 16 fixed microcode locations (specified in an external table). Finally, a "long call" (type 7) is a conditional subroutine call to one of 16 locations (different from the jump targets).

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

This "vertical" microcode format reduces the storage required for the microcode by encoding control signals into various fields. However, it requires some decoding logic to process the fields and generate the low-level control signals. Surprisingly, there's no specific "microcode decoder" circuit. Instead, the logic is scattered across the chip, looking for various microcode bit patterns to generate control signals where they are needed.

How instructions map onto the ROM

One interesting issue is how the micro-instructions are organized in the ROM, and how the right micro-instructions are executed for a particular machine instruction. The 8086 uses a clever mapping from the machine instruction to a microcode address that allows machine instructions to share microcode.

Different processors use a variety of approaches to microcode organization. One technique is for each micro-instruction to contain a field with the address of the next micro-instruction. This provides complete flexibility for the arrangement of micro-instructions, but requires a field to hold the address, increasing the number of bits in each micro-instruction. A common alternative is to execute micro-instructions sequentially, with a micro-program-counter stepping through each micro-address unless there is an explicit jump to a new address. This approach avoids the cost of an address field in each instruction, but requires a program counter with an incrementer, increasing the hardware complexity.

The 8086 uses a hybrid approach. A 4-bit program counter steps through the bottom 4 bits of the address, so up to 16 micro-instructions can be executed in sequence without a jump. This approach has the advantage of requiring a smaller 4-bit incrementer for the program counter, rather than a 13-bit incrementer. The microcode engine provides a "short jump" operation that makes it easy to jump within the group of 16 instructions using a 4-bit jump target, rather than a full 13-bit address.

Another important design decision in microcode is how to determine the starting micro-address for each machine instruction. In other words, if you want to do an ADD, where does the microcode for ADD start? One approach is a table of starting addresses: the system looks in the table to find the starting address for ADD, but this requires a large table of 256 entries. A second approach is to use the opcode code value as the starting address. That is, an ADD instruction 0x05 would start at micro-address 5. This approach has two problems. First, you can't run the microcode sequentially since consecutive micro-instructions belong to different machine instructions. Moreover, you can't share microcode since each instruction has a different address in the microcode ROM.

The 8086 solves these problems in two ways. First, the machine instructions are spaced sixteen slots apart in the microcode. In other words, the opcode is multiplied by 16 (has four zeros appended) to form the starting address in the microcode ROM, so there is plenty of space to implement each machine instruction. The second technique is that the ROM's addressing is partially decoded rather than fully decoded, so multiple micro-addresses can correspond to the same physical storage.4

To make this concrete, consider the 8086's arithmetic-logic instructions: one-byte add register to memory, one-byte add memory to register, one-word subtract memory from register, one-word xor register to memory, and so forth. There are 8 ALU operations and each can be byte- or word-sized, with memory as source or destination. This yields 32 different machine opcodes. These opcodes were carefully assigned, so they all have the format 00xxx0xx. The ROM address decoder is designed to look for three 0 bits in those positions, and ignore the other bits, so it will match that pattern. The result is that all 32 of these ALU instructions activate the same ROM column select line, and thus they all share the same microcode, shrinking the size of the ROM.

The microcode ROM's physical layout

The microcode ROM holds 512 words of 21 bits, so the obvious layout would be 512 columns and 21 rows. However, these dimensions are not practical for physically building the ROM because it would be too long and skinny. Instead, the ROM is constructed by grouping four words in each column, resulting in 128 columns of 84 rows, much closer to square. Not only does this make the physical layout more convenient, but it also reduces the number of column decoders from 512 to 128, reducing the circuitry size. Although the ROM now requires 21 multiplexers to select which of the four rows corresponds to each output bit, the circuitry is still much smaller. There is a tradeoff with the ability to merge addresses together by ignoring bits, though. Each decoder now selects a column of four words, rather than a single word, so each block of four words must have consecutive addresses.

The main components of the microcode engine. The metal layer has been removed to show the silicon and polysilicon underneath. If you zoom in, the bit pattern is visible in the silicon doping pattern.

The main components of the microcode engine. The metal layer has been removed to show the silicon and polysilicon underneath. If you zoom in, the bit pattern is visible in the silicon doping pattern.

The image above shows how microcode is stored and accessed. At the top is the 13-bit microcode address register, which will be discussed in detail below. The column selection circuit decodes 11 of the 13 address bits to select one column of the microcode storage. At the left, multiplexers select one bit out of each four rows using the two remaining address bits (specifically, the two lowest sequence bits). The selected 21 microcode outputs are latched and fed to the rest of the processor, where they are decoded as described earlier and control the processor's actions.

Optimizing the microcode

In 1978, the number of bits that could be stored in the microcode ROM was rather limited. In particular, the 8086 holds only 512 micro-instructions. Since it has approximately 256 machine-code instructions in its one-byte opcode, combined with multiple addressing modes, and each instruction requires multiple micro-instructions, compression and optimization were necessary to make the microcode fit.5 The main idea was to move functionality out of the microcode and into discrete logic when it made sense. I'll describe some of the ways they did this.

The 8086 has an arithmetic-logic unit (ALU) that performs operations such as addition and subtraction, as well as logical operations such as AND and XOR. Consider the machine instruction ADD, implemented with a few micro-operations that compute the memory address, fetch data, perform the addition, and store the result. The machine instructions for subtraction, AND, or XOR require identical steps, except that the ALU performs a different operation. In total, the 8086 has eight ALU-based operations that are identical except for the operation performed by the ALU.6 The 8086 uses a "trick" where these eight machine instructions share the same microcode. Specifically, the microcode tells the ALU to perform a special operation XI, which indicates that the ALU should look at the appropriate bits of the instruction and do the appropriate operation.7 This shrinks the microcode for these operations by a factor of eight, at the cost of requiring additional logic for the ALU. In particular, the ALU control circuitry has a register to hold the relevant instruction bits, and a PLA to decode these bits into low-level ALU control signals.

Similarly, the 8086 has eight machine instructions to increment a specific register (out of a set of 8), and eight instructions to decrement a register. All 16 instructions are handled by the same set of micro-instructions and the ALU does the increment or decrement as appropriate. Moreover, the register control circuitry determines which register is specified by the instruction, without involvement from the microcode.

Another optimization is that the 8086 has many machine instructions in pairs: an 8-bit version and a 16-bit version. One approach would be to have separate microcode for the two instructions, one to handle a single byte and one to handle two bytes. Instead, the machine instructions share microcode. The complexity is moved to the circuitry that moves data on the bus: it looks at the low bit of the instruction to determine if it should process a byte or a word. This cuts the microcode size in half for the many affected instructions.

Finally, simple instructions that can take place in one cycle are implemented with logic gates, rather than through microcode. For instance, the CLC (clear carry flag) instruction updates the flag directly. Similarly, prefix instructions for segment selection, instruction locking, or repetition are performed in logic. These instructions don't use any microcode at all, which will be important later below.

Using techniques such as these, about 75 different instruction types are implemented in the microcode (instead of about 256), making the microcode much smaller. The tradeoff is that the 8086 requires more logic circuitry, but the designers found the tradeoff to be worthwhile.

The ModR/M byte

There's another complication for 8086 microcode, however. Most 8086 instructions have a second byte: the ModR/M byte, which controls the addressing mode for the instructions in a complex way (shown below). This byte gives 8086 instructions a lot of flexibility: you can use two registers, a register and a memory location, or a register and an "immediate" value specified in the instruction. The memory location can be specified by 8 index register combinations with a one-byte or two-byte displacement optionally added. (This is useful for accessing data in an array or structure, for instance.) Although these addressing modes are powerful, they pose a problem for the microcode.

A summary of the ModR/M byte, from MCS-86 Assembly Language Reference Guide.

A summary of the ModR/M byte, from MCS-86 Assembly Language Reference Guide.

These different addressing modes need to be implemented in microcode, since different addressing modes require different sequences of steps. In other words, you can't use the previous trick of pushing the problem into logic gates. And you clearly don't want a separate implementation of each instruction for each addressing mode since the size of the microcode would multiply out of control.

The solution is to use a subroutine (in microcode) to compute the memory address. Thus, instructions can share the microcode for each addressing mode. This adds a lot of complexity to the microcode engine, however, since it needs to store the micro-address for a micro-subroutine-call so it can return to the right location. To support this, the microcode engine has a register to hold this return address. (Since it doesn't have a full stack, you can't perform nested subroutine calls, but this isn't a significant limitation.)

The microcode ends up having about 10 subroutines for the different addressing modes, as well as four routines for the different sizes of displacement. (The 8 possibilities for source registers are handled in the register selection logic, rather than microcode.) Thus, the microcode handles the 256 different addressing modes with about 14 short routines that add the appropriate address register(s) and the displacement to obtain the memory address.

One more complication is that machine instructions can switch the source and destination specified by the ModR/M byte, depending on the opcode. For example, one subtract instruction will subtract a memory location from a register, while a different subtract instruction subtracts a register from a memory location. The two variants are distinguished by bit 1 of the instruction, the "direction" bit. These variants are handled by the control logic, so the microcode can ignore them. Specifically, before the source and destination specifications go to the register control circuitry, a crossover circuit can swap them based on the value of the direction bit.

The Translation ROM

As explained above, the starting address for a machine instruction is derived directly from the instruction's opcode. However, the microcode engine needs a mechanism to provide the address for jump and call operations. In the 8086, this address is hard-coded into the Translation ROM, which provides a 13-bit address.8 It holds ten destination addresses for jump operations and ten (different) addresses for call operations.

A second role of the Translation ROM is to hold target addresses for each ModR/M addressing mode, pointing to the code to compute the effective address. As a complication, two of the jump table entries in the Translation ROM are implemented with conditional logic, depending on whether or not the instruction's memory address calculation includes a displacement. By wiring this condition into the Translation ROM, the microcode avoids the need to test this condition.

The image below shows how the Translation ROM appears on the die. It is implemented as a partially-decoded ROM with multiplexed inputs.9 The inputs are at the bottom left. For a jump or call, the ROM uses 4 input bits from the microcode output, since the microcode selects the jump targets. For an address computation, it takes 5 bits from the instruction's ModR/M byte, so the routine is selected by the instruction. The ROM has additional input bits to select the mode (jump, call, or address) and for the conditional jumps. The decoding logic (left half) activates a row in the right half, generating the output address. This address exits at the bottom and is loaded into the micro-address register below the Translation ROM.

The Translation ROM holds addresses of routines in the microcode.

The Translation ROM holds addresses of routines in the microcode.

The Group Decode ROM

In the discussion above, I've discussed how various categories of instructions are optimized. For instance, many instructions have a bit that selects if they act on a byte or a word. Many instructions have a bit to reverse the direction of the operation's memory and register accesses. These features are implemented in logic rather than microcode. Other instructions are implemented outside microcode entirely. How does the 8086 determine which way to process an instruction?

The Group Decode ROM takes an instruction opcode and generate 15 signals that indicate various categories of instructions that are handled differently.10 The outputs from the Group Decode ROM are used by various logic circuits to determine how to handle the instruction. Some cases affect the microcode, for instance calling a microcode addressing routine if the instruction has a ModR/M byte. In other cases, these signals act "downstream" of the microcode, for example to determine if the operation should act on a byte or a word. Other signals cause the microcode to be bypassed completely.

A closeup of the Group Decode ROM. The circuit uses two layers of NOR gates to generate the output signals from the opcode inputs. This image shows a composite of the metal, polysilicon, and silicon layers.

A closeup of the Group Decode ROM. The circuit uses two layers of NOR gates to generate the output signals from the opcode inputs. This image shows a composite of the metal, polysilicon, and silicon layers.

Specially-encoded instructions

For most of the 8086 instructions, the first byte specifies the instruction. However, the 8086 has a few instructions where the ModR/M byte completely changes the meaning of the first byte. For instance, opcode 0xF6 (Grp 1 below) can be a TEST, NOT, NEG, MUL, IMUL, DIV, or IDIV instruction based on the value of the ModR/M byte. Similarly, opcode 0xFE (Grp 2) indicates an INC, DEC, CALL, JMP, or PUSH instruction.11

The 8086 instruction map for opcodes 0xF0 to 0xFF. Based on MCS-86 Assembly Language Reference Guide.

The 8086 instruction map for opcodes 0xF0 to 0xFF. Based on MCS-86 Assembly Language Reference Guide.

This encoding may seem a bit random, but there's a reason behind it. Most instructions act on a source and a destination. But some, such as INC (increment) use the same register or memory location for the source and the destination. Others, such as CALL or JMP, only use one address. Thus, the "reg" field in the ModR/M byte is redundant. Since these bits would be otherwise "wasted", they are used instead to specify different instructions. (There are only 256 single-byte opcodes, so you want to make the best use of them.)

The implementation of these instructions in microcode is interesting. Since the instructions share the same first byte, the standard microcode mapping would put them at the same microcode address. However, these instructions are treated specially, with the "reg" field from the ModR/M byte copied into the lower bits of the microcode address. In effect, the instructions are treated as opcodes 0xF0 through 0xFF, so the different instruction variants execute at separate microcode addresses. You might expect a collision with the opcodes that really have the values 0xF0 through 0xFF. However, the 8086 opcodes were cleverly arranged so none of the other instructions in this range use microcode. As you can see above, the other instructions are prefixes (LOCK, REP, REPZ), halt (HLT), or flag operations (CMC, CLC, STC, CLI, STI, CLD, STD), all implemented outside microcode. Thus, the range 0xF0-0xFF is freed up for the "expanded" instructions.

The hardware implementation for this is not too complex. The Group ROM produces an output for these special instructions. This causes the microcode address register to load the appropriate bits from the ModR/M byte, causing the appropriate microcode routine to be executed.

The microcode address register

The heart of the microcode engine is the microcode address register, which determines which microcode address to execute. As described earlier, the microcode address is 13 bits, of which 8 bits generally correspond to the instruction opcode, one bit is an extra "X" instruction bit, and 4 bits are sequentially incremented. The diagram below shows how the circuitry for the bits is arranged. The 9 instruction bits each have a nearly-identical circuit. The sequence bits have more circuitry and each one is different, because the circuit to increment the address is different for each bit.

Layout of the microcode address register. Each bit has a roughly vertical block of circuitry.

Layout of the microcode address register. Each bit has a roughly vertical block of circuitry.

The schematic below shows the circuitry for one bit in the microcode address register. It has two flip-flops: one to hold the current address bit and one to hold the old address while performing a subroutine call. A multiplexer (mux) selects the input to each flip-flop. For instance, if the microcode is waiting for a memory access, the "hold" input to the multiplexer causes the current address to loop around and get reloaded into the flip-flop. For a subroutine call, the "call" input saves the current address in the subroutine flip-flop. Conversely, when returning from a subroutine, the "return" input loads the old address from the subroutine flip-flop. The address flip-flop also has inputs to load the instruction as the address, to load an address from the translation ROM, or to load an interrupt microcode handler address. The circuit sends the address bit (and inverted address bit) to the microcode ROM's address decoder.

Schematic of a typical bit in the microcode address register.

Schematic of a typical bit in the microcode address register.

Each bit has some special-case handling, so this schematic should be viewed as an illustration, not an accurate wiring diagram. In particular, the sequence bits also have inputs from the incrementer, so they can step to the next address. The low-order bits have instruction inputs to handle the specially-encoded "group" instructions discussed in the previous section.

The control signals for the multiplexers are generated from various sources. A circuit called the loader starts processing of an instruction, synchronized to the prefetch queue and instruction fetch from memory. The call and return operations are microcode instructions. The Group Decode ROM controls some of the inputs, for instance to process a ModR/M byte. Thus, there is a moderate amount of conditional logic that determines the microcode address and thus what microcode gets executed.

Conclusions

This has been a lot of material, so thank you for sticking with it to the end. I draw three conclusions from studying the microcode engine of the 8086. First, the implementation of microcode is considerably more complex than the clean description of microcode that is presented in books. A lot of functionality is implemented in logic outside of microcode, so it's not a "pure" microcode implementation. Moreover, there are many optimizations and corner cases. The microcode engine has two supporting ROMs: the Translation ROM and the Group Decode ROM. Even the microcode address register has complications.

Second, the need for all these optimizations shows how the 8086 was just on the edge of what was practical. The designers clearly went to a lot of effort to get the microcode to fit in the space available.

Finally, looking at the 8086 in detail shows how complex its instruction set is. I knew in the abstract that it was much more convoluted than, say, an ARM chip. But seeing all the special case circuitry on the die to handle the corner cases of the instruction set really makes this clear.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected]. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

  1. The 8086 microcode was disassembled (link) a couple of years ago by Andrew Jenner by extracting the bits from my die photos. My post here is a bit different, looking at hardware that runs the microcode, rather than the contents of the microcode. 

  2. According to Wikipedia, the Zilog Z8000 (1979) didn't use microcode, which is a bit surprising for that timeframe. This design decision had the advantage of reducing transistor count, but the disadvantage of hard-to-fix logic bugs in the instruction decoder. 

  3. As an example of a non-microcoded processor, the MOS 6502 (1975) used a PLA (Programmable Logic Array) to perform much of the decoding (details). A PLA provides a structured way of implementing logic gates in a relatively dense array. A PLA is kind of like a ROM—a PLA can implement a ROM and vice versa—so it can be hard to see the difference. The usual distinction is that only one row of a ROM is active at a time, the address that you're reading out. A PLA is more general since multiple rows can be active at a time, combined to form the outputs.

    The Z80 had a slightly different implementation. It used a smaller PLA to perform basic decoding of instructions into various types. It then generated control signals through a large amount of "random" logic (so-called because of its appearance, not because it's actually random). This logic combined instruction types and timing signals to generate the appropriate control signals. 

  4. In a "normal" ROM, the address bits are decoded with logic gates so each unique address selects a different storage column in the ROM. However, parts of the decoder circuitry can "ignore" bits so multiple addresses select the same storage column. For a hypothetical example, suppose you're decoding 5-bit addresses. Instead of decoding 11010 and 11011 separately, you could ignore the last bit so both addresses access the same ROM data. or you could ignore the last three bits so all 8 addresses of the form 11xxx go to the same ROM location (where x indicates a bit that can be either 0 or 1). This makes the ROM more like a PLA (Programmable Logic Array), but still accessing a single row at a time. 

  5. The Intel 8087 was the floating point co-processor for the 8086. The 8087 required a lot of microcode, more than could fit in a standard ROM on the die. To get the microcode to fit, Intel created a special ROM that stored two bits per transistor (instead of one) by using four different sizes of transistors to generate four different voltages. Analog circuitry converted each voltage level into two bits. This complex technique doubled the density (at least in theory), allowing the microcode to fit on the chip. I wrote about the 8087's non-binary ROM here

  6. The ALU operations that are grouped together are add, add with carry, subtract, subtract with borrow, logical AND, logical XOR, logical OR, and compare. The compare operation may seem out of place in this list, but it is implemented as a subtract operation that updates the condition flags without changing the value. Operations such as increment, decrement, negation, and logical NOT may seem like they should be included, but since they operate on a single argument instead of two, they are implemented differently at the microcode level. Increment and decrement are combined in the microcode, however. Negation and logical NOT could be combined except that negation affects the condition code flags, while NOT doesn't, so they need separate microcode. (This illustrates how subtle features of the instruction set can have more impact than you might expect.) Since the ALU doesn't have hardware multiplication and division, the multiplication and division operations are implemented separately in microcode with loops. 

  7. The ALU itself isn't examining instruction bits to decide what to do. There's some control circuitry next to the ALU that uses a PLA (Programmable Logic Array) to examine the instruction bits and the microcode's ALU command to generate low-level control signals for the ALU. These signals control things such as carry propagation, argument negation, and logical operation selection to cause the ALU to perform the desired operation. 

  8. The Translation ROM has one additional output: a wire indicating an address mode that uses the BP register. This output goes to the segment register selection circuitry and selects a different segment register. The reason is that the 8086 uses the Data Segment by default for effective address computation, unless the address mode uses BP as a base register. In that case, the Stack Segment is used. This is an example of how the 8086 architecture is not orthogonal and has lots of corner cases. 

  9. You can also view the Translation ROM as a PLA (Programmable Logic Array) constructed from two layers of NOR gates. The conditional entries make it seem more like a PLA than a ROM. Technically, it can be considered a ROM since a single row is active at a time. I'm using the name "Translation ROM" because that's what Intel calls it in the patents. 

  10. Although the Group Decode ROM is called a ROM in the patent, I'd consider it more of a PLA (programmable logic array). Conceptually it holds 256 words, one for each instruction. But its implementation is an array of logic functions. 

  11. These instructions were called "Group 1" and "Group 2" instructions in the 8086 documentation. Later Intel documentation renamed them as "Unary Group 3", "INC/DEC Group 4" and "Indirect Group 5". Some details are here. The 8086 has two other groups of instructions where the reg field defines the instruction: the "Immediate" instructions 0x80-0x83 and the "Shift" instructions 0xD0-0xD3. For these opcodes, the different instructions were implemented by the ALU. As far as the microcode was concerned, these were "normal" instructions so I won't discuss them in this post.

    I should mention that although the 8086 opcodes are always expressed in hexadecimal, the encoding makes much more sense if you look at it in octal. Details here. The octal encoding also applies to other related chips including the 8008, 8080, and Z80. 

The unusual bootstrap drivers inside the 8086 microprocessor chip

The 8086 microprocessor is one of the most important chips ever created; it started the x86 architecture that still dominates desktop and server computing today. I've been reverse-engineering its circuitry by studying its silicon die. One of the most unusual circuits I found is a "bootstrap driver", a way to boost internal signals to improve performance.1

The bootstrap driver circuit from the 8086 processor.

The bootstrap driver circuit from the 8086 processor.

This circuit consists of just three NMOS transistors, amplifying an input signal to produce an output signal, but it doesn't resemble typical NMOS logic circuits and puzzled me for a long time. Eventually, I stumbled across an explanation:2 the "bootstrap driver" uses the transistor's capacitance to boost its voltage. It produces control pulses with higher current and higher voltage than otherwise possible, increasing performance. In this blog post, I'll attempt to explain how the tricky bootstrap driver circuit works.

A die photo of the 8086 processor. The metal layer on top of the silicon is visible. Around the edge of the chip, bond wires provide connections to the chip's external pins. Click this image (or any other) for a larger version.

A die photo of the 8086 processor. The metal layer on top of the silicon is visible. Around the edge of the chip, bond wires provide connections to the chip's external pins. Click this image (or any other) for a larger version.

NMOS transistors

The 8086 is built from MOS transistors (MOSFETs), specifically NMOS transistors. Understanding the bootstrap driver requires some understanding of these transistors. If you're familiar with MOSFETs as components, they have source and drain pins and current flows from the drain to the source, controlled by the gate pin. Most of the time I treat an NMOS transistor as a digital switch between the drain and the source: a 1 input turns the transistor on, closing the switch, while a 0 turns the transistor off. However, for the bootstrap driver, we must consider the MOSFET in a bit more detail.

A MOSFET switches current from the drain to the source, under control of the gate.

A MOSFET switches current from the drain to the source, under control of the gate.

The important aspect of the gate is the difference between the gate voltage and the (typically lower) source voltage; this is denoted as Vgs. Without going into semiconductor physics, a slightly more accurate model is that the transistor turns on when the voltage between the gate and the source exceeds the fixed threshold voltage, Vth. This creates a conducting channel between the transistor's source and drain. Thus, if Vgs > Vth, the transistor turns on and current flows. Otherwise, the transistor turns off and no current flows.

The voltage between the gate and the source (Vgs) controls the transistor.

The voltage between the gate and the source (Vgs) controls the transistor.

The threshold voltage has an important consequence for a chip such as the 8086. The 8086, like most chips of that era, used a 5-volt power supply. The threshold voltage depends on manufacturing characteristics, but I'll use 1 volt as a typical value.3 The result is that if you put 5 volts on the drain and on the gate, the transistor can pull the source up to about 4 volts, but then Vgs falls to the threshold voltage and the transistor stops conducting. Thus, the transistor can't pull the source all the way up to the 5-volt supply, but falls short by a volt on the output. In some circumstances this is a problem, and this is the problem that the bootstrap driver fixes.

Due to the threshold voltage, the transistor doesn't pull the source all the way to the drain's voltage, but "loses" a volt.

Due to the threshold voltage, the transistor doesn't pull the source all the way to the drain's voltage, but "loses" a volt.

If you get a transistor as a physical component, the source and drain are not interchangeable. However, in an integrated circuit, there is no difference between the source and the drain, and this will be important.4 The diagram below shows how a MOSFET is constructed on the silicon die. The source and drain consist of regions of silicon doped with impurities to change their property. Between them is a channel of undoped silicon, which normally does not conduct. Above the channel is the gate, made of a special type of silicon called polysilicon. The voltage on the gate controls the conductivity of the channel. A very thin insulating layer separates the gate from the channel. As a side effect, the insulating layer creates some capacitance between the gate and the underlying silicon.

Diagram of an NMOS transistor in an integrated circuit.

Diagram of an NMOS transistor in an integrated circuit.

Basic NMOS circuits

Before getting to the bootstrap driver, I'll explain how a basic inverter is implemented in an NMOS chip like the 8086. The inverter is built from two transistors: a normal transistor on the bottom, and a special load transistor on top that acts like a pull-up resistor, providing a small constant current.5 With a 1 input, the lower transistor turns on, pulling the output to ground to produce a 0 output. With a 0 input, the lower transistor turns off and the current from the upper transistor drives the output high to produce a 1 output. Thus, the circuit implements an inverter: producing a 1 when the input is 0 and vice versa.

A standard NMOS inverter is built from two transistors. The upper transistor is a "depletion load" transistor.

A standard NMOS inverter is built from two transistors. The upper transistor is a "depletion load" transistor.

The disadvantage of this inverter circuit is that when it produces a 0 output, current continuously flows through the load transistor and the lower transistor to ground. This wastes power, leading to high power consumption for NMOS circuitry. (To solve this, CMOS circuitry took over in the 1980s and is used in modern microprocessors.) This also limits the current that the inverter can provide.

If a gate needs to provide a relatively large current, for instance to drive a long bus inside the chip, a more complex circuit is used, the "superbuffer". The superbuffer uses one transistor to pull the output high and a second transistor to pull the output low.6 Because only one transistor is on at a time, a high-current output can be produced without wasting power. There are two disadvantages of the superbuffer, though. First, the superbuffer requires an inverter to control the high-side transistor, so it uses considerably more space on the die. Second, the superbuffer can't pull the high output all the way up; it loses a volt due to the threshold voltage as described earlier.

Combining two output transistors with an inverter produces a higher-current output, known as a superbuffer.

Combining two output transistors with an inverter produces a higher-current output, known as a superbuffer.

The bootstrap driver

In some circumstances, you want both a high-current output, and the full output voltage. One example is connecting a register to an internal bus. Since the 8086 is a 16-bit chip, it uses 16 transistors for the bus connection. Driving 16 transistors in parallel requires a fairly high current. But the bus transistors are "pass" transistors, which lose a volt due to the threshold voltage, so you want to start with the full voltage, not already down one volt. To provide both high current and the full voltage, bootstrap drivers are used to control the buses, as well as similar tasks such as ALU control.

The concept behind the bootstrap driver is to drive the gate voltage significantly higher than 5 volts, so even after losing the threshold voltage, the transistor can produce the full 5-volt output.7 The higher voltage is generated by a charge pump, as illustrated below. Suppose you charge a capacitor with 5 volts. Now, disconnect the bottom of the capacitor from ground, and connect it to +5 volts. The capacitor is still charged with 5 volts, so now the high side is at +10 volts with respect to ground. Thus, a capacitor can be used to create a higher voltage by "pumping" the charge to a higher level.

On the left, the "flying capacitor' is charged to 5 volts. By switching the lower terminal to +5 volts, the capacitor now outputs +10 volts

On the left, the "flying capacitor' is charged to 5 volts. By switching the lower terminal to +5 volts, the capacitor now outputs +10 volts

The idea of the bootstrap driver is to attach a capacitor to the gate and charge it to 5 volts. Then, the low side of the capacitor is raised to 5 volts, boosting the gate side of the capacitor to 10 volts. With this high voltage on the gate, the threshold voltage is easily exceeded and the transistor can pass the full 5 volts from the drain to the source, producing a 5-volt output.

With a large voltage on the gate, the threshold voltage is exceeded and the transistor remains on until the source reaches 5 volts.

With a large voltage on the gate, the threshold voltage is exceeded and the transistor remains on until the source reaches 5 volts.

In the 8086 bootstrap driver,8 an explicit capacitor is not used.9 Instead, the transistor's inherent capacitance is sufficient. Due to the thin insulating oxide layer between the gate and the underlying silicon, the gate acts as the plate of a capacitor relative to the source and drain. This "parasitic" capacitance is usually a bad thing, but the bootstrap driver takes advantage of it.

The diagrams below show how the bootstrap driver works. Unlike an inverter, the bootstrap driver is controlled by the chip's clock, generating an output only when the clock is high. In the first diagram, we assume that the input is a 1 and the clock is low (0). Two things happen. First, the inverted clock turns on the bottom transistor, pulling the output to ground. Second, the 5V input passes through the first transistor; the left side of the transistor acts as the drain and the right side as the source. Due to the threshold voltage, a volt is "lost" so about 4 volts reaches the gate of the second transistor. Since the source and drain of the second transistor are at 0 volts, the gate capacitors are charged with 4 volts. (Recall that these are not explicit capacitors, but are parasitic capacitors.)

The first step in the operation of the bootstrap driver. The gate capacitance is charged by the input.

The first step in the operation of the bootstrap driver. The gate capacitance is charged by the input.

In the next step, the clock switches state and things become more interesting. The second transistor is on due to the voltage on the gate, so current flows from the clock to the output. In a "normal" circuit, the output would rise to 4 volts, losing a volt due to the threshold voltage of the second transistor. However, as the output voltage rises, it boosts the voltage on the gate capacitors and thus raises the gate voltage. The increased gate voltage allows the output voltage to rise above 4 volts, pushing the gate voltage even higher, until the output reaches 5 volts.10 Thus, the bootstrap driver produces a high-current output with the full 5 volts.

The second step in the operation of the bootstrap driver. As the output rises, it boosts the gate voltage even higher.

The second step in the operation of the bootstrap driver. As the output rises, it boosts the gate voltage even higher.

An important factor is that the first transistor now has a higher voltage on the right than on the left, so the source and drain switch roles. Since the transistor has 5 volts on the gate and on the (now) source, Vgs is 0 and current can't flow. Thus the first transistor blocks current flow from the gate, keeping the gate at its higher voltage. This is the critical role of the first transistor in the bootstrap driver, acting as a diode to block current flow out of the gate.

The diagram below shows what happens when the clock switches state again, assuming a low input. Now the first transistor's source voltage drops, making Vgs large and turning the transistor on. This allows the second transistor's gate voltage to flow out. Note that the first transistor is no longer acting as a diode, since current can flow in the "reverse" direction. The other important action in this clock phase is that the bottom transistor turns on, pulling the output low. These actions discharge the gate capacitance, preparing it for the next bootstrap cycle.

When the clock switches off, the driver is discharged, preparing it for the next cycle.

When the clock switches off, the driver is discharged, preparing it for the next cycle.

The 8086 die

Now that I've explained the theory, how do bootstrap drivers appear on the silicon die of the 8086? The diagram below shows six drivers that control the ALU operation.11 There's a lot happening in this diagram, but I'll try to explain what's going on. For this photo, I removed the metal layer with acid to reveal the silicon underneath; the yellow lines show where the metal wiring was. The large pinkish regions are doped silicon, while the gray speckled lines are polysilicon on top. The greenish and reddish regions are undoped silicon, which doesn't conduct and can be ignored. A transistor is formed where a polysilicon line crosses silicon, with the source and drain on opposite sides. Note that some transistors share the source or drain region with a neighboring transistor, saving space. The circles are vias, connections between the metal and a lower layer.

Six bootstrap drivers as they appear on the chip.

Six bootstrap drivers as they appear on the chip.

The drivers start with six inputs at the right. Each input goes through a "diode" transistor with the gate tied to +5V. I've labeled two of these transistors and the other four are scattered around the image. Next, each signal goes to the gate of one of the drive transistors. These six large transistors pass the clock to the output when turned on. Note that the clock signal flows through large silicon regions, rather than "wires". Finally, each output has a pull-down transistor on the left, connecting it to ground (another large silicon region) under control of the inverted clock. The drive transistors are much larger than the other transistors, so they can provide much more current. Their size also provides the gate capacitance necessary for the operation of the bootstrap driver.

Although the six drivers in this diagram are electrically identical, each one has a different layout instead of repeating the same layout six times. This demonstrates how the layout has been optimized, moving transistors around to use space most efficiently.

In total, the 8086 has 81 bootstrap drivers, mostly controlling the register file and the ALU (arithmetic-logic unit). The die photo below shows the location of the drivers, indicated with red dots. Most of them are in the center-left of the chip, between the registers and ALU on the left and the control circuitry in the center.

The 8086 die with main functional blocks labeled. The bootstrap drivers are indicated with red dots.

The 8086 die with main functional blocks labeled. The bootstrap drivers are indicated with red dots.

Conclusions

For the most part, the 8086 uses standard NMOS logic circuits. However, a few of its circuits are unusual, and the bootstrap driver is one of them. This driver is a tricky circuit, depending on some subtle characteristics of MOS transistors, so I hope my explanation made sense. This driver illustrates how Intel used complex, special-case circuitry when necessary to get as much performance from the chip as possible.

If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier. I plan to write more about the 8086 so follow me on Twitter @kenshirriff or RSS for updates.

Notes and references

  1. Intel used a "bootstrap load" circuit in the 4004 and 8008 processors. The bootstrap load has many similarities to the bootstrap driver, using capacitance to boost the output voltage. But it is a different circuit, used in a different role. The bootstrap load was designed for PMOS circuits to boost the voltage from a pull-up transistor, using explicit capacitors, built with a process invented by Federico Faggin. I wrote about the bootstrap load here

  2. The only explanation of a bootstrap driver that I could find is in section 2.3.1 of DRAM Circuit Design: A Tutorial. The 8086 transistors with the gate wired to +5V puzzled me for the longest time. It seemed to me that this transistor would always be on, and thus had no function. However, the high voltage of the bootstrap driver gives it a function. I was randomly reading the DRAM book and suddenly recognized that one of the circuits in that book was similar to the mysterious 8086 circuit. 

  3. The threshold voltage was considerably higher for older PMOS transistors. To get around this, old chips used considerably higher supply voltages, so "losing" the threshold voltage wasn't as much of a problem. For instance, the Intel 4004 used a 15-volt supply. 

  4. The reason that MOSFETs are symmetrical in an integrated circuit and asymmetrical as physical components is that MOSFETs really have four terminals: source, gate, drain, and the substrate (the underlying silicon on which the transistor is constructed). In component MOSFETs, the substrate is internally connected to the source, so the transistor has three pins. However, the source-substrate connection creates a diode, making the component MOSFET asymmetrical. Four-terminal MOSFETs such as the 3N155 exist but are rare. The MOnSter 6502 made use of 4-terminal MOSFET modules to implement the 6502's pass transistors. 

  5. The load transistor is a special type of transistor, a depletion transistor that is doped differently. The doping produces a negative threshold voltage, so the transistor remains on and provides a relatively constant current. See Wikipedia for more on depletion loads. 

  6. The superbuffer has some similarity with a CMOS gate. Both use separate transistors to pull the signal high or low, with only one transistor on at a time. The difference is that CMOS uses a complementary transistor, i.e. PMOS, to pull the signal high. PMOS performs better in this role than NMOS. Moreover, a PMOS transistor is turned on by a 0 on the gate. This behavior eliminates the need for the inverter in a superbuffer. 

  7. The 8086 processor also uses completely different charge pumps to create a negative voltage for a substrate bias. I discuss that use of charge pumps here

  8. Why is it called a bootstrap driver? The term originates with footwear: boots often had boot straps on the top, physical straps to help pull the boots on. In the 1800s, the saying "No man can lift himself by his own boot straps" was used as a metaphor for the impossibility of improvement solely through one's own effort. (Pulling on the straps on your boots superficially seems like it should lift you off the ground, but is of course physically impossible.) By the mid-1940s, "bootstrap" was used in electronics to describe a circuit that started itself up through positive feedback, metaphorically pulling itself up by its bootstraps. The bootstrap driver continues this tradition, pulling itself up to a higher voltage. 

  9. Some circuits in the 8086 use physical capacitors on the die, constructed from a metal layer over silicon. The substrate bias generators use relatively large capacitors. There are also some small capacitors that appear to be used for timing reasons. 

  10. The exact voltage on the gate will depend on the relative capacitances of different parts of the circuit, but I'm ignoring these factors. The voltages that I show in the diagram are illustrations of the principle, not accurate values. 

  11. Some of the 8086's bootstrap drivers pre-discharge when the clock is low and produce an output when the clock is high, while other drivers operate on the opposite clock phases. The ALU drivers in the die photo operate on the opposite phases, but I've labeled the diagram to match the previous discussion.