Showing posts with label 8086. Show all posts
Showing posts with label 8086. Show all posts

Reverse-engineering the Intel 8086 processor's HALT circuits

The 8086 processor was introduced in 1978 and has greatly influenced modern computing through the x86 architecture. One unusual instruction in this processor is HLT, which stops the processor and puts it in a halt state. In this blog post, I explain in detail how the halt circuitry is implemented and how it interacts with the 8086's architecture.

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; the ones that are important to this discussion are darker and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions. Both are stopped by a halt instruction.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Halt processing in the Execution Unit

In this section, I'll explain how the HLT instruction is decoded and handled in the Execution Unit. The 8086 uses a combination of lookup ROMs, logic, and microcode to implement instructions. The process starts with the loader, a state machine that provides synchronization between the prefetch queue and the decoding circuitry. When an instruction byte is available, the loader provides a signal called First Clock that loads the instruction into the Instruction Register and starts the instruction decoding process.

Before microcode gets involved, the Group Decode ROM classifies instructions by producing about 15 signals, indicating properties such as instructions with a Mod R/M byte, instructions with a byte/word bit, instructions that always act on a byte, and so forth. For the HLT instruction, the Group Decode ROM provides two important signals. The first is one-byte logic (1BL), indicating that the instruction is one byte long and is implemented with logic circuitry rather than microcode.1 The second signal is produced for the HLT instruction specifically and generates the internal HALT signal. This signal travels to various parts of the 8086 to halt the processor.

The Group Decode ROM. The yellow rectangle detects the HLT instruction, with an output at the bottom. The red rectangle generates the 1BL (one-byte logic) signal.

The Group Decode ROM. The yellow rectangle detects the HLT instruction, with an output at the bottom. The red rectangle generates the 1BL (one-byte logic) signal.

In the Execution Unit, the HALT signal blocks the reading of new instructions from the prefetch queue. This causes the loader to wait indefinitely and stops execution of new instructions. Since no new instruction replaces HLT, the Group Decode ROM continues to generate the HALT signal. The HALT signal also blocks most of the other outputs from the Group Decode ROM, preventing other decoding actions.

Thus, the Execution Unit sits idle as a result of the HLT instruction, unable to start a new instruction. Modern processors often have low-power halt modes, where part of the processor is shut down or a clock domain is stopped to reduce power consumption. The 8086, however, doesn't do anything clever to minimize power consumption in the halt mode, since this wasn't a concern for processors in the 1970s.

Halt processing in the Bus Interface Unit

Memory and I/O devices are connected to the 8086 chip through a bus that transmits address, data, and control information. The 8086's Bus Interface Unit handles reads and writes over this bus, running independently from the Execution Unit. A complete bus cycle for a read or write takes four clock periods, called T1, T2, T3, and T4,2 with specific signals on the bus for each time state.

A HLT instruction stops the Bus Interface Unit, but this takes several steps. First, the Bus Interface Unit must complete any currently-running bus cycle. Any new bus cycle must be blocked. Finally, the processor indicates the HALT state to any devices on the bus by issuing a special T1 cycle over the bus.

The main HALT control signal inside the Bus Interface Unit is something I call halt-not-hold, indicating a HALT is active, but not a HOLD. (Ignore the HOLD part for now.) This signal is activated by the HLT instruction signal from the Group Decode ROM, except it is blocked by any bus operations in progress. Once any current bus operation reaches T2, halt-not-hold gets activated and starts the halt process while the current bus cycle finishes up.

To prevent new bus activity, the halt-not-hold signal blocks new prefetch requests. The only other source of bus activity is an instruction that performs reads or writes. But the current instruction is HLT, so it won't generate any bus traffic. Thus, the Bus Interface Unit will remain idle.

The read/write control circuitry on the die with the flip-flops labeled. Metal and polysilicon were removed to show the underlying silicon.

The read/write control circuitry on the die with the flip-flops labeled. Metal and polysilicon were removed to show the underlying silicon.

The circuitry to control the bus cycle is complicated with many flip-flops and logic gates; the diagram above shows the flip-flops. I plan to write about the bus cycle circuitry in detail later, but for now, I'll give an extremely simplified description. Internally, there is a T0 state before T1 to provide a cycle to set up the bus operation. The bus timing states are controlled by a chain of flip-flops configured like a shift register with additional logic: the output from the T0 flip-flop is connected to the input of the T1 flip-flop and likewise with T2 and T3, forming a chain. A bus cycle is started by putting a 1 into the input of the T0 flip-flop.3 When the CPU's clock transitions, the flip-flop latches this signal, indicating the (internal) T0 bus state. On the next clock cycle, this 1 signal goes from the T0 flip-flop to the T1 flip-flop, creating the externally-visible T1 state. Likewise, the signal passes to the T2 and T3 flip-flops in sequence, creating the bus cycle.

A slightly different path is used to generate the special T1 signal that indicates a HALT. Once any bus activity is completed, the halt-not-hold signal puts a 1 into the T1 flip-flop through some gates. This generates the T1 signal, bypassing T0. Moreover, this signal does not propagate to the T2 flip-flop because it is blocked by halt-not-hold and some gates. Another flip-flop blocks this T1 cycle after the first cycle so halt-not-hold doesn't repeately trigger it. Overall, this special HALT T1 state looks like a special case that was hacked into the circuitry.

One complication is the bus hold feature. The 8086 supports complex bus configurations, where external devices may take control of the bus. For instance, peripherals may use the bus for direct memory access, bypassing the CPU. A device can request control of the bus, a "bus hold", through the 8086's HOLD pin.4 This causes the 8086 to electrically stop putting signals on the bus (i.e. a high-impedance, tri-state off state). This allows another device to use the bus until it releases HOLD.

Even when the CPU is halted, the CPU still has "ownership" of the bus and drives the bus with idle signals.5 If a device requests a bus hold when the CPU is halted, the halt-not-hold signal is blocked. When the device releases the hold, halt-not-hold is unblocked. This causes the 8086 to go through the special T1 cycle again, using the same flip-flop process described above. This lets listeners on the bus know that the CPU is still halted.

Exiting the halt state

The processor exits the halt state when it receives a reset, interrupt, or non-maskable interrupt. To implement this, an interrupt unblocks the instruction decoder by overriding the queue-unavailable signal. This causes the loader, which controls instruction decoding, to move into the First Clock state. Meanwhile, the interrupt causes the microcode address register to be loaded with the hardcoded microcode address of the appropriate interrupt routine. Thus, the microcode engine starts running the interrupt handler microcode.

The Instruction Register holds the 8-bit opcode that is currently being processed. It has a ninth bit that indicates if an interrupt is being processed. The Instruction Register (including the interrupt bit) is loaded on First Clock (described above). It outputs the instruction and interrupt bit to the Group Decode ROM one clock cycle later. The interrupt bit blocks regular instruction decoding by the Group Decode ROM. In particular, the HLT instruction will no longer be decoded, dropping the HALT signal throughout the CPU. In the Execution Unit, this reactivates the prefetch queue. This will allow instruction execution once the microcode finishes executing the interrupt handling code. In the Bus Interface Unit, dropping the HALT signal causes halt-not-hold to drop. This enables bus activity from the Bus Interface Unit.6

History of HALT and x86

Historically, computers usually had some sort of "stop" or "wait" instruction to stop execution at the end of a program. This goes back to the electromechanical Harvard Mark I (1944), EDSCAC (1949), and Univac I (1951), among other machines. Most (but not all) mainframes and minicomputers continued this approach.7

The HLT instruction in the 8086, like many other features, derives from the Datapoint 2200, and there's an interesting story behind that. The Datapoint 2200 was a desktop computer announced in 1970, and sold as a "programmable terminal". The processor of the Datapoint 2200 was implemented with a board of TTL integrated circuits, since this was before microprocessors. The Datapoint manufacturer talked to Intel and Texas Instruments about replacing the board of chips with a single processor chip. Texas Instruments produced the TMX 1795 microprocessor chip and Intel produced the 8008 shortly after,8 both copying the Datapoint 2200's architecture and instruction set. Datapoint didn't like the performance of these chips and decided to stick with a TTL-based processor. Texas Instruments couldn't find a customer for the TMX 1795 and abandoned it. Intel, on the other hand, sold the 8008 as an 8-bit microprocessor, creating the microprocessor market in the process. Intel improved the 8008 to create the popular 8080 microprocessor (1974). Zilog produced the more powerful Z80 (1976), backward-compatible with the 8080.

The Datapoint 2200. This is the later Model II with an improved TTL processor using the 74181 ALU chip.

The Datapoint 2200. This is the later Model II with an improved TTL processor using the 74181 ALU chip.

Intel started designing the iAPX 432 in 1975 to be their high-end 32-bit processor, a "micromainframe" that supported garbage collection and objects in the processor. The iAPX 432 was too complex for the time and as the schedule slipped, Intel decided to produce a stopgap 16-bit processor to compete with Zilog and Motorola: this processor became the 8086. To make it easier for Intel customers to move to the 8086, the processor was designed for compatibility with 8080 assembly language so it inherited much of the architecture and instruction set, although extended from 8 bits to 16 bits.9

The consequence of this history is that the 8086 inherited many features from the Datapoint 2200. The Datapoint 2200 used cheaper shift-register memory so it had a serial processor that operated on one bit at a time. This required the Datapoint 2200 to be little-endian, a feature that lives on in the x86 architecture. Since the Datapoint 2200 was marketed as a programmable terminal, it had parity calculation built into the hardware. Thus, the 8008 and descendants have a parity flag, in contrast to contemporary processors such as the 6800 and 6502 that omitted this moderately complex feature. The use of I/O ports instead of memory-mapped I/O is another feature of the Datapoint 2200 that persists in the x86, but was not used in the 6800 and 6502 and their descendants. The opcodes of the Datapoint 2200 were based on octal 3-bit fields for hardware reasons. The x86 instructions are still designed around octal, but the usual hexadecimal display obscures their structure. Finally, the Datapoint 2200's HALT instruction was exactly copied by the 800810 and persists in x86.

Conclusions

The HLT instruction seems like a simple function, but its implementation touches many parts of the 8086. It is implemented in logic circuitry, completely bypassing the microcode. The implementation became more complicated because of the 8086's four-step bus protocol, as well as interaction between halting and the bus hold feature. This illustrates how complexity creates more complexity, something the RISC processors of the 1980s tried to counter.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. Thanks to monocasa for suggesting this topic.

Notes and references

  1. The instructions implemented outside microcode are the segment register prefixes (ES:, CS:, SS:, DS:), the other prefixes (LOCK, REPNZ, REPZ), the simple flag instructions (CMC, CLC, STC, CLI, STI, CLD, STD), and HLT. These instructions are indicated by the 1BL (one-byte logic) output from the Group Decode ROM. 

  2. The bus cycle may also include optional Tw wait states after T3 for slow memory. The memory (or I/O device) lowers the READY pin until it is ready to proceed and the Bus Interface Unit waits. I'm ignoring Tw states in this discussion to keep things simpler. 

  3. For some reason, the T-state flip-flops all hold inverted signals, so strictly speaking a 0 bit goes through the flip-flops. 

  4. The 8086 has a separate prioritized "request/grant" way for a device to obtain a bus hold, but it doesn't change the underlying hold behavior. 

  5. During a HALT, the 8086 is not actively using the bus, but it does not release the bus either; it is still electrically driving the bus. Otherwise, the bus would float to random voltages, confusing attached memory chips or other circuitry. 

  6. When the Bus Interface Unit is unhalted due to an interrupt, you might expect it to immediately start prefetching, accessing unwanted instructions. It turns out that the prefetch circuitry does try to start prefetching and reaches the internal T0 bus state. But it then gets preempted by the interrupt handler microcode, which uses the bus to send two interrupt acknowledge cycles. Immediately after, the microcode routine suspends prefetching. Thus, prefetching doesn't run until the interrupt microcode finishes and reenables prefetching. There's a lot of tricky timing in the 8086 to make everything work. 

  7. For more history of the stop instruction, see "Computer Architecture", Blaauw and Brooks, page 349. (This the same Brooks who wrote "The Mythical Man-Month" and "No Silver Bullet".) 

  8. You might wonder how the Intel 4004 fits into this history. Although many of the same people worked on both chips, they have completely different architectures. The 8008 is not at all an 8-bit version of the 4-bit 4004. 

  9. Assembly code for the 8-bit 8080 processor couldn't run directly on the 16-bit 8086. Instead, a translation program converted the 8080 assembly language to be compatible with the 8086, making some changes in the process. The 8086 dropped some of the less-useful instructions of the 8080, replacing them with multiple instructions in the translation. For instance, the 8080 had conditional subroutine call and return instructions (inherited from the Datapoint 2200), but the 8086 dropped them. 

  10. To see that the 8008 copied the Datapoint 2200's HALT instruction, note that the Datapoint had three opcodes for HALT (00, 01, and FF), which is a bit unusual. The 8008 also has three opcodes for HLT: 00, 01, and FF. Most instructions in the 8008 used the same opcode values as the Datapoint, with a few minor changes. 

Reverse-engineering the conditional jump circuitry in the 8086 processor

Intel introduced the 8086 microprocessor in 1978 and it had a huge influence on computing. I'm reverse-engineering the 8086 by examining the circuitry on its silicon die and in this blog post I take a look at how conditional jumps are implemented. Conditional jumps are an important part of any instruction set, changing the flow of execution based on a condition. Although this instruction may seem simple, it involves many parts of the CPU: the 8086 uses microcode along with special-purpose condition logic.

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; the ones that are important to this discussion are darker and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions. Most of the relevant circuitry is in the Execution Unit, such as the condition evaluation circuitry near the center, and the microcode in the lower right. But the Bus Interface Unit plays a part too, holding and modifying the program counter.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Microcode

Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. One of the hardest parts of computer design is creating the control logic that directs the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated, error-prone, and hard to design.

The alternative is microcode: instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns design of control circuitry into a programming task instead of a difficult logic design task.

The 8086 uses a hybrid approach: although the 8086 uses microcode, much of the instruction functionality is implemented with gate logic. This approach removed duplication from the microcode and kept the microcode small enough for 1978 technology. In a sense, the microcode is parameterized. For instance, the microcode can specify a generic Arithmetic/Logic Unit (ALU) operation, and the gate logic determines from the instruction which ALU (Arithmetic/Logic Unit) operation to perform. More relevant to this blog post, the microcode can specify a generic conditional test and the gate logic determines which condition to use. Although this made the 8086's gate logic more complicated, the tradeoff was worthwhile.

Microcode for conditional jumps

The 8086 processor has six status flags: carry, parity, auxiliary carry, zero, sign, and overflow.1 These flags are updated by arithmetic and logic operations based on the result. The 8086 has sixteen different conditional jump instructions2 that test status flags and jump if conditions are satisfied, such as zero, less than, or odd parity. These instructions are very important since they permit if statements, loops, comparisons, and so forth.

In machine language, a conditional jump opcode is followed by a signed offset byte which specifies a location relative to the current program counter, from 127 bytes ahead to 128 bytes back. This is a fairly small range, but the benefit is that the offset fits in a single byte, reducing the code size.3 For typical applications such as loops or conditional code, jumps usually stay in the same neighborhood of code, so the tradeoff is worthwhile.

The 8086's microcode was disassembled by Andrew Jenner (link) from my die photos, so we can see exactly what micro-instructions the 8086 is running for each machine instruction. The microcode below implements conditional jumps. In brief, the conditional jump code (Jcond) gets the branch offset byte. It tests the appropriate condition and, if satisfied, jumps to the relative jump microcode (RELJUMP). The RELJMP code adds the offset to the program counter. In either case, the microcode routine ends when it runs the next instruction (RNI).

   move       action
Jcond:
1 Q→tmpBL
2          XC    RELJMP                    
3          RNI                       

RELJMP:
4          SUSP
5          CORR                      
6 PC→tmpA  ADD   tmpA
7 Σ→PC     FLUSH RNI                       

In more detail, micro-instruction 1 (arbitrary numbering) moves a byte from the prefetch queue (Q) across the queue bus to the ALU's temporary B register.4 (Arguments for ALU operations are first stored in temporary registers, invisible to the programmer.) Instruction 2 tests the appropriate condition with XC, and jumps to the RELJMP routine if the condition is satisfied.5 Otherwise, RNI (Run Next Instruction) ends this sequence and loads the next machine instruction without jumping.

If the condition is satisfied, the relative jump routine starts with instruction 4, which suspends prefetching.6 Instruction 5 corrects the program counter value, since it normally points to the next byte to prefetch, not the next byte to execute. Instruction 6 moves the corrected program counter address to the ALU's temporary A register. It also starts an ALU operation to add temporary A and temporary B. Instruction 7 moves the sum (Σ) to the program counter. It flushes the prefetch queue, which starts up prefetching from the new PC value. Finally, RNI runs the next instruction, from the updated address.

This code supports all 16 conditional jumps because the microcode tests the generic "XC" condition. This indicates that the specific test depends on the four low bits of the opcode, and the hardware determines exactly what to test. It's important to keep the two levels straight: the machine instruction is doing a conditional jump to a different memory address, while the microcode that implements this instruction is performing a conditional jump to a different micro-address.

The timing for conditional jumps

The RNI (Run Next Instruction) micro-operation initiates processing of the next machine instruction. However, it takes a clock cycle to get the next instruction from the prefetch queue, decode it, and start the appropriate micro-instruction. This causes a wasted clock cycle before the next micro-instruction executes. To avoid this delay, most microcode routines issue a NXT micro-operation one cycle before they end. This gives the 8086 time to decode the next machine instruction so micro-instructions can run uninterrupted.

Unfortunately, the conditional jump instructions can't take advantage of NXT. The problem is that the control flow in the microcode depends on whether the conditional jump is taken or not. By the time the microcode knows it is not taking the branch, it's too late to issue NXT.

The datasheet gives the timing of a conditional jump as 4 clock cycles if the jump is not taken, and 8 clock cycles if the jump is taken. Looking at the microcode explains these timings. There are 3 micro-instructions executed if the jump is not taken, and 7 if it is taken. Because of the RNI, there is one wasted clock cycle, resulting in the documented 4 or 8 cycles in total.

The conditions

At this point I will review the 8086's conditional jumps. The 8086 implements 16 conditional jumps. (This is a large number compared to earlier CPUs: the 8080, 6502, and Z80 all had 8 conditional jumps, specified by 3 bits.) The table below shows which flags are tested for each condition, specified by the low four bits of the opcode. Some jump instructions have multiple names depending on the programmer's interpretation, but they map to the same machine instruction.7

ConditionBitsCondition trueCondition false
Overflow Flag (OF)=1000xoverflow (JO)not overflow (JNO)
Carry Flag (CF)=1001xcarry (JC)
below (JB)
not above or equal (JNAE)
not carry (JNC)
not below (JNB)
above or equal (JAE)
Zero Flag (ZF)=1010xzero (JZ)
equal (JE)
not zero (JNZ)
not equal (JNE)
CF=1 or ZF=1011xbelow or equal (JBE)
not above (JNA)
not below or equal (JNBE)
above (JA)
Sign Flag (SF)=1100xsign (JS)not sign (JNS)
Parity Flag (PF)=1101xparity (JP)
parity even (JPE)
not parity (JNP)
parity odd (JPO)
SF ≠ OF110xless (JL)
not greater or equal (JNGE)
not less (JNL)
greater or equal (JGE)
ZF=1 or SF ≠ OF111xless or equal (JLE)
not greater (JNG)
not less or equal (JNLE)
greater (JG)

From the hardware perspective, the important thing is that there are eight different condition flag tests. Each test has two jump instructions associated with it: one that jumps if the condition is true, and one that jumps if the condition is false. The low bit of the opcode selects "if true" or "if false".

The image below shows the condition evaluation circuitry as it appears on the die. There isn't much structure to it; it's just a bunch of gates. This image shows the doped silicon regions that form transistors. The numerous small polygons with a circle inside are connections between the metal layer and the polysilicon layer. Many of these connections use the silicon layer to optimize the layout.

The circuitry to compute conditions as it appears on the die. The metal and polysilicon layers have been removed for this image, showing the silicon underneath.

The circuitry to compute conditions as it appears on the die. The metal and polysilicon layers have been removed for this image, showing the silicon underneath.

This circuitry evaluates each condition by getting the instruction bits from the Instruction Register, checking the bits to match each condition, and testing if the condition is satisfied. For instance, the overflow condition (with instruction bits 000x) is computed by a NOR gate: NOR(IR3, IR2, IR1, OF'), which will be true if instruction register bits 3, 2, and 1 are zero and the Overflow Flag is 1.

The results from the individual condition tests are combined with a 7-input NOR gate, producing a result that is 0 if the specified 3-bit condition is satisfied. Finally, the "if true" and "if false" cases are handled by flipping this signal depending on the low bit of the instruction. This final result indicates if the 4-bit condition in the instruction is satisfied, and this signal is passed on to the microcode control circuitry.

One unexpected feature of the implementation is that a 7-input NOR gate combines the various conditions to test if the selected condition is satisfied. You'd expect that with eight conditions, there would be eight inputs to the NOR gate. However, there is a clever optimization that takes advantage of conditions that are combinations of clauses, for example, "less or equal". Specifically, the zero flag is tested for bit pattern 01xx (where x indicates a 0 or 1), which covers two conditions with one gate. Likewise, SF≠OF is tested for bit pattern 11xx and CF=1 is tested for bit pattern 0x1x. With these optimizations, the eight conditions are covered with seven checks. (This shows that the opcodes weren't assigned arbitrarily: the bit patterns needed to be carefully assigned for this to work.)

Back to the microcode

Before explaining how the microcode jump circuitry works, I'll briefly discuss the microcode format. A micro-instruction is encoded into 21 bits as shown below. Every micro-instruction contains a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits is a bit tricky since it depends on the type field, which is two or three bits long. The "short jump" (type 0) is a conditional jump within the current block of 16 micro-instructions. The ALU operation (type 1) sets up the arithmetic-logic unit to perform an operation. Bookkeeping operations (type 4) are anything from flushing the prefetch queue to ending the current instruction. A memory read or write is type 6. A "long jump" (type 5) is a conditional jump to any of 16 fixed microcode locations (specified in an external table). Finally, a "long call" (type 7) is a conditional subroutine call to one of 16 locations (different from the jump targets).

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

I'm going to focus on the XC RELJMP micro-instruction that we saw in the microcode earlier. This is a "long jump" with XC as the condition and RELJMP as the target tag. Another layer of hardware is required to implement the microcode conditions. The microcode supports 16 conditions, which are completely different from the 16 programmer-level conditions.8 Some microcode conditions test special-purpose internal flags, while others test conditions such as an interrupt, the chip's TEST pin, bit 3 of the opcode, or if the instruction has a one-byte address offset. The XC condition is one of these 16 conditions, number 15 specifically.

The conditions are evaluated by the condition PLA (Programmable Logic Array, a grid of gates), shown below. The four condition bits from the micro-instruction, along with their complements, are fed into the columns. The PLA has 16 rows, one for each condition. Each row is a NOR gate matching one bit combination (i.e. selecting a condition) and the corresponding signal value to test.9 Thus, if a particular condition is specified and is satisfied, that row will be 1. The 16 row outputs are combined by the 16-input NOR gate at the left. Thus, if the specified condition is satisfied, this output will be 0, and if the condition is unsatisfied, the output will be 1. This signal controls the jump or call micro-instruction: if the condition is satisfied, the new micro-address is loaded into the microcode address register. If the condition is not satisfied, the microcode proceeds sequentially.

The condition PLA evaluates microcode conditionals.

The condition PLA evaluates microcode conditionals.

Conclusions

To summarize, the 8086 processor implements 16 conditional jump instructions. One piece of microcode efficiently implements all 16 instructions, with gate logic determining which flags to test, depending on bits in the machine instruction. The result of this test is used by the microcode XC conditional jump, one of 16 completely different microcode-level conditions. If the XC condition is satisfied, the program counter is updated by adding the offset, jumping to the new location.

Conditional jumps are relatively straightforward instructions from the programmer's perspective, but they interact with most parts of the 8086 processor including the prefetch queue, the address adder, the ALU, microcode, and the Translation ROM. The diagram below shows the interactions for each step of the jump.

The conditional jump involves many parts of the die, shown in this diagram.

The conditional jump involves many parts of the die, shown in this diagram.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff.

Notes and references

  1. In addition to the six status flags, the 8086 has three control flags: trap, direction, and interrupt enable. These flags aren't tested by conditional branches so I won't discuss them further. 

  2. Strictly speaking, the 8086 has a few more conditional jumps. The JCXZ instruction tests if the CX register is zero. The LOOP, LOOPNZ, and LOOPZ instructions decrement the CX register and loop if it is nonzero. The last two only loop if the zero flag indicates nonzero or zero, respectively. I'm ignoring these instructions in the blog post. 

  3. Although a conditional jump only supports a small range, it's still possible to conditionally jump to a distant location by using two instructions. A conditional jump with the opposite condition can skip over a longer unconditional jump instruction. The 80386 removed this restriction by providing long-displacement conditional jumps, which could perform a 16-bit or 32-bit relative jump. 

  4. The relative offset byte is sign-extended when it is moved to the temporary B register. That is, if the top bit is high, the high byte is set to all 1's to produce a 16-bit negative value. 

  5. The details of how the microcode jumps to the RELJMP routine are interesting, but a bit of a tangent, so I've put this discussion in a footnote. For long jumps (and long calls) in microcode, the target micro-addresses are stored in the Translation ROM, and the 4-bit target tag indexes into this ROM. The motivation for this structure is that micro-addresses are 13 bits, which is a lot of bits to try to fit into a 21-bit micro-instruction. Using a 4-bit tag keeps the microcode compact, but at the cost of requiring a small ROM in the 8086.

    The translation ROM on the die.

    The translation ROM on the die.

    Above is a view of the Translation ROM, with the RELJMP entry highlighted. The left half decodes tags, while the right half provides the corresponding microcode address. The row for RELJMP is highlighted. 

  6. Much of this microcode snippet deals with the prefetch queue. To increase efficiency, the 8086 processor fetches instructions from memory before they are needed and stores them in a 6-byte prefetch queue. In most processors, the program counter points to the memory address of the next instruction to execute. However, in the 8086, the program counter advances during prefetching, so it points to the memory address of the next instruction to fetch. This discrepancy is invisible to the programmer, but the microcode needs to handle it.

    First, the microcode issues a SUSP micro-operation to suspend prefetching. This ensures that the program counter will not be changed due to more prefetching. Next, the CORR micro-operation corrects the program counter to point to the next address to execute. This correction is performed by subtracting the number of unused bytes in the prefetch queue. You might expect this correction to be performed by the Arithmetic/Logic Unit (ALU). However, the 8086 has a separate adder that is used for memory address computations: each memory access in the 8086 requires a segment register base address to be added to an offset address. This address adder is also used for program counter correction. The constant ROM holds the values -1 through -6, the appropriate constant is selected based on the number of bytes in the prefetch queue, and this constant is added to the program counter. (Interestingly, the address adder is used for program counter correction, while the ALU is used to modify the program counter for the relative jump computation.)

    The address adder has multiple uses. It is also used for updating the program counter during prefetching. It updates addresses when performing block copy operations. Finally, it updates addresses when performing an unaligned word operation. The constant ROM holds constants for these operations.

    At the end of the microcode sequence, the FLUSH micro-operation flushes the stale bytes from the prefetch queue, resets the prefetch queue pointers, and restarts prefetching. I wrote about prefetching in detail here

  7. Often, the compare (CMP) instruction will be executed to compare two numbers by subtracting and discarding the result but keeping the condition codes. One complication is that some tests make sense for signed numbers, while other tests make sense for unsigned numbers. Specifically, "greater", "greater or equal", "less", and "less or equal" make sense for signed comparisons. On the other hand, "above", "above or equal", "below", and "below or equal" make sense for unsigned comparisons.

    The 8086 supports both signed and unsigned numbers. The arithmetic operations are the same for both; it's just the programmer's interpretation that differs. For instance, consider adding hex numbers 0xfe and 0x01. Treating them as unsigned numbers, the sum is 254 + 1 = 255. But as signed numbers, -2 + 1 = -1. In either case, the processor computes the same result, 0xff, but the interpretation is different.

    The signed vs unsigned distinction matters for comparisons. For instance, as unsigned numbers, 0xfe (254) is above 0x01 (1). But as signed numbers, 0xfe (-2) is less than 0x01 (1). This is why different instructions are used to compare unsigned versus signed numbers.

    Another important factor is that the carry flag indicates an unsigned result is too large for its byte (or word), while the overflow flag indicates that a signed result is too large for its byte (or word). For instance, adding unsigned bytes 0xff (255) and 0x02 (2) yields 0x01 (1) and a carry, indicating the result is too big for a byte. However, as signed bytes this corresponds to -1 + 2 = 1, which fits in a byte, so the overflow flag is not set. Conversely, 0x7f + 0x01 = 0x80. As unsigned bytes, this corresponds to 127 + 1 = 128 which is fine. But as signed bytes, this corresponds to 127 + 1, which unexpectedly yields -128 due to overflow. Thus, the carry flag is not set, but the overflow flag is set in this case. 

  8. Short jumps have four bits to specify the condition, so they can access 16 conditions. For long jumps and long calls, one bit is "stolen" from the condition to indicate the type, so they can only access eight of the conditions. Thus, the conditions need to be assigned carefully so the necessary ones are available. 

  9. PLAs are typically uniform grids, but the grid pattern breaks down a bit in the condition PLA. The reason is that each test uses a separate signal, so there is a different signal into each row (unlike a typical PLA where each row receives the same signals). Moreover, some of the test signals are processed at the left, distorting the 16-input NOR gate. This illustrates the degree of layout optimization in the 8086, squeezing transistors in to save a bit of space. 

Counting the transistors in the 8086 processor: it's harder than you might think

How many transistors are in Intel's 8086 processor? This seems like a straightforward question, but it doesn't have a straightforward answer. Most sources say that this processor has 29,000 transistors.1 However, I have traced out every transistor from die photos and my count is 19,618. What accounts for the 9382 missing transistors?

The explanation is that when manufacturers report the transistor count of chips, typically often report "potential" transistors. Chips that include a ROM will have different numbers of transistors depending on the values stored in the ROM. Since marketing doesn't want to publish varying numbers depending on the number of 1 bits and 0 bits, they often count ROM sites: places that could have transistors, but might not. A PLA (Programmable Logic Array) has similar issues; the transistor count depends on the desired logic functions.

What are these potential transistor sites? ROMs are typically constructed as a grid of cells, with a transistor at a cell for a 1 bit, and no transistor for a 0 bit.2 In the 8086, transistors are created or not through the pattern of silicon doping. The photo below shows a closeup of the silicon layer for part of the 8086's microcode ROM. The empty regions are undoped silicon, while the other regions are doped silicon. Transistor gates are formed where vertical polysilicon lines (removed for the photo) passed over the doped silicon. Thus, depending on the data encoded into the ROM during manufacturing, the number of transistors varies.

A closeup of part of the microcode ROM. The dark circles indicate vias between the silicon and the metal on top.

A closeup of part of the microcode ROM. The dark circles indicate vias between the silicon and the metal on top.

The diagram below provides more detail, showing the microcode ROM up close. Green T's indicate transistors, while red X's indicate positions with no transistor. As you can see, the potential transistor positions form a grid, but only some of the positions are occupied by transistors. The common method for counting transistors counts all the potential positions (18 below) rather than the actual transistors that are implemented (12 below).

An extreme closeup of the microcode ROM. Green T's indicate transistors, while red X's indicate positions with no transistor.

An extreme closeup of the microcode ROM. Green T's indicate transistors, while red X's indicate positions with no transistor.

I found an Intel history that confirmed that the 8086 transistor count includes potential sites, saying "This is 29,000 transistors if all ROM and PLA available placement sites are counted." That paper gives the approximate number of (physical) transistors in the 8086 as 20,000. This number is close to my count of 19,618.

To get a transistor count that includes empty sites, I counted the number of transistor sites in the various ROMs and PLAs in the 8086 chip. This is harder than you might expect because the smaller ROMs, such as the constant ROM, have some layout optimization. The photo below shows a closeup of the constant ROM. It is essentially a grid, but has been "squeezed" slightly to optimize its layout, making it slightly irregular. I'm counting its "potential" transistors, but one could argue that it shouldn't be counted because filling in these transistors might run into problems.

Closeup of the constant ROM showing the silicon and polysilicon.

Closeup of the constant ROM showing the silicon and polysilicon.

The following table breaks down the ROM and PLA counts by subcomponent. I found a total of approximately 9659 unfilled transistor vacancies. If you add those to my transistor count, it works out to 29,277 transistors.

ComponentTransistor sitesTransistorsVacancies
Microcode1390462107694
Group Decode ROM1254603651
Translation ROM1050431619
Register PLAs465182283
ALU PLA354170184
Constant ROM20310994
Condition PLA1607486
Segment PLA904248

The image below shows these ROMs and PLAs on the die and how much the vacancies increase the transistor count. Not surprisingly, the large microcode ROM and its decoding PLA are responsible for most of the vacancies.

The 8086 die with transistor vacancy counts and how much they contribute to the final transistor count. (Click this image or any other for a larger version.)

The 8086 die with transistor vacancy counts and how much they contribute to the final transistor count. (Click this image or any other for a larger version.)

Potential exclusions

So are my counts of 19,618 transistors and 29,277 transistor sites correct? There are some subtleties that could lower this count slightly. First, the output pins use large, high-current transistors. Each output transistor is constructed from more than a dozen transistors wired in parallel. Should this be counted as a dozen transistors or a single transistor? I'm counting the component transistors.

An output pad with a bond wire attached. Driver transistors next to the pad are constructed from multiple transistors in parallel.

An output pad with a bond wire attached. Driver transistors next to the pad are constructed from multiple transistors in parallel.

The 8086 has about 43 transistors wired as diodes for various purposes. Some are input protection diodes, while others are used in the charge pump for the substrate bias generator. Should these be excluded from the transistor count? Physically they are transistors but functionally they aren't.

The 8086 is built with NMOS logic which builds gates out of active "enhancement" transistors as well as "depletion" transistors which basically act as pull-up resistors. I count 2689 depletion-mode transistors, but you could exclude them from the count as not "real" transistors.

Conclusions

The number of transistors in a chip is harder to define than you might expect. The 8086 is commonly described as having 29,000 transistors when including empty sites in ROMs and PLAs that potentially could have a transistor. The published number of physical transistors in the 8086 is "approximately 20,000". From my counts, the 8086 has 19,618 physical transistors and 29,277 transistors when including empty sites. Given the potential uncertainties in counting, it's not surprising that Intel rounded the numbers to the nearest thousand.

The practice of counting empty transistor sites may seem like an exaggeration of the real transistor count, but there are some good reasons to count this way. Including empty sites gives a better measure of the size and complexity of the chip, since these sites take up area whether or not they are used. This number also lets one count the number of transistors before the microcode is written, and it is also stable as the microcode changes. But when looking at transistor counts, it's good to know exactly what is getting counted.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. I discussed the transistor count in the 6502 processor here.

Notes and references

  1. For example, The 8086 Family Users Manual says on page A-210: "The central processor for the iSBC 86/12 board is Intel's 8086, a powerful 16-bit H-MOS device. The 225 sq. mil chip contains 29,000 transistors and has a clock rate of 5MHz." 

  2. ROMs can also be constructed the other way around, with a transistor indicating a 0. It's essentially an arbitrary decision, depending on whether the output buffer inverts the bit or not. Other ROM technologies may have transistors at all the sites but only connect the desired ones. 

The 8086 processor's microcode pipeline from die analysis

Intel introduced the 8086 microprocessor in 1978, and its influence still remains through the popular x86 architecture. The 8086 was a fairly complex microprocessor for its time, implementing instructions in microcode with pipelining to improve performance. This blog post explains the microcode operations for a particular instruction, "ADD immediate". As the 8086 documentation will tell you, this instruction takes four clock cycles to execute. But looking internally shows seven clock cycles of activity. How does the 8086 fit seven cycles of computation into four cycles? As I will show, the trick is pipelining.

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below, which will be important in the discussion. The Bus Interface Unit handles memory accesses (including instruction prefetching), while the Execution Unit executes instructions. The functional blocks labeled in black are the ones that are part of the discussion below. In particular, the registers and ALU (Arithmetic/Logic Unit) are at the left and the large microcode ROM is in the lower-right.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Microcode for "ADD"

Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. The motivation is that instructions usually require multiple steps inside the processor. One of the hardest parts of computer design is creating the control logic that directs the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated, error-prone, and hard to design.

The alternative is microcode: instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task.

The 8086 uses a hybrid approach: although the 8086 uses microcode, much of the instruction functionality is implemented with gate logic. This approach removed duplication from the microcode and kept the microcode small enough for 1978 technology. In a sense the microcode is parameterized. For instance, the microcode can specify a generic ALU operation, and the gate logic determines from the instruction which ALU operation to perform. Likewise, the microcode can specify a generic register and the gate logic determines which register to use. The simplest instructions (such as prefixes or condition-code operations) don't use microcode at all. Although this made the 8086's gate logic more complicated, the tradeoff was worthwhile.

The 8086's microcode was disassembled by Andrew Jenner (link) from my die photos, so we can see exactly what micro-instructions the 8086 is running for each machine instruction. In this post, I will focus on the ADD instruction, since it is fairly straightforward. In particular, the "ADD AX, immediate" instruction contains a 16-bit value that is added to the value in the 16-bit AX register. This instruction consists of three bytes: the opcode 05, followed by the two-byte immediate value. (An "immediate" value is included in the instruction, rather than coming from a register or memory location.)

This ADD instruction is implemented in the 8086's microcode as four micro-instructions, shown below. Each micro-instruction specifies a move operation across the internal ALU bus. It also specifies an action. In brief, the first two instructions get the immediate argument from the prefetch queue. The third instruction gets the argument from the AX register and starts the ALU (Arithmetic/Logic Unit) operation. The final instruction stores the result into the AX register and updates the condition flags.

µ-address    move        action
   018    Q → tmpBL     L8    2
   019    Q → tmpBH
   01a    M → tmpA      XI    tmpA, NXT
   01b    Σ → M         RNI   FLAGS

In detail, the first instruction moves a byte from the prefetch queue (Q) to one of the ALU's temporary registers, specifically the low byte of the tmpB register. (The ALU has three temporary registers to hold arguments: tmpA, tmpB, and tmpC. These temporary registers are invisible to the programmer and are unrelated to the AX, BX, CX registers.) Likewise, the second instruction fetches the high byte of the immediate value from the queue and stores it in the high byte of the ALU's tmpB register. The action in the first micro-instruction, L8, will branch to step 2 (01a) if the instruction specifies an 8-bit operation, skipping the load of the high byte. Thus, the same microcode supports the 8-bit and 16-bit ADD instructions.1

The third micro-instruction is more complicated. The move section moves the AX register's contents (indicated by M) to the accumulator's tmpA register, getting both arguments ready for the operation. XI tmpA starts an ALU operation, in this case adding tmpA to tmpB.2 Finally, NXT indicates that this is the next-to-last micro-instruction, as will be discussed below.

The last micro-instruction stores the ALU's result (Σ) into the AX register. The end of the microcode for this machine instruction is indicated by RNI (Run Next Instruction). Finally, FLAGS causes the 8086's condition flags register to be updated, indicating if the result is zero, negative, and so forth.

You may have noticed that the microcode doesn't explicitly specify the ADD operation or the AX register, using XI and M instead. This illustrates the "parameterized" microcode mentioned earlier. The microcode specifies a generic ALU operation with XI,3 and the hardware fills in the particular ALU operation from bits 5-3 of the machine instruction. Thus, the microcode above can be used for addition, subtraction, exclusive-or, comparisons, and four other arithmetic/logic operations.

The other parameterized aspect is the generic M register specification. The 8086's instruction set has a flexible way of specifying registers for the source and destination of an operation: registers are often specified by a "Mod R/M" byte, but can also be specified by bits in the first opcode. Moreover, many instructions have a bit to switch the source and destination, and another bit to specify an 8-bit or 16-bit register. The microcode can ignore all this; a micro-instruction uses M and N for the source and destination registers, and the hardware handles the details.4 The M and N values are implemented by 5-bit registers that are invisible to the programmer and specify the "real" register to use. The diagram below shows how they appear on the die.

Die photo of the circuitry that implements the M and N registers. A multiplexer selects a source for the N register value and feeds it into the 5-bit N register. The M register is similar. Between the two registers is a "swap" circuit to swap the outputs of the two registers based on the instruction's "direction" bit. In this image, the metal layer has been dissolved with acid to show the transistors in the silicon layer underneath.

Die photo of the circuitry that implements the M and N registers. A multiplexer selects a source for the N register value and feeds it into the 5-bit N register. The M register is similar. Between the two registers is a "swap" circuit to swap the outputs of the two registers based on the instruction's "direction" bit. In this image, the metal layer has been dissolved with acid to show the transistors in the silicon layer underneath.

Pipelining

The 8086 documentation says this ADD instruction takes four clock cycles, and as we have seen, it is implemented with four micro-instructions. One micro-instruction is executed per clock cycle, so the timing seems straightforward. The problem, however, is that a micro-instruction can't be completed in one clock cycle. It takes a clock cycle to read a micro-instruction from the microcode ROM. Sending signals across an internal bus typically takes a clock cycle and other actions take more time. So a typical micro-instruction ends up taking 2½ clock cycles from start to end. One solution would be to slow down the clock, so the micro-instruction can complete in one cycle, but that would drastically reduce performance. A better solution is pipelining the execution so a micro-instruction can complete every cycle.5

The idea of pipelining is to break instruction processing into "stages", so different stages can work on different instructions at the same time. It's sort of like an assembly line, where a particular car might take an hour to manufacture, but a new car comes off the assembly line every minute. The diagram below shows a simple example. Suppose executing an instruction requires three steps: A, B, and C. Executing four instructions, as shown at the top would take 12 steps in total.

Diagram of a simple pipeline showing four instructions executing through three stages.

Diagram of a simple pipeline showing four instructions executing through three stages.

However, suppose the steps can execute independently, so step B for one instruction can execute at the same time as step A for another instruction. Now, as soon as instruction 1 finishes step A and moves on to step B, instruction 2 can start step A. Next, instruction 3 starts step A as instructions 2 and 1 move to steps B and C respectively. The first instruction still takes 3 time units to complete, but after that, an instruction completes every time unit, providing a theoretical 3× speedup.6 In a bit, I will show how the 8086 uses the idea of pipelining.

The prefetch queue

The 8086 uses instruction prefetching to improve performance. Prefetching is not the focus of this article, but a brief explanation is necessary. (I wrote about the prefetch circuitry in detail earlier.) Memory accesses on the 8086 are relatively slow (at least four clock cycles), so we don't want to wait every time the processor needs a new instruction. The idea behind prefetching is that the processor fetches future instructions from memory while the CPU is busy with the current instruction. When the CPU is ready to execute the next instruction, hopefully the instruction is already in the prefetch queue and the CPU doesn't need to wait for memory. The 8086 appears to be the first microprocessor to implement prefetching.

In more detail, the 8086 fetches instructions into its prefetch queue asynchronously from instruction execution: The "Bus Interface Unit" performs prefetches, while the "Execution Unit" executes instructions. Prefetched instructions are stored in the 6-byte prefetch queue. The Q bus (short for "Queue bus") provides bytes, one at a time, from the prefetch queue to the Execution Unit.7 If the prefetch queue doesn't have a byte available when the Execution Unit needs one, the Execution Unit waits until the prefetch circuitry can complete a memory access.

The loader

To decode and execute an instruction, the Execution Unit must get instruction bytes from the prefetch queue, but this is not entirely straightforward. The main problem is that the prefetch queue can be empty, blocking execution. Second, instruction decoding is relatively slow, so for maximum performance, the decoder needs a new byte before the current instruction is finished. A circuit called the "loader" solves these problems by using a small state machine (below) to efficiently fetch bytes from the queue at the right time.

The state machine for the 8086 "loader" circuit. I'm not going to explain how it works in this post, but the diagram looks pretty cool.
From patent US4449184.

The state machine for the 8086 "loader" circuit. I'm not going to explain how it works in this post, but the diagram looks pretty cool. From patent US4449184.

The loader generates two timing signals that synchronize instruction decoding and microcode execution with the prefetch queue. The FC (First Clock) indicates that the first instruction byte is available, while the SC (Second Clock) indicates the second instruction byte. Note that the First Clock and Second Clock are not necessarily consecutive clock cycles because the first byte could be the last one in the queue, delaying the Second Clock.

At the end of a microcode sequence, the Run Next Instruction (RNI) micro-operation causes the loader to fetch the next machine instruction. However, microcode execution would be blocked for a cycle due to the delay of fetching and decoding the next instruction. In many cases, this can be avoided: if the microcode knows that it is one micro-instruction away from finishing, it issues a Next-to-last (NXT) micro-operation so the loader can start loading the next instruction before the previous instruction finishes. As will be shown in the next section, this usually allows micro-instructions to run without interruption.

Instruction execution

Putting this all together, we can see how the ADD instruction is executed, cycle by cycle. Each clock cycle starts with the clock high (H) and ends with the clock low (L).8 The sequence starts with the prefetch queue supplying the ADD instruction across the Q bus in cycle 1. The loader indicates that this is First Clock and the instruction is loaded into the microcode address register. It takes a clock cycle for the address to exit the address register (as indicated by an arrow) along with the microcode counter value indicating step 0. To remember the ALU operation, bits 5-3 of the instruction are saved in the internal X register (unrelated to the AX register).

In cycle 2, the prefetch queue has supplied the second byte of the instruction so the loader indicates Second Clock. In the second half of cycle 2, the microcode address decoder has converted the instruction-based address to the micro-address 018 and supplies it to the microcode ROM.

In cycle 3, the microcode ROM outputs the micro-instruction at micro-address 018: Q→tmpBL, which will move a byte from the prefetch queue bus (Q bus) to the low byte of the ALU temporary B register, as described earlier. It takes a full clock cycle for this action to take place, as the byte traverses buses to reach the register. This micro-instruction also generates the L8 micro-op, which will branch if an 8-bit operation is taking place. As this is a 16-bit operation, no branch takes place.9 Meanwhile, the microcode address register moves to step 1, causing the decoder to produce the micro-address 019.

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

In cycle 4, the prefetch queue provides a new byte, the high byte of the immediate value. The microcode ROM outputs the micro-instruction at micro-address 019: Q→tmpBH, which will move this byte from the prefetch queue bus to the high byte of the ALU temporary B register. As before, it takes a full cycle for this move to complete. Meanwhile, the microcode address register moves to step 2, causing the decoder to produce the micro-address 01a.

In cycle 5, the microcode ROM outputs the micro-instruction at micro-address 01a: M→tmpA,XI tmpA,NXT. Since the M (source) register specifies AX, the contents of the AX register will be moved into the ALU tmpA register, but this will take a cycle to complete. The XI tmpA part starts decoding the ALU operation saved in the X register, in this case ADD. Finally, NXT indicates that the next micro-instruction is the last one in this instruction. In combination with the next instruction on the Q bus, this causes the loader to issue First Clock. This starts execution of the next machine instruction, even though the current instruction is still executing.

In cycle 6, the microcode ROM outputs the micro-instruction at micro-address 01b: Σ→M,RNI. This will store the ALU output into the register indicated by M (i.e. AX), but not yet. In the first half of cycle 6, the ALU decoder determines the ALU control signals that will cause an ADD to take place. In the second half of cycle 6, the ALU receives these control signals and computes the sum. The RNI (Run Next Instruction) and the second instruction byte from the prefetch queue cause the loader to issue Second Clock, and the micro-address for the next machine instruction is sent to the microcode ROM.

Finally, in cycle 7, the sum is written to the AX register and the flags are updated, completing the ADD instruction. Meanwhile, the next instruction is well underway with its first micro-instruction being executed.

As you can see, execution of a micro-instruction is pipelined, with three full clock cycles from the arrival of an instruction until the first micro-instruction completes in cycle 4. Although this system is complex, in the best case it achieves the goal of running a micro-instruction each cycle, without gaps. (There are gaps in some cases, most commonly when the prefetch queue is empty. A gap will also occur if the microcode control flow doesn't allow a NXT micro-instruction to be issued. In that case, the loader can't issue First Clock until the RNI micro-instruction is issued, resulting in a delay.)

Conclusions

The 8086 uses multiple types of pipelining to increase performance. I've focused on the pipelining at the microcode level, but the 8086 uses at least four interlocking types of pipelining. First, microcode pipelining allows micro-instructions to complete at the rate of one per clock cycle, even though it takes multiple cycles for a micro-instruction to complete. Admittedly, this pipeline is not very deep compared to the pipelines in RISC processors; the 8086 designers called the overlap in the microcode ROM a "sort of mini-pipeline."10

The second type of pipelining overlaps instruction decoding and execution. Instruction decoding is fairly complicated on the 8086 since there are many different formats of instructions, usually depending on the second byte (Mod R/M). The loader coordinates this pipelining, issuing the First Clock and Second Clock signals so decoding on the next instruction can start before the previous instruction has completed. Third is the prefetch queue, which overlaps fetching instructions from memory with execution. This is accomplished by partitioning the processor into the Bus Interface Unit and the Execution Unit, with the prefetch queue in between. (I recently wrote about instruction prefetching in detail.)

There's a final type of pipelining that I haven't discussed. Inside the memory access sequence, computing the memory address from a segment register and offset is overlapped with the previous memory access. The result is that memory accesses appear to take four cycles, even though they really take six cycles. I plan to write more about memory access in a later post.

The 8086 was a large advance in size, performance, and architecture compared to earlier microprocessors such as the Z80 (1976), 8085 (1977), and 6809 (1978). As well as moving to 16 bits, the 8086 had a considerably more complex architecture with instruction prefetching and microcode, among other features. At the same time, the 8086 avoided the architectural overreach of Intel's ill-fated iAPX 432, a complex processor that supported garbage collection and objects in hardware. Although the 8086's architecture had flaws, it was a success and led to the x86 architecture, still dominant today.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

  1. The lowest bit of many 8086 instructions selects if the instruction operates on a byte or a word. Thus, many instructions in the instruction set appear in pairs. The support for byte operations gave the 16-bit 8086 processor compatibility with the older 8-bit 8080, if assembly code was suitably translated. 

  2. The microcode for an ALU operation can select the first operand from tmpA, tmpB, or tmpC. The second operand is always tmpB. 

  3. I don't know why Intel used XI to indicate the ALU opcode. I don't think it's the Greek letter Ξ, although they did use Σ (sigma) for the ALU output. The opcode is stored in the X register, so maybe XI is X Instruction? (It's also unclear why the register is called X.) 

  4. Normally, the internal M register specifies the source register and the N register specifies the destination register, and these two registers are loaded from the instruction. However, some instructions only use the A or AX register, depending on whether the instruction acts on bytes or words. These instructions are the ALU immediate instructions, accumulator move instructions, string instructions, and the TEST, IN, and OUT instructions. For these instructions, the Group Decode ROM activates a signal that forces the M register to specify the AX register for a 16-bit operation, or the A register for an 8-bit operation. Thus, by specifying the M register in the microcode above, the same microcode is used for instructions with an 8-bit immediate argument or a 16-bit immediate argument. This also illustrates how the designers of the 8086 kept the microcode small by moving a lot of logic into hardware. 

  5. I should mention that the pipelining in the 8086 is completely different from the parallelism in modern superscalar CPUs. The 8086 is executing instructions linearly, step-by-step, even though instructions overlap. There is only one execution path and no speculative execution, for instance. 

  6. I showed a theoretical speedup from pipelining. Several issues make the real speedup smaller. First, the steps of an instruction typically don't take the same amount of time, so you're limited by the slowest step. Second, the overhead to handle the steps adds some delay. Finally, conflicts between instructions and other "hazards" may prevent overlap in various cases. 

  7. The interaction between the prefetch queue and the Execution Unit is a "push" model rather than a "pull" model. If the prefetch queue contains a byte, the prefetch circuitry puts the byte on the Q bus and lets the Execution Unit know that a byte is available. The Execution Unit signals the prefetch circuitry when it uses a byte, and the prefetch queue moves to the next byte in the queue. If the Execution Unit needs a byte and it isn't ready, it blocks until a byte is available. The prefetch queue loads new words as it empties, when the memory bus isn't in use for other purposes. 

  8. The 8086 is active during both phases (low and high) of the clock, with things happening both while the clock is high and while it is low. One unusual feature of the 8086 is that the clock signal is asymmetrical with a 33% duty cycle, so the clock is low for twice as long as the clock is high. In other words, the 8086 does twice as much (by time) during the low part of the clock cycle as during the high part of the clock cycle. There are multiple reasons why actions take a full clock cycle to complete. Much of the circuitry uses edge-triggered flip-flops to hold state. These latch data on one clock edge and move data internally during the other part of the clock. (The 8086 uses both positive-edge and negative-edge triggered flip flops; some latch when the clock goes high and others latch when the clock goes low.) Many control signals have their voltage level boosted by a bootstrap driver circuit, driven by the clock.

    Many buses are precharged during one clock phase and then transmit a signal during the other phase. The motivation behind precharging the bus is that NMOS transistors are much better at pulling a line low than pulling it high (i.e. they can provide more current). This especially affects buses because they have relatively high capacitance due to their length, so pulling the bus high is slow. Thus, the bus is "leisurely" precharged to a high state during one clock phase, and then it can be rapidly pulled low (if the bit is a 0) and transmit the data during the other clock phase. 

  9. You might expect that the 8-bit ADD would be faster than the 16-bit ADD since it is a 2-byte instruction instead of a 3-byte instruction and one micro-instruction is skipped. However, both the 8-bit and the 16-bit ADD instructions take 4 cycles. The reason is that branching to a new micro-instruction requires updating the microcode address register, which takes a clock cycle, resulting in a wasted clock cycle where no micro-instruction is executed. (Specifically, the next micro-instruction is on the way, so it is blocked by the ROM Enable (ROME) signal going low.) The result of this is that the branch for an 8-bit ADD costs an extra cycle, which cancels out the saved cycle. (In practice, the 16-bit instruction might be slower because it needs one more byte from the prefetch queue, which could cause a delay.) Just as a branch in the machine instructions can cause a delay (a "bubble") in the instruction pipeline, a branch in the microcode causes a delay in the micro-instruction pipeline. 

  10. The design decisions for the 8086 are described in: J. McKevitt and J. Bayliss, "New options from big chips," in IEEE Spectrum, vol. 16, no. 3, pp. 28-34, March 1979, doi: 10.1109/MSPEC.1979.6367944. 

Inside the 8086 processor's instruction prefetch circuitry

The groundbreaking 8086 microprocessor was introduced by Intel in 1978 and led to the x86 architecture that still dominates desktop and server computing. One way that the 8086 increased performance was by prefetching: the processor fetches instructions from memory before they are needed, so the processor can execute them without waiting on the (relatively slow) memory. I've been reverse-engineering the 8086 from die photos and this blog post discusses what I've uncovered about the prefetch circuitry.

The 8086 was introduced at an interesting point in microprocessor history, where memory was becoming slower than the CPU. For the first microprocessors, the speed of the CPU and the speed of memory were comparable.1 However, as processors became faster, the speed of memory failed to keep up. The 8086 was probably the first microprocessor to prefetch instructions to improve performance. While modern microprocessors have megabytes of fast cache2 to act as a buffer between the CPU and much-slower main memory, the 8086 has just 6 bytes of prefetch queue. However, this was enough to increase performance by about 50%.3

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; ones that are important to the prefetch queue are highlighted in red and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Prefetching and the architecture of the 8086

Prefetching had a major impact on the design of the 8086. Earlier processors such as the 6502, 8080, or Z80 were deterministic. The processor fetched an instruction, executed the instruction, fetched the next instruction, and so forth. Memory accesses corresponded directly to instruction fetching and execution and instructions took a predictable number of clock cycles. This all changed with the introduction of the prefetch queue. Memory operations became unlinked from instruction execution since prefetches happen as needed and when the memory bus is available.

Since memory operations and instruction execution happen independently, the implementors of the 8086 split the chip into two processing units: the Bus Interface Unit (BIU) that handles memory accesses, and the Execution Unit (EU) that executes instructions, as shown below.4 The Bus Interface Unit contains the 6-byte instruction prefetch queue; it supplies instructions to the Execution Unit via the Q (queue) bus. The adder (Σ) performs address calculation, adding the segment register base to an address offset, among other things. The Execution Unit is what comes to mind when you think of a processor: it has most of the registers, the arithmetic/logic unit (ALU), and the microcode that implements instructions. The address adder and the ALU are independent arithmetic units. The segment registers (CS, DS, SS, ES) and the Instruction Pointer (IP) are in the Bus Interface Unit since they are directly involved in memory accesses, while the general-purpose registers are in the Execution Unit.

Block diagram of the 8086 processor.
This diagram differs from most 8086 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor.
The "Internal Communication Registers" consist of the Indirect Register (IND) and the Operand Register (OPR). These hold a memory address and memory data value respectively.
From The 8086 Family User's Manual.

Block diagram of the 8086 processor. This diagram differs from most 8086 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor. The "Internal Communication Registers" consist of the Indirect Register (IND) and the Operand Register (OPR). These hold a memory address and memory data value respectively. From The 8086 Family User's Manual.

The 8086's segment registers play an important part in this architecture, so I'll review them quickly. One of the challenges of the 8086 was how to support more than 64K of memory with 16-bit registers. The much-reviled solution was to create a 1-megabyte (20-bit) address space consisting of 64K segments, with segment registers indicating the start of each segment. Specifically, a memory address was specified by a 16-bit offset address along with a particular segment register selecting a segment (Code Segment, Data Segment, Stack Segment, or Extra Segment). The segment register's value was shifted by 4 bits to give the segment's 20-bit base address. The 16-bit offset address was added, yielding a 20-bit memory address. This gave the processor a 1-megabyte address space, although only 64K could be accessed without changing a segment register.

It may seem inefficient for the Bus Interface Unit to have its own adder instead of using the ALU, but there are a couple of reasons for the separate adder. First, every memory access uses the adder at least once to add the segment base and offset. The adder is also used to increment the PC or index registers. Since these operations are so frequent, they would create a bottleneck if they used the ALU. Second, since the Execution Unit and the Bus Interface Unit run asynchronously with respect to each other, it would be complicated to share the ALU without causing delays and conflicts.

Prefetching had another major but little-known effect on the 8086 architecture: the designers were considering making the 8086 a two-chip microprocessor. Prefetching, however, required a one-chip design because the number of control signals required to synchronize prefetching across two chips exceeded the package pins available. This became a compelling argument for the one-chip design that was used for the 8086.3 (The unsuccessful Intel iAPX 432, which was under development at the same time, ended up being a two-chip processor: one to fetch and decode instructions, and one to execute them.)

Implementing the queue

The instruction prefetch queue is implemented with three 16-bit queue registers along with two hardware pointers that keep track of the current position in the queue. One two-bit counter keeps track of the current read position from 0 to 2, i.e. the queue register that will provide the next instruction. The second counter keeps track of the current write position, i.e. the queue register that will receive the next instruction from memory. As words are fetched from the queue, the read pointer advances. As words are added to the queue, the write pointer advances. Because the queue registers hold words, while the prefetch circuitry provides bytes, another flip-flop keeps track of whether the high byte or the low byte of the word is being used. I call this the HL flip-flop. This causes the low byte to be provided first and then the high byte (since the 8086 is little-endian).

The diagram below shows an example queue configuration with four bytes. The first two queue registers (Q0 and Q1) hold data. The read pointer and HL pointer indicate that the next prefetched byte will come from the low byte of Q0. The write pointer indicates that the next prefetched word will go into Q2.

A queue configuration with four bytes in the prefetch queue. Bytes in blue hold prefetched data.

A queue configuration with four bytes in the prefetch queue. Bytes in blue hold prefetched data.

The diagram below shows how the queue pointers can wrap around. In this configuration, one byte has been used from Q2 so the next byte will be Q2's high byte. Q0 holds the next prefetched word. The next word to be prefetched will be stored in Q1, as indicated by the write pointer.

A queue configuration with three bytes in the prefetch queue.

A queue configuration with three bytes in the prefetch queue.

The relative positions of the write and read pointers indicate how much data is in the queue. If the write pointer is one position before the read pointer (modulo 3), the queue holds 3 or 4 bytes. If the write pointer is one position after the read pointer (modulo 3), the queue holds 1 or 2 bytes. But what about when the read pointer and write pointer indicate the same register? This can either indicate that the queue is empty or that the queue is full (5 or 6 bytes). To distinguish these cases, a flip-flop is set if the queue enters the empty state. This flip-flop generates a signal that Intel called MT (empty).

Another complication occurs if you jump to an odd address. Because of its 16-bit bus, the 8086 will prefetch a word from the even address one less. This loads one usable byte and one byte that needs to be discarded. The 8086 handles this by setting the HL flip-flop high, using a handful of gates to detect this case. As in the diagram above, the unwanted low byte will be skipped.

The diagram below zooms in on the prefetch and queue control circuitry on the die, with the main flip-flops and circuitry labeled. The lower half manages the queue, keeping track of the read and write positions and computing the queue length. The upper circuitry controls prefetch operations and interacts with the rest of the memory cycle circuitry.

The queue and prefetch circuitry on the die. The metal layer has been removed for the closeup to show the silicon of the underlying transistors.

The queue and prefetch circuitry on the die. The metal layer has been removed for the closeup to show the silicon of the underlying transistors.

Even though there is not a lot of circuitry involved (about a dozen flip-flops and associated logic gates), this circuitry occupies a substantial part of the 8086 die. (The relatively small amount of circuitry did not make this easy to reverse-engineer, however!) Compared to modern chips, the density of the 8086 is very low; you can almost see the flip-flops with the naked eye. This diagram only shows the circuitry directly involved in prefetching. Additional circuitry is scattered through the memory cycle control circuitry to deal with prefetching, and the queue registers take up a substantial part of the register file. Thus, prefetching was a moderately expensive feature for the 8086, as far as die area.

The loader

To decode and execute an instruction, the Execution Unit must get instruction bytes from the Bus Interface Unit, but this is not entirely straightforward. The main problem is that the queue can be empty, in which case instruction decoding must block until a byte is available from the queue. The second problem is that instruction decoding is relatively slow, so for maximum performance, the decoder needs a new byte before the current instruction is finished. A circuit called the "loader" solves these problems by providing synchronization between the prefetch queue and the instruction decoder. The loader uses a small state machine to efficiently fetch bytes from the queue at the right time and to provide timing signals to the decoder and microcode engine.

In more detail, as the loader requests the first two instruction bytes from the prefetch queue, it generates two timing signals that control the microcode execution. The FC (First Clock) indicates that the first instruction byte is available, while the SC (Second Clock) indicates the second instruction byte. Note that the First Clock and Second Clock are not necessarily consecutive clock cycles because the prefetch queue could be empty or contain just one byte, in which case the First Clock and/or Second Clock would be delayed. The instruction decoding circuitry and the microcode engine are controlled by the First Clock and Second Clock signals, so they remain synchronized with the bytes supplied by the prefetch queue.

At the end of a microcode sequence, the Run Next Instruction (RNI) micro-operation causes the loader to fetch the next machine instruction. However, fetching and decoding the next instruction is a bit slow so microcode execution would be blocked for a cycle. In many cases, this slowdown can be avoided: if the microcode knows that it is one micro-instruction away from finishing, it issues a Next-to-last (NXT) micro-operation so the loader can start loading the next instruction. This achieves a degree of pipelining in most cases; fetching the next instruction is overlapped with finishing the execution of the previous instruction.

The state machine for the 8086 "loader" circuit.
The 1BL signal indicates a 1-byte instruction implemented in logic rather than microcode.
From patent US4449184.

The state machine for the 8086 "loader" circuit. The 1BL signal indicates a 1-byte instruction implemented in logic rather than microcode. From patent US4449184.

The diagram above shows the state machine for the loader. I won't explain it in detail, but essentially it keeps track of whether it is waiting for a First Clock byte or a Second Clock byte, and if it is performing a fetch in advance (NXT) or at the end of an instruction (RNI). The state machine is implemented with two flip-flops to support its four states.

Other memory accesses

The loader takes care of fetching an instruction that consists of an opcode byte and a Mod R/M (addressing mode) byte. However, many instructions have additional bytes or don't follow this format For example, an opcode such as "ADD AX" can be followed by an 8- or 16-bit immediate value, adding that value to the AX register. Or a "move memory to AX" instruction can be followed by a 16-bit memory address The microcode uses a separate mechanism for fetching these instruction bytes from the queue. Specifically, each micro-instruction contains a source register and a destination register that specify a data move. By specifying "Q" (the queue) as the source, a byte is fetched from the prefetch queue.

A third path is used for arbitrary memory reads and writes, such as when an instruction stores a register's contents to memory. In this case, the microcode puts the memory address in the IND (indirect) register. The microcode then issues a read or write micro-operation which causes the memory contents to be read into the OPR (operand register) or written from the OPR. (The IND and OPR registers are internal 8086 registers that are not visible to the programmer.) In the 8086, a memory cycle takes at least four clock cycles (called T1 through T4), including adding the segment register to compute the memory address. An "unaligned" memory access takes twice as long, though, because the 8086 has a word-based 16-bit data bus. Thus, if you try to access a word from an odd address, two memory accesses are required, one for the first byte and one for the second byte.5

As you can see, a memory access is a fairly complex operation. In the 8086, all these steps are done by hardware in the Bus Interface Unit, rather than being performed by microcode. (I'll discuss the complex memory control circuitry in detail in a future post.) After issuing a memory read or write, the microcode engine is blocked until the memory request completes.

Microcode instructions and the correction circuitry

The microcode interacts with prefetching in several ways. In addition to requesting a byte from the queue (as discussed above), microcode can perform three micro-instructions that involve prefetching: SUSP, FLUSH, and CORR. The SUSP (suspend) micro-instruction stops prefetching, typically before a change to execution flow. The FLUSH micro-instruction flushes the prefetch queue and resumes prefetching. To implement these, the prefetching circuitry has a flip-flop to keep track of the suspended state, and logic to reset the queue pointer counters to flush the queue.

The CORR (correct) micro-instruction corrects the Instruction Pointer to point to the next execution position. This is an interesting and more complicated micro-instruction. Like most processors, the 8086 has a program counter (PC) to keep track of what instruction to execute; the 8086 calls this the Instruction Pointer (IP). In the programmer's view, the Instruction Pointer points to the memory address of the next instruction to execute. However, in the hardware, the Instruction Pointer points to the next instruction to be fetched, which is generally several bytes after the next instruction to be executed.6

For the most part, this doesn't matter; the queue provides instructions in the order they were fetched and it doesn't matter if the Instruction Pointer runs ahead. However, there are a few cases where the "real" Instruction Pointer address is needed. For example, a relative jump instruction causes execution to jump to an address relative to the current instruction. When performing a subroutine call, the return address must be pushed on the stack. The correct Instruction Pointer value is also needed for an interrupt. Thus, the 8086 needs a mechanism to compute the real Instruction Pointer value from the value in the Instruction Pointer register.

The solution is the CORR micro-instruction, which corrects the Instruction Pointer value by subtracting the prefetch queue length, so the Instruction Pointer holds the "true" value. For instance, if there are 4 bytes in the queue, then the address in the Instruction Pointer register is four more than the desired Instruction Pointer address. The Bus Interface Unit performs this subtraction by using the addressing adder and a small table of constants called the Constant ROM.7

The diagram below zooms in on the Constant ROM, located next to the adder. The Constant ROM is implemented as a PLA (programmable logic array), a two-level structured arrangement of gates. The first level (bottom) selects the desired correction constant, while the second level (middle) generates the bits of the constant: three bits plus a sign bit. The necessary correction constant is selected based on the length of the queue in words, the HL pointer, and the empty (MT) flag.

The Constant ROM, highlighted on the die.

The Constant ROM, highlighted on the die.

The Constant ROM is used for more than just address correction. For example, it is also used to increment the Instruction Pointer by 2 after a prefetch. Other constants are used for the 8086's string operations, which act on a block of memory. The index registers are incremented or decremented by 1 for bytes or 2 for words. When popping a value from the stack, the stack pointer is decremented, which uses the constant -2. Additional constants are required to increment and decrement the IND register when accessing words from unaligned (odd) addresses. These increment/decrement values are selected in the upper part of the Constant ROM. In total, the Constant ROM holds values from -6 to +2.

Policy

There are some "policy" decisions on prefetching, and it's interesting to see how the 8086 implements them. Prefetching is not free: there is a tradeoff when performing a prefetch between saving time later versus delaying memory accesses from an executing instruction. Moreover, if a jump operation takes place, the prefetch queue is discarded and the memory cycles were wasted. Thus, the length of the queue is an "extremely tricky design problem, because performance can deteriorate if the queue is too long as well as if it is too short."3

Intel performed simulations to determine the best queue length. A 4-byte queue provided a large benefit, while a 6-byte queue (which they chose) was slightly better. The designers were surprised to find that performance flattened out after that; they expected a much longer queue would be necessary. The 8088 process has only a 4-byte prefetch queue because its 8-bit bus changes the tradeoffs.8

The basic prefetch policy is that if a memory access and a prefetch are requested at the same time, the memory access "wins", since it is guaranteed to be useful while the prefetch is just speculative. If the queue holds 0 to 2 bytes, prefetch happens during the next free memory cycle. If the queue holds 5 or 6 bytes, no prefetch can happen, since prefetch happens a word at a time. However, if the queue holds 3 or 4 bytes, prefetch is delayed for two clock cycles, which is an interesting choice. This gives an instruction more opportunities to perform a memory operation without being delayed by a prefetch. There is a tradeoff because maybe delaying the prefetch will waste two cycles of memory bandwidth, but performing the prefetch might waste four cycles of memory bandwidth. The motivation for this delay is that the last two bytes in the queue are less valuable because they are more likely to be discarded.

Another policy decision is how to handle a change in execution flow, such as a jump or subroutine call. The 8086 simply discards the prefetch queue and starts fetching from the new address. The 8086 designers considered better ways of handling jumps, but it wasn't practical to implement at the time. There is no intelligence if the instructions are already in the queue (e.g. jumping forward a couple of bytes). There is also no branch prediction; prefetching proceeds linearly regardless of branch instructions.

The 8086 does nothing to ensure consistency between the prefetch queue and memory if a prefetched instruction is modified in memory.9 In this case, the "stale" instruction in the queue is executed. This situation may seem contrived, but self-modifying code used to be fairly popular, where a program would change its own instructions.10

Prefetching and the 8087 coprocessor

One feature of the 8086 microprocessor is that it supports coprocessors such as the 8087 floating point chip.11 The 8087 implements high-performance floating-point computation, performing arithmetic and transcendental computations up to 100 times faster than the 8086. The 8087 gets instructions in an interesting fashion, executing floating-point instructions from the 8086's instruction stream. Specifically, an "ESCAPE" opcode indicates an instruction that is performed by the 8087 rather than the 8086. However, prefetching adds a lot of complexity to the coprocessor because the 8087 monitors the bus to determine when it should execute an instruction. With prefetching, the instruction on the bus doesn't match the instruction being executed. An instruction may be executed many cycles after it was fetched over the bus. A prefetched instruction may even be discarded and never executed.

To solve this problem, the 8087 manages its own copy of the prefetch queue to determine when the 8086 would be executing a floating-point instruction. The 8087 watches the bus to see when instructions are prefetched. The 8086 provides queue status signals (QS0 and QS1) to indicate when it takes bytes from the queue or flushes the queue. These signals allow the 8087 coprocessor to keep track of the 8086's queue state so it can tell what instruction the 8086 is executing. In other words, the 8086 doesn't tell the 8087 coprocessor what to do; instead, the two chips process the instruction stream in parallel. Another complication is the 8087 coprocessor can be used with the 8088 processor chip, which has a smaller 4-byte queue. Thus, the 8087 coprocessor must detect whether it is connected to an 8086 or an 8088 and maintain its queue appropriately.

Brief history

Caching and prefetching were used in mainframe computers dating back to the 1960s. For instance, the IBM System/360 Model 91 (1966) had a cache with prefetching. Minicomputers such as the VAX 11/780 (1977) later used caching and prefetching. However, these features took a while to trickle down to microprocessors. The Motorola 68000 (1980) had a 4-byte prefetch queue. As far as I can tell, the 8086 was the first microprocessor with a prefetch queue.

We can view the 8086 as a stepping-stone towards the large caches first used externally in the 80386 and internally in the 486. The 80186 and 80286 kept the 6-byte prefetch buffer size of the 8086. The 80386 has a 16-byte prefetch buffer, although apparently due to a bug it was shrunk to 12 bytes in later revisions. As well as the prefetch queue, the 80386 supported an external cache.

Early microprocessors such as the 6502 or Z80 could fetch the next instruction while they were finishing the previous instruction. This minimal two-stage pipelining improved performance, but was much more limited than 8086-style prefetching. An Intel study found that this simple overlap provides a 35% performance increase with 15% more hardware, while implementing prefetching provided an additional 11% gain with 14% more hardware.3 This illustrates how the increasing transistor counts from Moore's law opened up new opportunities to improve performance. But it also shows diminishing returns as performance increases become smaller and more expensive.

Conclusions

Well, this was supposed to be a quick post about the prefetch queue, but the topic turned out to have a lot more complexity than I expected. A six-byte prefetch queue may seem like a simple feature to add to a processor, but it affects many parts of the system. Prefetching is tied closely to the memory access circuitry, of course, but it also required a Constant ROM to handle the difference between the execution address and the prefetch address. Prefetching also impacted the microcode, with three micro-instructions to support prefetching.

Prefetching also illustrates some of the ways that each feature and corner case of a processor like the 8086 leads to more complexity. For instance, byte-aligned (rather than word-aligned) instructions require a mechanism to fetch bytes as well as words. Supporting multiple instruction formats (1-byte opcodes, an opcode byte followed by a Mod R/M byte, multi-byte instructions) resulted in the loader state machine. The segment registers required an adder to compute the memory address for every access. Looking at the 8086 internals makes it easier to understand the motivation behind RISC processors, discarding the complexity and corner cases to create a simpler but faster processor.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

  1. Steve Furber, co-creator of the ARM chip, mentions that "The first integrated CPUs were coincidentally quite well matched to semiconductor memory speeds, and were therefore built without caches. This can now be seen as a temporary aberration." See VLSI Risc Architecture and Organization p77. To make this concrete, the Apple II (1977) used a MOS 6502 processor running at about 1 megahertz while its 4116 DRAM chips could perform an access in 250 nanoseconds (4 times the clock speed). The 8086 processor ran at 5-10 MHz which meant that 250 ns DRAM chips were slower than the clock speed. Nowadays, processors run at 4 GHz but DRAM access speed is about 50 nanoseconds (1/200 the clock speed). 

  2. Modern processors use caches to improve memory performance; caches are often megabytes in size. Accessing data from a cache is faster than accessing it from main memory, but the tradeoff is that caches are smaller. The 8086's prefetch queue is similar to a cache in some ways, but there are some key differences. First, the prefetch queue is strictly sequential. If you jump ahead two bytes, even if the prefetch queue has those instruction bytes, the processor can't use them. Second, the prefetch queue can't reuse bytes. If you have a 6-byte loop, even though all the code fits in the prefetch queue, it will be reloaded every time. Third, the prefetch queue doesn't provide any consistency. If you modify an instruction in memory a couple of bytes ahead of the PC, the 8086 will run the old instruction if it's in the queue. 

  3. The design decisions for the 8086 prefetch cache (and many other aspects of the chip) are described in: J. McKevitt and J. Bayliss, "New options from big chips," in IEEE Spectrum, vol. 16, no. 3, pp. 28-34, March 1979, doi: 10.1109/MSPEC.1979.6367944. 

  4. A detailed block diagram of the 8086 is provided in the patent. Conveniently, the layout of the diagram is close to the physical layout of the chip.

    Detailed block diagram of the 8086, based on patent US4449184. I have modified the register names to match the common naming.

    Detailed block diagram of the 8086, based on patent US4449184. I have modified the register names to match the common naming.

    I won't discuss this block diagram in detail here, but I'll point out the Q (queue) control logic in the upper center, with its associated read and write pointers. The Q bus connects the queue to various parts of the instruction decoding circuitry.

     

  5. Supporting misaligned memory accesses (i.e. accessing a word at an odd address) adds complexity to the 8086. It's not surprising that many RISC chips prohibit unaligned accesses. On SPARC, for instance, an unaligned access fails with a "bus error", which the Sun programmers out there probably recognize. ARM processors before ARMv7 didn't support unaligned accesses. RISC-V supports misaligned data accesses but not misaligned instructions. 

  6. The 8086 patent describes how the program counter in the 8086 does not hold the "real" value:

    PC is not a real or true program counter in that it does not, nor does any other register within CPU, maintain the actual execution point at any time. PC actually points to the next byte to be input into queue. The real program counter is calculated by instruction whenever a relative jump or call is required by subtracting the number of accessed instructions still remaining unused in queue from PC.
     

  7. The CORR correction operation adds more complexity to the system than you might expect, with synchronization between the Bus Interface Unit and the Execution Unit. Because the correction computation uses the addressing adder, the correction operation must be synchronized with memory accesses that also use the adder. To accomplish this, the Bus Interface Unit waits until any memory operation is finished and then generates two clock cycles of "fake" memory operation, keeping the adder free for the CORR instruction. As a result, the memory control circuitry needs logic to implement this memory cycle. Meanwhile, the microcode engine is stopped until the CORR instruction completes, requiring synchronization circuitry. 

  8. The 8088 is famous as the processor in the original IBM PC. The 8088 processor is essentially the same as the 8086 except that it has an 8-bit data bus instead of a 16-bit data bus, so it performs memory accesses a byte at a time instead of a word at a time. Internally, the 8088 is nearly identical to the 8086 but there are a few differences in microcode and in the bus circuitry. The most visible difference is that the 8088 has a 4-byte prefetch queue instead of a 6-byte prefetch queue. Simulations showed that a 4-byte queue was sufficient for the 8088. Because it fetches one byte at a time instead of two bytes, the 8088 fills the prefetch queue more slowly and wouldn't get much benefit from the larger queue. I haven't looked at the 8088's prefetch circuitry in detail, so I can't describe it exactly. 

  9. At some point, Intel implemented consistency between cached instructions and memory. This ensures that self-modifying code will run the latest version of an instruction rather than a stale instruction in the cache. I couldn't determine exactly when this was implemented; various sources say the 486, the Pentium, or the Pentium Pro. (If you have a definitive answer, please let me know.) 

  10. Self-modifying code can be used as a way to distinguish between the 8086 and 8088 chips in software. Since the 8086 has a 6-byte queue and the 8088 has a 4-byte queue, you can create self-modifying code that will run a prefetched instruction on the 8086 but run the modified instruction on the 8088. 

  11. Although the 8087 is the most well-known coprocessor for the 8086, it was not the only coprocessor. The Intel 8089 input/output coprocessor provided mainframe-style I/O channels, offloading I/O processing from the 8086. More than just a DMA engine, the 8089 was a separate processor with its own instruction set. Unlike the 8087, the 8089 didn't take instructions from the 8086's instruction stream so it didn't interact with prefetching; instead, the 8086 sent a Channel Attention signal to the 8089 and the 8089 read instructions from shared memory. The 8089 was complex and expensive and wasn't very popular. The Intel 82586 Ethernet coprocessor used a similar Channel Attention scheme.