Showing posts with label intel. Show all posts
Showing posts with label intel. Show all posts

The Group Decode ROM: The 8086 processor's first step of instruction decoding

A key component of any processor is instruction decoding: analyzing a numeric opcode and figuring out what actions need to be taken. The Intel 8086 processor (1978) has a complex instruction set, making instruction decoding a challenge. The first step in decoding an 8086 instruction is something called the Group Decode ROM, which categorizes instructions into about 35 types that control how the instruction is decoded and executed. For instance, the Group Decode ROM determines if an instruction is executed in hardware or in microcode. It also indicates how the instruction is structured: if the instruction has a bit specifying a byte or word operation, if the instruction has a byte that specifies the addressing mode, and so forth.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The diagram above shows the position of the Group Decode ROM on the silicon die, as well as other key functional blocks. The 8086 chip is partitioned into a Bus Interface Unit that communicates with external components such as memory, and the Execution Unit that executes instructions. Machine instructions are fetched from memory by the Bus Interface Unit and stored in the prefetch queue registers, which hold 6 bytes of instructions. To execute an instruction, the queue bus transfers an instruction byte from the prefetch queue to the instruction register, under control of a state machine called the Loader. Next, the Group Decode ROM categorizes the instruction according to its structure. In most cases, the machine instruction is implemented in low-level microcode. The instruction byte is transferred to the Microcode Address Register, where the Microcode Address Decoder selects the appropriate microcode routine that implements the instruction. The microcode provides the micro-instructions that control the Arithmetic/Logic Unit (ALU), registers, and other components to execute the instruction.

In this blog post, I will focus on a small part of this process: how the Group Decode ROM decodes instructions. Be warned that this post gets down into the weeds, so you might want to start with one of my higher-level posts, such as how the 8086's microcode engine works.

Microcode

Most instructions in the 8086 are implemented in microcode. Most people think of machine instructions as the basic steps that a computer performs. However, many processors have another layer of software underneath: microcode. With microcode, instead of building the CPU's control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.

Microcode is only used if the Group Decode ROM indicates that the instruction is implemented in microcode. In that case, the microcode address register is loaded with the instruction and the address decoder selects the appropriate microcode routine. However, there's a complication. If the second byte of the instruction is a Mod R/M byte, the Group Decode ROM indicates this and causes a memory addressing micro-subroutine to be called.

Some simple instructions are implemented entirely in hardware and don't use microcode. These are known as 1-byte logic instructions (1BL) and are also indicated by the Group Decode ROM.

The Group Decode ROM's structure

The Group Decode ROM takes an 8-bit instruction as input, along with an interrupt signal. It produces 15 outputs that control how the instruction is handled. In this section I'll discuss the physical implementation of the Group Decode ROM; the various outputs are discussed in a later section.

Although the Group Decode ROM is called a ROM, its implementation is really a PLA (Programmable Logic Array), two levels of highly-structured logic gates.1 The idea of a PLA is to create two levels of NOR gates, each in a grid. This structure has the advantages that it implements the logic densely and is easy to modify. Although physically two levels of NOR gates, a PLA can be thought of as an AND layer followed by an OR layer. The AND layer matches particular bit patterns and then the OR layer combines multiple values from the first layer to produce arbitrary outputs.

The Group Decode ROM. This photo shows the metal layer on top of the die.

The Group Decode ROM. This photo shows the metal layer on top of the die.

Since the output values are highly structured, a PLA implementation is considerably more efficient than a ROM, since in a sense it combines multiple entries. In the case of the Group Decode ROM, using a ROM structure would require 256 columns (one for each 8-bit instruction pattern), while the PLA implementation requires just 36 columns, about 1/7 the size.

The diagram below shows how one column of the Group Decode ROM is wired in the "AND" plane. In this die photo, I removed the metal layer with acid to reveal the polysilicon and silicon underneath. The vertical lines show where the metal line for ground and the column output had been. The basic idea is that each column implements a NOR gate, with a subset of the input lines selected as inputs to the gate. The pull-up resistor at the top pulls the column line high by default. But if any of the selected inputs are high, the corresponding transistor turns on, connecting the column line to ground and pulling it low. Thus, this implements a NOR gate. However, it is more useful to think of it as an AND of the complemented inputs (via De Morgan's Law): if all the inputs are "correct", the output is high. In this way, each column matches a particular bit pattern.

Closeup of a column in the Group Decode ROM.

Closeup of a column in the Group Decode ROM.

The structure of the ROM is implemented through the silicon doping pattern, which is visible above. A transistor is formed where a polysilicon wire crosses a doped silicon region: the polysilicon forms the gate, turning the transistor on or off. At each intersection point, a transistor can be created or not, depending on the doping pattern. If a particular transistor is created, then the corresponding input must be 0 to produce a high output.

At the top of the diagram above, the column outputs are switched from the metal layer to polysilicon wires and become the inputs to the upper "OR" plane. This plane is implemented in a similar fashion as a grid of NOR gates. The plane is rotated 90 degrees, with the inputs vertical and each row forming an output.

Intermediate decoding in the Group Decode ROM

The first plane of the Group Decode ROM categorizes instructions into 36 types based on the instruction bit pattern.2 The table below shows the 256 instruction values, colored according to their categorization.3 For instance, the first blue block consists of the 32 ALU instructions corresponding to the bit pattern 00XXX0XX, where X indicates that the bit can be 0 or 1. These instructions are all decoded and executed in a similar way. Almost all instructions have a single category, that is, they activate a single column line in the Group Decode ROM. However, a few instructions activate two lines and have two colors below.

Grid of 8086 instructions, colored according to the first level of the Group Decode Rom.

Grid of 8086 instructions, colored according to the first level of the Group Decode Rom.

Note that the instructions do not have arbitrary numeric opcodes, but are assigned in a way that makes decoding simpler. Because these blocks correspond to bit patterns, there is little flexibility. One of the challenges of instruction set design for early microprocessors was to assign numeric values to the opcodes in a way that made decoding straightforward. It's a bit like a jigsaw puzzle, fitting the instructions into the 256 available values, while making them easy to decode.

Outputs from the Group Decode ROM

The Group Decode ROM has 15 outputs, one for each row of the upper half. In this section, I'll briefly discuss these outputs and their roles in the 8086. For an interactive exploration of these signals, see this page, which shows the outputs that are triggered by each instruction.

Out 0 indicates an IN or OUT instruction. This signal controls the M/IO (S2) status line, which distinguishes between a memory read/write and an I/O read/write. Apart from this, memory and I/O accesses are basically the same.

Out 1 indicates (inverted) that the instruction has a Mod R/M byte and performs a read/modify/write on its argument. This signal is used by the Translation ROM when dispatching an address handler (details). (This signal distinguishes between, say, ADD [AX],BX and MOV [AX],BX. The former both reads and writes [AX], while the latter only writes to it.)

Out 2 indicates a "group 3/4/5" opcode, an instruction where the second byte specifies the particular instruction, and thus decoding needs to wait for the second byte. This controls the loading of the microcode address register.

Out 3 indicates an instruction prefix (segment, LOCK, or REP). This causes the next byte to be decoded as a new instruction, while blocking interrupt handling.

Out 4 indicates (inverted) a two-byte ROM instruction (2BR), i.e. an instruction is handled by the microcode ROM, but requires the second byte for decoding. This is an instruction with a Mod R/M byte. This signal controls the loader indicating that it needs to fetch the second byte. This signal is almost the same as output 1 with a few differences.

Out 5 specifies the top bit for an ALU operation. The 8086 uses a 5-bit field to specify an ALU operation. If not specified explicitly by the microcode, the field uses bits 5 through 3 of the opcode. (These bits distinguish, say, an ADD instruction from AND or SUB.) This control line sets the top bit of the ALU field for instructions such as DAA, DAS, AAA, AAS, INC, and DE that fall into a different set from the "regular" ALU instructions.

Out 6 indicates an instruction that sets or clears a condition code directly: CLC, STC, CLI, STI, CLD, or STD (but not CMC). This signal is used by the flag circuitry to update the condition code.

Out 7 indicates an instruction that uses the AL or AX register, depending on the instruction's size bit. (For instance MOVSB vs MOVSW.) This signal is used by the register selection circuitry, the M register specifically.

Out 8 indicates a MOV instruction that uses a segment register. This signal is used by the register selection circuitry, the N register specifically.

Out 9 indicates the instruction has a d bit, where bit 1 of the instruction swaps the source and destination. This signal is used by the register selection circuitry, swapping the roles of the M and N registers according to the d bit.

Out 10 indicates a one-byte logic (1BL) instruction, a one-byte instruction that is implemented in logic, not microcode. These instructions are the prefixes, HLT, and the condition-code instructions. This signal controls the loader, causing it to move to the next instruction.

Out 11 indicates instructions where bit 0 is the byte/word indicator. This signal controls the register handling and the ALU functionality.

Out 12 indicates an instruction that operates only on a byte: DAA, DAS, AAA, AAS, AAM, AAD, and XLAT. This signal operates in conjunction with the previous output to select a byte versus word.

Out 13 forces the instruction to use a byte argument if instruction bit 1 is set, overriding the regular byte/word pattern. Specifically, it forces the L8 (length 8 bits) condition for the JMP direct-within-segment and the ALU instructions that are immediate with sign extension (details).

Out 14 allows a carry update. This prevents the carry from being updated by the INC and DEC operations. This signal is used by the flag circuitry.

Columns

Most of the Group Decode ROM's column signals are used to derive the outputs listed above. However, some column outputs are also used as control signals directly. These are listed below.

Column 10 indicates an immediate MOV instruction. These instructions use instruction bit 3 (rather than bit 1) to select byte versus word, because the three low bits specify the register. This signal affects the L8 condition described earlier and also causes the M register selection to be converted from a word register to a byte register if necessary.

Column 12 indicates an instruction with bits 5-3 specifying the ALU instruction. This signal causes the X register to be loaded with the bits in the instruction that specify the ALU operation. (To be precise, this signal prevents the X register from being reloaded from the second instruction byte.)

Column 13 indicates the CMC (Complement Carry) instruction. This signal is used by the flags circuitry to complement the carry flag (details).

Column 14 indicates the HLT (Halt) instruction. This signal stops instruction processing by blocking the instruction queue.

Column 31 indicates a REP prefix. This signal causes the REPZ/NZ latch to be loaded with instruction bit 0 to indicate if the prefix is REPNZ or REPZ. It also sets the REP latch.

Column 32 indicates a segment prefix. This signal loads the segment latches with the desired segment type.

Column 33 indicates a LOCK prefix. It sets the LOCK latch, locking the bus.

Column 34 indicates a CLI instruction. This signal immediately blocks interrupt handling to avoid an interrupt between the CLI instruction and when the interrupt flag bit is cleared.

Timing

One important aspect of the Group Decode ROM is that its outputs are not instantaneous. It takes a clock cycle to get the outputs from the Group Decode ROM. In particular, when instruction decoding starts, the timing signal FC (First Clock) is activated to indicate the first clock cycle. However, the Group Decode ROM's outputs are not available until the Second Clock SC.

One consequence of this is that even the simplest instruction (such as a flag operation) takes two clock cycles, as does a prefix. The problem is that even though the instruction could be performed in one clock cycle, it takes two clock cycles for the Group Decode ROM to determine that the instruction only needs one cycle. This illustrates how a complex instruction format impacts performance.

The FC and SC timing signals are generated by a state machine called the Loader. These signals may seem trivial, but there are a few complications. First, the prefetch queue may run empty, in which case the FC and/or SC signal is delayed until the prefetch queue has a byte available. Second, to increase performance, the 8086 can start decoding an instruction during the last clock cycle of the previous instruction. Thus, if the microcode indicates that there is one cycle left, the Loader can proceed with the next instruction. Likewise, for a one-byte instruction implemented in hardware (one-byte logic or 1BL), the loader proceeds as soon as possible.

The diagram below shows the timing of an ADD instruction. Each line is half of a clock cycle. Execution is pipelined: the instruction is fetched during the first clock cycle (First Clock). During Second Clock, the Group Decode ROM produces its output. The microcode address register also generates the micro-address for the instruction's microcode. The microcode ROM supplies a micro-instruction during the third clock cycle and execution of the micro-instruction takes place during the fourth clock cycle.

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

The Group Decode ROM's outputs during Second Clock control the decoding. Most importantly, the ADD imm instruction used microcode; it is not a one-byte logic instruction (1BL). Moreover, it does not have a Mod R/M byte, so it does not need two bytes for decoding (2BR). For a 1BL instruction, microcode execution would be blocked and the next instruction would be immediately fetched. On the other hand, for a 2BR instruction, the loader would tell the prefetch queue that it was done with the second byte during the second half of Second Clock. Microcode execution would be blocked during the third cycle and the fourth cycle would execute a microcode subroutine to determine the memory address.

For more details, see my article on the 8086 pipeline.

Interrupts

The Group Decode ROM takes the 8 bits of the instruction as inputs, but it has an additional input indicating that an interrupt is being handled. This signal blocks most of the Group Decode ROM outputs. This prevents the current instruction's outputs from interfering with interrupt handling. I wrote about the 8086's interrupt handling in detail here, so I won't go into more detail in this post.

Conclusions

The Group Decode ROM indicates one of the key differences between CISC processors (Complex Instruction Set Computer) such as the 8086 and the RISC processors (Reduced Instruction Set Computer) that became popular a few years later. A RISC instruction set is designed to make instruction decoding very easy, with a small number of uniform instruction forms. On the other hand, the 8086's CISC instruction set was designed for compactness and high code density. As a result, instructions are squeezed into the available opcode space. Although there is a lot of structure to the 8086 opcodes, this structure is full of special cases and any patterns only apply to a subset of the instructions. The Group Decode ROM brings some order to this chaotic jumble of instructions, and the number of outputs from the Group Decode ROM is a measure of the instruction set's complexity.

The 8086's instruction set was extended over the decades to become the x86 instruction set in use today. During that time, more layers of complexity were added to the instruction set. Now, an x86 instruction can be up to 15 bytes long with multiple prefixes. Some prefixes change the register encoding or indicate a completely different instruction set such as VEX (Vector Extensions) or SSE (Streaming SIMD Extensions). Thus, x86 instruction decoding is very difficult, especially when trying to decode multiple instructions in parallel. This has an impact in modern systems, where x86 processors typically have 4 complex instruction decoders while Apple's ARM processors have 8 simpler decoders; this is said to give Apple a performance benefit. Thus, architectural decisions from 45 years ago are still impacting the performance of modern processors.

I've written numerous posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected]. Thanks to Arjan Holscher for suggesting this topic.

Notes and references

  1. You might wonder what the difference is between a ROM and a PLA. Both of them produce arbitrary outputs for a set of inputs. Moreover, you can replace a PLA with a ROM or vice versa. Typically a ROM has all the input combinations decoded, so it has a separate row for each input value, i.e. 2N rows. So you can think of a ROM as a fully-decoded PLA.

    Some ROMs are partially decoded, allowing identical rows to be combined and reducing the size of the ROM. This technique is used in the 8086 microcode, for instance. A partially-decoded ROM is fairly similar to a PLA, but the technical distinction is that a ROM has only one output row active at a time, while a PLA can have multiple output rows active and the results are OR'd together. (This definition is from The Architecture of Microprocessors p117.)

    The Group Decode ROM, however, has a few cases where multiple rows are active at the same time (for instance the segment register POP instructions). Thus, the Group Decode ROM is technically a PLA and not a ROM. This distinction isn't particularly important, but you might find it interesting. 

  2. The Group Decode ROM has 38 columns, but two columns (11 and 35) are unused. Presumably, these were provided as spares in case a bug fix or modification required additional decoding. 

  3. Like the 8008 and 8080, the 8086's instruction set was designed around a 3-bit octal structure. Thus, the 8086 instruction set makes much more sense if viewed in octal instead of hexadecimal. The table below shows the instructions with an octal organization. Each 8×8 block uses the two low octal digits, while the four large blocks are positioned according to the top octal digit (labeled). As you can see, the instruction set has a lot of structure that is obscured in the usual hexadecimal table.

    The 8086 instruction set, put in a table according to the octal opcode value.

    The 8086 instruction set, put in a table according to the octal opcode value.

    For details on the octal structure of the 8086 instruction set, see The 80x86 is an Octal Machine

Reverse-engineering the division microcode in the Intel 8086 processor

While programmers today take division for granted, most microprocessors in the 1970s could only add and subtract — division required a slow and tedious loop implemented in assembly code. One of the nice features of the Intel 8086 processor (1978) was that it provided machine instructions for integer multiplication and division. Internally, the 8086 still performed a loop, but the loop was implemented in microcode: faster and transparent to the programmer. Even so, division was a slow operation, about 50 times slower than addition.

I recently examined multiplication in the 8086, and now it's time to look at the division microcode.1 (There's a lot of overlap with the multiplication post so apologies for any deja vu.) The die photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. At the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations at the heart of division: subtraction and shifts. Division also uses a few special hardware features: the X register, the F1 flag, and a loop counter. The microcode ROM at the lower right controls the process.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

Microcode

Like most instructions, the division routines in the 8086 are implemented in microcode. Most people think of machine instructions as the basic steps that a computer performs. However, many processors have another layer of software underneath: microcode. With microcode, instead of building the CPU's control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. This is especially useful for a machine instruction such as division, which performs many steps in a loop.

Each micro-instruction in the 8086 is encoded into 21 bits as shown below. Every micro-instruction moves data from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits depends on the type field and can be anything from an ALU operation to a memory read or write to a change of microcode control flow. Thus, an 8086 micro-instruction typically does two things in parallel: the move and the action. For more about 8086 microcode, see my microcode blog post.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

A few details of the ALU (Arithmetic/Logic Unit) operations are necessary to understand the division microcode. The ALU has three temporary registers that are invisible to the programmer: tmpA, tmpB, and tmpC. An ALU operation takes its first argument from the specified temporary register, while the second argument always comes from tmpB. An ALU operation requires two micro-instructions. The first micro-instruction specifies the ALU operation and source register, configuring the ALU. For instance, ADD tmpA to add tmpA to the default tmpB. In the next micro-instruction (or a later one), the ALU result can be accessed through a register called Σ (Sigma) and moved to another register.

The carry flag plays a key role in division. The carry flag is one of the programmer-visible status flags that is set by arithmetic operations, but it is also used by the microcode. For unsigned addition, the carry flag is set if there is a carry out of the word (or byte). For subtraction, the carry flag indicates a borrow, and is set if the subtraction requires a borrow. Since a borrow results if you subtract a larger number from a smaller number, the carry flag also indicates the "less than" condition. The carry flag (and other status flags) are only updated if micro-instruction contains the F bit.

The RCL (Rotate through Carry, Left) micro-instruction is heavily used in the division microcode.2 This operation shifts the bits in a 16-bit word, similar to the << bit-shift operation in high-level languages, but with an additional feature. Instead of discarding the bit on the end, that bit is moved into the carry flag. Meanwhile, the bit formerly in the carry flag moves into the word. You can think of this as rotating the bits while treating the carry flag as a 17th bit of the word. (The RCL operation can also act on a byte.)

The rotate through carry left micro-instruction.

The rotate through carry left micro-instruction.

These shifts perform an important part of the division process since shifting can be viewed as multiplying or dividing by two. RCL also provides a convenient way to move the most-significant bit to the carry flag, where it can be tested for a conditional jump. (This is important because the top bit is used as the sign bit.) Another important property is that performing RCL on a lower word and then RCL on an upper word will perform a 32-bit shift, since the high bit of the lower word will be moved into the low bit of the upper word via the carry bit. Finally, the shift moves the quotient bit from the carry into the register.

Binary division

The division process in the 8086 is similar to grade-school long division, except in binary instead of decimal. The diagram below shows the process: dividing 67 (the dividend) by 9 (the divisor) yields the quotient 7 at the top and the remainder 4 at the bottom. Long division is easier in binary than decimal because you don't need to guess the right quotient digit. Instead, at each step you either subtract the divisor (appropriately shifted) or subtract nothing. Note that although the divisor is 4 bits in this example, the subtractions use 5-bit values. The need for an "extra" bit in division will be important in the discussion below; 16-bit division needs a 17-bit value.

 0111
100101000011
 -0000
 10000
 -1001
 01111
 -1001
 01101
 -1001
 0100

Instead of shifting the divisor to the right each step, the 8086's algorithm shifts the quotient and the current dividend to the left each step. This trick reduces the register space required. Dividing a 32-bit number (the dividend) by a 16-bit number yields a 16-bit result, so it seems like you'd need four 16-bit registers in total. The trick is that after each step, the 32-bit dividend gets one bit smaller, while the result gets one bit larger. Thus, the dividend and the result can be packed together into 32 bits. At the end, what's left of the dividend is the 16-bit remainder. The table below illustrates this process for a sample dividend (blue) and quotient (green).3 At the end, the 16-bit blue value is the remainder.

dividendquotient
00001111000000001111111100000000
00001110000001011111111000000000
00001100000011111111110000000000
00001000001000111111100000000000
00000000010010111111000000000000
00000000100101111110000000000001
00000001001011111100000000000011
00000010010111111000000000000111
00000100101111110000000000001111
00001001011111100000000000011111
00000011000000000000000000111110
00000110000000000000000001111101
00001100000000000000000011111011
00001000000001000000000111110110
00000000000011000000001111101100
00000000000110000000011111011001
00000000001100000000111110110011

The division microcode

The 8086 has four division instructions to handle signed and unsigned division of byte and word operands. I'll start by describing the microcode for the unsigned word division instruction DIV, which divides a 32-bit dividend by a 16-bit divisor. The dividend is supplied in the AX and DX registers while the divisor is specified by the source operand. The 16-bit quotient is returned in AX and the 16-bit remainder in DX. For a divide-by-zero, or if the quotient is larger than 16 bits, a type 0 "divide error" interrupt is generated.

CORD: The core division routine

The CORD microcode subroutine below implements the long-division algorithm for all division instructions; I think CORD stands for Core Divide. At entry, the arguments are in the ALU temporary registers: tmpA/tmpC hold the 32-bit dividend, while tmpB holds the 16-bit divisor. (I'll explain the configuration for byte division later.) Each cycle of the loop shifts the values and then potentially subtracts the divisor. One bit is appended to the quotient to indicate whether the divisor was subtracted or not. At the end of the loop, whatever is left of the dividend is the remainder.

Each micro-instruction specifies a register move on the left, and an action on the right. The moves transfer words between the visible registers and the ALU's temporary registers, while the actions are mostly ALU operations or control flow. As is usually the case with microcode, the details are tricky. The first three lines below check if the division will overflow or divide by zero. The code compares tmpA and tmpB by subtracting tmpB, discarding the result, but setting the status flags (F). If the upper word of the divisor is greater or equal to the dividend, the division will overflow, so execution jumps to INT0 to generate a divide-by-zero interrupt.4 (This handles both the case where the dividend is "too large" and the divide-by-0 case.) The number of loops in the division algorithm is controlled by a special-purpose loop counter. The MAXC micro-instruction initializes the counter to 7 or 15, for a byte or word divide instruction respectively.

   move        action
             SUBT tmpA   CORD: set up compare
Σ → no dest  MAXC F       compare, set up counter, update flags
             JMP NCY INT0 generate interrupt if overflow
             RCL tmpC    3: main loop: left shift tmpA/tmpC
Σ → tmpC     RCL tmpA     
Σ → tmpA     SUBT tmpA    set up compare/subtract
             JMPS CY 13   jump if top bit of tmpA was set
Σ → no dest  F            compare, update flags
             JMPS NCY 14  jump for subtract
             JMPS NCZ 3   test counter, loop back to 3
             RCL tmpC    10: done:
Σ → tmpC                  shift last bit into tmpC
Σ → no dest  RTN          done: get top bit, return

             RCY         13: reset carry
Σ → tmpA     JMPS NCZ 3  14: subtract, loop
             JMPS 10     done, goto 10

The main loop starts at 3. The tmpC and tmpA registers are shifted left. This has two important side effects. First, the old carry bit (which holds the latest quotient bit) is shifted into the bottom of tmpC. Second, the top bit of tmpA is shifted into the carry bit; this provides the necessary "extra" bit for the subtraction below. Specifically, if the carry (the "extra" tmpA bit) is set, tmpB can be subtracted, which is accomplished by jumping to 13. Otherwise, the code compares tmpA and tmpB by subtracting tmpB, discarding the result, and updating the flags (F). If there was no borrow/carry (tmpA ≥ tmpB), execution jumps to 14 to subtract. Otherwise, the internal loop counter is decremented and control flow goes back to the top of the loop if not done (NCZ, Not Counter Zero). If the loop is done, tmpC is rotated left to pick up the last quotient bit from the carry flag. Then a second rotate of tmpC is performed but the result is discarded; this puts the top bit of tmpC into the carry flag for use later in POSTIDIV. Finally, the subroutine returns.

The subtraction path is 13 and 14, which subtract tmpB from tmpA by storing the result (Σ) in tmpA. This path resets the carry flag for use as the quotient bit. As in the other path, the loop counter is decremented and tested (NCZ) and execution either continues back at 3 or finishes at 10.

One complication is that the carry bit is the opposite of the desired quotient bit. Specifically, if tmpA < tmpB, the comparison generates a borrow so the carry flag is set to 1. In this case, the desired quotient bit is 0 and no subtraction takes place. But if tmpA ≥ tmpB, the comparison does not generate a borrow (so the carry flag is set to 0), the code subtracts tmpB, and the desired quotient bit is 1. Thus, tmpC ends up holding the complement of the desired result; this is fixed later.

The microcode is carefully designed to pack the divide loop into a small number of micro-instructions. It uses the registers and the carry flag in tricky ways, using the carry flag to hold the top bit of tmpA, the comparison result, and the generated quotient bit. This makes the code impressively dense but tricky to understand.

The top-level division microcode

Now I'll pop up a level and take a look at the top-level microcode (below) that implements the DIV and IDIV machine instructions. The first three instructions load tmpA, tmpC, and tmpB from the specified registers. (The M register refers to the source specified in the instruction, either a register or a memory location.) Next, the X0 condition tests bit 3 of the instruction, which in this case distinguishes DIV from IDIV. For signed division (IDIV), the microcode calls PREIDIV, which I'll discuss below. Next, the CORD micro-subroutine discussed above is called to perform the division.

DX → tmpA                      iDIV rmw: load tmpA, tmpC, tmpB 
AX → tmpC    RCL tmpA           set up RCL left shift operation
M → tmpB     CALL X0 PREIDIV    set up integer division if IDIV
             CALL CORD          call CORD to perform division 
             COM1 tmpC          set up to complement the quotient 
DX → tmpB    CALL X0 POSTIDIV   get original dividend, handle IDIV
Σ → AX       NXT                store updated quotient
tmpA → DX    RNI                store remainder, run next instruction

As discussed above, the quotient in tmpC needs to be 1's-complemented, which is done with COM1. For IDIV, the micro-subroutine POSTIDIV sets the signs of the results appropriately. The results are stored in the AX and DX registers. The NXT micro-operation indicates the next micro-instruction is the last one, directing the microcode engine to start the next machine instruction. Finally, RNI directs the microcode engine to run the next machine instruction.

8-bit division

The 8086 has separate opcodes for 8-bit division. The 8086 supports many instructions with byte and word versions, using 8-bit or 16-bit arguments respectively. In most cases, the byte and word instructions use the same microcode, with the ALU and register hardware using bytes or words based on the instruction. In the case of division, the shift micro-operations act on tmpA and tmpC as 8-bit registers rather than 16-bit registers. Moreover, the MAXC micro-operation initializes the internal counter to 7 rather than 15. Thus, the same CORD microcode is used for word and byte division, but the number of loops and the specific operations are changed by the hardware.

The diagram below shows the tmpA and tmpC registers during each step of dividing 0x2345 by 0x34. Note that the registers are treated as 8-bit registers. The divisor (blue) steadily shrinks with the quotient (green) taking its place. At the end, the remainder is 0x41 (blue) and the quotient is 0xad, complement of the green value.

tmpAtmpC
00000000001000110000000001000101
00000000000100100000000010001010
00000000001001010000000000010101
00000000000101100000000000101010
00000000001011000000000001010101
00000000001001000000000010101010
00000000000101010000000001010100
00000000001010100000000010101001
00000000001000010000000001010010

Although the CORD routine is shared for byte and word division, the top-level microcode is different. In particular, the byte and word division instructions use different registers, requiring microcode changes. The microcode below is the top-level code for byte division. It is almost the same as the microcode above, except it uses the top and bottom bytes of the accumulator (AH and AL) rather than the AX and DX registers.

AH → tmpA                     iDIV rmb: get arguments
AL → tmpC    RCL tmpA          set up RCL left shift operation
M → tmpB     CALL X0 PREIDIV   handle signed division if IDIV
             CALL CORD         call CORD to perform division
             COM1 tmpC         complement the quotient
AH → tmpB    CALL X0 POSTIDIV  handle signed division if IDIV
Σ → AL       NXT               store quotient
tmpA → AH    RNI               store remainder, run next instruction

Signed division

The 8086 (like most computers) represents signed numbers using a format called two's complement. While a regular byte holds a number from 0 to 255, a signed byte holds a number from -128 to 127. A negative number is formed by flipping all the bits (known as the one's complement) and then adding 1, yielding the two's complement value. For instance, +5 is 0x05 while -5 is 0xfb. (Note that the top bit of a number is set for a negative number; this is the sign bit.) The nice thing about two's complement numbers is that the same addition and subtraction operations work on both signed and unsigned values. Unfortunately, this is not the case for signed multiplication and division.

The 8086 has separate IDIV (Integer Divide) instructions to perform signed division. The 8086 performs signed division by converting the arguments to positive values, performing unsigned division, and then negating the results if necessary. As shown earlier, signed and unsigned division both use the same top-level microcode and the microcode conditionally calls some subroutines for signed division. These additional subroutines cause a significant performance penalty, making signed division over 20 cycles slower than unsigned division. I will discuss those micro-subroutines below.

PREIDIV

The first subroutine for signed division is PREIDIV, performing preliminary operations for integer division. It converts the two arguments, stored in tmpA/tmpC and tmpB, to positive values. It keeps track of the signs using an internal flag called F1, toggling this flag for each negative argument. This conveniently handles the rule that two negatives make a positive since complementing the F1 flag twice will clear it. The point of this is that the main division code (CORD) only needs to handle unsigned division.

The microcode below implements PREIDIV. First it tests if tmpA is negative, but the 8086 does not have a microcode condition to directly test the sign of a value. Instead, the microcode determines if a value is negative by shifting the value left, which moves the top (sign) bit into the carry flag. The conditional jump (NCY) then tests if the carry is clear, jumping if the value is non-negative. If tmpA is negative, execution falls through to negate the first argument. This is tricky because the argument is split between the tmpA and tmpC registers. The two's complement operation (NEG) is applied to the low word, while either 2's complement or one's complement (COM1) is applied to the upper word, depending on the carry for mathematical reasons.5 The F1 flag is complemented to keep track of the sign. (The multiplication process reuses most of this code, starting at the NEGATE entry point.)

Σ → no dest             PREIDIV: shift tmpA left
             JMPS NCY 7  jump if non-negative
             NEG tmpC   NEGATE: negate tmpC
Σ → tmpC     COM1 tmpA F maybe complement tmpA
             JMPS CY 6  
             NEG tmpA    negate tmpA if there's no carry
Σ → tmpA     CF1        6: toggle F1 (sign)

             RCL tmpB  7: test sign of tmpB
Σ → no dest  NEG tmpB    maybe negate tmpB
             JMPS NCY 11 skip if tmpB positive
Σ → tmpB     CF1 RTN     else negate tmpB, toggle F1 (sign)
             RTN        11: return

The next part of the code, starting at 7, negates tmpB (the divisor) if it is negative. Since the divisor is a single word, this code is simpler. As before, the F1 flag is toggled if tmpB is negative. At the end, both arguments (tmpA/tmpC and tmpB) are positive, and F1 indicates the sign of the result.

POSTIDIV

After computing the result, the POSTIDIV routine is called for signed division. The routine first checks for a signed overflow and raises a divide-by-zero interrupt if so. Next, the routine negates the quotient and remainder if necessary.6

In more detail, the CORD routine left the top bit of tmpC (the complemented quotient) in the carry flag. Now, that bit is tested. If the carry bit is 0 (NCY), then the top bit of the quotient is 1 so the quotient is too big to fit in a signed value.7 In this case, the INT0 routine is executed to trigger a type 0 interrupt, indicating a divide overflow. (This is a rather roundabout way of testing the quotient, relying on a carry bit that was set in a previous subroutine.)

             JMP NCY INT0 POSTIDIV: if overflow, trigger interrupt
             RCL tmpB      set up rotate of tmpB
Σ → no dest  NEG tmpA      get sign of tmpB, set up negate of tmpA
             JMPS NCY 5    skip if tmpB non-negative
Σ → tmpA                   otherwise negate tmpA (remainder)
             INC tmpC     5: set up increment
             JMPS F1 8     test sign flag, skip if set
             COM1 tmpC     otherwise set up complement
             CCOF RTN     8: clear carry and overflow flags, return

Next, tmpB (the divisor) is rotated to see if it is negative. (The caller loaded tmpB with the original divisor, replacing the dividend that was in tmpB previously.) If the divisor is negative, tmpA (the remainder) is negated. This implements the 8086 rule that the sign of the remainder matches the sign of the divisor.

The quotient handling is a bit tricky. Recall that tmpC holds the complemented quotient. the F1 flag is set if the result should be negative. In that case, the complemented quotient needs to be incremented by 1 (INC) to convert from 1's complement to 2's complement. On the other hand, if the quotient should be positive, 1's-complementing tmpC (COM1) will yield the desired positive quotient. In either case, the ALU is configured in POSTIDIV, but the result will be stored back in the main routine.

Finally, the CCOF micro-operation clears the carry and overflow flags. Curiously, the 8086 documentation declares that the status flags are undefined following IDIV, but the microcode explicitly clears the carry and overflow flags. I assume that the flags were cleared in analogy with MUL, but then Intel decided that this wasn't useful so they didn't document it. (Documenting this feature would obligate them to provide the same functionality in later x86 chips.)

The hardware for division

For the most part, the 8086 uses the regular ALU addition and shifts for the division algorithm. Some special hardware features provide assistance. In this section, I'll look at this hardware.

Loop counter

The 8086 has a 4-bit loop counter for multiplication and division. This counter starts at 7 for byte division and 15 for word division, based on the low bit of the opcode. This loop counter allows the microcode to decrement the counter, test for the end, and perform a conditional branch in one micro-operation. The counter is implemented with four flip-flops, along with logic to compute the value after decrementing by one. The MAXC (Maximum Count) micro-instruction sets the counter to 7 or 15 for byte or word operations respectively. The NCZ (Not Counter Zero) micro-instruction has two actions. First, it performs a conditional jump if the counter is nonzero. Second, it decrements the counter.

The F1 flag

Signed multiplication and division use an internal flag called F18 to keep track of the sign. The F1 flag is toggled by microcode through the CF1 (Complement F1) micro-instruction. The F1 flag is implemented with a flip-flop, along with a multiplexer to select the value. It is cleared when a new instruction starts, set by a REP prefix, and toggled by the CF1 micro-instruction. The diagram below shows how the F1 latch and the loop counter appear on the die. In this image, the metal layer has been removed, showing the silicon and the polysilicon wiring underneath.

The counter and F1 latch as they appear on the die. The latch for the REP state is also here.

The counter and F1 latch as they appear on the die. The latch for the REP state is also here.

X register

The division microcode uses an internal register called the X register to distinguish between the DIV and IDIV instructions. The X register is a 3-bit register that holds the ALU opcode, indicated by bits 5–3 of the instruction.9 Since the instruction is held in the Instruction Register, you might wonder why a separate register is required. The motivation is that some opcodes specify the type of ALU operation in the second byte of the instruction, the ModR/M byte, bits 5–3.10 Since the ALU operation is sometimes specified in the first byte and sometimes in the second byte, the X register was added to handle both these cases.

For the most part, the X register indicates which of the eight standard ALU operations is selected (ADD, OR, ADC, SBB, AND, SUB, XOR, CMP). However, a few instructions use bit 0 of the X register to distinguish between other pairs of instructions. For instance, it distinguishes between MUL and IMUL, DIV and IDIV, CMPS and SCAS, MOVS and LODS, or AAA and AAS. While these instruction pairs may appear to have arbitrary opcodes, they have been carefully assigned so the microcode can distinguish them.

The implementation of the X register is straightforward, consisting of three flip-flops to hold the three bits of the instruction. The flip-flops are loaded from the prefetch queue bus during First Clock and during Second Clock for appropriate instructions, as the instruction bytes travel over the bus. Testing bit 0 of the X register with the X0 condition is supported by the microcode condition evaluation circuitry, so it can be used for conditional jumps in the microcode.

Algorithmic and historical context

As you can see from the microcode, division is a complicated and relatively slow process. On the 8086, division takes up to 184 clock cycles to perform all the microcode steps. (In comparison, two registers can be added in 3 clock cycles.) Multiplication and division both loop over the bits, performing repeated addition or subtraction respectively. But division requires a decision (subtract or not?) at each step, making it even slower, about half the speed of multiplication.11

Various algorithms have been developed to speed up division. Rather than performing long division one bit at a time, you can do long division in, say, base 4, producing two quotient bits in each step. As with decimal long division, the tricky part is figuring out what digit to select. The SRT algorithm (1957) uses a small lookup table to estimate the quotient digit from a few bits of the divisor and dividend. The clever part is that the selected digit doesn't need to be exactly right at each step; the algorithm will self-correct after a wrong "guess". The Pentium processor (1993) famously had a floating point division bug due to a few missing values in the SRT table. This bug cost Intel $475 million to replace the faulty processors.

Intel's x86 processors steadily improved divide performance. The 80286 (1982) performed a word divide in 22 clocks, about 6 times as fast as the 8086. In the Penryn architecture (2007), Intel upgraded from Radix-4 to Radix-16 division. Rather than having separate integer and floating-point hardware, integer divides were handled through the floating point divider. Although modern Intel processors have greatly improved multiplication and division compared to the 8086, division is still a relatively slow operation. While a Tiger Lake (2020) processor can perform an integer multiplication every clock cycle (with a latency of 3 cycles), division is much slower and can only be done once every 6-10 clock cycles (details).

I've written numerous posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected].

Notes and references

  1. My microcode analysis is based on Andrew Jenner's 8086 microcode disassembly

  2. The 8086 patent and Andrew Jenner's microcode use the name LRCY (Left Rotate through Carry) instead of RCL. I figure that RCL will be more familiar to people because of the corresponding machine instruction. 

  3. In the dividend/quotient table, the tmpA register is on the left and the tmpC register is on the right. 0x0f00ff00 divided by 0x0ffc yielding the remainder 0x0030 (blue) and quotient 0xf04c (green). (The green bits are the complement of the quotient due to implementation in the 8086.) 

  4. I described the 8086's interrupt circuitry in detail in this post

  5. The negation code is a bit tricky because the result is split across two words. In most cases, the upper word is bitwise complemented. However, if the lower word is zero, then the upper word is negated (two's complement). I'll demonstrate with 16-bit values to keep the examples small. The number 257 (0x0101) is negated to form -257 (0xfeff). Note that the upper byte is the one's complement (0x01 vs 0xfe) while the lower byte is two's complement (0x01 vs 0xff). On the other hand, the number 256 (0x0100) is negated to form -256 (0xff00). In this case, the upper byte is the two's complement (0x01 vs 0xff) and the lower byte is also the two's complement (0x00 vs 0x00).

    (Mathematical explanation: the two's complement is formed by taking the one's complement and adding 1. In most cases, there won't be a carry from the low byte to the upper byte, so the upper byte will remain the one's complement. However, if the low byte is 0, the complement is 0xff and adding 1 will form a carry. Adding this carry to the upper byte yields the two's complement of that byte.)

    To support multi-word negation, the 8086's NEG instruction clears the carry flag if the operand is 0, and otherwise sets the carry flag. (This is the opposite of the above because subtractions (including NEG) treat the carry flag as a borrow flag, with the opposite meaning.) The microcode NEG operation has identical behavior to the machine instruction, since it is used to implement the machine instruction.

    Thus to perform a two-word negation, the microcode negates the low word (tmpC) and updates the flags (F). If the carry is set, the one's complement is applied to the upper word (tmpA). But if the carry is cleared, the two's complement is applied to tmpA. 

  6. There is a bit of ambiguity with the quotient and remainder of negative numbers. For instance, consider -27 ÷ 7. -27 = 7 × -3 - 6 = 7 * -4 + 1. So you could consider the result to be a quotient of -3 and remainder of -6, or a quotient of -4 and a remainder of 1. The 8086 uses the rule that the remainder will have the same sign as the dividend, so the first result would be used. The advantage of this rule is that you can perform unsigned division and adjust the signs afterward:
    27 ÷ 7 = quotient 3, remainder 6.
    -27 ÷ 7 = quotient -3, remainder -6.
    27 ÷ -7 = quotient -3, remainder 6.
    -27 ÷ -7 = quotient 3, remainder -6.

    This rule is known as truncating division, but some languages use different approaches such as floored division, rounded division, or Euclidean division. Wikipedia has details. 

  7. The signed overflow condition is slightly stricter than necessary. For a word division, the 16-bit quotient is restricted to the range -32767 to 32767. However, a 16-bit signed value can take on the values -32768 to 32767. Thus, a quotient of -32768 fits in a 16-bit signed value even though the 8086 considers it an error. This is a consequence of the 8086 performing unsigned division and then updating the sign if necessary. 

  8. The internal F1 flag is also used to keep track of a REP prefix for use with a string operation. I discussed string operations and the F1 flag in this post

  9. Curiously, the 8086 patent states that the X register is a 4-bit register holding bits 3–6 of the byte (col. 9, line 20). But looking at the die, it is a 3-bit register holding bits 3–5 of the byte. 

  10. Some instructions are specified by bits 5–3 in the ModR/M byte rather than in the first opcode byte. The motivation is to avoid wasting bits for instructions that use a ModR/M byte but don't need a register specification. For instance, consider the instruction ADD [BX],0x1234. This instruction uses a ModR/M byte to specify the memory address. However, because it uses an immediate operand, it does not need the register specification normally provided by bits 5–3 of the ModR/M byte. This frees up the bits to specify the instruction. From one perspective, this is an ugly hack, while from another perspective it is a clever optimization. 

  11. Even the earliest computers such as ENIAC (1945) usually supported multiplication and division. However, early microprocessors did not provide multiplication and division instructions due to the complexity of these instructions. Instead, the programmer would need to write an assembly code loop, which was very slow. Early microprocessors often had binary-coded decimal instructions that could perform additions and subtractions in decimal. One motivation for these instructions was that converting between binary and decimal was extremely slow due to the need for multiplication and division. Instead, it was easier and faster to keep the values as decimal if that was how they were displayed.

    The Texas Instruments TMS9900 (1976) was one of the first microprocessors with multiplication and division instructions. Multiply and divide instructions remained somewhat controversial on RISC (Reduced Instruction-Set Computer) processors due to the complexity of these instructions. The early ARM processors, for instance, did not support multiplication and division. Multiplication was added to ARMv2 (1986) but most ARM processors still don't have integer division. The popular open-source RISC-V architecture (2015) doesn't include integer multiply and divide by default, but provides them as an optional "M" extension.

    The 8086's algorithm is designed for simplicity rather than speed. It is a "restoring" algorithm that checks before subtracting to ensure that the current term is always positive. This can require two ALU operations (comparison and subtraction) per cycle. A slightly more complex approach is a "nonrestoring" algorithm that subtracts even if it yields a negative term, and then adds during a later loop iteration. 

How the 8086 processor's microcode engine works

The 8086 microprocessor was a groundbreaking processor introduced by Intel in 1978. It led to the x86 architecture that still dominates desktop and server computing. The 8086 chip uses microcode internally to implement its instruction set. I've been reverse-engineering the 8086 from die photos and this blog post discusses how the chip's microcode engine operated. I'm not going to discuss the contents of the microcode1 or how the microcode controls the rest of the processor here. Instead, I'll look at how the 8086 decides what microcode to run, steps through the microcode, handles jumps and calls inside the microcode, and physically stores the microcode. It was a challenge to fit the microcode onto the chip with 1978 technology, so Intel used many optimization techniques to reduce the size of the microcode.

In brief, the microcode in the 8086 consists of 512 micro-instructions, each 21 bits wide. The microcode engine has a 13-bit register that steps through the microcode, along with a 13-bit subroutine register to store the return address for microcode subroutine calls. The microcode engine is assisted by two smaller ROMs: the "Group Decode ROM" to categorize machine instructions, and the "Translation ROM" to branch to microcode subroutines for address calculation and other roles. Physically, the microcode is stored in a 128×84 array. It has a special address decoder that optimizes the storage. The microcode circuitry is visible in the die photo below.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

What is microcode?

Machine instructions are generally considered the basic steps that a computer performs. However, each instruction usually requires multiple operations inside the processor. For instance, an ADD instruction may involve computing the memory address, accessing the value, moving the value to the Arithmetic-Logic Unit (ALU), computing the sum, and storing the result in a register. One of the hardest parts of computer design is creating the control logic that signals the appropriate parts of the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated and error-prone.

In 1951, Maurice Wilkes came up with the idea of microcode: instead of building the control circuitry from complex logic gates, the control logic could be replaced with another layer of code (i. e. microcode) stored in a special memory called a control store. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task. Microcode also permits complex instructions and a large instruction set to be implemented without making the processor more complex (apart from the size of the microcode). Finally, it is generally easier to fix a bug in microcode than in circuit logic.

Early computers didn't use microcode, largely due to the lack of good storage technologies to hold the microcode. This changed in the 1960s; for example IBM made extensive use of microcode in the System/360 (1964). (I've written about that here.) But early microprocessors didn't use microcode, returning to hard-coded control logic with logic gates.3 This logic was generally more compact and ran faster than microcode, since the circuitry could be optimized. Since space was at a premium in early microprocessors and the instruction sets were relatively simple, this tradeoff made sense. But as microprocessor instruction sets became complex and transistors became cheaper, microcode became appealing. This led to the use of microcode in the Intel 8086 (1978) and 8088 (1979) and Motorola 68000 (1979), for instance.2

The 8086's microcode

The 8086's microcode is much simpler than in most processors, but it's still fairly complex. The code below is the microcode routine from the 8086 for a routine called "CORD", part of integer division, consisting of 16 micro-instructions. I'm not going to explain how this microcode works in detail, but I want to give a flavor of it. Each line has an address on the left (blue) and the micro-instruction on the right (yellow), specifying the low-level actions during one time step (i.e. clock cycle). Each micro-instruction performs a move, transferring data from a source register (S) to a destination register (D). (The source Σ indicates the ALU output.) For parallelism, the micro-instruction performs an operation or two at the same time as the move. This operation is specified by the "a" and "b" fields; their meanings depend on the type field. For instance, type 1 indicates an ALU instruction such as subtract (SUBT) or left-rotate through carry (LRCY). Type 4 selects two general operations such as "RTN" which returns from a microcode subroutine. Type 0 indicates a jump operation; "UNC 10" is an unconditional jump to line 10 while "CY 13" jumps to line 13 if the carry flag is set. Finally, the "F" field indicates if the condition code flags should be updated. The key points are that the micro-instructions are simple and execute in one clock cycle, they can perform multiple operations in parallel to maximize performance, and they include control-flow operations such as conditional jumps and subroutines.

An example of a microcode routine. The CORD routine implements integer division with subtracts and left rotates. This is from patent 4,449,184.

An example of a microcode routine. The CORD routine implements integer division with subtracts and left rotates. This is from patent 4,449,184.

Each instruction is stored at a 13-bit address (blue) which consists of 9 bits shown explicitly and a 4-bit sequence counter "CR". The eight numbered address bits usually correspond to the machine instruction's opcode. The "X" bit is an extra bit to provide more address space for code that is not directly tied to a machine instruction, such as reset and interrupt code, address computation, and the multiply/divide algorithms.

A micro-instruction is encoded into 21 bits as shown below. Every micro-instruction contains a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits is a bit tricky since it depends on the type field, which is two or three bits long. The "short jump" (type 0) is a conditional jump within the current block of 16 micro-instructions. The ALU operation (type 1) sets up the arithmetic-logic unit to perform an operation. Bookkeeping operations (type 4) are anything from flushing the prefetch queue to ending the current instruction. A memory read or write is type 6. A "long jump" (type 5) is a conditional jump to any of 16 fixed microcode locations (specified in an external table). Finally, a "long call" (type 7) is a conditional subroutine call to one of 16 locations (different from the jump targets).

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

This "vertical" microcode format reduces the storage required for the microcode by encoding control signals into various fields. However, it requires some decoding logic to process the fields and generate the low-level control signals. Surprisingly, there's no specific "microcode decoder" circuit. Instead, the logic is scattered across the chip, looking for various microcode bit patterns to generate control signals where they are needed.

How instructions map onto the ROM

One interesting issue is how the micro-instructions are organized in the ROM, and how the right micro-instructions are executed for a particular machine instruction. The 8086 uses a clever mapping from the machine instruction to a microcode address that allows machine instructions to share microcode.

Different processors use a variety of approaches to microcode organization. One technique is for each micro-instruction to contain a field with the address of the next micro-instruction. This provides complete flexibility for the arrangement of micro-instructions, but requires a field to hold the address, increasing the number of bits in each micro-instruction. A common alternative is to execute micro-instructions sequentially, with a micro-program-counter stepping through each micro-address unless there is an explicit jump to a new address. This approach avoids the cost of an address field in each instruction, but requires a program counter with an incrementer, increasing the hardware complexity.

The 8086 uses a hybrid approach. A 4-bit program counter steps through the bottom 4 bits of the address, so up to 16 micro-instructions can be executed in sequence without a jump. This approach has the advantage of requiring a smaller 4-bit incrementer for the program counter, rather than a 13-bit incrementer. The microcode engine provides a "short jump" operation that makes it easy to jump within the group of 16 instructions using a 4-bit jump target, rather than a full 13-bit address.

Another important design decision in microcode is how to determine the starting micro-address for each machine instruction. In other words, if you want to do an ADD, where does the microcode for ADD start? One approach is a table of starting addresses: the system looks in the table to find the starting address for ADD, but this requires a large table of 256 entries. A second approach is to use the opcode code value as the starting address. That is, an ADD instruction 0x05 would start at micro-address 5. This approach has two problems. First, you can't run the microcode sequentially since consecutive micro-instructions belong to different machine instructions. Moreover, you can't share microcode since each instruction has a different address in the microcode ROM.

The 8086 solves these problems in two ways. First, the machine instructions are spaced sixteen slots apart in the microcode. In other words, the opcode is multiplied by 16 (has four zeros appended) to form the starting address in the microcode ROM, so there is plenty of space to implement each machine instruction. The second technique is that the ROM's addressing is partially decoded rather than fully decoded, so multiple micro-addresses can correspond to the same physical storage.4

To make this concrete, consider the 8086's arithmetic-logic instructions: one-byte add register to memory, one-byte add memory to register, one-word subtract memory from register, one-word xor register to memory, and so forth. There are 8 ALU operations and each can be byte- or word-sized, with memory as source or destination. This yields 32 different machine opcodes. These opcodes were carefully assigned, so they all have the format 00xxx0xx. The ROM address decoder is designed to look for three 0 bits in those positions, and ignore the other bits, so it will match that pattern. The result is that all 32 of these ALU instructions activate the same ROM column select line, and thus they all share the same microcode, shrinking the size of the ROM.

The microcode ROM's physical layout

The microcode ROM holds 512 words of 21 bits, so the obvious layout would be 512 columns and 21 rows. However, these dimensions are not practical for physically building the ROM because it would be too long and skinny. Instead, the ROM is constructed by grouping four words in each column, resulting in 128 columns of 84 rows, much closer to square. Not only does this make the physical layout more convenient, but it also reduces the number of column decoders from 512 to 128, reducing the circuitry size. Although the ROM now requires 21 multiplexers to select which of the four rows corresponds to each output bit, the circuitry is still much smaller. There is a tradeoff with the ability to merge addresses together by ignoring bits, though. Each decoder now selects a column of four words, rather than a single word, so each block of four words must have consecutive addresses.

The main components of the microcode engine. The metal layer has been removed to show the silicon and polysilicon underneath. If you zoom in, the bit pattern is visible in the silicon doping pattern.

The main components of the microcode engine. The metal layer has been removed to show the silicon and polysilicon underneath. If you zoom in, the bit pattern is visible in the silicon doping pattern.

The image above shows how microcode is stored and accessed. At the top is the 13-bit microcode address register, which will be discussed in detail below. The column selection circuit decodes 11 of the 13 address bits to select one column of the microcode storage. At the left, multiplexers select one bit out of each four rows using the two remaining address bits (specifically, the two lowest sequence bits). The selected 21 microcode outputs are latched and fed to the rest of the processor, where they are decoded as described earlier and control the processor's actions.

Optimizing the microcode

In 1978, the number of bits that could be stored in the microcode ROM was rather limited. In particular, the 8086 holds only 512 micro-instructions. Since it has approximately 256 machine-code instructions in its one-byte opcode, combined with multiple addressing modes, and each instruction requires multiple micro-instructions, compression and optimization were necessary to make the microcode fit.5 The main idea was to move functionality out of the microcode and into discrete logic when it made sense. I'll describe some of the ways they did this.

The 8086 has an arithmetic-logic unit (ALU) that performs operations such as addition and subtraction, as well as logical operations such as AND and XOR. Consider the machine instruction ADD, implemented with a few micro-operations that compute the memory address, fetch data, perform the addition, and store the result. The machine instructions for subtraction, AND, or XOR require identical steps, except that the ALU performs a different operation. In total, the 8086 has eight ALU-based operations that are identical except for the operation performed by the ALU.6 The 8086 uses a "trick" where these eight machine instructions share the same microcode. Specifically, the microcode tells the ALU to perform a special operation XI, which indicates that the ALU should look at the appropriate bits of the instruction and do the appropriate operation.7 This shrinks the microcode for these operations by a factor of eight, at the cost of requiring additional logic for the ALU. In particular, the ALU control circuitry has a register to hold the relevant instruction bits, and a PLA to decode these bits into low-level ALU control signals.

Similarly, the 8086 has eight machine instructions to increment a specific register (out of a set of 8), and eight instructions to decrement a register. All 16 instructions are handled by the same set of micro-instructions and the ALU does the increment or decrement as appropriate. Moreover, the register control circuitry determines which register is specified by the instruction, without involvement from the microcode.

Another optimization is that the 8086 has many machine instructions in pairs: an 8-bit version and a 16-bit version. One approach would be to have separate microcode for the two instructions, one to handle a single byte and one to handle two bytes. Instead, the machine instructions share microcode. The complexity is moved to the circuitry that moves data on the bus: it looks at the low bit of the instruction to determine if it should process a byte or a word. This cuts the microcode size in half for the many affected instructions.

Finally, simple instructions that can take place in one cycle are implemented with logic gates, rather than through microcode. For instance, the CLC (clear carry flag) instruction updates the flag directly. Similarly, prefix instructions for segment selection, instruction locking, or repetition are performed in logic. These instructions don't use any microcode at all, which will be important later below.

Using techniques such as these, about 75 different instruction types are implemented in the microcode (instead of about 256), making the microcode much smaller. The tradeoff is that the 8086 requires more logic circuitry, but the designers found the tradeoff to be worthwhile.

The ModR/M byte

There's another complication for 8086 microcode, however. Most 8086 instructions have a second byte: the ModR/M byte, which controls the addressing mode for the instructions in a complex way (shown below). This byte gives 8086 instructions a lot of flexibility: you can use two registers, a register and a memory location, or a register and an "immediate" value specified in the instruction. The memory location can be specified by 8 index register combinations with a one-byte or two-byte displacement optionally added. (This is useful for accessing data in an array or structure, for instance.) Although these addressing modes are powerful, they pose a problem for the microcode.

A summary of the ModR/M byte, from MCS-86 Assembly Language Reference Guide.

A summary of the ModR/M byte, from MCS-86 Assembly Language Reference Guide.

These different addressing modes need to be implemented in microcode, since different addressing modes require different sequences of steps. In other words, you can't use the previous trick of pushing the problem into logic gates. And you clearly don't want a separate implementation of each instruction for each addressing mode since the size of the microcode would multiply out of control.

The solution is to use a subroutine (in microcode) to compute the memory address. Thus, instructions can share the microcode for each addressing mode. This adds a lot of complexity to the microcode engine, however, since it needs to store the micro-address for a micro-subroutine-call so it can return to the right location. To support this, the microcode engine has a register to hold this return address. (Since it doesn't have a full stack, you can't perform nested subroutine calls, but this isn't a significant limitation.)

The microcode ends up having about 10 subroutines for the different addressing modes, as well as four routines for the different sizes of displacement. (The 8 possibilities for source registers are handled in the register selection logic, rather than microcode.) Thus, the microcode handles the 256 different addressing modes with about 14 short routines that add the appropriate address register(s) and the displacement to obtain the memory address.

One more complication is that machine instructions can switch the source and destination specified by the ModR/M byte, depending on the opcode. For example, one subtract instruction will subtract a memory location from a register, while a different subtract instruction subtracts a register from a memory location. The two variants are distinguished by bit 1 of the instruction, the "direction" bit. These variants are handled by the control logic, so the microcode can ignore them. Specifically, before the source and destination specifications go to the register control circuitry, a crossover circuit can swap them based on the value of the direction bit.

The Translation ROM

As explained above, the starting address for a machine instruction is derived directly from the instruction's opcode. However, the microcode engine needs a mechanism to provide the address for jump and call operations. In the 8086, this address is hard-coded into the Translation ROM, which provides a 13-bit address.8 It holds ten destination addresses for jump operations and ten (different) addresses for call operations.

A second role of the Translation ROM is to hold target addresses for each ModR/M addressing mode, pointing to the code to compute the effective address. As a complication, two of the jump table entries in the Translation ROM are implemented with conditional logic, depending on whether or not the instruction's memory address calculation includes a displacement. By wiring this condition into the Translation ROM, the microcode avoids the need to test this condition.

The image below shows how the Translation ROM appears on the die. It is implemented as a partially-decoded ROM with multiplexed inputs.9 The inputs are at the bottom left. For a jump or call, the ROM uses 4 input bits from the microcode output, since the microcode selects the jump targets. For an address computation, it takes 5 bits from the instruction's ModR/M byte, so the routine is selected by the instruction. The ROM has additional input bits to select the mode (jump, call, or address) and for the conditional jumps. The decoding logic (left half) activates a row in the right half, generating the output address. This address exits at the bottom and is loaded into the micro-address register below the Translation ROM.

The Translation ROM holds addresses of routines in the microcode.

The Translation ROM holds addresses of routines in the microcode.

The Group Decode ROM

In the discussion above, I've discussed how various categories of instructions are optimized. For instance, many instructions have a bit that selects if they act on a byte or a word. Many instructions have a bit to reverse the direction of the operation's memory and register accesses. These features are implemented in logic rather than microcode. Other instructions are implemented outside microcode entirely. How does the 8086 determine which way to process an instruction?

The Group Decode ROM takes an instruction opcode and generate 15 signals that indicate various categories of instructions that are handled differently.10 The outputs from the Group Decode ROM are used by various logic circuits to determine how to handle the instruction. Some cases affect the microcode, for instance calling a microcode addressing routine if the instruction has a ModR/M byte. In other cases, these signals act "downstream" of the microcode, for example to determine if the operation should act on a byte or a word. Other signals cause the microcode to be bypassed completely.

A closeup of the Group Decode ROM. The circuit uses two layers of NOR gates to generate the output signals from the opcode inputs. This image shows a composite of the metal, polysilicon, and silicon layers.

A closeup of the Group Decode ROM. The circuit uses two layers of NOR gates to generate the output signals from the opcode inputs. This image shows a composite of the metal, polysilicon, and silicon layers.

Specially-encoded instructions

For most of the 8086 instructions, the first byte specifies the instruction. However, the 8086 has a few instructions where the ModR/M byte completely changes the meaning of the first byte. For instance, opcode 0xF6 (Grp 1 below) can be a TEST, NOT, NEG, MUL, IMUL, DIV, or IDIV instruction based on the value of the ModR/M byte. Similarly, opcode 0xFE (Grp 2) indicates an INC, DEC, CALL, JMP, or PUSH instruction.11

The 8086 instruction map for opcodes 0xF0 to 0xFF. Based on MCS-86 Assembly Language Reference Guide.

The 8086 instruction map for opcodes 0xF0 to 0xFF. Based on MCS-86 Assembly Language Reference Guide.

This encoding may seem a bit random, but there's a reason behind it. Most instructions act on a source and a destination. But some, such as INC (increment) use the same register or memory location for the source and the destination. Others, such as CALL or JMP, only use one address. Thus, the "reg" field in the ModR/M byte is redundant. Since these bits would be otherwise "wasted", they are used instead to specify different instructions. (There are only 256 single-byte opcodes, so you want to make the best use of them.)

The implementation of these instructions in microcode is interesting. Since the instructions share the same first byte, the standard microcode mapping would put them at the same microcode address. However, these instructions are treated specially, with the "reg" field from the ModR/M byte copied into the lower bits of the microcode address. In effect, the instructions are treated as opcodes 0xF0 through 0xFF, so the different instruction variants execute at separate microcode addresses. You might expect a collision with the opcodes that really have the values 0xF0 through 0xFF. However, the 8086 opcodes were cleverly arranged so none of the other instructions in this range use microcode. As you can see above, the other instructions are prefixes (LOCK, REP, REPZ), halt (HLT), or flag operations (CMC, CLC, STC, CLI, STI, CLD, STD), all implemented outside microcode. Thus, the range 0xF0-0xFF is freed up for the "expanded" instructions.

The hardware implementation for this is not too complex. The Group ROM produces an output for these special instructions. This causes the microcode address register to load the appropriate bits from the ModR/M byte, causing the appropriate microcode routine to be executed.

The microcode address register

The heart of the microcode engine is the microcode address register, which determines which microcode address to execute. As described earlier, the microcode address is 13 bits, of which 8 bits generally correspond to the instruction opcode, one bit is an extra "X" instruction bit, and 4 bits are sequentially incremented. The diagram below shows how the circuitry for the bits is arranged. The 9 instruction bits each have a nearly-identical circuit. The sequence bits have more circuitry and each one is different, because the circuit to increment the address is different for each bit.

Layout of the microcode address register. Each bit has a roughly vertical block of circuitry.

Layout of the microcode address register. Each bit has a roughly vertical block of circuitry.

The schematic below shows the circuitry for one bit in the microcode address register. It has two flip-flops: one to hold the current address bit and one to hold the old address while performing a subroutine call. A multiplexer (mux) selects the input to each flip-flop. For instance, if the microcode is waiting for a memory access, the "hold" input to the multiplexer causes the current address to loop around and get reloaded into the flip-flop. For a subroutine call, the "call" input saves the current address in the subroutine flip-flop. Conversely, when returning from a subroutine, the "return" input loads the old address from the subroutine flip-flop. The address flip-flop also has inputs to load the instruction as the address, to load an address from the translation ROM, or to load an interrupt microcode handler address. The circuit sends the address bit (and inverted address bit) to the microcode ROM's address decoder.

Schematic of a typical bit in the microcode address register.

Schematic of a typical bit in the microcode address register.

Each bit has some special-case handling, so this schematic should be viewed as an illustration, not an accurate wiring diagram. In particular, the sequence bits also have inputs from the incrementer, so they can step to the next address. The low-order bits have instruction inputs to handle the specially-encoded "group" instructions discussed in the previous section.

The control signals for the multiplexers are generated from various sources. A circuit called the loader starts processing of an instruction, synchronized to the prefetch queue and instruction fetch from memory. The call and return operations are microcode instructions. The Group Decode ROM controls some of the inputs, for instance to process a ModR/M byte. Thus, there is a moderate amount of conditional logic that determines the microcode address and thus what microcode gets executed.

Conclusions

This has been a lot of material, so thank you for sticking with it to the end. I draw three conclusions from studying the microcode engine of the 8086. First, the implementation of microcode is considerably more complex than the clean description of microcode that is presented in books. A lot of functionality is implemented in logic outside of microcode, so it's not a "pure" microcode implementation. Moreover, there are many optimizations and corner cases. The microcode engine has two supporting ROMs: the Translation ROM and the Group Decode ROM. Even the microcode address register has complications.

Second, the need for all these optimizations shows how the 8086 was just on the edge of what was practical. The designers clearly went to a lot of effort to get the microcode to fit in the space available.

Finally, looking at the 8086 in detail shows how complex its instruction set is. I knew in the abstract that it was much more convoluted than, say, an ARM chip. But seeing all the special case circuitry on the die to handle the corner cases of the instruction set really makes this clear.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected]. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

  1. The 8086 microcode was disassembled (link) a couple of years ago by Andrew Jenner by extracting the bits from my die photos. My post here is a bit different, looking at hardware that runs the microcode, rather than the contents of the microcode. 

  2. According to Wikipedia, the Zilog Z8000 (1979) didn't use microcode, which is a bit surprising for that timeframe. This design decision had the advantage of reducing transistor count, but the disadvantage of hard-to-fix logic bugs in the instruction decoder. 

  3. As an example of a non-microcoded processor, the MOS 6502 (1975) used a PLA (Programmable Logic Array) to perform much of the decoding (details). A PLA provides a structured way of implementing logic gates in a relatively dense array. A PLA is kind of like a ROM—a PLA can implement a ROM and vice versa—so it can be hard to see the difference. The usual distinction is that only one row of a ROM is active at a time, the address that you're reading out. A PLA is more general since multiple rows can be active at a time, combined to form the outputs.

    The Z80 had a slightly different implementation. It used a smaller PLA to perform basic decoding of instructions into various types. It then generated control signals through a large amount of "random" logic (so-called because of its appearance, not because it's actually random). This logic combined instruction types and timing signals to generate the appropriate control signals. 

  4. In a "normal" ROM, the address bits are decoded with logic gates so each unique address selects a different storage column in the ROM. However, parts of the decoder circuitry can "ignore" bits so multiple addresses select the same storage column. For a hypothetical example, suppose you're decoding 5-bit addresses. Instead of decoding 11010 and 11011 separately, you could ignore the last bit so both addresses access the same ROM data. or you could ignore the last three bits so all 8 addresses of the form 11xxx go to the same ROM location (where x indicates a bit that can be either 0 or 1). This makes the ROM more like a PLA (Programmable Logic Array), but still accessing a single row at a time. 

  5. The Intel 8087 was the floating point co-processor for the 8086. The 8087 required a lot of microcode, more than could fit in a standard ROM on the die. To get the microcode to fit, Intel created a special ROM that stored two bits per transistor (instead of one) by using four different sizes of transistors to generate four different voltages. Analog circuitry converted each voltage level into two bits. This complex technique doubled the density (at least in theory), allowing the microcode to fit on the chip. I wrote about the 8087's non-binary ROM here

  6. The ALU operations that are grouped together are add, add with carry, subtract, subtract with borrow, logical AND, logical XOR, logical OR, and compare. The compare operation may seem out of place in this list, but it is implemented as a subtract operation that updates the condition flags without changing the value. Operations such as increment, decrement, negation, and logical NOT may seem like they should be included, but since they operate on a single argument instead of two, they are implemented differently at the microcode level. Increment and decrement are combined in the microcode, however. Negation and logical NOT could be combined except that negation affects the condition code flags, while NOT doesn't, so they need separate microcode. (This illustrates how subtle features of the instruction set can have more impact than you might expect.) Since the ALU doesn't have hardware multiplication and division, the multiplication and division operations are implemented separately in microcode with loops. 

  7. The ALU itself isn't examining instruction bits to decide what to do. There's some control circuitry next to the ALU that uses a PLA (Programmable Logic Array) to examine the instruction bits and the microcode's ALU command to generate low-level control signals for the ALU. These signals control things such as carry propagation, argument negation, and logical operation selection to cause the ALU to perform the desired operation. 

  8. The Translation ROM has one additional output: a wire indicating an address mode that uses the BP register. This output goes to the segment register selection circuitry and selects a different segment register. The reason is that the 8086 uses the Data Segment by default for effective address computation, unless the address mode uses BP as a base register. In that case, the Stack Segment is used. This is an example of how the 8086 architecture is not orthogonal and has lots of corner cases. 

  9. You can also view the Translation ROM as a PLA (Programmable Logic Array) constructed from two layers of NOR gates. The conditional entries make it seem more like a PLA than a ROM. Technically, it can be considered a ROM since a single row is active at a time. I'm using the name "Translation ROM" because that's what Intel calls it in the patents. 

  10. Although the Group Decode ROM is called a ROM in the patent, I'd consider it more of a PLA (programmable logic array). Conceptually it holds 256 words, one for each instruction. But its implementation is an array of logic functions. 

  11. These instructions were called "Group 1" and "Group 2" instructions in the 8086 documentation. Later Intel documentation renamed them as "Unary Group 3", "INC/DEC Group 4" and "Indirect Group 5". Some details are here. The 8086 has two other groups of instructions where the reg field defines the instruction: the "Immediate" instructions 0x80-0x83 and the "Shift" instructions 0xD0-0xD3. For these opcodes, the different instructions were implemented by the ALU. As far as the microcode was concerned, these were "normal" instructions so I won't discuss them in this post.

    I should mention that although the 8086 opcodes are always expressed in hexadecimal, the encoding makes much more sense if you look at it in octal. Details here. The octal encoding also applies to other related chips including the 8008, 8080, and Z80.