Showing posts with label chips. Show all posts
Showing posts with label chips. Show all posts

Reverse-engineering the register codes for the 8086 processor's microcode

Like most processors, the Intel 8086 (1978) provides registers that are faster than main memory. As well as the registers that are visible to the programmer, the 8086 has a handful of internal registers that are hidden from the user. Internally, the 8086 has a complicated scheme to select which register to use, with a combination of microcode and hardware. Registers are assigned a 5-bit identifying number, either from the machine instruction or from the microcode. In this blog post, I explain how this register system works.

My analysis is based on reverse-engineering the 8086 from die photos. The die photo below shows the chip under a microscope. For this die photo, I removed the the metal and polysilicon layers, revealing the silicon underneath. I've labeled the key functional blocks; the ones that are important to this post are darker. In particular, the registers and the Arithmetic/Logic Unit (ALU) are at the left and the microcode ROM is in the lower right. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles bus and memory activity as well as instruction prefetching, while the Execution Unit (EU) executes the instructions.

The 8086 die under a microscope, with main functional blocks labeled.  Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. Click on this image (or any other) for a larger version.

Microcode

Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. With microcode, instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.

The 8086 uses a hybrid approach: although it uses microcode, much of the instruction functionality is implemented with gate logic. This approach removed duplication from the microcode and kept the microcode small enough for 1978 technology. In a sense, the microcode is parameterized. For instance, the microcode can specify a generic Arithmetic/Logic Unit (ALU) operation and a generic register. The gate logic examines the instruction to determine which specific operation to perform and the appropriate register.

A micro-instruction in the 8086 is encoded into 21 bits as shown below. Every micro-instruction has a move from a source register to a destination register, each specified with 5 bits; this encoding is the main topic of this blog post. The meaning of the remaining bits depends on the type field and can be anything from an ALU operation to a memory read or write to a change of microcode control flow. For more about 8086 microcode, see my microcode blog post.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

Let's look at how the machine instruction XCHG AX,reg is implemented in microcode. This instruction exchanges AX with the register specified in the low 3 bits of the instruction.1 The microcode for this instruction consists of three micro-instructions, so the instruction takes three clock cycles. Each micro-instruction contains a move, which is the interesting part.2 The specified register is moved to the ALU's temporary B register, the AX register is moved to the specified register, and finally the temporary B is moved to AX, completing the swap.

 move       action
M → tmpB           XCHG AX,rw: move reg to tmpB
AX → M      NXT     move AX to reg, Next to last
tmpB → AX   RNI     move tmpB to AX, Run Next Instruction

The key part for this discussion is how M indicates the desired register. Suppose the instruction is XCHG AX,DX. The bottom three bits of the instruction are 010, indicating DX. During the first clock cycle of instruction execution, the opcode byte is transferred from the prefetch queue over the queue bus. The M register is loaded with the DX number (which happens to be 26), based on the bottom three bits of the instruction. After a second clock cycle, the microcode starts. The first micro-instruction puts M's value (26) onto the source bus and the number for tmpB (13) on the destination bus, causing the transfer from DX to tmpB. The second micro-instruction puts the AX number (24) onto the source bus and the M value (26) onto the destination bus, causing the transfer from AX to DX. The third micro-instruction puts tmpB number (13) onto the source bus and the AX number (24) onto the destination bus, causing the transfer from tmpB to AX.

Thus, the values on the source and destination bus control the data transfer during each micro-instruction. Microcode can either specify these values explicitly (as for AX and tmpB) or can specify the M register to use the register defined in the instruction. Thus, the same microcode implements all the XCHG instructions and the microcode doesn't need to know which register is involved.

The register encoding

The microcode above illustrated how different numbers specified different registers. The table below shows how the number 0-31 maps onto a register. Some numbers have a different meaning for a source register or a destination register; a slash separates these entries.

0ES8AL16AH24AX
1CS9CL17CH25CX
2SS10DL18DH(M)26DX
3DS11BL19BH(N)27BX
4PC12tmpA20Σ/tmpAL28SP
5IND13tmpB21ONES/tmpBL29BP
6OPR14tmpC22CR/tmpAH30SI
7Q/-15F23ZERO/tmpBH31DI

Most of these entries are programmer-visible registers: the segment registers are in green, the 8-bit registers in blue, and the 16-bit registers in red. Some internal registers and pseudo-registers are also accessible: IND (Indirect register), holding the memory address for a read or write; OPR (Operand register), holding the data for a read or write; Q (Queue), reading a byte from the instruction prefetch queue; ALU temporary registers A, B, and C, along with low (L) and (H) bytes; F, Flags register; Σ, the ALU output; ONES, all ones; CR, the three low bits of the microcode address; and ZERO, the value zero. The M and N entries can only be specified from microcode, taking the place of DH and BH.

The table is kind of complicated, but there are reasons for its structure. First, machine instructions in the 8086 encode registers according to the system below. The 5-bit register number above is essentially an extension of the instruction encoding. Moreover, the AX/CX/DX/BX registers (red) are lined up with their upper-byte and lower-byte versions (blue). This simplifies the hardware since the low three bits of the register number select the register, while the upper two bits perform the byte versus word selection.3 The internal registers fit into available spots in the table.

The register assignments, from MCS-86 Assembly Language Reference Guide.

The register assignments, from MCS-86 Assembly Language Reference Guide.

The ModR/M byte

Many of the 8086 instructions use a second byte called the ModR/M byte to specify the addressing modes.4 The ModR/M byte gives the 8086 a lot of flexibility in how an instruction accesses its operands. The byte specifies a register for one operand and either a register or memory for the other operand. The diagram below shows how the byte is split into three fields: mod selects the overall mode, reg selects a register, and r/m selects either a register or memory mode. For a ModR/M byte, the reg and the r/m fields are read into the N and M registers respectively, so the registers specified in the ModR/M byte can be accessed by the microcode.

modregr/m
76543210

Let's look at the instruction SUB AX,BX which subtracts BX from AX. In the 8086, some important processing steps take place before the microcode starts. In particular, the "Group Decode ROM" categorizes the instruction into over a dozen categories that affect how it is processed, such as instructions that are implemented without microcode, one-byte instructions, or instructions with a ModR/M byte. The Group Decode ROM also indicates the structure of instructions, such as instructions that have a W bit selecting byte versus word operations, or a D bit reversing the direction of the operands. In this case, the Group Decode ROM classifies the instruction as containing a D bit, a W bit, an ALU operation, and a ModR/M byte.

Based on the Group Decode ROM's signals, fields from the opcode and ModR/M bytes are extracted and stored in various internal registers. The ALU operation type (SUB) is stored in the ALU opr register. The ModR/M byte specifies BX in the reg field and AX in the r/m field so the reg register number (BX, 27) is stored in the N register, and the r/m register number (AX, 24) is stored in the M register.

Once the preliminary decoding is done, the microcode below for this ALU instruction is executed.5 There are three micro-instructions, so the instruction takes three clock cycles. First, the register specified by M (i.e. AX) is moved to the ALU's temporary A register (tmpA). Meanwhile, XI configures the ALU to perform the operation specified by the instruction bits, i.e. SUB. The second micro-instruction moves the register specified by N (i.e. BX) to the ALU's tmpB register. The last micro-instruction stores the ALU's result (Σ, number 20) in the register indicated by M (i.e. AX).

  move       action
M → tmpA     XI tmpA   ALU rm↔r: AX to tmpA
N → tmpB     NXT       BX to tmpB
Σ → M        RNI F     result to AX, update flags

One of the interesting features of the 8086 is that many instructions contain a D bit that reverses the direction of the operation, swapping the source and the destination. If we keep the ModR/M byte but use the SUB instruction with the D bit set, the instruction becomes SUB BX,AX, subtracting AX from BX, the opposite of before. (Swapping the source and destination is more useful when one argument is in memory. But I'll use an example with two registers to keep it simple.) This instruction runs exactly the same microcode as before. The difference is that when the microcode accesses M, due to the direction bit it gets the value in N, i.e. BX instead of AX. The access to N is similarly swapped. The result is that AX is subtracted from BX, and the change of direction is transparent to the microcode.

The M and N registers

Now let's take a closer look at how the M and N registers are implemented. Each register holds a 5-bit register number, expanded from the three bits of the instruction. The M register is loaded from the three least significant bits of the instruction or ModR/M byte, while the N register is loaded with bits three through five. Most commonly, the registers are specified by the ModR/M byte, but some instructions specify the register in the opcode.6

The table below shows how the bits in the instruction's opcode or ModR/M byte (i5, i4, i3) are converted to a 5-bit number for the N register. There are three cases: a 16-bit register, an 8-bit register, and a segment register. The mappings below may seem random, but they result in the entries shown in the 5-bit register encoding table earlier. I've colored the entries so you can see the correspondence.

Mode43210
16-bit reg11i5i4i3
8-bit regi5i5'0i4i3
segment reg000i4i3

I'll go through the three cases in more detail. Many 8086 instructions have two versions, one that acts on bytes and one that acts on words, distinguished by the W bit (bit 0) in the instruction. If the Group Decode ROM indicates that the instruction has a W bit and the W bit is 0, then the instruction is a byte instruction.7 If the instruction has a ModR/M byte and the instruction operates on a byte, the N register is loaded with the 5-bit number for the specified byte register. This happens during "Second Clock", the clock cycle when the ModR/M byte is fetched from the instruction queue. The second case is similar; if the instruction operates on a word, the N register is loaded with the number for the word register specified in the ModR/M byte.

The third case handles a segment register. The N register is loaded with a segment register number during Second Clock if the Group Decode ROM indicates the instruction has a ModR/M byte with a segment-register field (specifically the segment register MOV instructions). A bit surprisingly, a segment register number is also loaded during First Clock. This supports the PUSH and POP segment register instructions, which have the segment register encoded in bits 3 and 4 of the opcode.8

The table below shows how the bits are assigned in the M register, which uses instruction bits i2, i1, and i0. The cases are a bit more complicated than the N register. First, a 16-bit register number is loaded from the opcode byte during First Clock to support instructions that specify the register in the low bits. During Second Clock, this value may be replaced.

For a ModR/M byte using register mode, the M register is reloaded with the specified 8-bit or a 16-bit register, depending on the byte mode signal described earlier. However, for a ModR/M byte that uses a memory mode, the M register is loaded with OPR (Operand), the internal register that holds the word that is read or written to memory.

Mode43210
16-bit reg11i2i1i0
8-bit regi2i2'0i1i0
OPR00110
AX/ALbyte'1000
convert to 8-bitm2m2'0m1m0

Many instructions use the AX or AL register, such as the ALU immediate instructions, the input and output instructions, and string instructions. For these, the Group Decode ROM triggers the AX or AL register number specifically to be loaded into the M register during Second Clock. The top bit is set for a word operation and cleared for a byte operation providing AX or AL as appropriate.

The final M register case is a bit tricky. For an immediate move instruction such as MOV BX,imm, bit 3 switches between a byte and a word move (rather than bit 0), because bits 2-0 specify the register. Unfortunately, the Group Decode ROM outputs aren't available during First Clock to indicate this case. Instead, M is loaded during First Clock with the assumption of a 16-bit register. If that turns out to be wrong, the M register is converted to an 8-bit register number during Second Clock by shuffling a few bits.

Producing the source and destination values

There are three cases for the number that goes on the source or destination buses: the register number can come from the micro-instruction, the value can come from the M or N register as specified in the micro-instruction, or the value can come from the M and N register with the roles swapped by the D bit. (Note that the source and destination can be different cases and are selected with separate circuitry.)

The first case is the default case, where the 5 bits from the micro-instruction source or destination specify the register directly. For instance, in the micro-instruction tmpB→AX, the microcode knows which registers are being used and specifies them directly.

The second and third cases involve more logic. Consider the source in M→tmpB. For an instruction without a D bit, the register number is taken from M. Likewise if the D bit is 0. But if the instruction uses a D bit and the D bit is 1, then the register number is taken from N. Multiplexers between the M and N registers select the appropriate register to put on the bus.

The M and N registers as they appear on the die. The metal layer has been removed from this image to show the silicon and polysilicon underneath.

The M and N registers as they appear on the die. The metal layer has been removed from this image to show the silicon and polysilicon underneath.

The diagram above shows how the M and N register circuitry is implemented on the die, with the N register at the top and the M register below. Each register has an input multiplexer that implements the tables above, selecting the appropriate 5 bits depending on the mode. The registers themselves are implemented as dynamic latches driven by the clock. In the middle, a crossover multiplexer drives the source and destination buses, selecting the M and N registers as appropriate and amplifying the signals with relatively large transistors. The third output from the multiplexer, the bits from the micro-instruction, is implemented in circuitry physically separated and closer to the microcode ROM.

The register selection hardware

How does the 5-bit number select a register? The 8086 has a bunch of logic that turns a register number into a control line that enables reading or writing of the register. For the most part, this logic is implemented with NOR gates that match a particular register number and generate a select signal. The signal goes through a special bootstrap driver to boost its voltage since it needs to control 16 register bits.

The 8086 registers are separated into two main groups. The "upper registers" are in the upper left of the chip, in the Bus Interface Unit. These are the registers that are directly involved with memory accesses. The "lower registers" are in the lower left of the chip, in the Execution Unit. From bottom to top, they are AX, CX, DX, BX, SP, BP, SI, and DI; their physical order matches their order in the instruction set.9 A separate PLA (Programmable Logic Array) selects the ALU temporary registers or flags as destination. Just below it, a PLA selects the source from ALU temporary registers, flags, or the ALU result (Σ).10 I've written about the 8086's registers and their low-level implementation here if you want more information.

Some history

The 8086's system of selecting registers with 3-bit codes originates with the Datapoint 2200,11 a desktop computer announced in 1970. The processor of the Datapoint 2200 was implemented with a board of TTL integrated circuits, since this was before microprocessors. Many of the Datapoint's instructions used a 3-bit code to select a register, with a destination register specification in bits 5-3 of the instruction and a source register in bits 2-0. (This layout is essentially the same as in 8086 instructions and the ModR/M byte.)12 The eight values of this code selected one of 7 registers, with the eighth value indicating a memory access. Intel copied the Datapoint 2200 architecture for the 800813 microprocessor (1972) and cleaned it up for the 8080 (1974), but kept the basic instruction layout and register/memory selection bits.

The 8086's use of a numbering system for all the registers goes considerably beyond this pattern, partly because its registers function both as general-purpose registers and special-purpose registers.14 Many instructions can act on the AX, BX, etc. registers interchangeably, treating them as general-purpose registers. But these registers each have their own special purposes for other instructions, so the microcode must be able to access them specifically. This motivates the 8086's approach where registers can be treated as general-purpose registers that are selected from instruction bits, or as special-purpose registers selected by the microcode.

The Motorola 68000 (1979) makes an interesting comparison to the 8086 since they were competitors. The 68000 uses much wider microcode (85-bit microinstructions compared to 21 bits in the 8086). It has two main internal buses, but instead of providing generic source and destination transfers like the 8086, the 68000 has a much more complicated system: about two dozen microcode fields that connect registers and other components to the bus in various ways.15

Conclusions

Internally, the 8086 represents registers with a 5-bit number. This is unusual compared to previous microprocessors, which usually selected registers directly from the instruction or control circuitry. Three factors motivated this design in the 8086. First, it used microcode, so a uniform method of specifying registers (both programmer-visible and internal) was useful. Second, being able to swap the source and destination in an instruction motivated a level of indirection in register specification, provided by the M and N registers. Finally, the flexibility of the ModR/M byte, in particular supporting byte, word, and segment registers, meant that the register specification needed 5 bits.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff.

Notes and references

  1. As an aside, the NOP instruction (no operation) in the 8086 is really XCHG AX,AX. Exchanging the AX register with itself accomplishes nothing but takes 3 clock cycles. 

  2. The action part of the micro-instructions indicates the second-last micro-instruction (NXT, next) and the last (RNI, Run Next Instruction), so execution of the next machine instruction can start. 

  3. Note that register #18 can refer both to DH and the destination register. This doesn't cause a conflict because it refers to DH if loaded from the instruction, and refers to the destination register if specified in the micro-instruction. The only issue is that a micro-instruction can't refer to the DH register explicitly (or the BH register similarly). This restriction isn't a problem because the microcode never needs to do this. 

  4. I discuss the 8086's ModR/M byte in detail here

  5. The microcode listings are based on Andrew Jenner's disassembly. I have made some modifications to (hopefully) make it easier to understand. 

  6. There are a few instructions that specify a register in the opcode rather than the ModR/M byte. For 16-bit registers, the INC, DEC, XCHG, PUSH, and POP instructions specify the register in the low three bits of the opcode. The MOV immediate instructions specify either an 8-bit or 16-bit register in the low three bits. On the other hand, the segment is specified by bits 3 and 4 of the segment prefixes, PUSH, and POP instructions. 

  7. A few instructions only have byte versions (DAA, AAA, DAS, AAS, AAM, AAD, XLAT). This is indicated by a Group Decode ROM output and forces instruction execution into byte mode. Thus, these instructions would load a byte register into N, but since these instructions don't have a register specification, it doesn't matter. 

  8. The segment prefixes use the same instruction bits (3 and 4) as PUSH and POP to select the segment register, so you might expect the prefixes to also load the N register. However, the prefixes are implemented in hardware, rather than microcode. Thus, they do not use the N register and the N register is not loaded with the segment register number. 

  9. You might wonder why the BX register is out of sequence with the other registers, both physically on the chip and in the instruction set. The 8086 was designed so 8080 assembly code could be translated to 8086 code. Originally, the 8086 registers had different names: XA, BC, DE, HL, SP, MP, IJ, and IK. The first four names matched the registers in the Intel 8080 processor, while MP was Memory Pointer and IJ and IK were Index registers. However, when the 8086 was released the registers were given names that corresponded to their functions in the 8086, abandoning the 8080 names. XA became the Accumulator AX, The BC register was used for counting, so it became the Count register CX. The DE register was a data register, so it became the Data register DX. The HL register was used as a base for memory accesses, so it became the Base register BX. The result is that the BX register ended up last.

    A program CONV-86 allowed 8080 assembly programs to be translated into 8086 assembly programs, with 8080 registers replaced with the corresponding 8086 registers. The old 8086 register names can be seen in the 8086 patent, while the Accumulator, Base, Count, Data names are in the MCS-86 Assembly Language Reference Guide. See also this Stack Exchange discussion

  10. The all-ones source doesn't have any decoding; the ALU bus is precharged to the high state, so it is all ones by default. 

  11. The system of using bit fields in instructions to select registers is much older, of course. The groundbreaking IBM System/360 architecture (1964), for instance, used 4-bit fields in instructions to select one of the 16 general-purpose registers. 

  12. Note that with this instruction layout, the instruction set maps cleanly onto octal. The Datapoint 2200 used octal to describe the instruction set, but Intel switched to hexadecimal for its processors. Hexadecimal was becoming more popular than octal at the time, but the move to hexadecimal hides most of the inherent structure of the instructions. See x86 is an octal machine for details. 

  13. The Datapoint manufacturer talked to Intel and Texas Instruments about replacing the board of chips with a single processor chip. Texas Instruments produced the TMX 1795 microprocessor chip and Intel produced the 8008 shortly after, both copying the Datapoint 2200's architecture and instruction set. Datapoint didn't like the performance of these chips and decided to stick with a TTL-based processor. Texas Instruments couldn't find a customer for the TMX 1795 and abandoned it. Intel, on the other hand, sold the 8008 as an 8-bit microprocessor, creating the microprocessor market in the process. Register selection in these processors was pretty simple: the 3 instruction bits were decoded into 8 select lines that selected the appropriate register (or memory). Since these processors had hard-coded control instead of microcode, the control circuity generated other register selection lines directly. 

  14. While the 8086 has eight registers that can be viewed as general-purpose, they all have some specific purposes. The AX register acts as the accumulator and has several special functions, such as its use in the XCHG (Exchange) operation, I/O operations, multiplication, and division. The BX register has a special role as a base register for memory accesses. The CX register acts as the counter for string operations and for the JCXZ (Jump if CX Zero) instruction. The DX register can specify the port for I/O operations and is used for CWD (Convert Word to Doubleword) and in multiplication and division. The SP register has a unique role as the stack pointer. The SI and DI registers are used as index registers for string operations and memory accesses. Finally, the BP register has a unique role as base pointer into the stack segment. On the 8-bit side, AX, BX, CX, and DX can be accessed as 8-bit registers, while the other registers cannot. The 8-bit AL register is used specifically for XLAT (Translate) while AH is used for the flag operations LAHF and SAHF. Thus, the 8086's registers are not completely orthogonal, and each one has some special cases, often for historical reasons. 

  15. Another way of looking at the Motorola 68000's microcode is that the register controls come from "horizontal" microcode, a micro-instruction with many bits, and fields that control functional elements directly. The 8086's microcode is more "vertical"; the micro-instructions have relatively few bits and the fields are highly encoded. In particular, the 8086's source and destination register fields are highly encoded, while the 68000 has fields that control the connection of an individual register to the bus. 

How the 8086 processor determines the length of an instruction

The Intel 8086 processor (1978) has a complicated instruction set with instructions ranging from one to six bytes long. This raises the question of how the processor knows the length of an instruction.1 The answer is that the 8086 uses an interesting combination of lookup ROMs and microcode to determine how many bytes to use for an instruction. In brief, the ROMs perform enough decoding to figure out if it needs one byte or two. After that, the microcode simply consumes instruction bytes as it needs them. Thus, nothing in the chip explicitly "knows" the length of an instruction. This blog post describes this process in more detail.

The die photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles bus and memory activity as well as instruction prefetching, while the Execution Unit (EU) executes the instructions.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The prefetch queue, the loader, and the microcode

The 8086 uses a 6-byte instruction prefetch queue to hold instructions, and this queue will play an important role in this discussion.3 Earlier microprocessors read instructions from memory as they were needed, which could cause the CPU to wait on memory. The 8086, instead, read instructions from memory before they were needed, storing them in the instruction prefetch queue. (You can think of this as a primitive instruction cache.) To execute an instruction, the 8086 took bytes out of the queue one at a time. If the queue ran empty, the processor waited until more instruction bytes were fetched from memory into the queue.

A circuit called the loader handles the interaction between the prefetch queue and instruction execution. The loader is a small state machine that provides control signals to the rest of the execution circuitry. The loader gets the first byte of an instruction from the prefetch queue and issues a signal FC (First Clock) that starts execution of the instruction.

At this point, the Group Decode ROM performs the first stage of instruction decoding, classifying the instruction into various categories based on the opcode byte. Most of the 8086's instructions are implemented in microcode. However, a few instructions are so simple that they are implemented with logic circuits. For example, the CLC (Clear Carry) instruction clears the carry flag directly. The Group Decode ROM categorizes these instructions as 1BL (one-byte, implemented in logic). The loader responds by issuing an SC (Second Clock) signal to wrap up execution and start the next instruction. Thus, these simple instructions take two clock cycles.

The 8086 has various prefix bytes that can be put in front of an instruction to change its behavior. For instance, a segment prefix changes the memory segment that the instruction uses. A LOCK prefix locks the bus during the next instruction. The Group Decode ROM detects a prefix and outputs a prefix signal. This causes the prefix to be handled in logic, rather than microcode, similar to the 1BL instructions. Thus, a prefix also takes one byte and two clock cycles.

The remaining instructions are handled by microcode.2 Let's start with a one-byte instruction such as INC AX, which increments the AX register. As before, the loader gets the instruction byte from the prefix queue. The Group Decode ROM determines that this instruction is implemented in microcode and can start after one byte, so the microcode engine starts running. The microcode below handles the increment and decrement instructions. It moves the appropriate register, indicated by M to the ALU's temporary B register. It puts the incremented or decremented result (Σ) back into the register (M). RNI tells the loader to run the next instruction. With two micro-instruction, this instruction takes two clock cycles.

M → tmpB        XI tmpB, NX INC/DEC: get value from M, set up ALU
Σ → M           WB,RNI F     put result in M, run next instruction

But what happens with an instruction that is more than one byte long, such as adding an immediate value to a register? Let's look at ADD AX,1234, which adds 1234 to the AX register. As before, the loader reads one byte and then the microcode engine starts running. At this point, the 8086 doesn't "realize" that this is a 3-byte instruction. The first line of the microcode below gets one byte of the immediate operand: Q→tmpBL loads a byte from the instruction prefetch queue into the low byte of the temporary B register. Similarly, the second line loads the second byte. The next line puts the register value (M) in tmpA. The last line puts the sum back into the register and runs the next instruction. Since this instruction takes two bytes from the prefetch queue, it is three bytes long in total. But nothing explicitly indicates this instruction is three bytes long.

Q → tmpBL       JMPS L8 2  alu A,i: get byte from queue
Q → tmpBH                   get byte from queue
M → tmpA        XI tmpA, NX get value from M, set up ALU
Σ → M           WB,RNI F    put result in M, run next instruction

You can also add a one-byte immediate value to a register, such as ADD AL,12. This uses the same microcode above. However, in the first line, JMPS L8 is a conditional jump that skips the second micro-instruction if the data length is 8 bits. Thus, the microcode only consumes one byte from the prefetch queue, making the instruction two bytes long. In other words, what makes this instruction two bytes instead of three is the bit in the opcode which triggers the conditional jump in the microcode.

The 8086 has another class of instructions, those with a ModR/M byte following the opcode. The Group Decode ROM classifies these instructions as 2BR (two-byte ROM) indicating that the second byte must be fetched before processing by the microcode ROM. For these instructions, the loader fetches the second byte from the prefetch queue before triggering the SC (Second Clock signal) to start microcode execution.

The ModR/M byte indicates the addressing mode that the instruction should use, such as register-to-register or memory-to-register. The ModR/M can change the instruction length by specifying an address displacement of one or two bytes. A second ROM called the Translation ROM selects the appropriate microcode for the addressing mode (details). For example, if the addressing mode includes an address displacement, the microcode below is used:

Q → tmpBL   JMPS MOD1 12 [i]: get byte(s)
Q → tmpBH         
Σ → tmpA    BX EAFINISH 12: add displacement

This microcode fetches two displacement bytes from the prefetch queue (Q). However, if the ModR/M byte specifies a one-byte displacement, the MOD1 condition causes the microcode to jump over the second fetch. Thus, this microcode uses one or two additional instruction bytes depending on the value of the ModR/M byte.

To summarize, nothing in the 8086 "knows" how long an instruction is. The Group Decode ROM makes part of the decision, classifying instructions as a prefix, 1-byte logic, 2-byte ROM, or otherwise, causing the loader to fetch one or two bytes. The microcode then consumes instruction bytes as needed. In the end, the length of an 8086 instruction is determined by how many bytes are taken from the prefetch queue by the time it ends.

Some other systems

It's interesting to see how other processors deal with instruction length. For example, RISC processors (Reduced Instruction Set Computers) typically have fixed-length instructions. For instance, the ARM-1 processor used 32-bit instructions, making instruction decoding very simple.

Early microprocessors such as the MOS Technology 6502 (1975) didn't use microcode, but were controlled by state machines. The CPU fetches instruction bytes from memory as needed, as it moves through various execution states. Thus, as with the 8086, the length of an instruction wasn't explicit, but was how many bytes it used.

The IBM 1401 computer (1959) took a completely different approach with its variable-length words. Each character in memory had an associated "word mark" bit, which you can think of as a metadata bit. Each machine instruction consisted of a variable number of characters with a word mark on the first one. Thus, the processor could read instruction characters until it hit a word mark, which indicated the start of the next instruction. The word mark explicitly indicated to the processor how long each instruction was.

Perhaps the worst approach for variable-length instructions was the Intel iAPX 432 processor (1981), which had instructions with variable bit lengths, from 6 to 321 bits long. As a result, instructions weren't aligned on byte boundaries, making instruction decoding even more inconvenient. This was just one of the reasons that the iAPX 432 ended up overly complicated, years behind schedule, and a commercial failure.

The 8086's variable-length instructions led to the x86 architecture, with instructions from 1 to 15 bytes long. This is particularly inconvenient with modern superscalar processors that run multiple instructions in parallel. The problem is that the processor must break the instruction stream into individual instructions before they execute. The Intel P6 microarchitecture used in the Pentium Pro (1995) has instruction decoders to decode the instruction stream into micro-operations.4 It starts with an "instruction length block" that analyzes the first bytes of the instruction to determine how long it is. (This is not a straightforward task to perform rapidly on multiple instructions in parallel.) The "instruction steering block" uses this information to break the byte stream into instructions and steer instructions to the decoders.

The AMD K6 3D processor (1999) had predecode logic that associated 5 predecode bits with each instruction byte: three pointed to the start of the next instruction, one indicated the length depended on a D bit, and one indicated the presence of a ModR/M byte. This logic examined up to three bytes to make its decisions. Instructions were split apart and assigned to decoders based on the predecode bits. In some cases, the predecode logic gave up and flagged the instruction as "unsuccessfully predecoded", for instance an instruction longer than 7 bytes. These instructions were handled by a slower path.

Conclusions

The 8086 processor has instructions with a variety of lengths, but nothing in the processor explicitly determines the length. Instead, an instruction uses as many bytes as it needs. (That sounds sort of tautological, but I'm not sure how else to put it.) The Group Decode ROM makes an initial classification, the Translation ROM determines the addressing mode, and the microcode consumes bytes as needed.

While this approach gave the 8086 a flexible instruction set, it created a problem in the long run for the x86 architecture, requiring complicated logic to determine instruction length. One benefit of RISC-based processors such as the Apple M1 is that they have (mostly) constant instruction lengths, making instruction decoding faster and simpler.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff.

Notes and references

  1. I was inspired to investigate instruction length based on a Stack Overflow question

  2. I'll just give a brief overview of microcode here. Each micro-instruction is 21 bits long, as shown below. A micro-instruction specifies a move between a source register and destination register. It also has an action that depends on the micro-instruction type. For more details, see my post on the 8086 microcode pipeline.

    The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

    The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

     

  3. The 8088 processor, used in the original IBM PC, has a smaller 4-byte prefetch queue. The 8088 is almost the same as the 8086, except it has an 8-bit external bus instead of a 16-bit external bus. This makes memory accesses (including prefetches) slower, so a smaller prefetch queue works better. 

  4. The book Modern Processor Design discusses the P6 microarchitecture in detail. The book The Anatomy of a High-Performance Microprocessor discusses the AMD K5 3D processor in even more detail; see chapter 2. 

Reverse-engineering the Intel 8086 processor's HALT circuits

The 8086 processor was introduced in 1978 and has greatly influenced modern computing through the x86 architecture. One unusual instruction in this processor is HLT, which stops the processor and puts it in a halt state. In this blog post, I explain in detail how the halt circuitry is implemented and how it interacts with the 8086's architecture.

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; the ones that are important to this discussion are darker and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions. Both are stopped by a halt instruction.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Halt processing in the Execution Unit

In this section, I'll explain how the HLT instruction is decoded and handled in the Execution Unit. The 8086 uses a combination of lookup ROMs, logic, and microcode to implement instructions. The process starts with the loader, a state machine that provides synchronization between the prefetch queue and the decoding circuitry. When an instruction byte is available, the loader provides a signal called First Clock that loads the instruction into the Instruction Register and starts the instruction decoding process.

Before microcode gets involved, the Group Decode ROM classifies instructions by producing about 15 signals, indicating properties such as instructions with a Mod R/M byte, instructions with a byte/word bit, instructions that always act on a byte, and so forth. For the HLT instruction, the Group Decode ROM provides two important signals. The first is one-byte logic (1BL), indicating that the instruction is one byte long and is implemented with logic circuitry rather than microcode.1 The second signal is produced for the HLT instruction specifically and generates the internal HALT signal. This signal travels to various parts of the 8086 to halt the processor.

The Group Decode ROM. The yellow rectangle detects the HLT instruction, with an output at the bottom. The red rectangle generates the 1BL (one-byte logic) signal.

The Group Decode ROM. The yellow rectangle detects the HLT instruction, with an output at the bottom. The red rectangle generates the 1BL (one-byte logic) signal.

In the Execution Unit, the HALT signal blocks the reading of new instructions from the prefetch queue. This causes the loader to wait indefinitely and stops execution of new instructions. Since no new instruction replaces HLT, the Group Decode ROM continues to generate the HALT signal. The HALT signal also blocks most of the other outputs from the Group Decode ROM, preventing other decoding actions.

Thus, the Execution Unit sits idle as a result of the HLT instruction, unable to start a new instruction. Modern processors often have low-power halt modes, where part of the processor is shut down or a clock domain is stopped to reduce power consumption. The 8086, however, doesn't do anything clever to minimize power consumption in the halt mode, since this wasn't a concern for processors in the 1970s.

Halt processing in the Bus Interface Unit

Memory and I/O devices are connected to the 8086 chip through a bus that transmits address, data, and control information. The 8086's Bus Interface Unit handles reads and writes over this bus, running independently from the Execution Unit. A complete bus cycle for a read or write takes four clock periods, called T1, T2, T3, and T4,2 with specific signals on the bus for each time state.

A HLT instruction stops the Bus Interface Unit, but this takes several steps. First, the Bus Interface Unit must complete any currently-running bus cycle. Any new bus cycle must be blocked. Finally, the processor indicates the HALT state to any devices on the bus by issuing a special T1 cycle over the bus.

The main HALT control signal inside the Bus Interface Unit is something I call halt-not-hold, indicating a HALT is active, but not a HOLD. (Ignore the HOLD part for now.) This signal is activated by the HLT instruction signal from the Group Decode ROM, except it is blocked by any bus operations in progress. Once any current bus operation reaches T2, halt-not-hold gets activated and starts the halt process while the current bus cycle finishes up.

To prevent new bus activity, the halt-not-hold signal blocks new prefetch requests. The only other source of bus activity is an instruction that performs reads or writes. But the current instruction is HLT, so it won't generate any bus traffic. Thus, the Bus Interface Unit will remain idle.

The read/write control circuitry on the die with the flip-flops labeled. Metal and polysilicon were removed to show the underlying silicon.

The read/write control circuitry on the die with the flip-flops labeled. Metal and polysilicon were removed to show the underlying silicon.

The circuitry to control the bus cycle is complicated with many flip-flops and logic gates; the diagram above shows the flip-flops. I plan to write about the bus cycle circuitry in detail later, but for now, I'll give an extremely simplified description. Internally, there is a T0 state before T1 to provide a cycle to set up the bus operation. The bus timing states are controlled by a chain of flip-flops configured like a shift register with additional logic: the output from the T0 flip-flop is connected to the input of the T1 flip-flop and likewise with T2 and T3, forming a chain. A bus cycle is started by putting a 1 into the input of the T0 flip-flop.3 When the CPU's clock transitions, the flip-flop latches this signal, indicating the (internal) T0 bus state. On the next clock cycle, this 1 signal goes from the T0 flip-flop to the T1 flip-flop, creating the externally-visible T1 state. Likewise, the signal passes to the T2 and T3 flip-flops in sequence, creating the bus cycle.

A slightly different path is used to generate the special T1 signal that indicates a HALT. Once any bus activity is completed, the halt-not-hold signal puts a 1 into the T1 flip-flop through some gates. This generates the T1 signal, bypassing T0. Moreover, this signal does not propagate to the T2 flip-flop because it is blocked by halt-not-hold and some gates. Another flip-flop blocks this T1 cycle after the first cycle so halt-not-hold doesn't repeately trigger it. Overall, this special HALT T1 state looks like a special case that was hacked into the circuitry.

One complication is the bus hold feature. The 8086 supports complex bus configurations, where external devices may take control of the bus. For instance, peripherals may use the bus for direct memory access, bypassing the CPU. A device can request control of the bus, a "bus hold", through the 8086's HOLD pin.4 This causes the 8086 to electrically stop putting signals on the bus (i.e. a high-impedance, tri-state off state). This allows another device to use the bus until it releases HOLD.

Even when the CPU is halted, the CPU still has "ownership" of the bus and drives the bus with idle signals.5 If a device requests a bus hold when the CPU is halted, the halt-not-hold signal is blocked. When the device releases the hold, halt-not-hold is unblocked. This causes the 8086 to go through the special T1 cycle again, using the same flip-flop process described above. This lets listeners on the bus know that the CPU is still halted.

Exiting the halt state

The processor exits the halt state when it receives a reset, interrupt, or non-maskable interrupt. To implement this, an interrupt unblocks the instruction decoder by overriding the queue-unavailable signal. This causes the loader, which controls instruction decoding, to move into the First Clock state. Meanwhile, the interrupt causes the microcode address register to be loaded with the hardcoded microcode address of the appropriate interrupt routine. Thus, the microcode engine starts running the interrupt handler microcode.

The Instruction Register holds the 8-bit opcode that is currently being processed. It has a ninth bit that indicates if an interrupt is being processed. The Instruction Register (including the interrupt bit) is loaded on First Clock (described above). It outputs the instruction and interrupt bit to the Group Decode ROM one clock cycle later. The interrupt bit blocks regular instruction decoding by the Group Decode ROM. In particular, the HLT instruction will no longer be decoded, dropping the HALT signal throughout the CPU. In the Execution Unit, this reactivates the prefetch queue. This will allow instruction execution once the microcode finishes executing the interrupt handling code. In the Bus Interface Unit, dropping the HALT signal causes halt-not-hold to drop. This enables bus activity from the Bus Interface Unit.6

History of HALT and x86

Historically, computers usually had some sort of "stop" or "wait" instruction to stop execution at the end of a program. This goes back to the electromechanical Harvard Mark I (1944), EDSCAC (1949), and Univac I (1951), among other machines. Most (but not all) mainframes and minicomputers continued this approach.7

The HLT instruction in the 8086, like many other features, derives from the Datapoint 2200, and there's an interesting story behind that. The Datapoint 2200 was a desktop computer announced in 1970, and sold as a "programmable terminal". The processor of the Datapoint 2200 was implemented with a board of TTL integrated circuits, since this was before microprocessors. The Datapoint manufacturer talked to Intel and Texas Instruments about replacing the board of chips with a single processor chip. Texas Instruments produced the TMX 1795 microprocessor chip and Intel produced the 8008 shortly after,8 both copying the Datapoint 2200's architecture and instruction set. Datapoint didn't like the performance of these chips and decided to stick with a TTL-based processor. Texas Instruments couldn't find a customer for the TMX 1795 and abandoned it. Intel, on the other hand, sold the 8008 as an 8-bit microprocessor, creating the microprocessor market in the process. Intel improved the 8008 to create the popular 8080 microprocessor (1974). Zilog produced the more powerful Z80 (1976), backward-compatible with the 8080.

The Datapoint 2200. This is the later Model II with an improved TTL processor using the 74181 ALU chip.

The Datapoint 2200. This is the later Model II with an improved TTL processor using the 74181 ALU chip.

Intel started designing the iAPX 432 in 1975 to be their high-end 32-bit processor, a "micromainframe" that supported garbage collection and objects in the processor. The iAPX 432 was too complex for the time and as the schedule slipped, Intel decided to produce a stopgap 16-bit processor to compete with Zilog and Motorola: this processor became the 8086. To make it easier for Intel customers to move to the 8086, the processor was designed for compatibility with 8080 assembly language so it inherited much of the architecture and instruction set, although extended from 8 bits to 16 bits.9

The consequence of this history is that the 8086 inherited many features from the Datapoint 2200. The Datapoint 2200 used cheaper shift-register memory so it had a serial processor that operated on one bit at a time. This required the Datapoint 2200 to be little-endian, a feature that lives on in the x86 architecture. Since the Datapoint 2200 was marketed as a programmable terminal, it had parity calculation built into the hardware. Thus, the 8008 and descendants have a parity flag, in contrast to contemporary processors such as the 6800 and 6502 that omitted this moderately complex feature. The use of I/O ports instead of memory-mapped I/O is another feature of the Datapoint 2200 that persists in the x86, but was not used in the 6800 and 6502 and their descendants. The opcodes of the Datapoint 2200 were based on octal 3-bit fields for hardware reasons. The x86 instructions are still designed around octal, but the usual hexadecimal display obscures their structure. Finally, the Datapoint 2200's HALT instruction was exactly copied by the 800810 and persists in x86.

Conclusions

The HLT instruction seems like a simple function, but its implementation touches many parts of the 8086. It is implemented in logic circuitry, completely bypassing the microcode. The implementation became more complicated because of the 8086's four-step bus protocol, as well as interaction between halting and the bus hold feature. This illustrates how complexity creates more complexity, something the RISC processors of the 1980s tried to counter.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. Thanks to monocasa for suggesting this topic.

Notes and references

  1. The instructions implemented outside microcode are the segment register prefixes (ES:, CS:, SS:, DS:), the other prefixes (LOCK, REPNZ, REPZ), the simple flag instructions (CMC, CLC, STC, CLI, STI, CLD, STD), and HLT. These instructions are indicated by the 1BL (one-byte logic) output from the Group Decode ROM. 

  2. The bus cycle may also include optional Tw wait states after T3 for slow memory. The memory (or I/O device) lowers the READY pin until it is ready to proceed and the Bus Interface Unit waits. I'm ignoring Tw states in this discussion to keep things simpler. 

  3. For some reason, the T-state flip-flops all hold inverted signals, so strictly speaking a 0 bit goes through the flip-flops. 

  4. The 8086 has a separate prioritized "request/grant" way for a device to obtain a bus hold, but it doesn't change the underlying hold behavior. 

  5. During a HALT, the 8086 is not actively using the bus, but it does not release the bus either; it is still electrically driving the bus. Otherwise, the bus would float to random voltages, confusing attached memory chips or other circuitry. 

  6. When the Bus Interface Unit is unhalted due to an interrupt, you might expect it to immediately start prefetching, accessing unwanted instructions. It turns out that the prefetch circuitry does try to start prefetching and reaches the internal T0 bus state. But it then gets preempted by the interrupt handler microcode, which uses the bus to send two interrupt acknowledge cycles. Immediately after, the microcode routine suspends prefetching. Thus, prefetching doesn't run until the interrupt microcode finishes and reenables prefetching. There's a lot of tricky timing in the 8086 to make everything work. 

  7. For more history of the stop instruction, see "Computer Architecture", Blaauw and Brooks, page 349. (This the same Brooks who wrote "The Mythical Man-Month" and "No Silver Bullet".) 

  8. You might wonder how the Intel 4004 fits into this history. Although many of the same people worked on both chips, they have completely different architectures. The 8008 is not at all an 8-bit version of the 4-bit 4004. 

  9. Assembly code for the 8-bit 8080 processor couldn't run directly on the 16-bit 8086. Instead, a translation program converted the 8080 assembly language to be compatible with the 8086, making some changes in the process. The 8086 dropped some of the less-useful instructions of the 8080, replacing them with multiple instructions in the translation. For instance, the 8080 had conditional subroutine call and return instructions (inherited from the Datapoint 2200), but the 8086 dropped them. 

  10. To see that the 8008 copied the Datapoint 2200's HALT instruction, note that the Datapoint had three opcodes for HALT (00, 01, and FF), which is a bit unusual. The 8008 also has three opcodes for HLT: 00, 01, and FF. Most instructions in the 8008 used the same opcode values as the Datapoint, with a few minor changes. 

Reverse-engineering the conditional jump circuitry in the 8086 processor

Intel introduced the 8086 microprocessor in 1978 and it had a huge influence on computing. I'm reverse-engineering the 8086 by examining the circuitry on its silicon die and in this blog post I take a look at how conditional jumps are implemented. Conditional jumps are an important part of any instruction set, changing the flow of execution based on a condition. Although this instruction may seem simple, it involves many parts of the CPU: the 8086 uses microcode along with special-purpose condition logic.

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; the ones that are important to this discussion are darker and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions. Most of the relevant circuitry is in the Execution Unit, such as the condition evaluation circuitry near the center, and the microcode in the lower right. But the Bus Interface Unit plays a part too, holding and modifying the program counter.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Microcode

Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. One of the hardest parts of computer design is creating the control logic that directs the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated, error-prone, and hard to design.

The alternative is microcode: instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns design of control circuitry into a programming task instead of a difficult logic design task.

The 8086 uses a hybrid approach: although the 8086 uses microcode, much of the instruction functionality is implemented with gate logic. This approach removed duplication from the microcode and kept the microcode small enough for 1978 technology. In a sense, the microcode is parameterized. For instance, the microcode can specify a generic Arithmetic/Logic Unit (ALU) operation, and the gate logic determines from the instruction which ALU (Arithmetic/Logic Unit) operation to perform. More relevant to this blog post, the microcode can specify a generic conditional test and the gate logic determines which condition to use. Although this made the 8086's gate logic more complicated, the tradeoff was worthwhile.

Microcode for conditional jumps

The 8086 processor has six status flags: carry, parity, auxiliary carry, zero, sign, and overflow.1 These flags are updated by arithmetic and logic operations based on the result. The 8086 has sixteen different conditional jump instructions2 that test status flags and jump if conditions are satisfied, such as zero, less than, or odd parity. These instructions are very important since they permit if statements, loops, comparisons, and so forth.

In machine language, a conditional jump opcode is followed by a signed offset byte which specifies a location relative to the current program counter, from 127 bytes ahead to 128 bytes back. This is a fairly small range, but the benefit is that the offset fits in a single byte, reducing the code size.3 For typical applications such as loops or conditional code, jumps usually stay in the same neighborhood of code, so the tradeoff is worthwhile.

The 8086's microcode was disassembled by Andrew Jenner (link) from my die photos, so we can see exactly what micro-instructions the 8086 is running for each machine instruction. The microcode below implements conditional jumps. In brief, the conditional jump code (Jcond) gets the branch offset byte. It tests the appropriate condition and, if satisfied, jumps to the relative jump microcode (RELJUMP). The RELJMP code adds the offset to the program counter. In either case, the microcode routine ends when it runs the next instruction (RNI).

   move       action
Jcond:
1 Q→tmpBL
2          XC    RELJMP                    
3          RNI                       

RELJMP:
4          SUSP
5          CORR                      
6 PC→tmpA  ADD   tmpA
7 Σ→PC     FLUSH RNI                       

In more detail, micro-instruction 1 (arbitrary numbering) moves a byte from the prefetch queue (Q) across the queue bus to the ALU's temporary B register.4 (Arguments for ALU operations are first stored in temporary registers, invisible to the programmer.) Instruction 2 tests the appropriate condition with XC, and jumps to the RELJMP routine if the condition is satisfied.5 Otherwise, RNI (Run Next Instruction) ends this sequence and loads the next machine instruction without jumping.

If the condition is satisfied, the relative jump routine starts with instruction 4, which suspends prefetching.6 Instruction 5 corrects the program counter value, since it normally points to the next byte to prefetch, not the next byte to execute. Instruction 6 moves the corrected program counter address to the ALU's temporary A register. It also starts an ALU operation to add temporary A and temporary B. Instruction 7 moves the sum (Σ) to the program counter. It flushes the prefetch queue, which starts up prefetching from the new PC value. Finally, RNI runs the next instruction, from the updated address.

This code supports all 16 conditional jumps because the microcode tests the generic "XC" condition. This indicates that the specific test depends on the four low bits of the opcode, and the hardware determines exactly what to test. It's important to keep the two levels straight: the machine instruction is doing a conditional jump to a different memory address, while the microcode that implements this instruction is performing a conditional jump to a different micro-address.

The timing for conditional jumps

The RNI (Run Next Instruction) micro-operation initiates processing of the next machine instruction. However, it takes a clock cycle to get the next instruction from the prefetch queue, decode it, and start the appropriate micro-instruction. This causes a wasted clock cycle before the next micro-instruction executes. To avoid this delay, most microcode routines issue a NXT micro-operation one cycle before they end. This gives the 8086 time to decode the next machine instruction so micro-instructions can run uninterrupted.

Unfortunately, the conditional jump instructions can't take advantage of NXT. The problem is that the control flow in the microcode depends on whether the conditional jump is taken or not. By the time the microcode knows it is not taking the branch, it's too late to issue NXT.

The datasheet gives the timing of a conditional jump as 4 clock cycles if the jump is not taken, and 8 clock cycles if the jump is taken. Looking at the microcode explains these timings. There are 3 micro-instructions executed if the jump is not taken, and 7 if it is taken. Because of the RNI, there is one wasted clock cycle, resulting in the documented 4 or 8 cycles in total.

The conditions

At this point I will review the 8086's conditional jumps. The 8086 implements 16 conditional jumps. (This is a large number compared to earlier CPUs: the 8080, 6502, and Z80 all had 8 conditional jumps, specified by 3 bits.) The table below shows which flags are tested for each condition, specified by the low four bits of the opcode. Some jump instructions have multiple names depending on the programmer's interpretation, but they map to the same machine instruction.7

ConditionBitsCondition trueCondition false
Overflow Flag (OF)=1000xoverflow (JO)not overflow (JNO)
Carry Flag (CF)=1001xcarry (JC)
below (JB)
not above or equal (JNAE)
not carry (JNC)
not below (JNB)
above or equal (JAE)
Zero Flag (ZF)=1010xzero (JZ)
equal (JE)
not zero (JNZ)
not equal (JNE)
CF=1 or ZF=1011xbelow or equal (JBE)
not above (JNA)
not below or equal (JNBE)
above (JA)
Sign Flag (SF)=1100xsign (JS)not sign (JNS)
Parity Flag (PF)=1101xparity (JP)
parity even (JPE)
not parity (JNP)
parity odd (JPO)
SF ≠ OF110xless (JL)
not greater or equal (JNGE)
not less (JNL)
greater or equal (JGE)
ZF=1 or SF ≠ OF111xless or equal (JLE)
not greater (JNG)
not less or equal (JNLE)
greater (JG)

From the hardware perspective, the important thing is that there are eight different condition flag tests. Each test has two jump instructions associated with it: one that jumps if the condition is true, and one that jumps if the condition is false. The low bit of the opcode selects "if true" or "if false".

The image below shows the condition evaluation circuitry as it appears on the die. There isn't much structure to it; it's just a bunch of gates. This image shows the doped silicon regions that form transistors. The numerous small polygons with a circle inside are connections between the metal layer and the polysilicon layer. Many of these connections use the silicon layer to optimize the layout.

The circuitry to compute conditions as it appears on the die. The metal and polysilicon layers have been removed for this image, showing the silicon underneath.

The circuitry to compute conditions as it appears on the die. The metal and polysilicon layers have been removed for this image, showing the silicon underneath.

This circuitry evaluates each condition by getting the instruction bits from the Instruction Register, checking the bits to match each condition, and testing if the condition is satisfied. For instance, the overflow condition (with instruction bits 000x) is computed by a NOR gate: NOR(IR3, IR2, IR1, OF'), which will be true if instruction register bits 3, 2, and 1 are zero and the Overflow Flag is 1.

The results from the individual condition tests are combined with a 7-input NOR gate, producing a result that is 0 if the specified 3-bit condition is satisfied. Finally, the "if true" and "if false" cases are handled by flipping this signal depending on the low bit of the instruction. This final result indicates if the 4-bit condition in the instruction is satisfied, and this signal is passed on to the microcode control circuitry.

One unexpected feature of the implementation is that a 7-input NOR gate combines the various conditions to test if the selected condition is satisfied. You'd expect that with eight conditions, there would be eight inputs to the NOR gate. However, there is a clever optimization that takes advantage of conditions that are combinations of clauses, for example, "less or equal". Specifically, the zero flag is tested for bit pattern 01xx (where x indicates a 0 or 1), which covers two conditions with one gate. Likewise, SF≠OF is tested for bit pattern 11xx and CF=1 is tested for bit pattern 0x1x. With these optimizations, the eight conditions are covered with seven checks. (This shows that the opcodes weren't assigned arbitrarily: the bit patterns needed to be carefully assigned for this to work.)

Back to the microcode

Before explaining how the microcode jump circuitry works, I'll briefly discuss the microcode format. A micro-instruction is encoded into 21 bits as shown below. Every micro-instruction contains a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits is a bit tricky since it depends on the type field, which is two or three bits long. The "short jump" (type 0) is a conditional jump within the current block of 16 micro-instructions. The ALU operation (type 1) sets up the arithmetic-logic unit to perform an operation. Bookkeeping operations (type 4) are anything from flushing the prefetch queue to ending the current instruction. A memory read or write is type 6. A "long jump" (type 5) is a conditional jump to any of 16 fixed microcode locations (specified in an external table). Finally, a "long call" (type 7) is a conditional subroutine call to one of 16 locations (different from the jump targets).

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

I'm going to focus on the XC RELJMP micro-instruction that we saw in the microcode earlier. This is a "long jump" with XC as the condition and RELJMP as the target tag. Another layer of hardware is required to implement the microcode conditions. The microcode supports 16 conditions, which are completely different from the 16 programmer-level conditions.8 Some microcode conditions test special-purpose internal flags, while others test conditions such as an interrupt, the chip's TEST pin, bit 3 of the opcode, or if the instruction has a one-byte address offset. The XC condition is one of these 16 conditions, number 15 specifically.

The conditions are evaluated by the condition PLA (Programmable Logic Array, a grid of gates), shown below. The four condition bits from the micro-instruction, along with their complements, are fed into the columns. The PLA has 16 rows, one for each condition. Each row is a NOR gate matching one bit combination (i.e. selecting a condition) and the corresponding signal value to test.9 Thus, if a particular condition is specified and is satisfied, that row will be 1. The 16 row outputs are combined by the 16-input NOR gate at the left. Thus, if the specified condition is satisfied, this output will be 0, and if the condition is unsatisfied, the output will be 1. This signal controls the jump or call micro-instruction: if the condition is satisfied, the new micro-address is loaded into the microcode address register. If the condition is not satisfied, the microcode proceeds sequentially.

The condition PLA evaluates microcode conditionals.

The condition PLA evaluates microcode conditionals.

Conclusions

To summarize, the 8086 processor implements 16 conditional jump instructions. One piece of microcode efficiently implements all 16 instructions, with gate logic determining which flags to test, depending on bits in the machine instruction. The result of this test is used by the microcode XC conditional jump, one of 16 completely different microcode-level conditions. If the XC condition is satisfied, the program counter is updated by adding the offset, jumping to the new location.

Conditional jumps are relatively straightforward instructions from the programmer's perspective, but they interact with most parts of the 8086 processor including the prefetch queue, the address adder, the ALU, microcode, and the Translation ROM. The diagram below shows the interactions for each step of the jump.

The conditional jump involves many parts of the die, shown in this diagram.

The conditional jump involves many parts of the die, shown in this diagram.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff.

Notes and references

  1. In addition to the six status flags, the 8086 has three control flags: trap, direction, and interrupt enable. These flags aren't tested by conditional branches so I won't discuss them further. 

  2. Strictly speaking, the 8086 has a few more conditional jumps. The JCXZ instruction tests if the CX register is zero. The LOOP, LOOPNZ, and LOOPZ instructions decrement the CX register and loop if it is nonzero. The last two only loop if the zero flag indicates nonzero or zero, respectively. I'm ignoring these instructions in the blog post. 

  3. Although a conditional jump only supports a small range, it's still possible to conditionally jump to a distant location by using two instructions. A conditional jump with the opposite condition can skip over a longer unconditional jump instruction. The 80386 removed this restriction by providing long-displacement conditional jumps, which could perform a 16-bit or 32-bit relative jump. 

  4. The relative offset byte is sign-extended when it is moved to the temporary B register. That is, if the top bit is high, the high byte is set to all 1's to produce a 16-bit negative value. 

  5. The details of how the microcode jumps to the RELJMP routine are interesting, but a bit of a tangent, so I've put this discussion in a footnote. For long jumps (and long calls) in microcode, the target micro-addresses are stored in the Translation ROM, and the 4-bit target tag indexes into this ROM. The motivation for this structure is that micro-addresses are 13 bits, which is a lot of bits to try to fit into a 21-bit micro-instruction. Using a 4-bit tag keeps the microcode compact, but at the cost of requiring a small ROM in the 8086.

    The translation ROM on the die.

    The translation ROM on the die.

    Above is a view of the Translation ROM, with the RELJMP entry highlighted. The left half decodes tags, while the right half provides the corresponding microcode address. The row for RELJMP is highlighted. 

  6. Much of this microcode snippet deals with the prefetch queue. To increase efficiency, the 8086 processor fetches instructions from memory before they are needed and stores them in a 6-byte prefetch queue. In most processors, the program counter points to the memory address of the next instruction to execute. However, in the 8086, the program counter advances during prefetching, so it points to the memory address of the next instruction to fetch. This discrepancy is invisible to the programmer, but the microcode needs to handle it.

    First, the microcode issues a SUSP micro-operation to suspend prefetching. This ensures that the program counter will not be changed due to more prefetching. Next, the CORR micro-operation corrects the program counter to point to the next address to execute. This correction is performed by subtracting the number of unused bytes in the prefetch queue. You might expect this correction to be performed by the Arithmetic/Logic Unit (ALU). However, the 8086 has a separate adder that is used for memory address computations: each memory access in the 8086 requires a segment register base address to be added to an offset address. This address adder is also used for program counter correction. The constant ROM holds the values -1 through -6, the appropriate constant is selected based on the number of bytes in the prefetch queue, and this constant is added to the program counter. (Interestingly, the address adder is used for program counter correction, while the ALU is used to modify the program counter for the relative jump computation.)

    The address adder has multiple uses. It is also used for updating the program counter during prefetching. It updates addresses when performing block copy operations. Finally, it updates addresses when performing an unaligned word operation. The constant ROM holds constants for these operations.

    At the end of the microcode sequence, the FLUSH micro-operation flushes the stale bytes from the prefetch queue, resets the prefetch queue pointers, and restarts prefetching. I wrote about prefetching in detail here

  7. Often, the compare (CMP) instruction will be executed to compare two numbers by subtracting and discarding the result but keeping the condition codes. One complication is that some tests make sense for signed numbers, while other tests make sense for unsigned numbers. Specifically, "greater", "greater or equal", "less", and "less or equal" make sense for signed comparisons. On the other hand, "above", "above or equal", "below", and "below or equal" make sense for unsigned comparisons.

    The 8086 supports both signed and unsigned numbers. The arithmetic operations are the same for both; it's just the programmer's interpretation that differs. For instance, consider adding hex numbers 0xfe and 0x01. Treating them as unsigned numbers, the sum is 254 + 1 = 255. But as signed numbers, -2 + 1 = -1. In either case, the processor computes the same result, 0xff, but the interpretation is different.

    The signed vs unsigned distinction matters for comparisons. For instance, as unsigned numbers, 0xfe (254) is above 0x01 (1). But as signed numbers, 0xfe (-2) is less than 0x01 (1). This is why different instructions are used to compare unsigned versus signed numbers.

    Another important factor is that the carry flag indicates an unsigned result is too large for its byte (or word), while the overflow flag indicates that a signed result is too large for its byte (or word). For instance, adding unsigned bytes 0xff (255) and 0x02 (2) yields 0x01 (1) and a carry, indicating the result is too big for a byte. However, as signed bytes this corresponds to -1 + 2 = 1, which fits in a byte, so the overflow flag is not set. Conversely, 0x7f + 0x01 = 0x80. As unsigned bytes, this corresponds to 127 + 1 = 128 which is fine. But as signed bytes, this corresponds to 127 + 1, which unexpectedly yields -128 due to overflow. Thus, the carry flag is not set, but the overflow flag is set in this case. 

  8. Short jumps have four bits to specify the condition, so they can access 16 conditions. For long jumps and long calls, one bit is "stolen" from the condition to indicate the type, so they can only access eight of the conditions. Thus, the conditions need to be assigned carefully so the necessary ones are available. 

  9. PLAs are typically uniform grids, but the grid pattern breaks down a bit in the condition PLA. The reason is that each test uses a separate signal, so there is a different signal into each row (unlike a typical PLA where each row receives the same signals). Moreover, some of the test signals are processed at the left, distorting the 16-input NOR gate. This illustrates the degree of layout optimization in the 8086, squeezing transistors in to save a bit of space.