How the 8086 processor determines the length of an instruction

The Intel 8086 processor (1978) has a complicated instruction set with instructions ranging from one to six bytes long. This raises the question of how the processor knows the length of an instruction.1 The answer is that the 8086 uses an interesting combination of lookup ROMs and microcode to determine how many bytes to use for an instruction. In brief, the ROMs perform enough decoding to figure out if it needs one byte or two. After that, the microcode simply consumes instruction bytes as it needs them. Thus, nothing in the chip explicitly "knows" the length of an instruction. This blog post describes this process in more detail.

The die photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles bus and memory activity as well as instruction prefetching, while the Execution Unit (EU) executes the instructions.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The prefetch queue, the loader, and the microcode

The 8086 uses a 6-byte instruction prefetch queue to hold instructions, and this queue will play an important role in this discussion.3 Earlier microprocessors read instructions from memory as they were needed, which could cause the CPU to wait on memory. The 8086, instead, read instructions from memory before they were needed, storing them in the instruction prefetch queue. (You can think of this as a primitive instruction cache.) To execute an instruction, the 8086 took bytes out of the queue one at a time. If the queue ran empty, the processor waited until more instruction bytes were fetched from memory into the queue.

A circuit called the loader handles the interaction between the prefetch queue and instruction execution. The loader is a small state machine that provides control signals to the rest of the execution circuitry. The loader gets the first byte of an instruction from the prefetch queue and issues a signal FC (First Clock) that starts execution of the instruction.

At this point, the Group Decode ROM performs the first stage of instruction decoding, classifying the instruction into various categories based on the opcode byte. Most of the 8086's instructions are implemented in microcode. However, a few instructions are so simple that they are implemented with logic circuits. For example, the CLC (Clear Carry) instruction clears the carry flag directly. The Group Decode ROM categorizes these instructions as 1BL (one-byte, implemented in logic). The loader responds by issuing an SC (Second Clock) signal to wrap up execution and start the next instruction. Thus, these simple instructions take two clock cycles.

The 8086 has various prefix bytes that can be put in front of an instruction to change its behavior. For instance, a segment prefix changes the memory segment that the instruction uses. A LOCK prefix locks the bus during the next instruction. The Group Decode ROM detects a prefix and outputs a prefix signal. This causes the prefix to be handled in logic, rather than microcode, similar to the 1BL instructions. Thus, a prefix also takes one byte and two clock cycles.

The remaining instructions are handled by microcode.2 Let's start with a one-byte instruction such as INC AX, which increments the AX register. As before, the loader gets the instruction byte from the prefix queue. The Group Decode ROM determines that this instruction is implemented in microcode and can start after one byte, so the microcode engine starts running. The microcode below handles the increment and decrement instructions. It moves the appropriate register, indicated by M to the ALU's temporary B register. It puts the incremented or decremented result (Σ) back into the register (M). RNI tells the loader to run the next instruction. With two micro-instruction, this instruction takes two clock cycles.

M → tmpB        XI tmpB, NX INC/DEC: get value from M, set up ALU
Σ → M           WB,RNI F     put result in M, run next instruction

But what happens with an instruction that is more than one byte long, such as adding an immediate value to a register? Let's look at ADD AX,1234, which adds 1234 to the AX register. As before, the loader reads one byte and then the microcode engine starts running. At this point, the 8086 doesn't "realize" that this is a 3-byte instruction. The first line of the microcode below gets one byte of the immediate operand: Q→tmpBL loads a byte from the instruction prefetch queue into the low byte of the temporary B register. Similarly, the second line loads the second byte. The next line puts the register value (M) in tmpA. The last line puts the sum back into the register and runs the next instruction. Since this instruction takes two bytes from the prefetch queue, it is three bytes long in total. But nothing explicitly indicates this instruction is three bytes long.

Q → tmpBL       JMPS L8 2  alu A,i: get byte from queue
Q → tmpBH                   get byte from queue
M → tmpA        XI tmpA, NX get value from M, set up ALU
Σ → M           WB,RNI F    put result in M, run next instruction

You can also add a one-byte immediate value to a register, such as ADD AL,12. This uses the same microcode above. However, in the first line, JMPS L8 is a conditional jump that skips the second micro-instruction if the data length is 8 bits. Thus, the microcode only consumes one byte from the prefetch queue, making the instruction two bytes long. In other words, what makes this instruction two bytes instead of three is the bit in the opcode which triggers the conditional jump in the microcode.

The 8086 has another class of instructions, those with a ModR/M byte following the opcode. The Group Decode ROM classifies these instructions as 2BR (two-byte ROM) indicating that the second byte must be fetched before processing by the microcode ROM. For these instructions, the loader fetches the second byte from the prefetch queue before triggering the SC (Second Clock signal) to start microcode execution.

The ModR/M byte indicates the addressing mode that the instruction should use, such as register-to-register or memory-to-register. The ModR/M can change the instruction length by specifying an address displacement of one or two bytes. A second ROM called the Translation ROM selects the appropriate microcode for the addressing mode (details). For example, if the addressing mode includes an address displacement, the microcode below is used:

Q → tmpBL   JMPS MOD1 12 [i]: get byte(s)
Q → tmpBH         
Σ → tmpA    BX EAFINISH 12: add displacement

This microcode fetches two displacement bytes from the prefetch queue (Q). However, if the ModR/M byte specifies a one-byte displacement, the MOD1 condition causes the microcode to jump over the second fetch. Thus, this microcode uses one or two additional instruction bytes depending on the value of the ModR/M byte.

To summarize, nothing in the 8086 "knows" how long an instruction is. The Group Decode ROM makes part of the decision, classifying instructions as a prefix, 1-byte logic, 2-byte ROM, or otherwise, causing the loader to fetch one or two bytes. The microcode then consumes instruction bytes as needed. In the end, the length of an 8086 instruction is determined by how many bytes are taken from the prefetch queue by the time it ends.

Some other systems

It's interesting to see how other processors deal with instruction length. For example, RISC processors (Reduced Instruction Set Computers) typically have fixed-length instructions. For instance, the ARM-1 processor used 32-bit instructions, making instruction decoding very simple.

Early microprocessors such as the MOS Technology 6502 (1975) didn't use microcode, but were controlled by state machines. The CPU fetches instruction bytes from memory as needed, as it moves through various execution states. Thus, as with the 8086, the length of an instruction wasn't explicit, but was how many bytes it used.

The IBM 1401 computer (1959) took a completely different approach with its variable-length words. Each character in memory had an associated "word mark" bit, which you can think of as a metadata bit. Each machine instruction consisted of a variable number of characters with a word mark on the first one. Thus, the processor could read instruction characters until it hit a word mark, which indicated the start of the next instruction. The word mark explicitly indicated to the processor how long each instruction was.

Perhaps the worst approach for variable-length instructions was the Intel iAPX 432 processor (1981), which had instructions with variable bit lengths, from 6 to 321 bits long. As a result, instructions weren't aligned on byte boundaries, making instruction decoding even more inconvenient. This was just one of the reasons that the iAPX 432 ended up overly complicated, years behind schedule, and a commercial failure.

The 8086's variable-length instructions led to the x86 architecture, with instructions from 1 to 15 bytes long. This is particularly inconvenient with modern superscalar processors that run multiple instructions in parallel. The problem is that the processor must break the instruction stream into individual instructions before they execute. The Intel P6 microarchitecture used in the Pentium Pro (1995) has instruction decoders to decode the instruction stream into micro-operations.4 It starts with an "instruction length block" that analyzes the first bytes of the instruction to determine how long it is. (This is not a straightforward task to perform rapidly on multiple instructions in parallel.) The "instruction steering block" uses this information to break the byte stream into instructions and steer instructions to the decoders.

The AMD K6 3D processor (1999) had predecode logic that associated 5 predecode bits with each instruction byte: three pointed to the start of the next instruction, one indicated the length depended on a D bit, and one indicated the presence of a ModR/M byte. This logic examined up to three bytes to make its decisions. Instructions were split apart and assigned to decoders based on the predecode bits. In some cases, the predecode logic gave up and flagged the instruction as "unsuccessfully predecoded", for instance an instruction longer than 7 bytes. These instructions were handled by a slower path.

Conclusions

The 8086 processor has instructions with a variety of lengths, but nothing in the processor explicitly determines the length. Instead, an instruction uses as many bytes as it needs. (That sounds sort of tautological, but I'm not sure how else to put it.) The Group Decode ROM makes an initial classification, the Translation ROM determines the addressing mode, and the microcode consumes bytes as needed.

While this approach gave the 8086 a flexible instruction set, it created a problem in the long run for the x86 architecture, requiring complicated logic to determine instruction length. One benefit of RISC-based processors such as the Apple M1 is that they have (mostly) constant instruction lengths, making instruction decoding faster and simpler.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff.

Notes and references

  1. I was inspired to investigate instruction length based on a Stack Overflow question

  2. I'll just give a brief overview of microcode here. Each micro-instruction is 21 bits long, as shown below. A micro-instruction specifies a move between a source register and destination register. It also has an action that depends on the micro-instruction type. For more details, see my post on the 8086 microcode pipeline.

    The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

    The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

     

  3. The 8088 processor, used in the original IBM PC, has a smaller 4-byte prefetch queue. The 8088 is almost the same as the 8086, except it has an 8-bit external bus instead of a 16-bit external bus. This makes memory accesses (including prefetches) slower, so a smaller prefetch queue works better. 

  4. The book Modern Processor Design discusses the P6 microarchitecture in detail. The book The Anatomy of a High-Performance Microprocessor discusses the AMD K5 3D processor in even more detail; see chapter 2. 

Reverse-engineering the ModR/M addressing microcode in the Intel 8086 processor

One interesting aspect of a computer's instruction set is its addressing modes, how the computer determines the address for a memory access. The Intel 8086 (1978) used the ModR/M byte, a special byte following the opcode, to select the addressing mode.1 The ModR/M byte has persisted into the modern x86 architecture, so it's interesting to look at its roots and original implementation.

In this post, I look at the hardware and microcode in the 8086 that implements ModR/M2 and how the 8086 designers fit multiple addressing modes into the 8086's limited microcode ROM. One technique was a hybrid approach that combined generic microcode with hardware logic that filled in the details for a particular instruction. A second technique was modular microcode, with subroutines for various parts of the task.

I've been reverse-engineering the 8086 starting with the silicon die. The die photo below shows the chip under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; the ones that are important to this discussion are darker and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles bus and memory activity as well as instruction prefetching, while the Execution Unit (EU) executes instructions and microcode. Both units play important roles in memory addressing.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

8086 addressing modes

Let's start with an addition instruction, ADD dst,src, which adds a source value to a destination value and stores the result in the destination.3 What are the source and destination? Memory? Registers? The addressing mode answers this question.

You can use a register as the source and another register as the destination. The instruction below uses the AX register as the destination and the BX register as the source. Thus, it adds BX to AX and puts the result in AX.

ADD AX, BX           Add the contents of the BX register to the AX register

A memory access is indicated with square brackets around the "effective address"4 to access. For instance, [1234] means the memory location with address 1234, while [BP] means the memory location that the BP register points to. For a more complicated addressing mode, [BP+SI+1234] means the memory location is determined by adding the BP and SI registers to the constant 1234 (known as the displacement). On the 8086, you can use memory for the source or the destination, but not both. Here are some examples of using memory as a source:

ADD AX, [1234]       Add the contents of memory location 1234 to AX register
ADD CX, [BP]         Add memory pointed to by BP register to CX register
ADD DX, [BX+SI+1234] Source memory address is BX + SI + constant 1234

Here are examples with memory as the destination:

ADD [1234], AX       Add AX to the contents of memory location 1234
ADD [BP], CX         Add CX to memory pointed to by BP register
ADD [BX+SI+1234], DX Destination memory address is BX + SI + constant 1234

You can also operate on bytes instead of words, using a byte register and accessing a memory byte:

ADD AL, [SI+1234]    Add to the low byte of AX register
ADD AH, [BP+DI+1234] Add to the high byte of AX register

As you can see, the 8086 supports many different addressing schemes. To understand how they are implemented, we must first look at how instructions encode the addressing schemes in the ModR/M byte.

The ModR/M byte

The ModR/M byte follows many opcodes to specify the addressing mode. This byte is fairly complicated but I'll try to explain it in this section. The diagram below shows how the byte is split into three fields:5 mod selects the overall mode, reg selects a register, and r/m selects either a register or memory mode.

modregr/m
76543210

I'll start with the register-register mode, where the mod bits are 11 and the reg and r/m fields each select one of eight registers, as shown below. The instruction ADD AX,BX would use reg=011 to select BX and r/m=000 to select AX, so the ModR/M byte would be 11011000. (The register assignment depends on whether the instruction operates on words, bytes, or segment registers. For instance, in a word instruction, 001 selects the CX register, while in a byte instruction, 001 selects the CL register, the low byte of CX.)

The register assignments, from MCS-86 Assembly Language Reference Guide.

The register assignments, from MCS-86 Assembly Language Reference Guide.

The next addressing mode specifies a memory argument and a register argument. In this case, the mod bits are 00, the reg field specifies a register as described above, and the r/m field specifies a memory address according to the table below. For example, the instruction ADD [SI],CX would use reg=001 to select CX and r/m=100 to select [SI], so the ModR/M byte would be 00001100.

r/mOperand Address
000[BX+SI]
001[BX+DI]
010[BP+SI]
011[BP+DI]
100[SI]
101[DI]
110[BP]
111[BX]

The next mode, 01, adds an 8-bit signed displacement to the address. This displacement consists of one byte following the ModR/M byte. This supports addressing modes such as [BP+5]. The mode 10 is similar except the displacement is two bytes long, for addressing modes such as [BP+DI+0x1234].

The table below shows the meaning of all 256 values for the ModR/M byte. The mod bits are colored red, the reg bits green, and the r/m bits blue. Note the special case "disp16" to support a 16-bit fixed address.

The ModR/M values. Note that this table would be trivial if it used octal rather than hexadecimal. Based on Table 6-13 in the ASM386 Assembly Language Reference.

The ModR/M values. Note that this table would be trivial if it used octal rather than hexadecimal. Based on Table 6-13 in the ASM386 Assembly Language Reference.

The register combinations for memory accesses may seem random but they were designed to support the needs of high-level languages, such as arrays and data structures. The idea is to add a base register, an index register, and/or a fixed displacement to determine the address.6 The base register can indicate the start of an array, the index register holds the offset in the array, and the displacement provides the offset of a field in the array entry. The base register is BX for data or BP for information on the stack. The index registers are SI (Source Index) and DI (Destination Index).7

Some addressing features are handled by the opcode, not the ModR/M byte. For instance, the ModR/M byte doesn't distinguish between ADD AX,[SI] and ADD [SI],AX. Instead, the two variants are distinguished by bit 1 of the instruction, the D or "direction" bit.8 Moreover, many instructions have one opcode that operates on words and another that operates on bytes, distinguished by bit 0 of the opcode, the W or word bit.

The D and W bits are an example of orthogonality in the 8086 instruction set, allowing features to be combined in various combinations. For instance, the addressing modes combine 8 types of offset computation with three sizes of displacements and 8 target registers. Arithmetic instructions combine these addressing modes with eight ALU operations, each of which can act on a byte or a word, with two possible memory directions. All of these combinations are implemented with one block of microcode, implementing a large instruction set with a small amount of microcode. (The orthogonality of the 8086 shouldn't be overstated, though; it has many special cases and things that don't quite fit.)

An overview of 8086 microcode

Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. With microcode, instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.

The 8086 uses a hybrid approach: although it uses microcode, much of the instruction functionality is implemented with gate logic. This approach removed duplication from the microcode and kept the microcode small enough for 1978 technology. In a sense, the microcode is parameterized. For instance, the microcode can specify a generic Arithmetic/Logic Unit (ALU) operation and a generic register. The gate logic examines the instruction to determine which specific operation to perform and the appropriate register.

A micro-instruction in the 8086 is encoded into 21 bits as shown below. Every micro-instruction has a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits depends on the type field. A "short jump" is a conditional jump within the current block of 16 micro-instructions. An ALU operation sets up the arithmetic-logic unit to perform an operation. Bookkeeping operations are anything from flushing the prefetch queue to ending the current instruction. A memory operation triggers a bus cycle to read or write memory. A "long jump" is a conditional jump to any of 16 fixed microcode locations (specified in an external table called the Translation ROM). Finally, a "long call" is a conditional subroutine call to one of 16 locations. For more about 8086 microcode, see my microcode blog post.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

Some examples of microcode for addressing

In this section, I'll take a close look at a few addressing modes and how they are implemented in microcode. In the next section, I'll summarize all the microcode for addressing modes.

A register-register operation

Let's start by looking at a register-to-register instruction, before we get into the complications of memory accesses: ADD BX,AX which adds AX to BX, storing the result in BX. This instruction has the opcode value 01 and ModR/M value C3 (hex).

Before the microcode starts, the hardware performs some decoding of the opcode. The Group Decode ROM (below) classifies an instruction into multiple categories: this instruction contains a D bit, a W bit, and an ALU operation, and has a ModR/M byte. Fields from the opcode and ModR/M bytes are extracted and stored in various internal registers. The ALU operation type (ADD) is stored in the ALU opr register. From the ModR/M byte, the reg register code (AX) is stored in the N register, and the r/m register code (BX) is stored in the M register. (The M and N registers are internal registers that are invisible to the programmer; each holds a 5-bit register code that specifies a register.9)

This diagram shows the Group Decode ROM. The Group Decode ROM is more of a PLA (programmable logic array) with two layers of NOR gates. Its input lines are at the lower left and its outputs are at the upper right.

This diagram shows the Group Decode ROM. The Group Decode ROM is more of a PLA (programmable logic array) with two layers of NOR gates. Its input lines are at the lower left and its outputs are at the upper right.

Once the preliminary decoding is done, the microcode below for this ALU instruction is executed.10 (There are three micro-instructions, so the instruction takes three clock cycles.) Each micro-instruction contains a move and an action. First, the register specified by M (i.e. BX) is moved to the ALU's temporary A register (tmpA). Meanwhile, the ALU is configured to perform the appropriate operation on tmpA; XI indicates that the ALU operation is specified by the instruction bits, i.e. ADD).

The second instruction moves the register specified by N (i.e. AX) to the ALU's tmpB register. The action NX indicates that this is the next-to-last micro-instruction so the microcode engine can start processing the next machine instruction. The last micro-instruction stores the ALU's result (Σ) in the register indicated by M (i.e. BX). The status flags are updated because of the F. WB,RNI (Run Next Instruction) indicates that this is the end and the microcode engine can process the next machine instruction. The WB prefix would skip the actions if a memory writeback were pending (which is not the case).

  move       action
M → tmpA     XI tmpA   ALU rm↔r: BX to tmpA
N → tmpB     WB,NX      AX to tmpB
Σ → M        WB,RNI F   result to BX, run next instruction.

This microcode packs a lot into three micro-instructions. Note that it is very generic: the microcode doesn't know what ALU operation is being performed or which registers are being used. Instead, the microcode deals with abstract registers and operations, while the hardware fills in the details using bits from the instructions. The same microcode is used for eight different ALU operations. And as we'll see, it supports multiple addressing modes.

Using memory as the destination

Memory operations on the 8086 involve both microcode and hardware. A memory operation uses two internal registers: IND (Indirect) holds the memory address, while OPR (Operand) holds the word that is read or written. A typical memory micro-instruction is R DS,P0, which starts a read from the Data Segment with a "Plus 0" on the IND register afterward. The Bus Interface Unit carries out this operation by adding the segment register to compute the physical address, and then running the memory bus cycles.

With that background, let's look at the instruction ADD [SI],AX, which adds AX to the memory location indexed by SI. As before, the hardware performs some analysis of the instruction (hex 01 04). In the ModR/M byte, mod=00 (memory, no displacement), reg=000 (AX), and R/M=100 ([SI]). The N register is loaded with the code for AX as before. The M register, however, is loaded with OPR (the memory data register) since the Group Decode ROM determines that the instruction has a memory addressing mode.

The microcode below starts in an effective address microcode subroutine for the [SI] mode. The first line of the microcode subroutine computes the effective address simply by loading the tmpA register with SI. It jumps to the micro-routine EAOFFSET which ends up at EALOAD (for reasons that will be described below), which loads the value from memory. Specifically, EALOAD puts the address in IND, reads the value from memory, puts the value into tmpB, and returns from the subroutine.

SI → tmpA   JMP EAOFFSET [SI]: put SI in tmpA
tmpA → IND  R DS,P0      EALOAD: read memory
OPR → tmpB  RTN  
M → tmpA    XI tmpA      ALU rm↔r: OPR to tmpA
N → tmpB    WB,NX         AX to tmpB
Σ → M       WB,RNI F      result to BX, run next instruction.
            W DS,P0 RNI   writes result to memory

Microcode execution continues with the ALU rm↔r routine described above, but with a few differences. The M register indicates OPR, so the value read from memory is put into tmpA. As before, the N register specifies AX, so that register is put into tmpB. In this case, the WB,NX determines that the result will be written back to memory so it skips the NXT operation. The ALU's result (Σ) is stored in OPR as directed by M. The WB,RNI is skipped so microcode execution continues. The W DS,P0 micro-instruction writes the result (in OPR) to the memory address in IND. At this point, RNI terminates the microcode sequence.

A lot is going on here to add two numbers! The main point is that the same microcode runs as in the register case, but the results are different due to the M register and the conditional WB code. By running different subroutines, different effective address computations can be performed.

Using memory as the source

Now let's look at how the microcode uses memory as a source, as in the instruction ADD AX,[SI]. This instruction (hex 03 04) has the same ModR/M byte as before, so the N register holds AX and the M register holds OPR. However, because the opcode has the D bit set, the M and N registers are swapped when accessed. Thus, when the microcode uses M, it gets the value AX from N, and vice versa. (Yes, this is confusing.)

The microcode starts the same as the previous example, reading [SI] into tmpB and returning to the ALU code. However, since the meaning of M and N are reversed, the AX value goes into tmpA while the memory value goes into tmpB. (This switch doesn't matter for addition, but would matter for subtraction.) An important difference is that there is no writeback to memory, so WB,NX starts processing the next machine instruction. In the last micro-instruction, the result is written to M, indicating the AX register. Finally, WB,RNI runs the next machine instruction.

SI → tmpA   JMP EAOFFSET [SI]: put SI in tmpA
tmpA → IND  R DS,P0      EALOAD: read memory
OPR → tmpB  RTN  
M → tmpA    XI tmpA      ALU rm↔r: AX to tmpA
N → tmpB    WB,NX         OPR to tmpB
Σ → M       WB,RNI F      result to AX, run next instruction.

The main point is that the same microcode handles memory as a source and a destination, simply by setting the D bit. First, the D bit reverses the operands by swapping M and N. Second, the WB conditionals prevent the writeback to memory that happened in the previous case.

Using a displacement

The memory addressing modes optionally support a signed displacement of one or two bytes. Let's look at the instruction ADD AX,[SI+0x1234]. In hex, this instruction is 03 84 34 12, where the last two bytes are the displacement, reversed because the 8086 uses little-endian numbers. The mod bits are 10, indicating a 16-bit displacement, but the other bits are the same as in the previous example.

Microcode execution again starts with the [SI] subroutine. However, the jump to EAOFFSET goes to [i] this time, to handle the displacement offset. (I'll explain how, shortly.) This code loads the offset as two bytes from the instruction prefetch queue (Q) into the tmpB register. It adds the offset to the previous address in tmpA and puts the sum Σ in tmpA, computing the effective address. Then it jumps to EAFINISH (EALOAD). From there, the code continues as earlier, reading an argument from memory and computing the sum.

SI → tmpA   JMP EAOFFSET [SI]: put SI in tmpA
Q → tmpBL   JMPS MOD1 12 [i]: load from queue, conditional jump
Q → tmpBH     
Σ → tmpA    JMP EAFINISH 12:
tmpA → IND  R DS,P0      EALOAD: read memory
OPR → tmpB  RTN  
M → tmpA    XI tmpA      ALU rm↔r: AX to tmpA
N → tmpB    WB,NX         OPR to tmpB
Σ → M       WB,RNI F      result to AX, run next instruction.

For the one-byte displacement case, the conditional MOD1 will jump over the fetch of the second displacement byte. When the first byte is loaded into the low byte of tmpB, it was sign-extended into the high byte.14 Thus, the one-byte displacement case uses the same microcode but ends up with a sign-extended 1-byte displacement in tmpB.

The Translation ROM

Now let's take a closer look at the jumps to EAOFFSET, EAFINISH, and the effective address subroutines, which use something called the Translation ROM. The Translation ROM converts the 5-bit jump tag in a micro-instruction into a 13-bit microcode address. It also provides the addresses of the effective address subroutines. As will be seen below, there are some complications.11

The Translation ROM as it appears on the die. The metal layer has been removed to expose the silicon and polysilicon underneath. The left half decodes the inputs to select a row. The right half outputs the corresponding microcode address.

The Translation ROM as it appears on the die. The metal layer has been removed to expose the silicon and polysilicon underneath. The left half decodes the inputs to select a row. The right half outputs the corresponding microcode address.

The effective address micro-routines

Register calculations

The Translation ROM has an entry for the addressing mode calculations such as [SI] and [BP+DI], generally indicated by the r/m bits, the three low bits of the ModR/M byte. Each routine computes the effective address and puts it into the ALU's temporary A register and jumps to EAOFFSET, which adds any displacement offset. The microcode below shows the four simplest effective address calculations, which just load the appropriate register into tmpA.

SI → tmpA   JMP EAOFFSET   [SI]: load SI into tmpA
DI → tmpA   JMP EAOFFSET   [DI]: load SI into tmpA
BP → tmpA   JMP EAOFFSET   [BP]: load BP into tmpA
BX → tmpA   JMP EAOFFSET   [BX]: load BX into tmpA

For the cases below, an addition is required, so the registers are loaded into the ALU's temporary A and temporary B registers. The effective address is the sum (indicated by Σ), which is moved to temporary A.12 These routines are carefully arranged in memory so [BX+DI] and [BP+SI] each execute one micro-instruction and then jump into the middle of the other routines, saving code.13

BX → tmpA         [BX+SI]: get regs
SI → tmpB         1:
Σ → tmpA   JMP EAOFFSET  

BP → tmpA         [BP+DI]: get regs
DI → tmpB         4:
Σ → tmpA   JMP EAOFFSET  

BX → tmpA  JMPS 4 [BX+DI]: short jump to 4
BP → tmpA  JMPS 1 [BP+SI]: short jump to 1

The EAOFFSET and EAFINISH targets

After computing the register portion of the effective address, the routines above jump to EAOFFSET, but this is not a fixed target. Instead, the Translation ROM selects one of three target microcode addresses based on the instruction and the ModR/M byte:
If there's a displacement, the microcode jumps to [i] to add the displacement value.
If there is no displacement but a memory read, the microcode otherwise jumps to EALOAD to load the memory contents.
If there is no displacement and no memory read should take place, the microcode jumps to EADONE.
In other words, the microcode jump is a three-way branch that is implemented by the Translation ROM and is transparent to the microcode.

For a displacement, the [i] immediate code below loads a 1-byte or 2-byte displacement into the tmpB register and adds it to the tmpA register, as described earlier. At the end of a displacement calculation, the microcode jumps to the EAFINISH tag, which is another branching target. Based on the instruction, the Translation ROM selects one of two microcode targets: EALOAD to load from memory, or EADONE to skip the load.

Q → tmpBL   JMPS MOD1 12 [i]: get byte(s)
Q → tmpBH         
Σ → tmpA    JMP EAFINISH 12: add displacement

The EALOAD microcode below reads the value from memory, using the effective address in tmpA. It puts the result in tmpB. The RTN micro-instruction returns to the microcode that implements the original machine instruction.

tmpA → IND  R DS,P0   EALOAD: read from tmpA address
OPR → tmpB  RTN        store result in tmpB, return

The EADONE routine puts the effective address in IND, but it doesn't read from the memory location. This supports machine instructions such as MOV (some moves) and LEA (Load Effective Address) that don't read from memory

tmpA → IND  RTN   EADONE: store effective address in IND

To summarize, the microcode runs different subroutines and different paths, depending on the addressing mode, executing the appropriate code. The Translation ROM selects the appropriate control flow path.

Special cases

There are a couple of special cases in addressing that I will discuss in this section.

Supporting a fixed address

It is common to access a fixed memory address, but the standard addressing modes use a base or index register. The 8086 replaces the mode of [BP] with no displacement with 16-bit fixed addressing. In other words, a ModR/M byte with the pattern 00xxx110 is treated specially. (This special case is the orange disp16 line in the ModR/M table earlier.) This is implemented in the Translation ROM which has additional rows to detect this pattern and execute the immediate word [iw] microcode below instead. This microcode fetches a word from the instruction prefetch queue (Q) into the tmpA register, a byte at a time. It jumps to EAFINISH instead of EAOFFSET because it doesn't make sense to add another displacement.

Q → tmpAL          [iw]: get bytes
Q → tmpAH  JMP EAFINISH  

Selecting the segment

Memory accesses in the 8086 are relative to one of the 64-kilobyte segments: Data Segment, Code Segment, Stack Segment, or Extra Segment. Most addressing modes use the Data Segment by default. However, addressing modes that use the BP register use the Stack Segment by default. This is a sensible choice since the BP (Base Pointer) register is intended for accessing values on the stack.

This special case is implemented in the Translation ROM. It has an extra output bit that indicates that the addressing mode should use the Stack Segment. Since the Translation ROM is already decoding the addressing mode to select the right microcode routine, adding one more output bit is straightforward. This bit goes to the segment register selection circuitry, changing the default segment. This circuitry also handles prefixes that change the segment. Thus, segment register selection is handled in hardware without any action by the microcode.

Conclusions

I hope you have enjoyed this tour through the depths of 8086 microcode. The effective address calculation in the 8086 uses a combination of microcode and logic circuitry to implement a variety of addressing methods. Special cases make the addressing modes more useful, but make the circuitry more complicated. This shows the CISC (Complex Instruction Set Computer) philosophy of x86, making the instructions complicated but highly functional. In contrast, the RISC (Reduced Instruction Set Computer) philosophy takes the opposite approach, making the instructions simpler but allowing the processor to run faster. RISC vs. CISC was a big debate of the 1980s, but isn't as relevant nowadays.

People often ask if microcode could be updated on the 8086. Microcode was hardcoded into the ROM, so it could not be changed. This became a big problem for Intel with the famous Pentium floating-point division bug. The Pentium chip turned out to have a bug that resulted in rare but serious errors when dividing. Intel recalled the defective processors in 1994 and replaced them at a cost of $475 million. Starting with the Pentium Pro (1995), microcode could be patched at boot time, a useful feature that persists in modern CPUs.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff.

Notes and references

  1. There are additional addressing modes that don't use a ModR/M byte. For instance, immediate instructions use a constant in the instruction. For instance ADD AX,42 adds 42 to the AX register. Other instructions implicitly define the addressing mode. I'm ignoring these instructions for now. 

  2. The 8086 supports more addressing modes than the ModR/M byte provides, by using separate opcodes. For instance, arithmetic instructions can take an "immediate" value, an 8- or 16-bit value specified as part of the instruction. Other instructions operate on specific registers rather than memory or access memory through the stack. For this blog post, I'm focusing on the ModR/M modes and ignoring the other instructions. Also, although I'm discussing the 8086, this blog post applies to the Intel 8088 processor as well. The 8088 has an 8-bit bus, a smaller prefetch queue, and some minor internal changes, but for this post you can consider them to be the same. 

  3. My assembly code examples are based on Intel ASM86 assembly syntax. There's a completely different format of x86 assembly language known as AT&T syntax. Confusingly, it reverses the source and destination. For example, in AT&T syntax, addw %bx, %cx% stores the result in CX. AT&T syntax is widely used, for instance in Linux code. The AT&T syntax is based on earlier PDP-11 assembly code

  4. The term "effective address" dates back to the 1950s, when computers moved beyond fixed memory addresses and started using index registers. The earliest uses that I could find are from 1955 for the IBM 650 data processing machine and the IBM 704 mainframe. The "Load Effective Address" instruction, which provides the effective address as a value instead of performing the memory access, was perhaps introduced in the IBM System/360 (1964) under the name "Load Address". It has been a part of many subsequent processors including the 8086. 

  5. Note that the ModR/M byte has the bits grouped in threes (as do many instructions). This is due to the octal heritage of the 8086, dating back through the 8080 and the 8008 to the Datapoint 2200 (which used TTL chips to decode groups of three bits). Although the 8086 instruction set is invariably described in hexadecimal, it makes much more sense when viewed in octal. See x86 is an octal machine for details. 

  6. The 8086's addressing schemes are reminiscent of the IBM System/360 (1964). In particular, System/360 had a "RX" instruction format that accessed memory through a base register plus an index register plus a displacement, using another register for the other argument. This is very similar to the 8086's base + index + displacement method. The System/360's "RR" (register-register) instruction format accessed two registers, much like the register mode of the ModR/M byte. The details are very different, though, between the two systems. See the IBM System/360 Principles of Operation for more details. 

  7. The motivation behind the ModR/M options is discussed in The 8086/8088 Primer by 8086 designer Steve Morse, pages 23-33. 

  8. The D bit is usually called the register direction bit, but the designer of the 8086 instruction set calls it the destination field; see The 8086/8088 Primer, Steve Morse, page 28. To summarize:
    If the bit is 0, the result is stored into the location indicated by the mod and r/m fields while the register specified by reg is the source.
    If the bit is 1, the result is stored into the register indicated by the reg field.

    For the W word bit, 0 indicates a byte operation and 1 indicates a word operation.

    One curious side-effect of the D bit is that an instruction like ADD AX,BX can be implemented in two ways since both arguments are registers. The reg field can specify AX while the r/m field specifies BX or vice versa, depending on the D bit. Different 8086 assemblers can be "fingerprinted" based on their decisions in these ambiguous cases. 

  9. The M and N registers hold a 5-bit code. This code indicates a 16-bit register (e.g. AX or IND), an 8-bit register (e.g. AL), or a special value (e.g. Σ, the ALU result; ZEROS, all zero bits; or F, the flags). The 3-bit register specification is mapped onto the 5-bit code depending on whether the W bit is set (byte or word register), or if the operation is specifying a segment register. 

  10. The microcode listings are based on Andrew Jenner's disassembly. I have made some modifications to (hopefully) make it easier to understand. 

  11. You can also view the Translation ROM as a PLA (Programmable Logic Array) constructed from two layers of NOR gates. The conditional entries make it seem more like a PLA than a ROM. Technically, it can be considered a ROM since a single row is active at a time. I'm using the name "Translation ROM" because that's what Intel calls it in the patents. 

  12. Normally, an ALU operation requires a micro-instruction to specify the desired ALU operation and temporary register. For the address addition, the ALU operation is not explicitly specified because it uses the ALU's default, of adding tmpA and tmpB. The ALU is reset to this default at the beginning of each machine instruction. 

  13. A microcode jump takes an extra clock cycle for the microcode address register to get updated. This is why, for instance, [BP+DI] takes 7 clock cycles but [BX+DI] takes 8 clock cycles. Thus, the 8086 implementers took the tradeoff of slowing down some addressing modes by a clock cycle in order to save a few micro-instructions in the small microcode ROM.

    This table shows the clock cycles required for effective address calculations. From MCS-86 Assembly Language Reference Guide.

    This table shows the clock cycles required for effective address calculations. From MCS-86 Assembly Language Reference Guide.

     

  14. A one-byte signed number can be sign-extended into a two-byte signed number. This is done by copying the top bit (the sign) from the low byte and filling the top byte with that bit. For example, 0x64 is sign-extended to 0x0064 (+100), while 0x9c is sign-extended to 0xff9c (-100). 

Reverse-engineering the interrupt circuitry in the Intel 8086 processor

Interrupts have been an important part of computers since the mid-1950s,1 providing a mechanism to interrupt a program's execution. Interrupts allows the computer to handle time-critical tasks such as I/O device operations. In this blog post, I look at the interrupt features in the Intel 8086 (1978) and how they are implemented in silicon, a combination of interesting circuitry and microcode.

I've been reverse-engineering the 8086 starting with the silicon die. The die photo below shows the chip under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins; relevant pins are marked in yellow. I've labeled the key functional blocks; the ones that are important to this discussion are darker and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles bus activity, while the Execution Unit (EU) executes instructions and microcode. Both parts are extensively involved in interrupt handling.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Interrupts in the 8086

The idea behind an interrupt is to stop the current flow of execution, run an interrupt handler to perform a task, and then continue execution where it left off. An interrupt is like a subroutine call in some ways; it pushes the current segment register and program counter on the stack and continues at a new address. However, there are a few important differences. First, the address of the interrupt handler is obtained indirectly, through an interrupt vector table. Interrupts are numbered 0 through 255, and each interrupt has an entry in the vector table that gives the address of the code to handle the interrupt. Second, an interrupt pushes the processor flags to the stack, so they can be restored after the interrupt. Finally, an interrupt clears the interrupt and trap flags, blocking more interrupts while handling the interrupt.

The 8086 provides several types of interrupts, some generated by hardware and some generated by software. For hardware interrupts, the INTR pin on the chip generates a maskable interrupt when activated, while the NMI pin on the chip generates a higher-priority non-maskable interrupt.2 Typically, most interrupts use the INTR pin, signaling things such as a timer, keyboard request, real-time clock, or a disk needing service. The NMI interrupt is designed for things such as parity error or an impending power failure, which are so critical they can't be delayed. The 8086 also has a RESET pin that resets the CPU. Although not technically an interrupt, the RESET action has many features in common with interrupts, so I'll discuss it here.

On the software side, the 8086 has multiple types of interrupts generated by different instructions. The INT n instruction creates an interrupt of the specified type (0 to 255). These software interrupts were used in the IBM PC to execute a function in the BIOS, the layer underneath the operating system. These functions could be everything from a floppy disk operation to accessing the printer. The one-byte INT 3 instruction creates a breakpoint interrupt for debugging. The divide instructions generate an interrupt if a divide-by-zero or overflow occurs. The INTO instruction (Interrupt if Overflow) generates an interrupt if the overflow flag is set. To support single-step mode in debuggers, the Trap flag generate an interrupt on every instruction.

The diagram below shows how the vector table is implemented. Each of the 256 interrupt types has an entry holding the address of the interrupt handler (the code segment value and the instruction pointer (program counter) value). In the next section, I'll show below how the microcode loads the vector from the table and switches execution to that interrupt handler.

This diagram shows where the interrupt vectors are stored in memory.  From iAPX 86/88 User's Manual, Figure 4-18.

This diagram shows where the interrupt vectors are stored in memory. From iAPX 86/88 User's Manual, Figure 4-18.

Microcode

Most of the operations in the 8086 are implemented in microcode, a low-level layer of code that sits between the machine code instructions and the chip's hardware. I'll explain a few features of the 8086's microcode that are important for the interrupt code. Each micro-instruction is 21 bits long, as shown below. The first part of the micro-instruction specifies a move between a source register and a destination register; these may be special-purpose internal registers, not just the registers visible to the programmer. The meaning of the remaining bits depends on the type of micro-instruction, but includes jumps within the microcode, ALU (Arithmetic/Logic Unit) operations, and memory operations.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

For a memory access, microcode issues a memory read or write micro-instruction. Memory accesses use two internal registers: the IND (Indirect) register holds the address in the segment, while the OPR (Operand) register holds the word that is read or written. A micro-instruction such as W SS,P2 writes the OPR register to the memory address specified by the IND register and the segment register (SS indicates the stack segment). The IND register can also be incremented or decremented (P2 indicates "Plus 2").

The 8086's Bus Interface Unit (BIU) handles the memory request in hardware, while the microcode waits. The BIU has an adder to combine the segment address and the offset to obtain the "absolute" address. It also has a constant ROM to increment or decrement the IND register. Memory accesses are complicated in the 8086 and take at least four clock cycles,3 called T1, T2, T3, and T4. An interrupt acknowledge is almost the same as a memory read, except the IAK bit is set in the microcode, causing some behavior changes.

The interaction between microcode and the ALU (Arithmetic/Logic Unit) will also be important. The ALU has three temporary registers that hold the arguments for operations, called temporary A, B, and C. These registers are invisible to the programmer. The first argument for, say, an addition can come from any of the three registers, while the second argument is always from the temporary B register. Performing an ALU operation takes at least two micro-instructions. First, the ALU is configured to perform an operation, for example, ADD or DEC2 (decrement by 2). The result is then read from the ALU, denoted as the Σ register.

Software interrupts

The main microcode for interrupt handling is shown below.4 Each line specifies a move operation and an action, with my comments in green. On entry to INTR interrupt handler, the OPR operand register holds the interrupt type. This chunk of microcode looks up the interrupt handler in the vector table, pushes the status flags onto the stack, and then branches to a routine FARCALL2 to perform a subroutine call the interrupt handler.

       move            action
19d  OPR → tmpBL     SUSP         INTR: OPR to tmpB(low), suspend prefetch
19e  0 → tmpbH       ADD tmpB      0 to tmpB(high), add tmpB to tmpB
19f  Σ → tmpB                      ALU sum to tmpB, add tmpB to tmpB
1a0  Σ → IND         R S0,P2       ALU sum to IND, read from memory, IND+=2
1a1  OPR → tmpB      DEC2 tmpC     memory to tmpB, set up decrement tmpC
1a2  SP → tmpC       R S0,P0       SP to tmpC, read from memory
1a3  OPR → tmpA                    memory to tmpA
1a4  F → OPR         CITF          Flags to OPR, clear interrupt and trap flags
1a5  Σ → IND         W SS,P0       ALU dec to IND, Write to memory
1a6  IND → tmpC      JMP FARCALL2  IND to tmpC, branch to FARCALL2

In more detail, the microcode routine starts at 19d by moving the interrupt number from the OPR register to the low byte of the ALU's temporary B register. The SUSP action suspends instruction prefetching since we'll start executing instructions from a new location. Next, line 19e zeros out the top byte of the temporary B register and tells the ALU to add temporary B to itself. The next micro-instruction puts the ALU result (indicated by Σ) into temporary B, doubling the value.

Line 1a0 calculates another sum (doubling) from the ALU and stores it in the IND register. In other words, the interrupt number has been multiplied by 4, yielding an address into the vector table. The interrupt handle address is read from the vector table: R S0,P2 operation reads from memory, segment 0, and performs a "Plus 2" on the IND register. Line 1a1 puts the result (OPR) into the temporary B register.

Line 1a2 stores the current stack pointer register into temporary C. It also performs a second read to get the handler code segment from the vector table. Line 1a3 stores this in the temporary A register. Line 1a4 puts the flags (F) into the OPR register. It also clears the interrupt and trap flags (CITF), blocking further interrupts.

Line 1a5 puts the ALU result (the decremented stack pointer) into the IND register. (This ALU operation was set up back in line 1a1.) To push the flags on the stack, W SS,P0 writes OPR to the Stack segment and does a "Plus 0" on the IND register. Finally, line 1a6 stores the IND register (the new top-of-stack) into the temporary C register and jumps to the FARCALL2 micro-routine.5

Understanding microcode can be tricky, since it is even more low-level than machine instructions, but hopefully this discussion gives you a feel for it. Everything is broken down into very small steps, even more basic than machine instructions. Microcode is a bit like a jigsaw puzzle, carefully fit together to ensure everything is at the right place at the right time, as efficiently as possible.

Subroutine call microcode: FARCALL2

Next, I'll describe the FARCALL2 microcode. Because of its segment registers, the 8086 has two types of calls (and jumps): a near call is a subroutine call within the same code segment, while a far call is a subroutine call to a different code segment. A far call target is specified with two words: the new code segment register and the new program counter.

The FARCALL2 micro-routine performs a subroutine call to a particular segment and offset. At entry to FARCALL2, the target code segment in temporary A, the offset is in temporary B, and the decremented stack pointer will be provided by the ALU. The microcode below pushes the code segment register to the stack, updates the code segment register with the new value, and then jumps to NEARCALL to finish the subroutine call.

06c  Σ → IND      CORR        FARCALL2: ALU (IND-2) to IND, correct PC
06d  CS → OPR     W SS,M2      CS to OPR, write to memory, IND-=2
06e  tmpA → CS    PASS tmpC    tmpA to CS, ALU passthrough
06f  PC → OPR     JMP NEARCALL PC to OPR, branch to NEARCALL

For a subroutine call, the program counter is saved so execution can resume where it left off. But because of prefetching, the program counter in the 8086 points to the next instruction to fetch, not the next instruction to execute. To fix this, the CORR (Correction) micro-instruction corrects the program counter value by subtracting the length of the prefetch queue. Line 06c also puts the decremented stack location into IND using the ALU decrement operation set up way back at line 1a1.

Line 06d puts the code segment value (CS) into the OPR register and then writes it to the stack segment, performing a "Minus 2" on IND. In other words, the CS register is pushed onto the stack. Line 06e stores the new value (from temporary A) into the CS register. It also sets up the ALU to pass the value of the temporary C register as its result. Finally, line 06f puts the (corrected) program counter into the OPR register and jumps to the NEARCALL routine.

Subroutine call microcode: NEARCALL

The NEARCALL micro-routine does a near subroutine call, updating the program counter but not the segment register. At entry, the target address is in temporary B, the IND register indicates the top of the stack, and OPR holds the program counter.

077  tmpB → PC    FLUSH      NEARCALL: tmpB to PC, restart prefetch
078  IND → tmpC               IND to tmpC
079  Σ → IND                  ALU to IND
07a  Σ → SP       W SS,P0 RNI ALU to SP, write PC to memory, run next instruction

Line 077 puts temporary B into the program counter. The FLUSH operation flushes the stale instructions from the prefetch queue and starts prefetching from the new PC address. Line 078 puts IND (i.e. the new stack pointer value) into temporary C. Line 079 puts this value into the IND register and line 07a puts this value into the SP register. (The ALU was configured at line 06e to pass the temporary C value unmodified.)

Line 07a pushes the PC to the stack by writing OPR (the old program counter) to the stack segment. Finally, RNI (Run Next Instruction) ends this microcode sequence and causes the 8086 to run the next machine instruction, the first instruction of the interrupt handler.

Starting an interrupt

The above microcode handles a generic interrupt. But there's one more piece: setting up the interrupt type for the instruction. For instance, the INT ib machine instruction has the interrupt type in the second byte of the opcode. This machine instruction has the two micro-instructions below. The microcode loads the type from the instruction prefetch queue (Q) and puts it into temporary B and then OPR. Then it jumps to the INTR microcode discussed earlier.

1a8  Q → tmpB             INT ib: load a byte from the queue
1a9  tmpB → OPR  JMP INTR  Put the byte in OPR and jump to INTR

Several instructions require specific interrupt numbers, and the microcode uses a tricky technique to obtain these numbers. The numbers are obtained from a special pseudo-register called CR, which is all zeros except the three low bits come from the microcode address.6 The microcode is carefully arranged in memory so the micro-instruction is at the right address to generate the necessary value. For instance, in the microcode below, entry point INT1 will load the number 1 into OPR, entry point INT2 will load the number 2 into OPR, and INT0 will load 0 into OPR. Each line then jumps to the main INTR interrupt microcode.

198  CR → OPR     JMP INTR      INT1: num to OPR, branch to INTR
199  CR → OPR     JMP INTR      INT2: num to OPR, branch to INTR
...
1a7  CR → OPR     JMP INTR      INT0: num to OPR, branch to INTR

The microcode for the INT 3 and INTO (not to be confused with INT0) machine instructions has some wasted micro-instructions to ensure that the CR → OPR is at the right address. This wastes a couple of cycles and a couple of micro-instructions.7

Return from interrupt

The IRET interrupt is used to return from interrupts. It pops the program counter, code segment register, and flags from the stack, so execution can continue at the point where the interrupt happened. It calls the microcode subroutine FARRET to pop the code segment register and the PC from the stack. (I won't go into FARRET in this post.) Then it pops the flags from the stack, updates the Stack Pointer, and runs the next instruction.

0c8               CALL FARRET IRET: call Far Return
0c9               R SS,P2      read from stack, IND+=2
0ca  OPR → F                   mem to Flags
0cb  IND → SP     RNI          IND to stack pointer, run next instruction

External hardware interrupts

As well as software interrupts, the 8086 has hardware interrupts. The 8086 chip has pins for INTR and NMI; pulling the pin high causes a hardware interrupt. This section discusses the hardware circuitry and the microcode that handles these interrupts.

The interrupt pin circuit

The schematic below shows the input circuitry for the INTR pin; the NMI, RESET, and TEST pins use the same circuit. The function of this circuit is to clean up the input and ensure that it is synchronized with the clock. The chip's INTR pin is connected to a protection diode to drain a negative voltage to ground. Next, the signal goes through three inverters, probably to force a marginal voltage to either 0 or 1. Finally, the signal goes through an edge-triggered flip-flop to synchronize it with the clock. The flip-flop is constructed from two set-reset latches, the first gated by clk' and the second gated by clk. At the output of each stage is a "superbuffer", two transistors that produce a higher-current output than a regular inverter. This flip-flop circuit is unusual for the 8086; most flip-flops and latches are constructed from dynamic logic with pass transistors, which is much more compact. The more complicated circuitry on the INTR input probably protects against metastability and other problems that could occur with poor-quality input signals.

Schematic of the input circuitry for the INTR pin.

Schematic of the input circuitry for the INTR pin.

The interrupt logic circuitry

The chip has a block of interrupt logic to receive interrupts, handle interrupt priorities, and execute an interrupt at the appropriate time. This circuitry is in the top right part of the chip, on the opposite side of the chip from the interrupt pins. The schematic below shows this circuitry.

The interrupt logic circuitry activates the microcode interrupt code at the appropriate time.

The interrupt logic circuitry activates the microcode interrupt code at the appropriate time.

The top chunk of logic latches an NMI (non-maskable interrupt) until it runs or it is cleared by a reset.8 The first flip-flop helps convert an NMI input into a one-clock pulse. The second flip-flop holds the NMI until it runs.

The middle chunk of logic handles traps. If the trap flag is high, the latch will hold the trap request until it can take place. The latch is loaded on First Clock (FC), which indicates the start of a new instruction. The NOR gate blocks the trap if there is an interrupt or NMI, which has higher priority.9

The third chunk of logic schedules the interrupt. Three things can delay an interrupt: an interrupt delay micro-instruction, an instruction that modifies a segment register, or an instruction prefix.10 If not delayed, the interrupt (NMI, trap, or INTR pin) will run at the start of the next instruction (i.e. FC).11 The microcode interrupt code is run for these cases as well as a reset. Note that the reset is not gated by First Clock, but can run at any time.

The interrupt signal from this circuitry loads a hardcoded interrupt address into the microcode address latches, depending on the type of interrupt.12 This happens for an interrupt at First Clock, while a reset can happen any time in the instruction cycle. A trap goes to the INT1 microcode routine described earlier, while an NMI interrupt goes to INT2 microcode routine. The microcode for the INTR interrupt will be discussed in the next section.

The interrupt signal also goes to the Group Decode ROM (via the instruction register), where it blocks regular instruction decoding. Finally, the interrupt signal goes to a circuit called the loader, where it prevents fetching of the next instruction from the prefetch queue.

The external INTR interrupt

The INTR interrupt has some special behavior to communicate with the device that triggered the interrupt: the 8086 performs two bus cycles to acknowledge the interrupt and to obtain the interrupt number from the device. This is implemented with a combination of microcode and the bus interface logic. The bus cycles are similar to memory read cycles, but with some behavior specific to interrupts.

The INTR interrupt has its own microcode, shown below. The first micro-instruction zeros the IND memory address register and then performs a special IAK bus cycle.13 This is similar to a memory read but asserts the INTA interrupt acknowledge line so the device knows that its interrupt has been received. Next, prefetching is suspended. The third line performs a second IAK bus cycle and the external device puts the interrupt number onto the bus. The interrupt number is read into the ORD register, just like a memory read. At this point, the code falls through into the interrupt microcode described previously.

19a  0 → IND   IAK S0,P0  IRQ: 0 to IND, run interrupt bus cycle
19b            SUSP        suspend prefetch
19c            IAK S0,P0   run interrupt bus cycle
19d            ...        The INTR routine discussed earlier

The bus cycle

The diagram below provides timing details of the two interrupt acknowledge bus cycles. Each cycle is similar to a memory read bus cycle, going through the T1 through T4 states, starting with an ALE (Address Latch Enable) signal. The main difference is the interrupt acknowledge bus cycle also raises the INTA (Interrupt Acknowledge) to let the requesting device know that its interrupt has been acknowledged.14 On the second cycle, the device provides an 8-bit type vector that provides the interrupt number. The 8086 also issues a LOCK signal to lock the bus from other uses during this sequence. The point of this is that the 8086 goes through a fairly complex bus sequence when handling a hardware interrupt. The microcode triggers these two bus cycles with the IAK micro-operation, but the bus interface circuitry goes through the various states of the bus cycle in hardware, without involving the microcode.

This diagram shows the interrupt acknowledge sequence. From Intel 8086 datasheet.

This diagram shows the interrupt acknowledge sequence. From Intel 8086 datasheet.

The circuitry to control the bus cycle is complicated with many flip-flops and logic gates; the diagram below shows the flip-flops. I plan to write about the bus cycle circuitry in detail later, but for now, I'll give an extremely simplified description. Internally, there is a T0 state before T1 to provide a cycle to set up the bus operation. The bus timing states are controlled by a chain of flip-flops configured like a shift register with additional logic: the output from the T0 flip-flop is connected to the input of the T1 flip-flop and likewise with T2 and T3, forming a chain. A bus cycle is started by putting a 1 into the input of the T0 flip-flop. When the CPU's clock transitions, the flip-flop latches this signal, indicating the (internal) T0 bus state. On the next clock cycle, this 1 signal goes from the T0 flip-flop to the T1 flip-flop, creating the externally-visible T1 state. Likewise, the signal passes to the T2 and T3 flip-flops in sequence, creating the bus cycle. Some special-case logic changes the behavior for an interrupt versus a read.15

The read/write control circuitry on the die with the flip-flops labeled. Metal and polysilicon were removed to show the underlying silicon.

The read/write control circuitry on the die with the flip-flops labeled. Metal and polysilicon were removed to show the underlying silicon.

Reset

The reset pin resets the CPU to an initial state. This isn't an interrupt, but much of the circuitry is the same, so I'll describe it for completeness. The reset microcode below initializes the segment registers, program counter, and flags to 0, except the code segment is initialized to 0xffff. Thus, after a reset, instruction execution will start at absolute address 0xffff0. The reset line is also connected to numerous flip-flops and circuits in the 8086 to ensure that they are initialized to the proper state. These initializations happen outside of the microcode.

1e4  0 → DS     SUSP   RESET: 0 to DS, suspend prefetch
1e5  ONES → CS          FFFF to CS
1e6  0 → PC     FLUSH   0 to PC, start prefetch
1e7  0 → F              0 to Flags
1e8  0 → ES             0 to ES
1e9  0 → SS     RNI     0 to SS, run next instruction

A bit of history

The 8086's interrupt system inherits a lot from the Intel 8008 processor. Interrupts were a bit of an afterthought on the 8008 so the interrupt handling was primitive and designed to simplify implementation.17 In particular, an interrupt response acts like an instruction fetch except the interrupting device "jams" an instruction on the bus. To support this, the 8008 provided one-byte RST (Restart) instructions that would call a fixed location. The Intel 8080 improved the 8008, but kept this model of performing an instruction fetch cycle that received a "jammed" instruction for an interrupt. With more pins available, the 8080 added the INTA Interrupt Acknowledge pin.

The approach of "jamming" an instruction onto the bus for an interrupt is rather unusual. Other contemporary microprocessors such as the 6800, 6502, or Intel 8048 used an interrupt vector approach, which is much more standard: an interrupt vector table held pointers to the interrupt service routines.

The 8086 switched to an interrupt vector table, but retained some 8080 interrupt characteristics for backward compatibility. In particular, the 8086 performs a memory cycle very much like an instruction fetch, but instead of an instruction, it receives an interrupt number. The 8086 performs two interrupt ack bus cycles but ignores the first one, which lets the same hardware work with the 8080 and 8086.16

Conclusions

This is another blog post that I expected would be quick and easy, but there's a lot going on in the 8086's interrupt system, both in hardware and microcode. The 8086 has some strange characteristics, such as acknowledging interrupts twice, but these features make more sense when looking at the 8086's history and what it inherited from the 8008 and 8080.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. Thanks to pwg on HN for suggesting interrupts as a topic.

Notes and references

  1. The first computer to provide interrupts was probably the UNIVAC 1103A (1956). The book "Computer Architecture" by Blaauw and Brooks discusses different approaches to interrupts in great detail, pages 418-434. A history of interrupts is in this article

  2. The maskable interrupt can be blocked in software, while the non-maskable interrupt cannot be blocked. 

  3. A typical memory access takes four clock cycles. However, slow memory can add wait states, adding as many clock cycles as necessary. Moreover, accessing a word from an unaligned (odd) address results in two complete bus cycles to access the two bytes, since the bus can only read an aligned word at a time. Thus, memory accesses can take much more than four cycles. 

  4. The 8086's microcode was disassembled by Andrew Jenner (link) from my die photos, so the microcode listings are based on his disassembly. 

  5. The microcode jumps use a level of indirection because there isn't room in the micro-instruction for the full micro-address. Instead, the micro-instruction has a four-bit tag specifying the desired routine. The Translation ROM holds the corresponding micro-address for the routine, which is loaded into the microcode address register. 

  6. The CR transfer source loads the low three bits of the microcode address. Its implementation is almost the same as the ZERO source, which loads zero. A signal zeroes bits 15-3 for both sources. The bottom 3 bits are pulled low for the ZERO source or if the corresponding microcode bit is 0. By the time this load happens, the microcode counter has incremented, so the value is one more than the address with the micro-instruction. Also note that it uses the raw 13-bit microcode address which is 9 bits plus four counter bits. The address decoder converts this to the 9-bit "physical" microcode address that I'm showing. The point is that the 3 lower bits from my microcode listing won't give the right value. 

  7. The jump in the microcode is why the one-byte INT 3 instruction takes 52 clocks while the two-byte INT nn instruction takes 51 clocks. You'd expect INT nn to be slower because it has an extra byte to execute, but the microcode layout for INT 3 makes it slower. 

  8. There's a subtle difference between the NMI and the INTR interrupts. Once the NMI pin goes high, the interrupt is scheduled, even if the pin goes low. For a regular interrupt, the INTR pin must be high at the start of an instruction. Thus, NMI is latched but INTR is not. 

  9. Since the 8086 has multiple interrupt sources, you might wonder how multiple interrupts are handled at the same time. The chip makes sure the interrupts are handled correctly, according to their priorities. The diagram below shows what happens if trapping (single-step) is enabled, a divide causes a divide-by-0 exception, and an external interrupt arrives.

    Processing simultaneous interrupts. From iAPX 86/88 User's Manual, Figure 2-31.

    Processing simultaneous interrupts. From iAPX 86/88 User's Manual, Figure 2-31.

     

  10. The interrupt delay micro-instruction is used for the WAIT machine instruction. I think that a long string of prefix instructions will delay an interrupt (even an NMI) for an arbitrarily long time. 

  11. Interrupts usually happen after the end of a machine instruction, rather than interrupting an instruction during execution. There are a couple of exceptions, however, for instructions that can take a very long time to complete (block copies) or could take forever (WAIT). The solution is that the microcode for these instructions checks to see if an interrupt is pending, so the instruction can explicitly stop and the interrupt can be handled. 

  12. The microcode address is 13 bits long: a special bit, 8 instruction bits, and four counter bits. For an interrupt, it is set to 1r0000000.00ab, where r indicates a reset and ab indicate an interrupt of various types:
    Trap: goes to vector 1, INT1 addr 198 10000000.00
    NMI: goes to vector 2, INT2 addr 199 100000000.01 (Bit b is NMI)
    INTR: goes to IRQ addr 19a 100000000.10, vector specified by device. Bit a is intr-enable', blocked by NMI.
    This logic takes into account the relative priorities of the different interrupts. This address is initialized through a special multiplexer path for interrupts that loads bits directly into the microcode address latch. 

  13. The IAK micro-instruction is the same as a read micro-instruction except the IAK (Interrupt Acknowledge) bit is set. This bit controls the differences between a read micro-instruction and an interrupt acknowledge micro-instruction.

    The logic that makes the bus cycle an interrupt acknowledge rather than a read is a bit strange. A signal indicates that the cycle is an I/O operation or an interrupt acknowledge. This is determined by the instruction decoder (for I/O) or bit P of the microcode (for an interrupt acknowledge). This signal is used for the S2 status pin. Later, this signal is ANDed with the read/write signal to determine that it is an interrupt acknowledge and not an I/O. This probably optimized signal generation, but it seems inconvenient to merge two signals together and then split them out later. 

  14. The 8086 has two different modes, giving its pins different meanings in the different modes. In "Minimum" mode, the control pins have simple functions. In particular, the interrupt is acknowledged using the INTA pin. In "Maximum" mode, the control pins are encoded to provide more state information to a bus controller. In this mode, the interrupt acknowledge state is encoded and signaled over the S2, S1, and S0 state pins. I'm discussing minimum mode; the sequence is the same in maximum mode but uses different pins. 

  15. To prevent another device from grabbing the bus during interrupt acknowledgement, the LOCK pin is activated. The hardware for this LOCK signal toggles the internal lock latch on each of the interrupt ALE signals, so the lock is enabled on the first one and disabled on the second. The interrupt ack signal also prevents the address lines from being enabled during the interrupt ack. 

  16. An 8086/8088 system will typically use an external chip, the 8259A Programmable Interrupt Controller. This chip extends the interrupt functionality of the 8086 by providing 8 priority-based hardware interrupts.

    The Intel 8259A Programmable Interrupt Controller chip was designed to receive interrupt requests from multiple devices, prioritize interrupts, and direct the 8080 or 8086 processor accordingly. When used with the 8080, the interrupt controller chip will issue a CALL instruction to execute the specified interrupt handler. In particular, when the 8080 responds to an interrupt with INTA pulses, the interrupt controller chip will first put a CALL instruction on the bus, and then will respond to the next two INTA pulses with the address. For the 8086, the interrupt controller ignores the first INTA pulse and responds to the second INTA pulse with the 8-bit pointer. The point of this is that for both processors, the interrupt controller freezes its state on the first INTA and sends the interrupt-specific byte on the second INTA. Thus, even though the interrupt controller responds to the 8080 with an instruction and responds to the 8086 with an interrupt code, the underlying timing and logic are mostly the same. 

  17. The article Intel Microprocessors: 8008 to 8086 provides some history on interrupts in the 8008. Also see Intel 8008 Microprocessor Oral History Panel, pages 5-6. Most of the 8008's features were inherited from the Datapoint 2200 desktop computer, but the interrupts were not part of the Datapoint 2200. Instead, Intel decided to add interrupt functionality.