Reverse-engineering the conditional jump circuitry in the 8086 processor

Intel introduced the 8086 microprocessor in 1978 and it had a huge influence on computing. I'm reverse-engineering the 8086 by examining the circuitry on its silicon die and in this blog post I take a look at how conditional jumps are implemented. Conditional jumps are an important part of any instruction set, changing the flow of execution based on a condition. Although this instruction may seem simple, it involves many parts of the CPU: the 8086 uses microcode along with special-purpose condition logic.

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; the ones that are important to this discussion are darker and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions. Most of the relevant circuitry is in the Execution Unit, such as the condition evaluation circuitry near the center, and the microcode in the lower right. But the Bus Interface Unit plays a part too, holding and modifying the program counter.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Microcode

Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. One of the hardest parts of computer design is creating the control logic that directs the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated, error-prone, and hard to design.

The alternative is microcode: instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns design of control circuitry into a programming task instead of a difficult logic design task.

The 8086 uses a hybrid approach: although the 8086 uses microcode, much of the instruction functionality is implemented with gate logic. This approach removed duplication from the microcode and kept the microcode small enough for 1978 technology. In a sense, the microcode is parameterized. For instance, the microcode can specify a generic Arithmetic/Logic Unit (ALU) operation, and the gate logic determines from the instruction which ALU (Arithmetic/Logic Unit) operation to perform. More relevant to this blog post, the microcode can specify a generic conditional test and the gate logic determines which condition to use. Although this made the 8086's gate logic more complicated, the tradeoff was worthwhile.

Microcode for conditional jumps

The 8086 processor has six status flags: carry, parity, auxiliary carry, zero, sign, and overflow.1 These flags are updated by arithmetic and logic operations based on the result. The 8086 has sixteen different conditional jump instructions2 that test status flags and jump if conditions are satisfied, such as zero, less than, or odd parity. These instructions are very important since they permit if statements, loops, comparisons, and so forth.

In machine language, a conditional jump opcode is followed by a signed offset byte which specifies a location relative to the current program counter, from 127 bytes ahead to 128 bytes back. This is a fairly small range, but the benefit is that the offset fits in a single byte, reducing the code size.3 For typical applications such as loops or conditional code, jumps usually stay in the same neighborhood of code, so the tradeoff is worthwhile.

The 8086's microcode was disassembled by Andrew Jenner (link) from my die photos, so we can see exactly what micro-instructions the 8086 is running for each machine instruction. The microcode below implements conditional jumps. In brief, the conditional jump code (Jcond) gets the branch offset byte. It tests the appropriate condition and, if satisfied, jumps to the relative jump microcode (RELJUMP). The RELJMP code adds the offset to the program counter. In either case, the microcode routine ends when it runs the next instruction (RNI).

   move       action
Jcond:
1 Q→tmpBL
2          XC    RELJMP                    
3          RNI                       

RELJMP:
4          SUSP
5          CORR                      
6 PC→tmpA  ADD   tmpA
7 Σ→PC     FLUSH RNI                       

In more detail, micro-instruction 1 (arbitrary numbering) moves a byte from the prefetch queue (Q) across the queue bus to the ALU's temporary B register.4 (Arguments for ALU operations are first stored in temporary registers, invisible to the programmer.) Instruction 2 tests the appropriate condition with XC, and jumps to the RELJMP routine if the condition is satisfied.5 Otherwise, RNI (Run Next Instruction) ends this sequence and loads the next machine instruction without jumping.

If the condition is satisfied, the relative jump routine starts with instruction 4, which suspends prefetching.6 Instruction 5 corrects the program counter value, since it normally points to the next byte to prefetch, not the next byte to execute. Instruction 6 moves the corrected program counter address to the ALU's temporary A register. It also starts an ALU operation to add temporary A and temporary B. Instruction 7 moves the sum (Σ) to the program counter. It flushes the prefetch queue, which starts up prefetching from the new PC value. Finally, RNI runs the next instruction, from the updated address.

This code supports all 16 conditional jumps because the microcode tests the generic "XC" condition. This indicates that the specific test depends on the four low bits of the opcode, and the hardware determines exactly what to test. It's important to keep the two levels straight: the machine instruction is doing a conditional jump to a different memory address, while the microcode that implements this instruction is performing a conditional jump to a different micro-address.

The timing for conditional jumps

The RNI (Run Next Instruction) micro-operation initiates processing of the next machine instruction. However, it takes a clock cycle to get the next instruction from the prefetch queue, decode it, and start the appropriate micro-instruction. This causes a wasted clock cycle before the next micro-instruction executes. To avoid this delay, most microcode routines issue a NXT micro-operation one cycle before they end. This gives the 8086 time to decode the next machine instruction so micro-instructions can run uninterrupted.

Unfortunately, the conditional jump instructions can't take advantage of NXT. The problem is that the control flow in the microcode depends on whether the conditional jump is taken or not. By the time the microcode knows it is not taking the branch, it's too late to issue NXT.

The datasheet gives the timing of a conditional jump as 4 clock cycles if the jump is not taken, and 8 clock cycles if the jump is taken. Looking at the microcode explains these timings. There are 3 micro-instructions executed if the jump is not taken, and 7 if it is taken. Because of the RNI, there is one wasted clock cycle, resulting in the documented 4 or 8 cycles in total.

The conditions

At this point I will review the 8086's conditional jumps. The 8086 implements 16 conditional jumps. (This is a large number compared to earlier CPUs: the 8080, 6502, and Z80 all had 8 conditional jumps, specified by 3 bits.) The table below shows which flags are tested for each condition, specified by the low four bits of the opcode. Some jump instructions have multiple names depending on the programmer's interpretation, but they map to the same machine instruction.7

ConditionBitsCondition trueCondition false
Overflow Flag (OF)=1000xoverflow (JO)not overflow (JNO)
Carry Flag (CF)=1001xcarry (JC)
below (JB)
not above or equal (JNAE)
not carry (JNC)
not below (JNB)
above or equal (JAE)
Zero Flag (ZF)=1010xzero (JZ)
equal (JE)
not zero (JNZ)
not equal (JNE)
CF=1 or ZF=1011xbelow or equal (JBE)
not above (JNA)
not below or equal (JNBE)
above (JA)
Sign Flag (SF)=1100xsign (JS)not sign (JNS)
Parity Flag (PF)=1101xparity (JP)
parity even (JPE)
not parity (JNP)
parity odd (JPO)
SF ≠ OF110xless (JL)
not greater or equal (JNGE)
not less (JNL)
greater or equal (JGE)
ZF=1 or SF ≠ OF111xless or equal (JLE)
not greater (JNG)
not less or equal (JNLE)
greater (JG)

From the hardware perspective, the important thing is that there are eight different condition flag tests. Each test has two jump instructions associated with it: one that jumps if the condition is true, and one that jumps if the condition is false. The low bit of the opcode selects "if true" or "if false".

The image below shows the condition evaluation circuitry as it appears on the die. There isn't much structure to it; it's just a bunch of gates. This image shows the doped silicon regions that form transistors. The numerous small polygons with a circle inside are connections between the metal layer and the polysilicon layer. Many of these connections use the silicon layer to optimize the layout.

The circuitry to compute conditions as it appears on the die. The metal and polysilicon layers have been removed for this image, showing the silicon underneath.

The circuitry to compute conditions as it appears on the die. The metal and polysilicon layers have been removed for this image, showing the silicon underneath.

This circuitry evaluates each condition by getting the instruction bits from the Instruction Register, checking the bits to match each condition, and testing if the condition is satisfied. For instance, the overflow condition (with instruction bits 000x) is computed by a NOR gate: NOR(IR3, IR2, IR1, OF'), which will be true if instruction register bits 3, 2, and 1 are zero and the Overflow Flag is 1.

The results from the individual condition tests are combined with a 7-input NOR gate, producing a result that is 0 if the specified 3-bit condition is satisfied. Finally, the "if true" and "if false" cases are handled by flipping this signal depending on the low bit of the instruction. This final result indicates if the 4-bit condition in the instruction is satisfied, and this signal is passed on to the microcode control circuitry.

One unexpected feature of the implementation is that a 7-input NOR gate combines the various conditions to test if the selected condition is satisfied. You'd expect that with eight conditions, there would be eight inputs to the NOR gate. However, there is a clever optimization that takes advantage of conditions that are combinations of clauses, for example, "less or equal". Specifically, the zero flag is tested for bit pattern 01xx (where x indicates a 0 or 1), which covers two conditions with one gate. Likewise, SF≠OF is tested for bit pattern 11xx and CF=1 is tested for bit pattern 0x1x. With these optimizations, the eight conditions are covered with seven checks. (This shows that the opcodes weren't assigned arbitrarily: the bit patterns needed to be carefully assigned for this to work.)

Back to the microcode

Before explaining how the microcode jump circuitry works, I'll briefly discuss the microcode format. A micro-instruction is encoded into 21 bits as shown below. Every micro-instruction contains a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits is a bit tricky since it depends on the type field, which is two or three bits long. The "short jump" (type 0) is a conditional jump within the current block of 16 micro-instructions. The ALU operation (type 1) sets up the arithmetic-logic unit to perform an operation. Bookkeeping operations (type 4) are anything from flushing the prefetch queue to ending the current instruction. A memory read or write is type 6. A "long jump" (type 5) is a conditional jump to any of 16 fixed microcode locations (specified in an external table). Finally, a "long call" (type 7) is a conditional subroutine call to one of 16 locations (different from the jump targets).

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

I'm going to focus on the XC RELJMP micro-instruction that we saw in the microcode earlier. This is a "long jump" with XC as the condition and RELJMP as the target tag. Another layer of hardware is required to implement the microcode conditions. The microcode supports 16 conditions, which are completely different from the 16 programmer-level conditions.8 Some microcode conditions test special-purpose internal flags, while others test conditions such as an interrupt, the chip's TEST pin, bit 3 of the opcode, or if the instruction has a one-byte address offset. The XC condition is one of these 16 conditions, number 15 specifically.

The conditions are evaluated by the condition PLA (Programmable Logic Array, a grid of gates), shown below. The four condition bits from the micro-instruction, along with their complements, are fed into the columns. The PLA has 16 rows, one for each condition. Each row is a NOR gate matching one bit combination (i.e. selecting a condition) and the corresponding signal value to test.9 Thus, if a particular condition is specified and is satisfied, that row will be 1. The 16 row outputs are combined by the 16-input NOR gate at the left. Thus, if the specified condition is satisfied, this output will be 0, and if the condition is unsatisfied, the output will be 1. This signal controls the jump or call micro-instruction: if the condition is satisfied, the new micro-address is loaded into the microcode address register. If the condition is not satisfied, the microcode proceeds sequentially.

The condition PLA evaluates microcode conditionals.

The condition PLA evaluates microcode conditionals.

Conclusions

To summarize, the 8086 processor implements 16 conditional jump instructions. One piece of microcode efficiently implements all 16 instructions, with gate logic determining which flags to test, depending on bits in the machine instruction. The result of this test is used by the microcode XC conditional jump, one of 16 completely different microcode-level conditions. If the XC condition is satisfied, the program counter is updated by adding the offset, jumping to the new location.

Conditional jumps are relatively straightforward instructions from the programmer's perspective, but they interact with most parts of the 8086 processor including the prefetch queue, the address adder, the ALU, microcode, and the Translation ROM. The diagram below shows the interactions for each step of the jump.

The conditional jump involves many parts of the die, shown in this diagram.

The conditional jump involves many parts of the die, shown in this diagram.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff.

Notes and references

  1. In addition to the six status flags, the 8086 has three control flags: trap, direction, and interrupt enable. These flags aren't tested by conditional branches so I won't discuss them further. 

  2. Strictly speaking, the 8086 has a few more conditional jumps. The JCXZ instruction tests if the CX register is zero. The LOOP, LOOPNZ, and LOOPZ instructions decrement the CX register and loop if it is nonzero. The last two only loop if the zero flag indicates nonzero or zero, respectively. I'm ignoring these instructions in the blog post. 

  3. Although a conditional jump only supports a small range, it's still possible to conditionally jump to a distant location by using two instructions. A conditional jump with the opposite condition can skip over a longer unconditional jump instruction. The 80386 removed this restriction by providing long-displacement conditional jumps, which could perform a 16-bit or 32-bit relative jump. 

  4. The relative offset byte is sign-extended when it is moved to the temporary B register. That is, if the top bit is high, the high byte is set to all 1's to produce a 16-bit negative value. 

  5. The details of how the microcode jumps to the RELJMP routine are interesting, but a bit of a tangent, so I've put this discussion in a footnote. For long jumps (and long calls) in microcode, the target micro-addresses are stored in the Translation ROM, and the 4-bit target tag indexes into this ROM. The motivation for this structure is that micro-addresses are 13 bits, which is a lot of bits to try to fit into a 21-bit micro-instruction. Using a 4-bit tag keeps the microcode compact, but at the cost of requiring a small ROM in the 8086.

    The translation ROM on the die.

    The translation ROM on the die.

    Above is a view of the Translation ROM, with the RELJMP entry highlighted. The left half decodes tags, while the right half provides the corresponding microcode address. The row for RELJMP is highlighted. 

  6. Much of this microcode snippet deals with the prefetch queue. To increase efficiency, the 8086 processor fetches instructions from memory before they are needed and stores them in a 6-byte prefetch queue. In most processors, the program counter points to the memory address of the next instruction to execute. However, in the 8086, the program counter advances during prefetching, so it points to the memory address of the next instruction to fetch. This discrepancy is invisible to the programmer, but the microcode needs to handle it.

    First, the microcode issues a SUSP micro-operation to suspend prefetching. This ensures that the program counter will not be changed due to more prefetching. Next, the CORR micro-operation corrects the program counter to point to the next address to execute. This correction is performed by subtracting the number of unused bytes in the prefetch queue. You might expect this correction to be performed by the Arithmetic/Logic Unit (ALU). However, the 8086 has a separate adder that is used for memory address computations: each memory access in the 8086 requires a segment register base address to be added to an offset address. This address adder is also used for program counter correction. The constant ROM holds the values -1 through -6, the appropriate constant is selected based on the number of bytes in the prefetch queue, and this constant is added to the program counter. (Interestingly, the address adder is used for program counter correction, while the ALU is used to modify the program counter for the relative jump computation.)

    The address adder has multiple uses. It is also used for updating the program counter during prefetching. It updates addresses when performing block copy operations. Finally, it updates addresses when performing an unaligned word operation. The constant ROM holds constants for these operations.

    At the end of the microcode sequence, the FLUSH micro-operation flushes the stale bytes from the prefetch queue, resets the prefetch queue pointers, and restarts prefetching. I wrote about prefetching in detail here

  7. Often, the compare (CMP) instruction will be executed to compare two numbers by subtracting and discarding the result but keeping the condition codes. One complication is that some tests make sense for signed numbers, while other tests make sense for unsigned numbers. Specifically, "greater", "greater or equal", "less", and "less or equal" make sense for signed comparisons. On the other hand, "above", "above or equal", "below", and "below or equal" make sense for unsigned comparisons.

    The 8086 supports both signed and unsigned numbers. The arithmetic operations are the same for both; it's just the programmer's interpretation that differs. For instance, consider adding hex numbers 0xfe and 0x01. Treating them as unsigned numbers, the sum is 254 + 1 = 255. But as signed numbers, -2 + 1 = -1. In either case, the processor computes the same result, 0xff, but the interpretation is different.

    The signed vs unsigned distinction matters for comparisons. For instance, as unsigned numbers, 0xfe (254) is above 0x01 (1). But as signed numbers, 0xfe (-2) is less than 0x01 (1). This is why different instructions are used to compare unsigned versus signed numbers.

    Another important factor is that the carry flag indicates an unsigned result is too large for its byte (or word), while the overflow flag indicates that a signed result is too large for its byte (or word). For instance, adding unsigned bytes 0xff (255) and 0x02 (2) yields 0x01 (1) and a carry, indicating the result is too big for a byte. However, as signed bytes this corresponds to -1 + 2 = 1, which fits in a byte, so the overflow flag is not set. Conversely, 0x7f + 0x01 = 0x80. As unsigned bytes, this corresponds to 127 + 1 = 128 which is fine. But as signed bytes, this corresponds to 127 + 1, which unexpectedly yields -128 due to overflow. Thus, the carry flag is not set, but the overflow flag is set in this case. 

  8. Short jumps have four bits to specify the condition, so they can access 16 conditions. For long jumps and long calls, one bit is "stolen" from the condition to indicate the type, so they can only access eight of the conditions. Thus, the conditions need to be assigned carefully so the necessary ones are available. 

  9. PLAs are typically uniform grids, but the grid pattern breaks down a bit in the condition PLA. The reason is that each test uses a separate signal, so there is a different signal into each row (unlike a typical PLA where each row receives the same signals). Moreover, some of the test signals are processed at the left, distorting the 16-input NOR gate. This illustrates the degree of layout optimization in the 8086, squeezing transistors in to save a bit of space. 

11 comments:

Ruik said...

Hi,
The legal NEC v.Intel paper you mention has an interresting remark:

"Intel's translation was questionable in other respects as well. For example, Intel did not include in its translation those Intel sequences that did not have a corresponding NEC routine. Most notable was the "magic instruction", an undocumented instruction hidden by Intel in the Intel Microcode to detect copying. 107 When Intel analyzed the NEC Microcode it found, to its disappointment, that it did not contain the magic instruction.108"


It sort of looks it should be there... but it is not. Unfortunately I don't know what the 108 note means. It just says: Tr. 2483:25-2485:16. probably it is some sort of reference.

Is there a chance that something could be hidden in the wired logic instead?

Ken Shirriff said...

Hi Rulk,
I haven't noticed the "magic instruction" yet, so I'll keep my eyes open. As far as "Tr 2483:25", my guess is that refers to page 2483 line 25 of a legal transcript.

Anonymous said...

Magic instruction?: Using the REP or REPNE prefix with a MUL or IMUL instruction negates the product. Using the REP or REPNE prefix with an IDIV instruction negates the quotient.

Andrew Jenner said...

Could the magic instruction be SALC? That would explain why it was left undocumented, as a copyright trap. If not that then there are a couple of nonsense microcode instructions at the very end of the sequence which could be what the remark is talking about.

JO said...

Funny that the microcode for conditional branch has its own conditional branches. It's conditional branches all the way down!

It seems like it would be useless to jump 1 byte, since that's where you were going next anyway. It's probably rare to jump 2 bytes. I feel like they could have extended the range of the jumps a little bit by offseting by one or two, but adding a little complexity to the jump arithmetic. maybe the ability to jump 1 byte further isn't worth it. Wonder what the statistical distributions are for jump length on various architectures.

Anonymous said...

JO,
You seem to have forgotten that the jump offset is SIGNED and simply added to the program counter. Yes, jumps to the next byte following is rather useless, and jumps to the byte following are rare, but to avoid those and get an extra 2 possible destinations would be a lot more complex in terms of hardware since it wouldn't be just a simple offset.

Anonymous said...

Hi,

Thank you all for the the comments and especially Ken for his wonderful blog posts! It seems it could be really SALC because NEC did not implemented that it and it just does XLAT (see https://forum.vcfed.org/index.php?threads/undocumented-instructions-on-intel-8088-and-nec-v20.36551/ and https://forum.vcfed.org/index.php?threads/the-d6-instruction-on-the-nec-v-series-v20-v30-chips.29471/)

Thanks,
Ruik

Paul Laba said...

Another great article Ken, thanks!

It’s worth noting that while the Z80’s conditional instructions does have only 2 bits to indicate which flag to test (and 1 bit to indicate if the flag is set or reset), it has several conditional instructions: conditional jump absolute, conditional jump relative, conditional call (absolute) and conditional return. It also has the handy DJNZ instruction, which decrements register B and jumps to the (relative) address if the result is non-zero (Z flag reset).

Toivo Henningsson said...

Nice post!
One question though:
Couldn't the microcode have combined taking a byte from the prefetch queue and doing executing the conditional microcode jump during the first cycle, to save a cycle in both the taken and not taken cases, like this:

Jcond:
1 Q→tmpBL XC RELJMP
2 RNI

? As I understand it, Q→tmpBL and XC RELJMP are specified by different fields in the microcode, and the evaluation of the condition code and the relative jump shouldn't depend on the byte being fetched from the prefetch queue first.
But I'm sure there's a reason why they didn't do it like this.

barnacle said...

An interesting similarity with the 8086's ancestor the 8080 - which includes eight conditional tests on four flags. While amusing myself with designing an 8080 emulator/simulator in 74HC logic (I didn't say I wasn't crazy!) I noticed that the conditional instructions had two bits which could be decoded with a 1-of-4 and a single bit which indicated whether the test was for the condition being true or false.

An XOR gate tests the desired condition against each of the flags; a NOR gate selects the output of the XOR and a final 4-input OR gate gives a true output if the desired condition is met.

My system control doesn't use microcode but rather a wide ROM; to accommodate the conditional instructions the OR gate output is fed as a suitable address input to the ROM. There are two huge advantages here: first that the condition is evaluated as soon as the instruction opcode is loaded, and second that the code (rom contents) are identical for all the condition tests.

The disadvantage, of course, is the huge amount of ROM it takes - 16k by 48 bits - which would have been a show-stopper back in the day. These days it's difficult finding ROM that small...

Neil

Ken Shirriff said...

Toivo, I believe the issue is that it takes a couple of clock cycles for the flag update to complete from the previous instruction. So if the XC happens in the first instruction of the jump, it could do the comparison with stale flags.