Showing posts with label 8086. Show all posts
Showing posts with label 8086. Show all posts

The microcode and hardware in the 8086 processor that perform string operations

Intel introduced the 8086 microprocessor in 1978. This processor ended up being hugely influential, setting the path for the x86 architecture that is extensively used today. One interesting feature of the 8086 was instructions that can efficiently operate on blocks of memory up to 64K bytes long.1 These instructions rapidly copy, compare, or scan data and are known as "string" instructions.2

In this blog post, I explain string operations in the 8086, analyze the microcode that it used, and discuss the hardware circuitry that helped it out. My analysis is based on reverse-engineering the 8086 from die photos. The photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions. The microcode ROM at the lower right controls the process.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

Segments and addressing

Before I get into the details of the string instructions, I need to give a bit of background on how the 8086 accesses memory through segments. Earlier microprocessors such as the Intel 8080 (1974) used 16 bits to specify a memory address, allowing a maximum of 64K of memory. This memory capacity is absurdly small by modern standards, but at the time when a 4K memory board cost hundreds of dollars, this limit was not a problem. However, due to Moore's Law and the exponential growth in memory capacity, the follow-on 8086 processor needed to support more memory. At the same time, the 8086 needed to use 16-bit registers for backward compatibility with the 8080.

The much-reviled solution was to create a 1-megabyte (20-bit) address space consisting of 64K segments, with a 16-bit address specifying a position within the segment. In more detail, the memory address was specified by a 16-bit offset address along with a particular 16-bit segment register selecting a segment. The segment register's value was shifted by 4 bits to give the segment's 20-bit base address. The 16-bit offset address was added, yielding a 20-bit memory address. This gave the processor a 1-megabyte address space, although only 64K could be accessed without changing a segment register. The 8086 had four segment registers so it could use multiple segments at the same time: the Code Segment, Data Segment, Stack Segment, and Extra Segment.

The 8086 chip is split into two processing units: the Bus Interface Unit (BIU) that handles segments and memory accesses, and the Execution Unit (EU) that executes instructions. The Execution Unit is what comes to mind when you think of a processor: it has most of the registers, the arithmetic/logic unit (ALU), and the microcode that implements instructions. The Bus Interface Unit interacts with memory and other external systems, performing the steps necessary to read and write memory.

Among other things, the Bus Interface Unit has a separate adder for address calculation; this adds the segment register to the base address to determine the final memory address. Every memory access uses the address adder at least once to add the segment base and offset. The address adder is also used to increment the program counter. Finally, the address adder increments and decrements the index registers used for block operations. This will be discussed in more detail below.

Microcode in the 8086

Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. With microcode, instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. This provides a considerable performance improvement for the block operations, which requires many steps in a loop. Performing this loop in microcode is considerably faster than writing the loop in assembly code.

A micro-instruction in the 8086 is encoded into 21 bits as shown below. Every micro-instruction specifies a move operation from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits depends on the type field and can be anything from an ALU operation to a memory read or write to a change of microcode control flow. Thus, an 8086 micro-instruction typically does two things in parallel: the move and the action. For more about 8086 microcode, see my microcode blog post.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

I'll explain the behavior of an ALU micro-operation since it is important for string operations. The Arithmetic/Logic Unit (ALU) is the heart of the processor, performing addition, subtraction, and logical operations. The ALU has three temporary input registers that are invisible to the programmer: tmpA, tmpB, and tmpC. An ALU operation takes its first argument from any temporary register, while the second argument always comes from tmpB. Performing an ALU operation requires two micro-instructions. The first micro-instruction specifies the ALU operation and source register, configuring the ALU. For instance, ADD tmpA configures the ALU to add the tmpA register to the default tmpB register. In the next micro-instruction (or a later one), the ALU result can be accessed through a special register called Σ (SIGMA) and moved to another register.

I'll also explain the memory read and write micro-operations. A memory operation uses two internal registers: IND (Indirect) holds the memory address, while OPR (Operand) holds the word that is read or written. A typical memory micro-instruction for a read is R DS,BL. This causes the Bus Interface Unit to compute the memory address by adding the Data Segment (DS) to the IND register and then perform the read. The Bus Interface Unit determines if the instruction is performing a byte operation or a word operation and reads a byte or word as appropriate, going through the necessary bus cycles. The BL option3 causes the Bus Interface Unit to update the IND register as appropriate,3 incrementing or decrementing it by 1 or 2 depending on the Direction Flag and the size of the access (byte or word). All of this complexity happens in the hardware of the Bus Interface Unit and is invisible to the microcode. The tradeoff is that this simplifies the microcode but makes the chip's hardware considerably more complicated.

The string move instruction

The 8086 has five types of string instructions, operating on blocks of memory: MOVS (Move String), CMPS (Compare Strings), SCAS (Scan String), LODS (Load String), and STOS (Store String). Each instruction operates on a byte or word, but by using a REP prefix, the operation can be repeated for up to 64k bytes, controlled by a counter. Conditional repetitions can terminate the loop on various conditions. The string instructions provide a flexible way to operate on blocks of memory, much faster than a loop written in assembly code.

The MOVS (Move String) operation copies one memory region to another. The CMPS (Compare Strings) operation compares two memory blocks and sets the status flags. In particular, this indicates if one string is greater, less, or equal to the other. The SCAS (Scan String) operation scans memory, looking for a particular value. The LODS (Load String) operation moves an element into the accumulator, generally as part of a more complex loop. Finally, STOS (Store String) stores the accumulator value, either to initialize a block of memory or as part of a more complex loop.4

Like many 8086 instructions, each string instruction has two opcodes: one that operates on bytes and one that operates on words. One of the interesting features of the 8086 is that the same microcode implements the byte and word instructions, while the hardware takes care of the byte- or word-sized operations as needed. Another interesting feature of the string operations is that they can go forward through memory, incrementing the pointers, or they can go backward, decrementing the points. A special processor flag, the Direction Flag, indicates the direction: 0 for incrementing and 1 for decrementing. Thus, there are four possibilities for stepping through memory, part of the flexibility of the string operations.

The flowchart below shows the complexity of these instructions. I'm not going to explain the flowchart at this point, but the point is that there is a lot going on. This functionality is implemented by the microcode.

This flowchart shows the operation of a string instruction. From The 8086 Family Users Manual, fig 2-33.

This flowchart shows the operation of a string instruction. From The 8086 Family Users Manual, fig 2-33.

I'll start by explaining the MOVS (Move String) instruction, which moves (copies) a block of memory. Before executing this instruction, the registers should be configured so the SI Source Index register points to the first block, the DI Destination Index register points to the second block, and the CX Count register holds the number of bytes or words to move. The basic action of the MOVS instruction reads a byte (or word) from the SI address and updates SI, writes the value to the DI address and updates DI, and decrements the CX counter.

The microcode block below is executed for the MOVS (and LODS) instructions. There's a lot happening in this microcode with a variety of control paths, so it's a bit tricky to understand, but let's see how it goes. Each micro-instruction has a register-to-register move on the left and an action on the right, happening in parallel. The first micro-instruction handles the REP prefix, if any; let's assume for now that there's no prefix so it is skipped. Next is the read from memory, which requires the memory address to be in the IND register. Thus, the micro-instruction moves SI to IND and starts the read cycle (R DS,BL). When the read completes, the updated IND register is moved back to SI, updating that register. Meanwhile, X0 tests the opcode and jumps to "8" for LODS. The MOVS path falls through, getting the address from the DI register and writing to memory the value that we just read. The updated IND register is moved to DI while another conditional jump goes to "7" if there's no REP prefix. Micro-instruction 7 performs an RNI (Run Next Instruction), which ends the microcode and causes the next machine instruction to be decoded. As you can see, microcode is very low-level.

 move       action           
           CALL F1 RPTS MOVS/LODS: handle REP if active
SI → IND   R DS,BL      1: Read byte/word from SI
IND → SI   JMPS X0 8     test instruction bit 3: jump if LODS
DI → IND   W DA,BL       MOVS path: write to DI
IND → DI   JMPS NF1 7   4: run next instruction if not REP
Σ → tmpC   JMP INT RPTI 5: handle any interrupt
tmpC → CX  JMPS NZ 1     update CX, loop if not zero
           RNI          7: run next instruction

OPR → M    JMPS F1 5    8: LODS path: store AL/AX, jump back if REP
           RNI           run next instruction

Now let's look at the case with a REP prefix, causing the instruction to loop. The first step is to test if the count register CX is zero, and bail out of the loop if so. In more detail, the REP prefix sets an internal flag called F1. The first micro-instruction for MOVS above conditionally calls the RPTS subroutine if F1 is set. The RPTS subroutine below is a bit tricky. First, it moves the count in CX to the ALU's temporary C register. It also configures the ALU to pass tmpC through unchanged. The next move discards the ALU result Σ, but as a side effect, sets a flag if the value is zero. This micro-instruction also configures the ALU to perform DEC tmpC, but the decrement doesn't happen yet. Next, if the value is nonzero (NZ), the microcode execution jumps to 10 and returns from the microcode subroutine, continuing execution of the MOVS code described above. On the other hand, if CX is zero, execution falls through to RNI (Run Next Instruction), which terminates execution of the MOVS instruction.

CX → tmpC     PASS tmpC   RPTS: test CX
Σ → no dest   DEC tmpC     Set up decrement for later
              JMPS NZ 10   Jump to 10 if CX not zero
              RNI          If 0, run next instruction
              RTN         10: return

If execution returns to the MOVS microcode, it will execute as described earlier until the NF1 test below. With a REP prefix, the test fails and microcode execution falls through. The next micro-instruction performs Σ → tmpC, which puts the ALU result into tmpC. The ALU was configured back in the RPTS subroutine to decrement tmpC, which holds the count from CX, so the result is that CX is decremented, put into tmpC, and then put back into CX in the next micro-instruction. It seems like a roundabout way to decrement the counter, but that's microcode. Finally, if the value is nonzero (NZ), microcode execution jumps back to 1 (near the top of the MOVS code earlier), repeating the whole process. Otherwise, RNI ends processing of the instruction. Thus, the MOVS instruction repeats until CX is zero. In the next section, I'll explain how JMP INT RPTI handles an interrupt.

IND → DI   JMPS NF1 7   4: run next instruction if not REP
Σ → tmpC   JMP INT RPTI 5: handle any interrupt
tmpC → CX  JMPS NZ 1     update CX, loop if not zero
           RNI          7: run next instruction

The NZ (not zero) condition tests a special 16-bit zero flag, not the standard zero status flag. This allows zero to be tested without messing up the zero status flag.

Interrupts

Interrupts pose a problem for the string operations. The idea behind interrupts is that the computer can be interrupted during processing to handle a high-priority task, such as an I/O device that needs servicing. The processor stops its current task, executes the interrupt handling code, and then returns to the original task. The 8086 processor normally completes the instruction that it is executing before handling the interrupt, so it can continue from a well-defined state. However, a string operation can perform up to 64k moves, which could take a large fraction of a second.5 If the 8086 waited for the string operation to complete, interrupt handling would be way too slow and could lose network packets or disk data, for instance.

The solution is that a string instruction can be interrupted in the middle of the instruction, unlike most instructions. The string instructions are designed to use registers in a way that allows the instruction to be restarted. The idea is that the CX register holds the current count, while the SI and DI registers hold the current memory pointers, and these registers are updated as the instruction progresses. If the instruction is interrupted it can simply continue where it left off. After the interrupt, the 8086 restarts the string operation by backing the program counter up by two bytes (one byte for the REP prefix and one byte for the string opcode.) This causes the interrupted string operation to be re-executed, continuing where it left off.

If there is an interrupt, the RPTI microcode routine below is called to update the program counter. Updating the program counter is harder than you'd expect because the 8086 prefetches instructions. The idea is that while the memory bus is idle, instructions are read from memory into a prefetch queue. Then, when an instruction is needed, the processor can (hopefully) get the instruction immediately from the prefetch queue instead of waiting for a memory access. As a result, the program counter in the 8086 points to the memory address of the next instruction to fetch, not the next instruction to execute. To get the "real" program counter value, prefetching is first suspended (SUSP). Then the PC value is corrected (CORR) by subtracting the length of the prefetch queue. At this point, the PC points to the next instruction to execute.

tmpC → CX   SUSP        RPTI: store CX
            CORR         correct PC
PC → tmpB   DEC2 tmpB  
Σ → PC      FLUSH RNI    PC -= 2, end instruction

At last, the microcode gets to the purpose of this subroutine: the PC is decremented by 2 (DEC2) using the ALU. The prefetch queue is flushed and restarted and the RNI micro-operation terminates the microcode and runs the next instruction. Normally this would execute the instruction from the new program counter value (which now points to the string operation). However, since there is an interrupt pending, the interrupt will take place instead, and the interrupt handler will execute. After the interrupt handler finishes, the interrupted string operation will be re-executed, continuing where it left off.

There's another complication, of course. An 8086 instruction can have multiple prefixes attached, for example using a segment register prefix to access a different segment. The approach of backing up two bytes will only execute the last prefix, ignoring any others, so if you have two prefixes, the instruction doesn't get restarted correctly. The 8086 documentation describes this unfortunate behavior. Apparently a comprehensive solution (e.g. counting the prefixes or providing a buffer to hold prefixes during an interrupt) was impractical for the 8086. I think this was fixed in the 80286.

The remaining string instructions

I'll discuss the microcode for the other string operations briefly. The LODS instruction loads from memory into the accumulator. It uses the same microcode routine as MOVS; the code below is the same code discussed earlier. However, the path through the microcode is different for LODS since the JMPS X0 8 conditional jump will be taken. (This tests bit 3 of the opcode, which is set for LODS.) At step 8, a value has been read from memory and is in the OPR (Operand) register. This micro-instruction moves the value from OPR to the accumulator (represented by M for complicated reasons6). If there is a repeat prefix, the microcode jumps back to the previous flow (5). Otherwise, RNI runs the next instruction. Thus, LODS shares almost all its microcode with MOVS, making the microcode more compact at the cost of slowing it down slightly due to the conditional jumps.

 move       action           
           CALL F1 RPTS MOVS/LODS: handle REP if active
SI → IND   R DS,BL      1: Read byte/word from SI
IND → SI   JMPS X0 8     test instruction bit 3: jump if LODS
DI → IND   W DA,BL       MOVS path: write to DI
IND → DI   JMPS NF1 7   4: run next instruction if not REP
Σ → tmpC   JMP INT RPTI 5: handle any interrupt
tmpC → CX  JMPS NZ 1     update CX, loop if not zero
           RNI          7: run next instruction

OPR → M    JMPS F1 5    8: LODS path: store AL/AX, jump back if REP
           RNI           run next instruction

The STOS instruction is the opposite of LODS, storing the accumulator value into memory. The microcode (below) is essentially the second half of the MOVS microcode. The memory address in DI is moved to the IND register and the value in the accumulator is moved to the OPR register to set up the write operation. (As with LODS, the M register indicates the accumulator.6) The CX register is decremented using the ALU.

DI → IND    CALL F1 RPTS   STOS: if REP prefix, test if done
M → OPR     W DA,BL        1: write the value to memory
IND → DI    JMPS NF1 5      Quit if not F1 (repeat)
Σ → tmpC    JMP INT RPTI    Jump to RPTI if interrupt
tmpC → CX   JMPS NZ 1       Loop back if CX not zero
            RNI            5: run next instruction

The CMPS instruction compares strings, while the SCAS instruction looks for a zero or non-zero value, depending on the prefix. They share the microcode routine below, with the X0 condition testing bit 3 of the instruction to select the path. The difference is that CMPS reads the comparison character from SI, while SCAS compares against the character in the accumulator. The comparison itself is done by subtracting the two values and discarding the result. The F bit in the micro-instruction causes the processor's status flags to be updated with the result of the subtraction, indicating less than, equal, or greater than.

            CALL F1 RPTS   CMPS/SCAS: if RPT, quit if done
M → tmpA    JMPS X0 5      1:accum to tmpA, jump if SCAS
SI → IND    R DS,BL         CMPS path, read from SI to tmpA
IND → SI                    update SI
OPR → tmpA                  fallthrough
DI → IND    R DA,BL        5: both: read from DI to tmpB
OPR → tmpB  SUBT tmpA       subtract to compare
Σ → no dest DEC tmpC F      update flags, set up DEC
IND → DI    JMPS NF1 12     return if not RPT
Σ → CX      JMPS F1ZZ 12    update CX, exit if condition
Σ → tmpC    JMP INT RPTI    if interrupt, jump to RPTI
            JMPS NZ 1       loop if CX ≠ 0
            RNI            12: run next instruction

One tricky part about the scan and compare instructions is that you can either repeat until the values are equal or until they are unequal, with the REPE or REPNE prefixes respectively. Rather than implementing this two-part condition in microcode, the F1ZZ condition above tests the right condition depending on the prefix.

Hardware support

Although the 8086 uses microcode to implement instructions, it also uses a considerable amount of hardware to simplify the microcode. This hybrid approach was necessary in order to fit the microcode into the small ROM capacity available in 1978.7 This section discusses some of the hardware circuitry in the 8086 that supports the string operations.

Implementing the REP prefixes

Instruction prefixes, including REPNZ and REPZ, are executed in hardware rather than microcode. The first step of instruction decoding, before microcode starts, is the Group Decode ROM. This ROM categorizes instructions into various groups. For instructions that are categorized as prefixes, the signal from the Group Decode ROM delays any interrupts (because you don't want an interrupt between the prefix and the instruction) and starts the next instruction without executing microcode. The Group Decode ROM also outputs a REP signal specifically for these two prefixes. This signal causes the F1 latch to be loaded with 1, indicating a REP prefix. (This latch is also used during multiplication to track the sign.) This signal also causes the F1Z latch to be loaded with bit 0 of the instruction, which is 0 for REPNZ and 1 for REPZ. The microcode uses these latches to determine the appropriate behavior of the string instruction.

Updating SI and DI: the Constant ROM

The SI and DI index registers are updated during each step to point to the next element. This update is more complicated than you might expect, though, since the registers are incremented or decremented based on the Direction Flag. Moreover, the step size, 1 or 2, varies for a byte or word operation. Another complication is unaligned word accesses, using an odd memory address to access a word. The 8086's bus can only handle aligned words, so an unaligned word access is split into two byte accesses, incrementing the address after the first access. If the operation is proceeding downward, the address then needs to be decremented by 3 (not 2) at the end to cancel out this increment. The point is that updating the index registers is not trivial but requires an adjustment anywhere between -3 and +2, depending on the circumstances.

The Bus Interface Unit performs these updates automatically, without requiring the microcode to implement the addition or subtraction. The arithmetic is not performed by the regular ALU (Arithmetic/Logic Unit) but by the special adder dedicated to addressing arithmetic. The increment or decrement value is supplied by a special ROM called the Constant ROM, located next to the adder. The Constant ROM (shown below) is implemented as a PLA (programmable logic array), a two-level structured arrangement of gates. The first level (bottom) selects the desired constant, while the second level (middle) generates the bits of the constant: three bits plus a sign bit. The constant ROM is also used for correcting the program counter value as described earlier.

The Constant ROM, highlighted on the die. The correction constants are used to correct the PC.

The Constant ROM, highlighted on the die. The correction constants are used to correct the PC.

Condition testing

The microcode supports conditional jumps based on 16 conditions. Several of these conditions are designed to support the string operations. To test if a REP prefix is active, microcode uses the F1 test, which tests the F1 latch. The REPZ and REPNZ prefixes loop while the zero flag is 1 or 0 respectively. This somewhat complicated test is supported in microcode by the F1ZZ condition, which evaluates the zero flag XOR the F1Z latch. Thus, it tests for zero with REPZ (F1Z=0) and nonzero with REPNZ (F1Z=1).

Looping happens as long as the CX register is nonzero. This is tested in microcode with the NZ (Not Zero) condition. A bit surprisingly, this test doesn't use the standard zero status flag, but a separate latch that tracks if an ALU result is zero. (I call this the Z16 flag since it tests the 16-bit value, unlike the regular zero flag which tests either a byte or word.) The Z16 flag is only used by the microcode and is invisible to the programmer. The motivation behind this separate flag is so the string operations can leave the visible zero flag unchanged.8

Another important conditional jump is X0, which tests bit 3 of the instruction. This condition distinguishes between the MOVS and LODS instructions, which differ in bit 3, and similarly for CMPS versus SCAS. The test uses the X register which stores part of the instruction during decoding. Note that the opcodes aren't arbitrarily assigned to instructions like MOVS and LODS. Instead, the opcodes are carefully assigned so the instructions can share microcode but be distinguished by X0. Finally, the string operation microcode also uses the INT condition, which tests if an interrupt is pending.

The conditions are evaluated by the condition PLA (Programmable Logic Array, a grid of gates), shown below. The four condition bits from the micro-instruction, along with their complements, are fed into the columns. The PLA has 16 rows, one for each condition. Each row is a NOR gate matching one bit combination (i.e. selecting a condition) and the corresponding signal value to test. Thus, if a particular condition is specified and is satisfied, that row will be 1. The 16 row outputs are combined by the 16-input NOR gate at the left. Thus, if the specified condition is satisfied, this output will be 0, and if the condition is unsatisfied, the output will be 1. This signal controls the jump or call micro-instruction: if the condition is satisfied, the new micro-address is loaded into the microcode address register. If the condition is not satisfied, the microcode proceeds sequentially. I discuss the 8086's conditional circuitry in more detail in this post.

The condition PLA evaluates microcode conditionals.

The condition PLA evaluates microcode conditionals.

Conclusions

Hopefully you have found this close examination of microcode interesting. Microcode is implemented at an even lower level than assembly code, so it can be hard to understand. Moreover, the microcode in the 8086 was carefully optimized to make it compact, so it is even more obscure.

One of the big computer architecture debates of the 1980s was "RISC vs CISC", pitting Reduced Instruction Set Computers against Complex Instruction Set Computers. Looking at the 8086 in detail has given me more appreciation for the issues in a CISC processor such as the 8086. The 8086's string instructions are an example of the complex instructions in the 8086 that reduced the "semantic gap" between assembly code and high-level languages and minimized code size. While these instructions are powerful, their complexity spreads through the chip, requiring additional hardware features described above. These instructions also caused a great deal of complications for interrupt handling, including prefix-handling bugs that weren't fixed until later processors.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected].

Notes and references

  1. Block move instructions didn't originate with the 8086. The IBM System/360 series of mainframes had an extensive set of block instructions providing moves, compare, logical operations (AND, OR, Exclusive OR), character translation, formatting, and decimal arithmetic. These operations supported blocks of up to 256 characters.

    The Z80 processor (1976) had block instructions to move and compare blocks of data. The Z80 supported ascending and descending movements, but used separate instructions instead of a direction flag like the 8086. 

  2. The "string" operations process arbitrary memory bytes or words. Despite the name, these instructions are not specific to zero-terminated strings or any other string format. 

  3. The BL value in the micro-instruction indicates that the IND register should be incremented or decremented by 1 or 2 as appropriate. I'm not sure what BL stands for in the microcode. The patent says "BL symbolically represents a two bit code which causes external logic to examine the byte or word line and the direction flag in PSW register to generate, according to random logic well known to the art, the address factor required." So perhaps "Byte Logic"? 

  4. The designer of the 8086 instruction set, Steve Morse, discusses the motivation behind the string operations in his book The 8086/8088 primer. These instructions were designed to be flexible and support a variety of use cases. The XLAT (Translate) and JCXZ (Jump if CX Zero) instructions were designed to work well with the string instructions.

    The implementation of string instructions is discussed in detail in the 8086 patent, section 13 onward. 

  5. A string operation could perform 64k moves, each of which consists of a read and a write, yielding 128k memory operations. I think that if the memory accesses are unaligned, i.e. a word access to an odd address, then each byte of the word needs to be accessed separately. So I think you could get up to 256k memory accesses. Each memory operation takes at least 4 clock cycles, more if the memory is slow and has wait states. So one string instruction could take over a million clock cycles. 

  6. You might wonder why the register M indicates the accumulator, and the explanation is a bit tricky. The microcode uses 5-bit register specifications to indicate the source and destination for a data move. Registers can be specified explicitly, such as AX or BX, or a byte register such as AL or an internal register such as IND or tmpA. However, the microcode can also specify a generic source or destination register with M or N. The motivation is that the 8086 has a lot of operations that use an arbitrary source and destination register, for instance ADD AX, BX. Rather than making the microcode figure out which registers to use for these instructions, the hardware decodes the register fields from the instruction and substitutes the appropriate registers for M and N. This makes the microcode much simpler.

    But why does the LODS microcode use the M register instead of AX when this instruction only works with the accumulator? The microcode takes advantage of another clever feature of the M and N registers. The hardware looks at the instruction to determine if it is a byte or word instruction, and performs an 8-bit or 16-bit transfer accordingly. If the LODS microcode was hardcoded for the accumulator, the microcode would need separate paths for AX and AL, the full accumulator and the lower byte of the accumulator.

    The final piece of the puzzle is how the hardware knows to use the accumulator for the string instructions when they don't explicitly specify a register. The first step of instruction decoding is the Group Decode ROM, which categorizes instructions into various groups. One group is "instructions that use the accumulator". The string operations are categorized in this group, which causes the hardware to use the accumulator when the M register is specified. (Other instructions in this group include the immediate ALU operations, I/O operations, and accumulator moves.)

    I discussed the 8086's register codes in more detail here

  7. The 8086's microcode ROM was small: 512 words of 21 bits. In comparison, the VAX 11/780 minicomputer (1977) had 5120 words of 96 bits, over 45 times as large. 

  8. The internal Z16 zero flag is mostly used by the string operations. It is also used by the LOOP iteration-control instructions and the shift instructions that take a shift count. 

Reverse-engineering the multiplication algorithm in the Intel 8086 processor

While programmers today take multiplication for granted, most microprocessors in the 1970s could only add and subtract — multiplication required a slow and tedious loop implemented in assembly code.1 One of the nice features of the Intel 8086 processor (1978) was that it provided machine instructions for multiplication,2 able to multiply 8-bit or 16-bit numbers with a single instruction. Internally, the 8086 still performed a loop, but the loop was implemented in microcode: faster and transparent to the programmer. Even so, multiplication was a slow operation, about 24 to 30 times slower than addition.

In this blog post, I explain the multiplication process inside the 8086, analyze the microcode that it used, and discuss the hardware circuitry that helped it out.3 My analysis is based on reverse-engineering the 8086 from die photos. The die photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. At the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations at the heart of multiplication: addition and shifts. Multiplication also uses a few other hardware features: the X register, the F1 flag, and a loop counter. The microcode ROM at the lower right controls the process.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

Microcode

The multiplication routines in the 8086 are implemented in microcode. Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. With microcode, instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. This is especially useful for a machine instruction such as multiplication, which requires many steps in a loop.

A micro-instruction in the 8086 is encoded into 21 bits as shown below. Every micro-instruction has a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits depends on the type field and can be anything from an ALU operation to a memory read or write to a change of microcode control flow. Thus, an 8086 micro-instruction typically does two things in parallel: the move and the action. For more about 8086 microcode, see my microcode blog post.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The behavior of an ALU micro-operation is important for multiplication. The ALU has three temporary registers that are invisible to the programmer: tmpA, tmpB, and tmpC. An ALU operation takes its first argument from any temporary register, while the second argument always comes from tmpB. An ALU operation requires two micro-instructions. The first micro-instruction specifies the ALU operation and source register, configuring the ALU. For instance, ADD tmpA to add tmpA to the default tmpB. In the next micro-instruction (or a later one), the ALU result can be accessed through the Σ register and moved to another register.

Before I get into the microcode routines, I should explain two ALU operations that play a central role in multiplication: LRCY and RRCY, Left Rotate through Carry and Right Rotate through Carry. (These correspond to the RCL and RCR machine instructions, which rotate through carry left or right.) These operations shift the bits in a 16-bit word, similar to the << and >> bit-shift operations in high-level languages, but with an additional feature. Instead of discarding the bit on the end, that bit is moved into the carry flag (CF). Meanwhile, the bit formerly in the carry flag moves into the word. You can think of this as rotating the bits while treating the carry flag as a 17th bit of the word.

The left rotate through carry and right rotate through carry micro-instructions.

The left rotate through carry and right rotate through carry micro-instructions.

These shifts perform an important part of the multiplication process since shifting can be viewed as multiplying by two. LRCY also provides a convenient way to move the most-significant bit to the carry flag, where it can be tested for a conditional jump. (This is important because the top bit is used as the sign bit.) Similarly, RRCY provides access to the least significant bit, very important for the multiplication process. Another important property is that performing RRCY on an upper word and then RRCY on a lower word will perform a 32-bit shift, since the low bit of the upper word will be moved into the high bit of the lower word via the carry bit.

Binary multiplication

The shift-and-add method of multiplication (below) is similar to grade-school long multiplication, except it uses binary instead of decimal. In each row, the multiplicand is multiplied by one digit of the multiplier. (The multiplicand is the value that gets repeatedly added, and the multiplier controls how many times it gets added.) Successive rows are shifted left one digit. At the bottom, the rows are added together to yield the product. The example below shows how 6×5 is calculated in binary using long multiplication.

    0110
   ×0101
    0110
   0000
  0110
 0000
00011110

Binary long multiplication is much simpler than decimal multiplication: at each step, you're multiplying by 0 or 1. Thus, each row is either zero or the multiplicand appropriately shifted (0110 in this case). (Unlike decimal long multiplication, you don't need to know the multiplication table.) This simplifies the hardware implementation, since each step either adds the multiplicand or doesn't. In other words, each step tests a bit of the multiplier, starting with the low bit, to determine if an add should take place or not. This bit can be obtained by shifting the multiplier one bit to the right each step.

Although the diagram above shows the sum at the end, a real implementation performs the addition at each step of the loop, keeping a running total. Moreover, in the 8086, instead of shifting the multiplicand to the left during each step, the sum shifts to the right. (The result is the same but it makes the implementation easier.) Thus, multiplying 6×5 goes through the steps below.

  0101
 ×0110
 00000
 001010
 0011110
 00011110

Why would you shift the result to the right? There's a clever reason for this. Suppose you're multiplying two 16-bit numbers, which yields a 32-bit result. That requires four 16-bit words of storage if you use the straightforward approach. But if you look more closely, the first sum fits into 16 bits, and then you need one more bit at each step. Meanwhile, you're "using up" one bit of the multiplier at each step. So if you squeeze the sum and the multiplier together, you can fit them into two words. Shifting right accomplishes this, as the diagram below illustrates for 0xffff×0xf00f. The sum (blue) starts in a 16-bit register called tmpA while the multiplier (green) is stored in the 16-bit tmpB register. In each step, they are both shifted right, so the sum gains one bit and the multiplier loses one bit. By the end, the sum takes up all 32 bits, split across both registers.

sum (tmpA)multiplier (tmpC)
00000000000000001111000000001111
01111111111111111111100000000111
10111111111111110111110000000011
11011111111111110011111000000001
11101111111111110001111100000000
01110111111111111000111110000000
00111011111111111100011111000000
00011101111111111110001111100000
00001110111111111111000111110000
00000111011111111111100011111000
00000011101111111111110001111100
00000001110111111111111000111110
00000000111011111111111100011111
10000000011101110111111110001111
11000000001110110011111111000111
11100000000111010001111111100011
11110000000011100000111111110001

The multiplication microcode

The 8086 has four multiply instructions to handle signed and unsigned multiplication of byte and word operands. These machine instructions are implemented in microcode. I'll start by describing the unsigned word multiplication, which multiplies two 16-bit values and produces a 32-bit result. The source word is provided by either a register or memory. It is multiplied by AX, the accumulator register. The 32-bit result is returned in the DX and AX registers.

The microcode below is the main routine for word multiplication, both signed and unsigned. Each micro-instruction specifies a register move on the left, and an action on the right. The moves transfer words between the visible registers and the ALU's temporary registers, while the actions are mostly subroutine calls to other micro-routines.

  move        action
AX → tmpC   LRCY tmpC        iMUL rmw:
M → tmpB    CALL X0 PREIMUL   called for signed multiplication
            CALL CORX         the core routine
            CALL F1 NEGATE    called for negative result
            CALL X0 IMULCOF   called for signed multiplication
tmpC → AX   JMPS X0 7  
            CALL MULCOF       called for unsigned multiplication
tmpA → DX   RNI  

The microcode starts by moving one argument AX into the ALU's temporary C register and setting up the ALU to perform a Left Rotate through Carry on this register, in order to access the sign bit. Next, it moves the second argument M into the temporary B register; M references the register or memory specified in the second byte of the instruction, the "ModR/M" byte. For a signed multiply instruction, the PREIMUL micro-subroutine is called, but I'll skip that for now. (The X0 condition tests bit 3 of the instruction, which in this case distinguishes MUL from IMUL.) Next, the CORX subroutine is called, which is the heart of the multiplication.4 If the result needs to be negated (indicated by the F1 condition), the NEGATE micro-subroutine is called. For signed multiplication, IMULCOF is then called to set the carry and overflow flags, while MULCOF is called for unsigned multiplication. Meanwhile, the result bytes are moved from the temporary C and temporary registers to the AX and DX registers. Finally, RNI runs the next machine instruction, ending the microcode routine.

CORX

The heart of the multiplication code is the CORX routine, which performs the multiplication loop, computing the product through shifts and adds. The first two lines set up the loop, initializing the sum (tmpA) to 0. The number of loops is controlled by a special-purpose loop counter. The MAXC micro-instruction initializes the counter to 7 or 15, for a byte or word multiply respectively. The first shift of tmpC is performed, putting the low bit into the carry flag.

The loop body performs the shift-and-add step. It tests the carry flag, the low bit of the multiplicand. It skips over the ADD if there is no carry (NCY). Otherwise, tmpB is added to tmpA. (As tmpA gets shifted to the right, tmpB gets added to higher and higher positions in the result.) The tmpA and tmpC registers are rotated right. This also puts the next bit of the multiplicand into the carry flag for the next cycle. The microcode jumps to the top of the loop if the counter is not zero (NCZ). Otherwise, the subroutine returns with the result in tmpA and tmpC.

ZERO → tmpA  RRCY tmpC   CORX: initialize right rotate
Σ → tmpC     MAXC          get rotate result, initialize counter to max value
             JMPS NCY 8  5: top of loop
             ADD tmpA     conditionally add
Σ → tmpA               F  sum to tmpA, update flags to get carry
             RRCY tmpA   8: 32-bit shift of tmpA/tmpC
Σ → tmpA     RRCY tmpC  
Σ → tmpC     JMPS NCZ 5   loop to 5 if counter is not 0
             RTN  

MULCOF

The last subroutine is MULCOF, which configures the carry and overflow flags. The 8086 uses the rule that if the upper half of the result is nonzero, the carry and overflow flags are set, otherwise they are cleared. The first two lines pass tmpA (the upper half of the result) through the ALU to set the zero flag for the conditional jump. As a side-effect, the other status flags will get set but these values are "undefined" in the documentation.6 If the test is nonzero, the carry and overflow flags are set (SCOF), otherwise they are cleared (CCOF).5 The SCOF and CCOF micro-operations were implemented solely for used by multiplication, illustrating how microcode can be designed around specific needs.

             PASS tmpA  MULCOF: pass tmpA through to test if zero
Σ → no dest  JMPS 12   F update flags

             JMPS Z 8   12: jump if zero
             SCOF RTN    otherwise set carry and overflow

             CCOF RTN   8: clear carry and overflow

8-bit multiplication

The 8086 has separate instructions for 8-bit multiplication. The process for 8-bit multiplication is similar to 16-bit multiplication, except the values are half as long and the shift-and-add loop executes 8 times instead of 16. As shown below, the 8-bit sum starts in the low half of the temporary A register and is shifted left into tmpC. Meanwhile, the 8-bit multiplier starts in the low half of tmpC and is shifted out to the right. At the end, the result is split between tmpA and tmpC.

sum (tmpA)multiplier (tmpC)
00000000000000000000000001010101
00000000011111111000000000101010
00000000001111111100000000010101
00000000100111110110000000001010
00000000010011111011000000000101
00000000101001110101100000000010
00000000010100111010110000000001
00000000101010010101011000000000
00000000010101001010101100000000

The 8086 supports many instructions with byte and word versions, using 8-bit or 16-bit arguments. In most cases, the byte and word instructions use the same microcode, with the ALU and register hardware using bytes or words based on the instruction. However, the byte- and word-multiply instructions use different registers, requiring microcode changes. In particular, the multiplier is in AL, the low half of the accumulator. At the end, the 16-bit result is returned in AX, the full 16-bit accumulator; two micro-instructions assemble the result from tmpC and tmpA into the two bytes of the accumulator, 'AL' and 'AH' respectively. Apart from those changes, the microcode is the same as the word multiply microcode discussed earlier.

AL → tmpC    LRCY tmpC         iMUL rmb:
M → tmpB     CALL X0 PREIMUL  
             CALL CORX  
             CALL F1 NEGATE  
             CALL X0 IMULCOF  
tmpC → AL    JMPS X0 7  
             CALL MULCOF  
tmpA → AH    RNI

Signed multiplication

The 8086 (like most computers) represents signed numbers using a format called two's complement. While a regular byte holds a number from 0 to 255, a signed byte holds a number from -128 to 127. A negative number is formed by flipping all the bits (known as the one's complement) and then adding 1, yielding the two's complement value.7 For instance, +5 is 0x05 while -5 is 0xfb. (Note that the top bit of a number is set for a negative number; this is the sign bit.) The nice thing about two's complement numbers is that the same addition and subtraction operations work on both signed and unsigned values. Unfortunately, this is not the case for signed multiplication, since signed and unsigned values yield different results due to sign extension.

The 8086 has separate multiplication instructions IMUL (Integer Multiply) to perform signed multiplication. The 8086 performs signed multiplication by converting the arguments to positive values, performing unsigned multiplication, and then negating the result if necessary. As shown above, signed and unsigned multiplication both use the same microcode, but the microcode conditionally calls some subroutines for signed multiplication. I will discuss those micro-subroutines below.

PREIMUL

The first subroutine for signed multiplication is PREIMUL, performing preliminary operations for integer multiplication. It converts the two arguments, stored in tmpC and tmpB, to positive values. It keeps track of the signs using an internal flag called F1, toggling this flag for a negative argument. This conveniently handles the rule that two negatives make a positive since complementing the F1 flag twice will clear it.

This microcode, below, illustrates the complexity of microcode and how micro-operations are carefully arranged to get the right values at the right time. The first micro-instruction performs one ALU operation and sets up a second operation. The calling code had set up the ALU to perform LRCY tmpC, so that's the result returned by Σ (and discarded). Performing a left rotate and discarding the result may seem pointless, but the important side-effect is that the top bit (i.e. the sign bit) ends up in the carry flag. The microcode does not have a conditional jump based on the sign, but has a conditional jump based on carry, so the point is to test if tmpC is negative. The first micro-instruction also sets up negation (NEG tmpC) for the next ALU operation.

Σ → no dest  NEG tmpC   PREIMUL: set up negation of tmpC
             JMPS NCY 7  jump if tmpC positive
Σ → tmpC     CF1         if negative, negate tmpC, flip F1
             JMPS 7      jump to shared code

             LRCY tmpB  7:
Σ → no dest  NEG tmpB    set up negation of tmpB
             JMPS NCY 11 jump if tmpB positive
Σ → tmpB     CF1 RTN     if negative, negate tmpB, flip F1
             RTN        11: return

For the remaining lines, if the carry is clear (NCY), the next two lines are skipped. Otherwise, the ALU result (Σ) is written to tmpC, making it positive, and the F1 flag is complemented with CF1. (The second short jump (JMPS) may look unnecessary, but I reordered the code for clarity.) The second half of the microcode performs a similar test on tmpB. If tmpB is negative, it is negated and F1 is toggled.

NEGATE

The microcode below is called after computing the result, if the result needs to be made negative. Negation is harder than you might expect because the result is split between the tmpA and tmpC registers. The two's complement operation (NEG) is applied to the low word, while either 2's complement or one's complement (COM1) is applied to the upper word, depending on the carry for mathematical reasons.8 The code also toggles F1 and makes tmpB positive; I think this code is only useful for division, which also uses the NEGATE subroutine.

             NEG tmpC   NEGATE: negate tmpC
Σ → tmpC     COM1 tmpA F maybe complement tmpA
             JMPS CY 6  
             NEG tmpA    negate tmpA if there's no carry
Σ → tmpA     CF1        6: toggle F1 for some reason

             LRCY tmpB  7: test sign of tmpB
Σ → no dest  NEG tmpB    maybe negate tmpB
             JMPS NCY 11 skip if tmpB positive
Σ → tmpB     CF1 RTN     else negate tmpB, toggle F1
             RTN        11: return

IMULCOF

The IMULCOF routine is similar to MULCOF, but the calculation is a bit trickier for a signed result. This routine sets the carry and overflow flags if the upper half of the result is significant, that is, it is not just the sign extension of the lower half.9 In other words, the top byte is not significant if it duplicates the top bit (the sign bit) of the lower byte. The trick in the microcode is to add the top bit of the lower byte to the upper byte by putting it in the carry flag and performing an add with carry (ADC) of 0. If the result is 0, the upper byte is not significant, handling the positive and negative cases. (This also holds for words instead of bytes.)

ZERO → tmpB  LRCY tmpC  IMULCOF: get top bit of tmpC
Σ → no dest  ADC tmpA    add to tmpA and 0 (tmpB)
Σ → no dest   F          update flags
             JMPS Z 8   12: jump if zero result
             SCOF RTN    otherwise set carry and overflow

             CCOF RTN   8: clear carry and overflow

The hardware for multiplication

For the most part, the 8086 uses the regular ALU addition and shifts for the multiplication algorithm. Some special hardware features provide assistance.

Loop counter

The 8086 has a special 4-bit loop counter for multiplication. This counter starts at 7 for byte multiplication and 15 for word multiplication, based on the instruction. This loop counter allows the microcode to decrement the counter, test for the end, and perform a conditional branch in one micro-operation. The counter is implemented with four flip-flops, along with logic to compute the value after decrementing by one. The MAXC (Maximum Count) micro-instruction sets the counter to 7 or 15 for byte or word operations respectively. The NCZ (Not Counter Zero) micro-instruction has two actions. First, it performs a conditional jump if the counter is nonzero. Second, it decrements the counter.

X register

The multiplication microcode uses an internal register called the X register to distinguish between the MUL and IMUL instructions. The X register is a 3-bit register that holds the ALU opcode, indicated by bits 5–3 of the instruction.10 Since the instruction is held in the Instruction Register, you might wonder why a separate register is required. The motivation is that some opcodes specify the type of ALU operation in the second byte of the instruction, the ModR/M byte, bits 5–3.11 Since the ALU operation is sometimes specified in the first byte and sometimes in the second byte, the X register was added to handle both these cases.

For the most part, the X register indicates which of the eight standard ALU operations is selected (ADD, OR, ADC, SBB, AND, SUB, XOR, CMP). However, a few instructions use bit 0 of the X register to distinguish between other pairs of instructions. For instance, it distinguishes between MUL and IMUL, DIV and IDIV, CMPS and SCAS, MOVS and LODS, or AAA and AAS. While these instruction pairs may appear to have arbitrary opcodes, they have been carefully assigned. The microcode can test this bit using the X0 condition and perform conditional jumps.

The implementation of the X register is straightforward, consisting of three flip-flops to hold the three bits of the instruction. The flip-flops are loaded from the prefetch queue bus during First Clock and during Second Clock for appropriate instructions, as the instruction bytes travel over the bus. Testing bit 0 of the X register with the X0 condition is supported by the microcode condition evaluation circuitry, so it can be used for conditional jumps in the microcode.

The F1 flag

The multiplication microcode uses an internal flag called F1,12 which has two distinct uses. The flag keeps track of a REP prefix for use with a string operation. But the F1 flag is also used by signed multiplication and division to keep track of the sign. The F1 flag can be toggled by microcode through the CF1 (Complement F1) micro-instruction. The F1 flag is implemented with a flip-flop, along with a multiplexer to select the value. It is cleared when a new instruction starts, set by a REP prefix, and toggled by the CF1 micro-instruction.

The diagram below shows how the F1 latch and the loop counter appear on the die. In this image, the metal layer has been removed, showing the silicon and the polysilicon wiring underneath.

The counter and F1 latch as they appear on the die. The latch for the REP state is also here.

The counter and F1 latch as they appear on the die. The latch for the REP state is also here.

Later advances in multiplication

The 8086 was pretty slow at multiplying compared to later Intel processors.13 The 8086 took up to 133 clock cycles to multiply unsigned 16-bit values due to the complicated microcode loops. By 1982, the Intel 286 processor cut this time down to 21 clock cycles. The Intel 486 (1989) used an improved algorithm that could end early, so multiplying by a small number could take just 9 cycles.

Although these optimizations improved performance, they still depended on looping over the bits. With the shift to 32-bit processors, the loop time became unwieldy. The solution was to replace the loop with hardware: instead of performing 32 shift-and-add loops, an array of adders could compute the multiplication in one step. This quantity of hardware was unreasonable in the 8086 era, but as Moore's law made transistors smaller and cheaper, hardware multiplication became practical. For instance, the Cyrix Cx486SLC (1992) had a 16-bit hardware multiplier that cut word multiply down to 3 cycles. The Intel Core 2 (2006) was even faster, able to complete a 32-bit multiplication every clock cycle.

Hardware multiplication is a fairly complicated subject, with many optimizations to maximize performance while minimizing hardware.14 Simply replacing the loop with a sequence of 32 adders is too slow because the result would be delayed while propagating through all the adders. The solution is to arrange the adders as a tree to provide parallelism. The first layer has 16 adders to add pairs of terms. The next layer adds pairs of these partial sums, and so forth. The resulting tree of adders is 5 layers deep rather than 32, reducing the time to compute the sum. Real multipliers achieve further performance improvements by splitting up the adders and creating a more complex tree: the venerable Wallace tree (1964) and Dadda multiplier (1965) are two popular approaches. Another optimization is the Booth algorithm (1951), which performs signed multiplication directly, without converting the arguments to positive values first. The Pentium 4 (2000) used a Booth encoder and a Wallace tree (ref), but research in the early 2000s found the Dadda tree is faster and it is now more popular.

Conclusions

Multiplication is much harder to compute than addition or subtraction. The 8086 processor hid this complexity from the programmer by providing four multiplication instructions for byte and word multiplication of signed or unsigned values. These instructions implemented multiplication in microcode, performing shifts and adds in a loop. By using microcode subroutines and conditional execution, these four machine instructions share most of the microcode. As the microcode capacity of the 8086 was very small, this was a critical feature of the implementation.

If you made it through all the discussion of microcode, congratulations! Microcode is even harder to understand than assembly code. Part of the problem is that microcode is very fine-grain, with even ALU operations split into multiple steps. Another complication is that 8086 microcode performs a register move and another operation in parallel, so it's hard to keep track of what's going on. Microcode can seem a bit like a jigsaw puzzle, with pieces carefully fit together as compactly as possible. I hope the explanations here made sense, or at least gave you a feel for how microcode operates.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected].

Notes and references

  1. Mainframes going back to ENIAC had multiply and divide instructions. However, early microprocessors took a step back and didn't supports these more complex operations. (My theory is that the decline in memory prices made it more cost-effective to implement multiply and divide in software than hardware.) The National Semiconductor IMP-16, a 16-bit bit-slice microprocessor from 1973, may be the first with multiply and divide instructions. The 8-bit Motorola 6809 processor (1978) included 8-bit multiplication but not division. I think the 8086 was the first Intel processor to support multiplication. 

  2. The 8086 also supported division. Although the division instructions are similar to multiplication in many ways, I'm focusing on multiplication and ignoring division for this blog post. 

  3. My microcode analysis is based on Andrew Jenner's 8086 microcode disassembly

  4. I think CORX stands for Core Multiply and CORD stands for Core Divide

  5. The definitions of carry and overflow are different for multiplication compared to addition and subtraction. Note that the result of a multiplication operation will always fit in the available result space, which is twice as large as the arguments. For instance, the biggest value you can get by multiplying 16-bit values is 0xffff×0xffff=0xfffe0001 which fits into 32 bits. (Signed and 8-bit multiplications fit similarly.) This is in contrast to addition and subtraction, which can exceed their available space. A carry indicates that an addition exceeded its space when treated as unsigned, while an overflow indicates that an addition exceeded its space when treated as unsigned. 

  6. The Intel documentation states that the sign, carry, overflow, and parity flags are undefined after the MUL operation, even though the microcode causes them to be computed. The meaning of "undefined" is that programmers shouldn't count on the flag values because Intel might change the behavior in later chips. This thread discusses the effects of MUL on the flags, and how the behavior is different on the NEC V20 chip. 

  7. It may be worth explaining why the two's complement of a number is defined by adding 1 to the one's complement. The one's complement of a number simply flips all the bits. If you take a byte value n, 0xff - n is the one's complement, since a 1 bit in n produces a 0 bit in the result.

    Now, suppose we want to represent -5 as a signed byte. Adding 0x100 will keep the same byte value with a carry out of the byte. But 0x100 - 5 = (1 + 0xff) - 5 = 1 + (0xff - 5) = 1 + (one's complement of 5). Thus, it makes sense mathematically to represent -5 by adding 1 to the one's complement of 5, and this holds for any value. 

  8. The negation code is a bit tricky because the result is split across two words. In most cases, the upper word is bitwise complemented. However, if the lower word is zero, then the upper word is negated (two's complement). I'll demonstrate with 16-bit values to keep the examples small. The number 257 (0x0101) is negated to form -257 (0xfeff). Note that the upper byte is the one's complement (0x01 vs 0xfe) while the lower byte is two's complement (0x01 vs 0xff). On the other hand, the number 256 (0x0100) is negated to form -256 (0xff00). In this case, the upper byte is the two's complement (0x01 vs 0xff) and the lower byte is also the two's complement (0x00 vs 0x00).

    (Mathematical explanation: the two's complement is formed by taking the one's complement and adding 1. In most cases, there won't be a carry from the low byte to the upper byte, so the upper byte will remain the one's complement. However, if the low byte is 0, the complement is 0xff and adding 1 will form a carry. Adding this carry to the upper byte yields the two's complement of that byte.)

    To support multi-word negation, the 8086's NEG instruction clears the carry flag if the operand is 0, and otherwise sets the carry flag. (This is the opposite from the above because subtractions (including NEG) treat the carry flag as a borrow flag, with the opposite meaning.) The microcode NEG operation has identical behavior to the machine instruction, since it is used to implement the machine instruction.

    Thus to perform a two-word negation, the microcode negates the low word (tmpC) and updates the flags (F). If the carry is set, the one's complement is applied to the upper word (tmpA). But if the carry is cleared, the two's complement is applied to tmpA. 

  9. The IMULCOF routine considers the upper half of the result significant if it is not the sign extension of the lower half. For instance, dropping the top byte of 0x0005 (+5) yields 0x05 (+5). Dropping the top byte of 0xfffb (-5) yields 0xfb (-5). Thus, the upper byte is not significant in these cases. Conversely, dropping the top byte of 0x00fb (+251) yields 0xfb (-5), so the upper byte is significant. 

  10. Curiously, the 8086 patent states that the X register is a 4-bit register holding bits 3–6 of the byte (col. 9, line 20). But looking at the die, it is a 3-bit register holding bits 3–5 of the byte. 

  11. Some instructions are specified by bits 5–3 in the ModR/M byte rather than in the first opcode byte. The motivation is to avoid wasting bits for instructions that use a ModR/M byte but don't need a register specification. For instance, consider the instruction ADD [BX],0x1234. This instruction uses a ModR/M byte to specify the memory address. However, because it uses an immediate operand, it does not need the register specification normally provided by bits 5–3 of the ModR/M byte. This frees up the bits to specify the instruction. From one perspective, this is an ugly hack, while from another perspective it is a clever optimization. 

  12. Andrew Jenner discusses the F1 flag and the interaction between REP and multiplication here

  13. Here are some detailed performance numbers. The 8086 processor takes 70–77 clock cycles to multiply 8-bit values and 118–133 clock cycles to multiply 16-bit values. Signed multiplies are a bit slower because of the sign calculations: 80–98 and 128–154 clock cycles respectively. The time is variable because of the conditional jumps in the multiplication process.

    The Intel 186 (1982) optimized multiplication slightly, bringing the register word multiply down to 35–37 cycles. The Intel 286 (also 1982) reduced this to 21 clocks. The 486 (1989) used a shift-add multiply function but it had an "early out" algorithm that stopped when the remaining bits were zero, so a 16-bit multiply could take from 9 to 22 clocks. The 8087 floating point coprocessor (1980) used radix-4 multiplication, multiplying by pairs of bits at a time and either adding or subtracting. This yields half the addition cycles. The Pentium's P5 micro-architecture (1993) took the unusual approach of reusing the floating-point unit's hardware multiplier for integer multiplication, taking 10 cycles for a 32-bit multiplication. 

  14. This presentation gives a good overview of implementations of multiplication in hardware.