Showing posts with label 8086. Show all posts
Showing posts with label 8086. Show all posts

Reverse-engineering the 8086 processor's address and data pin circuits

The Intel 8086 microprocessor (1978) started the x86 architecture that continues to this day. In this blog post, I'm focusing on a small part of the chip: the address and data pins that connect the chip to external memory and I/O devices. In many processors, this circuitry is straightforward, but it is complicated in the 8086 for two reasons. First, Intel decided to package the 8086 as a 40-pin DIP, which didn't provide enough pins for all the functionality. Instead, the 8086 multiplexes address, data, and status. In other words, a pin can have multiple roles, providing an address bit at one time and a data bit at another time.

The second complication is that the 8086 has a 20-bit address space (due to its infamous segment registers), while the data bus is 16 bits wide. As will be seen, the "extra" four address bits have more impact than you might expect. To summarize, 16 pins, called AD0-AD15, provide 16 bits of address and data. The four remaining address pins (A16-A19) are multiplexed for use as status pins, providing information about what the processor is doing for use by other parts of the system. You might expect that the 8086 would thus have two types of pin circuits, but it turns out that there are four distinct circuits, which I will discuss below.

The 8086 die under the microscope, with the main functional blocks and address pins labeled. Click this image (or any other) for a larger version.

The 8086 die under the microscope, with the main functional blocks and address pins labeled. Click this image (or any other) for a larger version.

The microscope image above shows the silicon die of the 8086. In this image, the metal layer on top of the chip is visible, while the silicon and polysilicon underneath are obscured. The square pads around the edge of the die are connected by tiny bond wires to the chip's 40 external pins. The 20 address pins are labeled: Pins AD0 through AD15 function as address and data pins. Pins A16 through A19 function as address pins and status pins.1 The circuitry that controls the pins is highlighted in red. Two internal busses are important for this discussion: the 20-bit AD bus (green) connects the AD pins to the rest of the CPU, while the 16-bit C bus (blue) communicates with the registers. These buses are connected through a circuit that can swap the byte order or shift the value. (The lines on the diagram are simplified; the real wiring twists and turns to fit the layout. Moreover, the C bus (blue) has its bits spread across the width of the register file.)

Segment addressing in the 8086

One goal of the 8086 design was to maintain backward compatibility with the earlier 8080 processor.2 This had a major impact on the 8086's memory design, resulting in the much-hated segment registers. The 8080 (like most of the 8-bit processors of the early 1970s) had a 16-bit address space, able to access 64K (65,536 bytes) of memory, which was plenty at the time. But due to the exponential growth in memory capacity described by Moore's Law, it was clear that the 8086 needed to support much more. Intel decided on a 1-megabyte address space, requiring 20 address bits. But Intel wanted to keep the 16-bit memory addresses used by the 8080.

The solution was to break memory into segments. Each segment was 64K long, so a 16-bit offset was sufficient to access memory in a segment. The segments were allocated in a 1-megabyte address space, with the result that you could access a megabyte of memory, but only in 64K chunks.3 Segment addresses were also 16 bits, but were shifted left by 4 bits (multiplied by 16) to support the 20-bit address space.

Thus, every memory access in the 8086 required a computation of the physical address. The diagram below illustrates this process: the logical address consists of the segment base address and the offset within the segment. The 16-bit segment register was shifted 4 bits and added to the 16-bit offset to yield the 20-bit physical memory address.

The segment register and the offset are added to create a 20-bit physical address.  From iAPX 86,88 User's Manual, page 2-13.

The segment register and the offset are added to create a 20-bit physical address. From iAPX 86,88 User's Manual, page 2-13.

This address computation was not performed by the regular ALU (Arithmetic/Logic Unit), but by a separate adder that was devoted to address computation. The address adder is visible in the upper-left corner of the die photo. I will discuss the address adder in more detail below.

The AD bus and the C Bus

The 8086 has multiple internal buses to move bits internally, but the relevant ones are the AD bus and the C bus. The AD bus is a 20-bit bus that connects the 20 address/data pins to the internal circuitry.4 A 16-bit bus called the C bus provides the connection between the AD bus, the address adder and some of the registers.5 The diagram below shows the connections. The AD bus can be connected to the 20 address pins through latches. The low 16 pins can also be used for data input, while the upper 4 pins can also be used for status output. The address adder performs the 16-bit addition necessary for segment arithmetic. Its output is shifted left by four bits (i.e. it has four 0 bits appended), producing the 20-bit result. The inputs to the adder are provided by registers, a constant ROM that holds small constants such as +1 or -2, or the C bus.

My reverse-engineered diagram showing how the AD bus and the C bus interact with the address pins.

My reverse-engineered diagram showing how the AD bus and the C bus interact with the address pins.

The shift/crossover circuit provides the interface between these two buses, handling the 20-bit to 16-bit conversion. The busses can be connected in three ways: direct, crossover, or shifted.6 The direct mode connects the 16 bits of the C bus to the lower 16 bits of the address/data pins. This is the standard mode for transferring data between the 8086's internal circuitry and the data pins. The crossover mode performs the same connection but swaps the bytes. This is typically used for unaligned memory accesses, where the low memory byte corresponds to the high register byte, or vice versa. The shifted mode shifts the 20-bit AD bus value four positions to the right. In this mode, the 16-bit output from the address adder goes to the 16-bit C bus. (The shift is necessary to counteract the 4-bit shift applied to the address adder's output.) Control circuitry selects the right operation for the shift/crossover circuit at the right time.7

Two of the registers are invisible to the programmer but play an important role in memory accesses. The IND (Indirect) register specifies the memory address; it holds the 16-bit memory offset in a segment. The OPR (Operand) register holds the data value.9 The IND and OPR registers are not accessed directly by the programmer; the microcode for a machine instruction moves the appropriate values to these registers prior to the write.

Overview of a write cycle

I hesitate to present a timing diagram, since I may scare off my readers, but the 8086's communication is designed around a four-step bus cycle. The diagram below shows simplified timing for a write cycle, when the 8086 writes to memory or an I/O device.8 The external bus activity is organized as four states, each one clock cycle long: T1, T2, T3, T4. These T states are very important since they control what happens on the bus. During T1, the 8086 outputs the address on the pins. During the T2, T3, and T4 states, the 8086 outputs the data word on the pins. The important part for this discussion is that the pins are multiplexed depending on the T-state: the pins provide the address during T1 and data during T2 through T4.

A typical write bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

A typical write bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

There two undocumented T states that are important to the bus cycle. The physical address is computed in the two clock cycles before T1 so the address will be available in T1. I give these "invisible" T states the names TS (start) and T0.

The address adder

The operation of the address adder is a bit tricky since the 16-bit adder must generate a 20-bit physical address. The adder has two 16-bit inputs: the B input is connected to the upper registers via the B bus, while the C input is connected to the C bus. The segment register value is transferred over the B bus to the adder during the second half of the TS state (that is, two clock cycles before the bus cycle becomes externally visible during T1). Meanwhile, the address offset is transferred over the C bus to the adder, but the adder's C input shifts the value four bits to the right, discarding the four low bits. (As will be explained later, the pin driver circuits latch these bits.) The adder's output is shifted left four bits and transferred to the AD bus during the second half of T0. This produces the upper 16 bits of the 20-bit physical memory address. This value is latched into the address output flip-flops at the start of T1, putting the computed address on the pins. To summarize, the 20-bit address is generated by storing the 4 low-order bits during T0 and then the 16 high-order sum bits during T1.

The address adder is not needed for segment arithmetic during T1 and T2. To improve performance, the 8086 uses the adder during this idle time to increment or decrement memory addresses. For instance, after popping a word from the stack, the stack pointer needs to be incremented by 2. The address adder can do this increment "for free" during T1 and T2, leaving the ALU available for other operations.10 Specifically, the adder updates the memory address in IND, incrementing it or decrementing it as appropriate. First, the IND value is transferred over the B bus to the adder during the second half of T1. Meanwhile, a constant (-3 to +2) is loaded from the Constant ROM and transferred to the adder's C input. The output from the adder is transferred to the AD bus during the second half of T2. As before, the output is shifted four bits to the left. However, the shift/crossover circuit between the AD bus and the C bus is configured to shift four bits to the right, canceling the adder's shift. The result is that the C bus gets the 16-bit sum from the adder, and this value is stored in the IND register.11 For more information on the implemenation of the address adder, see my previous blog post.

The pin driver circuit

Now I'll dive down to the hardware implementation of an output pin. When the 8086 chip communicates with the outside world, it needs to provide relatively high currents. The tiny logic transistors can't provide enough current, so the chip needs to use large output transistors. To fit the large output transistors on the die, they are constructed of multiple wide transistors in parallel.12 Moreover, the drivers use a somewhat unusual "superbuffer" circuit with two transistors: one to pull the output high, and one to pull the output low.13

The diagram below shows the transistor structure for one of the output pins (AD10), consisting of three parallel transistors between the output and +5V, and five parallel transistors between the output and ground. The die photo on the left shows the metal layer on top of the die. This shows the power and ground wiring and the connections to the transistors. The photo on the right shows the die with the metal layer removed, showing the underlying silicon and the polysilicon wiring on top. A transistor gate is formed where a polysilicon wire crosses the doped silicon region. Combined, the +5V transistors are equivalent to about 60 typical transistors, while the ground transistors are equivalent to about 100 typical transistors. Thus, these transistors provide substantially more current to the output pin.

Two views of the output transistors for a pin. The first shows the metal layer, while the second shows the polysilicon and silicon.

Two views of the output transistors for a pin. The first shows the metal layer, while the second shows the polysilicon and silicon.

Tri-state output driver

The output circuit for an address pin uses a tri-state buffer, which allows the output to be disabled by putting it into a high-impedance "tri-state" configuration. In this state, the output is not pulled high or low but is left floating. This capability allows the pin to be used for data input. It also allows external devices to device can take control of the bus, for instance, to perform DMA (direct memory access).

The pin is driven by two large MOSFETs, one to pull the output high and one to pull it low. (As described earlier, each large MOSFET is physically multiple transistors in parallel, but I'll ignore that for now.) If both MOSFETs are off, the output floats, neither on nor off.

Schematic diagram of a "typical" address output pin.

Schematic diagram of a "typical" address output pin.

The tri-state output is implemented by driving the MOSFETs with two "superbuffer"15 AND gates. If the enable input is low, both AND gates produce a low output and both output transistors are off. On the other hand, if enable is high, one AND gate will be on and one will be off. The desired output value is loaded into a flip-flop to hold it,14 and the flip-flop turns one of the output transistors on, driving the output pin high or low as appropriate. (Conveniently, the flip-flop provides the data output Q and the inverted data output Q.) Generally, the address pin outputs are enabled for T1-T4 of a write but only during T1 for a read.16

In the remainder of the discussion, I'll use the tri-state buffer symbol below, rather than showing the implementation of the buffer.

The output circuit, expressed with a tri-state buffer symbol.

The output circuit, expressed with a tri-state buffer symbol.

AD4-AD15

Pins AD4-AD15 are "typical" pins, avoiding the special behavior of the top and bottom pins, so I'll discuss them first. The behavior of these pins is that the value on the AD bus is latched by the circuit and then put on the output pin under the control of the enable signal. The circuit has three parts: a multiplexer to select the output value, a flip-flop to hold the output value, and a tri-state driver to provide the high-current output to the pin. In more detail, the multiplexer selects either the value on the AD bus or the current output from the flip-flop. That is, the multiplexer can either load a new value into the flip-flop or hold the existing value.17 The flip-flop latches the input value on the falling edge of the clock, passing it to the output driver. If the enable line is high, the output driver puts this value on the corresponding address pin.

The output circuit for AD4-AD15 has a latch to hold the desired output value, an address or data bit.

The output circuit for AD4-AD15 has a latch to hold the desired output value, an address or data bit.

For a write, the circuit latches the address value on the bus during the second half of T0 and puts it on the pins during T1. During the second half of the T1 state, the data word is transferred from the OPR register over the C bus to the AD bus and loaded into the AD pin latches. The word is transferred from the latches to the pins during T2 and held for the remainder of the bus cycle.

AD0-AD3

The four low address bits have a more complex circuit because these address bits are latched from the bus before the address adder computes its sum, as described earlier. The memory offset (before the segment addition) will be on the C bus during the second half of TS and is loaded into the lower flip-flop. This flip-flop delays these bits for one clock cycle and then they are loaded into the upper flip-flop. Thus, these four pins pick up the offset prior to the addition, while the other pins get the result of the segment addition.

The output circuit for AD0-AD3 has a second latch to hold the low address bits before the address adder computes the sum.

The output circuit for AD0-AD3 has a second latch to hold the low address bits before the address adder computes the sum.

For data, the AD0-AD3 pins transfer data directly from the AD bus to the pin latch, bypassing the delay that was used to get the address bits. That is, the AD0-AD3 pins have two paths: the delayed path used for addresses during T0 and the direct path otherwise used for data. Thus, the multiplexer has three inputs: two for these two paths and a third loop-back input to hold the flip-flop value.

A16-A19: status outputs

The top four pins (A16-A19) are treated specially, since they are not used for data. Instead, they provide processor status during T2-T4.18 The pin latches for these pins are loaded with the address during T0 like the other pins, but loaded with status instead of data during T1. The multiplexer at the input to the latch selects the address bit during T0 and the status bit during T1, and holds the value otherwise. The schematic below shows how this is implemented for A16, A17, and A19.

The output circuit for AD16, AD17, and AD19 selects either an address output or a status output.

The output circuit for AD16, AD17, and AD19 selects either an address output or a status output.

Address pin A18 is different because it indicates the current status of the interrupt enable flag bit. This status is updated every clock cycle, unlike the other pins. To implement this, the pin has a different circuit that isn't latched, so the status can be updated continuously. The clocked transistors act as "pass transistors", passing the signal through when active. When a pass transistor is turned off, the following logic gate holds the previous value due to the capacitance of the wiring. Thus, the pass transistors provide a way of holding the value through the clock cycle. The flip-flops are implemented with pass transistors internally, so in a sense the circuit below is a flip-flop that has been "exploded" to provide a second path for the interrupt status.

The output circuit for AD18 is different from the rest so the I flag status can be updated every clock cycle.

The output circuit for AD18 is different from the rest so the I flag status can be updated every clock cycle.

Reads

A memory or I/O read also uses a 4-state bus cycle, slightly different from the write cycle. During T1, the address is provided on the pins, the same as for a write. After that, however, the output circuits are tri-stated so they float, allowing the external memory to put data on the bus. The read data on the pin is put on the AD bus at the start of the T4 state. From there, the data passes through the crossover circuit to the C bus. Normally the 16 data bits pass straight through to the C bus, but the bytes will be swapped if the memory access is unaligned. From the C bus, the data is written to the OPR register, a byte or a word as appropriate. (For an instruction prefetch, the word is written to a prefetch queue register instead.)

A typical read bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

A typical read bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

To support data input on the AD0-AD15 pins, they have a circuit to buffer the input data and transfer it to the AD bus. The incoming data bit is buffered by the two inverters and sampled when the clock is high. If the enable' signal is low, the data bit is transferred to the AD bus when the clock is low.19 The two MOSFETs act as a "superbuffer", providing enough current for the fairly long AD bus. I'm not sure what the capacitor accomplishes, maybe avoiding a race condition if the data pin changes just as the clock goes low.20

Schematic of the input circuit for the data pins.

Schematic of the input circuit for the data pins.

This circuit has a second role, precharging the AD bus high when the clock is low, if there's no data. Precharging a bus is fairly common in the 8086 (and other NMOS processors) because NMOS transistors are better at pulling a line low than pulling it high. Thus, it's often faster to precharge a line high before it's needed and then pull it low for a 0.21

Since pins A16-A19 are not used for data, they operate the same for reads as for writes: providing address bits and then status.

The pin circuit on the die

The diagram below shows how the pin circuitry appears on the die. The metal wiring has been removed to show the silicon and polysilicon. The top half of the image is the input circuitry, reading a data bit from the pin and feeding it to the AD bus. The lower half of the image is the output circuitry, reading an address or data bit from the AD bus and amplifying it for output via the pad. The light gray regions are doped, conductive silicon. The thin tan lines are polysilicon, which forms transistor gates where it crosses doped silicon.

The input/output circuitry for an address/data pin. The metal layer has been removed to show the underlying silicon and polysilicon. Some crystals have formed where the bond pad was.

The input/output circuitry for an address/data pin. The metal layer has been removed to show the underlying silicon and polysilicon. Some crystals have formed where the bond pad was.

A historical look at pins and timing

The number of pins on Intel chips has grown exponentially, more than a factor of 100 in 50 years. In the early days, Intel management was convinced that a 16-pin package was large enough for any integrated circuit. As a result, the Intel 4004 processor (1971) was crammed into a 16-pin package. Intel chip designer Federico Faggin22 describes 16-pin packages as a completely silly requirement that was throwing away performance, but the "God-given 16 pins" was like a religion at Intel. When Intel was forced to use 18 pins by the 1103 memory chip, it "was like the sky had dropped from heaven" and he had "never seen so many long faces at Intel." Although the 8008 processor (1972) was able to use 18 pins, this low pin count still harmed performance by forcing pins to be used for multiple purposes.

The Intel 8080 (1974) had a larger, 40-pin package that allowed it to have 16 address pins and 8 data pins. Intel stuck with this size for the 8086, even though competitors used larger packages with more pins.23 As processors became more complex, the 40-pin package became infeasible and the pin count rapidly expanded; The 80286 processor (1982) had a 68-pin package, while the i386 (1985) had 132 pins; the i386 needed many more pins because it had a 32-bit data bus and a 24- or 32-bit address bus. The i486 (1989) went to 196 pins while the original Pentium had 273 pins. Nowadays, a modern Core I9 processor uses the FCLGA1700 socket with a whopping 1700 contacts.

Looking at the history of Intel's bus timing, the 8086's complicated memory timing goes back to the Intel 8008 processor (1972). Instruction execution in the 8008 went through a specific sequence of timing states; each clock cycle was assigned a particular state number. Memory accesses took three cycles: the address was sent to memory during states T1 and T2, half of the address at a time since there were only 8 address pins. During state T3, a data byte was either transmitted to memory or read from memory. Instruction execution took place during T4 and T5. State signals from the 8008 chip indicated which state it was in.

The 8080 used an even more complicated timing system. An instruction consisted of one to five "machine cycles", numbered M1 through M5, where each machine cycle corresponded to a memory or I/O access. Each machine cycle consisted of three to five states, T1 through T5, similar to the 8008 states. The 8080 had 10 different types of machine cycle such as instruction fetch, memory read, memory write, stack read or write, or I/O read or write. The status bits indicated the type of machine cycle. The 8086 kept the T1 through T4 memory cycle. Because the 8086 decoupled instruction prefetching from execution, it no longer had explicit M machine cycles. Instead, it used status bits to indicate 8 types of bus activity such as instruction fetch, read data, or write I/O.

Conclusions

Well, address pins is another subject that I thought would be straightforward to explain but turned out to be surprisingly complicated. Many of the 8086's design decisions combine in the address pins: segmented addressing, backward compatibility, and the small 40-pin package. Moreover, because memory accesses are critical to performance, Intel put a lot of effort into this circuitry. Thus, the pin circuitry is tuned for particular purposes, especially pin A18 which is different from all the rest.

There is a lot more to say about memory accesses and how the 8086's Bus Interface Unit performs them. The process is very complicated, with interacting state machines for memory operation and instruction prefetches, as well as handling unaligned memory accesses. I plan to write more, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected] and Bluesky as @righto.com so you can follow me there too.

Notes and references

  1. In the discussion, I'll often call all the address pins "AD" pins for simplicity, even though pins 16-19 are not used for data. 

  2. The 8086's compatibility with the 8080 was somewhat limited since the 8086 had a different instruction set. However, Intel provided a conversion program called CONV86 that could convert 8080/8085 assembly code into 8086 assembly code that would usually work after minor editing. The 8086 was designed to make this process straightforward, with a mapping from the 8080's registers onto the 8086's registers, along with a mostly-compatible instruction set. (There were a few 8080 instructions that would be expanded into multiple 8086 instructions.) The conversion worked for straightforward code, but didn't work well with tricky, self-modifying code, for instance. 

  3. To support the 8086's segment architecture, programmers needed to deal with "near" and "far" pointers. A near pointer consisted of a 16-bit offset and could access 64K in a segment. A far pointer consisted of a 16-bit offset along with a 16-bit segment address. By modifying the segment register on each access, the full megabyte of memory could be accessed. The drawbacks were that far pointers were twice as big and were slower. 

  4. The 8086 patent provides a detailed architectural diagram of the 8086. I've extracted part of the diagram below. In most cases the diagram is accurate, but its description of the C bus doesn't match the real chip. There are some curious differences between the patent diagram and the actual implementation of the 8086, suggesting that the data pins were reorganized between the patent and the completion of the 8086. The diagram shows the address adder (called the Upper Adder) connected to the C bus, which is connected to the address/data pins. In particular, the patent shows the data pins multiplexed with the high address pins, while the low address pins A3-A0 are multiplexed with three status signals. The actual implementation of the 8086 is the other way around, with the data pins multiplexed with the low address pins while the high address pins A19-A16 are multiplexed with the status signals. Moreover, the patent doesn't show anything corresponding to what I call the AD bus; I made up that name. The moral is that while patents can be very informative, they can also be misleading.

    A diagram from patent US4449184 showing the connections to the address pins. This diagram does not match the actual chip. The diagram also shows the old segment register names: RC, RD, RS, and RA became CS, DS, SS, and ES.

    A diagram from patent US4449184 showing the connections to the address pins. This diagram does not match the actual chip. The diagram also shows the old segment register names: RC, RD, RS, and RA became CS, DS, SS, and ES.

     

  5. The C bus is connected to the PC, OPR, and IND registers, as well as the prefetch queue, but is not connected to the segment registers. Two other buses (the ALU bus and the B bus) provide access to the segment registers. 

  6. Swapping the bytes on the data pins is required in a few cases. The 8086 has a 16-bit data bus, so transfers are usually a 16-bit word, copied directly between memory and a register. However, the 8086 also allows 8-bit operations, in which case either the top half or bottom half of the word is accessed. Loading an 8-bit value from the top half of a memory word into the bottom half of a register uses the crossover circuit. Another case is performing a 16-bit access to an "unaligned" address, that is, an odd address so the word crosses the normal word boundaries. From the programmer's perspective, an unaligned access is permitted (unlike many RISC processors), but the hardware converts this access into two 8-bit accesses, so the bus itself never handles an unaligned access.

    The 8086 has the ability to access a single memory byte out of a word, either for a byte operation or for an unaligned word operation. This behavior has some important consequences on the address pins. In particular, the low address pin AD0 doesn't behave like the rest of the address pins due to the special handling of odd addresses. Instead, this pin indicates which half of the word to transfer. The AD0 line is low (0) when the lower portion of the bus transfers a byte. Another pin, BHE (Bus High Enable) has a similar role for the upper half of the bus: it is low (0) if a byte is transferred over D15-D8. (Keep in mind that the 8086 is little-endian, so the low byte of the word is first in memory, at the even address.)

    The following table summarizes how BHE and A0 work together to select a byte or word. When accessing a byte at an odd address, A0 is odd as you might expect.

    Access typeBHEA0
    Word00
    Low byte10
    High byte01
     

  7. The cbus-adbus-shift signal is activated during T2, when a memory index is being updated, either the instruction pointer or the IND register. The address adder is used to update the register and the shift undoes the 4-bit left shift applied to the adder's output. The shift is also used for the CORR micro-instruction, which corrects the instruction pointer to account for prefetching. The CORR micro-instruction generates a "fake" short bus cycle in which the constant ROM and the address adder are used during T0. I discuss the CORR micro-instruction in more detail in this post

  8. I've made the timing diagram somewhat idealized so actions line up with the clock. In the real datasheet, all the signals are skewed by various amounts so the timing is more complicated. Moreover, if the memory device is slow, it can insert "wait" states between T3 and T4. (Cheap memory was slower and would need wait states.) Moreover, actions don't exactly line up with the clock. I'm also omitting various control signals. The datasheet has pages of timing constraints on exactly when signals can change. 

  9. Instruction prefetches don't use the IND and OPR registers. Instead, the address is specified by the Instruction Pointer (or Program Counter), and the data is stored directly into one of the instruction prefetch registers. 

  10. A single memory operation takes six clock cycles: two preparatory cycles to compute the address before the four visible cycles. However, if multiple memory operations are performed, the operations are overlapped to achieve a degree of pipelining. Specifically, the address calculation for the next memory operation takes place during the last two clock cycles of the current memory operation, saving two clock cycles. That is, for consecutive bus cycles, T3 and T4 overlap with TS and T0 of the next cycle. In other words, during T3 and T4 of one bus cycle, the memory address gets computed for the next bus cycle. This pipelining improves performance, compared to taking 6 clock cycles for each bus cycle. 

  11. The POP operation is an example of how the address adder updates a memory pointer. In this case, the stack address is moved from the Stack Pointer to the IND register in order to perform the memory read. As part of the read operation, the IND register is incremented by 2. The address is then moved from the IND register to the Stack Pointer. Thus, the address adder not only performs the segment arithmetic, but also computes the new value for the SP register.

    Note that the increment/decrement of the IND register happens after the memory operation. For stack operations, the SP must be decremented before a PUSH and incremented after a POP. The adder cannot perform a predecrement, so the PUSH instruction uses the ALU (Arithmetic/Logic Unit) to perform the decrement. 

  12. The current from an MOS transistor is proportional to the width of the gate divided by the length (the W/L ratio). Since the minimum gate length is set by the manufacturing process, increasing the width of the gate (and thus the overall size of the transistor) is how the transistor's current is increased. 

  13. Using one transistor to pull the output high and one to pull the output low is normal for CMOS gates, but it is unusual for NMOS chips like the 8086. A normal NMOS gate only has active transistor to pull the output low and uses a depletion-mode transistor to provide a weak pull-up current, similar to a pull-up resistor. I discuss superbuffers in more detail here

  14. The flip-flop is controlled by the inverted clock signal, so the output will change when the clock goes low. Meanwhile, the enable signal is dynamically latched by a MOSFET, also controlled by the inverted clock. (When the clock goes high, the previous value will be retained by the gate capacitance of the inverter.) 

  15. The superbuffer AND gates are constructed on the same principle as the regular superbuffer, except with two inputs. Two transistors in series pull the output high if both inputs are high. Two transistors in parallel pull the output low if either input is low. The low-side transistors are driven by inverted signals. I haven't drawn these signals on the schematic to simplify it.

    The superbuffer AND gates use large transistors, but not as large as the output transistors, providing an intermediate amplification stage between the small internal signals and the large external signals. Because of the high capacitance of the large output transistors, they need to be driven with larger signals. There's a lot of theory behind how transistor sizes should be scaled for maximum performance, described in the book Logical Effort. Roughly speaking, for best performance when scaling up a signal, each stage should be about 3 to 4 times as large as the previous one, so a fairly large number of stages are used (page 21). The 8086 simplifies this with two stages, presumably giving up a bit of performance in exchange for keeping the drivers smaller and simpler. 

  16. The enable circuitry has some complications. For instance, I think the address pins will be enabled if a cycle was going to be T1 for a prefetch but then got preempted by a memory operation. The bus control logic is fairly complicated. 

  17. The multiplexer is implemented with pass transistors, rather than gates. One of the pass transistors is turned on to pass that value through to the multiplexer's output. The flip-flop is implemented with two pass transistors and two inverters in alternating order. The first pass transistor is activated by the clock and the second by the complemented clock. When a pass transistor is off, its output is held by the gate capacitance of the inverter, somewhat like dynamic RAM. This is one reason that the 8086 has a minimum clock speed: if the clock is too slow, these capacitively-held values will drain away. 

  18. The status outputs on the address pins are defined as follows:
    A16/S3, A17/S4: these two status lines indicate which relocation register is being used for the memory access, i.e. the stack segment, code segment, data segment, or alternate segment. Theoretically, a system could use a different memory bank for each segment and increase the total memory capacity to 4 megabytes.
    A18/S5: indicates the status of the interrupt enable bit. In order to provide the most up-to-date value, this pin has a different circuit. It is updated at the beginning of each clock cycle, so it can change during a bus cycle. The motivation for this is presumably so peripherals can determine immediately if the interrupt enable status changes.
    A19/S6: the documentation calls this a status output, even though it always outputs a status of 0. 

  19. For a read, the enable signal is activated at the end of T3 and the beginning of T4 to transfer the data value to the AD bus. The signal is gated by the READY pin, so the read doesn't happen until the external device is ready. The 8086 will insert Tw wait states in that case. 

  20. The datasheet says that a data value must be held steady for 10 nanoseconds (TCLDX) after the clock goes low at the start of T4. 

  21. The design of the AD bus is a bit unusual since the adder will put a value on the AD bus when the clock is high, while the data pin will put a value on the AD bus when the clock is low (while otherwise precharging it when the clock is low). Usually the bus is precharged during one clock phase and all users of the bus pull it low (for a 0) during the other phase. 

  22. Federico Faggin's oral history is here. The relevant part is on pages 55 and 56. 

  23. The Texas Instruments TMS9900 (1976) used a 64-pin package for instance, as did the Motorola 68000 (1979). 

The Group Decode ROM: The 8086 processor's first step of instruction decoding

A key component of any processor is instruction decoding: analyzing a numeric opcode and figuring out what actions need to be taken. The Intel 8086 processor (1978) has a complex instruction set, making instruction decoding a challenge. The first step in decoding an 8086 instruction is something called the Group Decode ROM, which categorizes instructions into about 35 types that control how the instruction is decoded and executed. For instance, the Group Decode ROM determines if an instruction is executed in hardware or in microcode. It also indicates how the instruction is structured: if the instruction has a bit specifying a byte or word operation, if the instruction has a byte that specifies the addressing mode, and so forth.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The diagram above shows the position of the Group Decode ROM on the silicon die, as well as other key functional blocks. The 8086 chip is partitioned into a Bus Interface Unit that communicates with external components such as memory, and the Execution Unit that executes instructions. Machine instructions are fetched from memory by the Bus Interface Unit and stored in the prefetch queue registers, which hold 6 bytes of instructions. To execute an instruction, the queue bus transfers an instruction byte from the prefetch queue to the instruction register, under control of a state machine called the Loader. Next, the Group Decode ROM categorizes the instruction according to its structure. In most cases, the machine instruction is implemented in low-level microcode. The instruction byte is transferred to the Microcode Address Register, where the Microcode Address Decoder selects the appropriate microcode routine that implements the instruction. The microcode provides the micro-instructions that control the Arithmetic/Logic Unit (ALU), registers, and other components to execute the instruction.

In this blog post, I will focus on a small part of this process: how the Group Decode ROM decodes instructions. Be warned that this post gets down into the weeds, so you might want to start with one of my higher-level posts, such as how the 8086's microcode engine works.

Microcode

Most instructions in the 8086 are implemented in microcode. Most people think of machine instructions as the basic steps that a computer performs. However, many processors have another layer of software underneath: microcode. With microcode, instead of building the CPU's control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.

Microcode is only used if the Group Decode ROM indicates that the instruction is implemented in microcode. In that case, the microcode address register is loaded with the instruction and the address decoder selects the appropriate microcode routine. However, there's a complication. If the second byte of the instruction is a Mod R/M byte, the Group Decode ROM indicates this and causes a memory addressing micro-subroutine to be called.

Some simple instructions are implemented entirely in hardware and don't use microcode. These are known as 1-byte logic instructions (1BL) and are also indicated by the Group Decode ROM.

The Group Decode ROM's structure

The Group Decode ROM takes an 8-bit instruction as input, along with an interrupt signal. It produces 15 outputs that control how the instruction is handled. In this section I'll discuss the physical implementation of the Group Decode ROM; the various outputs are discussed in a later section.

Although the Group Decode ROM is called a ROM, its implementation is really a PLA (Programmable Logic Array), two levels of highly-structured logic gates.1 The idea of a PLA is to create two levels of NOR gates, each in a grid. This structure has the advantages that it implements the logic densely and is easy to modify. Although physically two levels of NOR gates, a PLA can be thought of as an AND layer followed by an OR layer. The AND layer matches particular bit patterns and then the OR layer combines multiple values from the first layer to produce arbitrary outputs.

The Group Decode ROM. This photo shows the metal layer on top of the die.

The Group Decode ROM. This photo shows the metal layer on top of the die.

Since the output values are highly structured, a PLA implementation is considerably more efficient than a ROM, since in a sense it combines multiple entries. In the case of the Group Decode ROM, using a ROM structure would require 256 columns (one for each 8-bit instruction pattern), while the PLA implementation requires just 36 columns, about 1/7 the size.

The diagram below shows how one column of the Group Decode ROM is wired in the "AND" plane. In this die photo, I removed the metal layer with acid to reveal the polysilicon and silicon underneath. The vertical lines show where the metal line for ground and the column output had been. The basic idea is that each column implements a NOR gate, with a subset of the input lines selected as inputs to the gate. The pull-up resistor at the top pulls the column line high by default. But if any of the selected inputs are high, the corresponding transistor turns on, connecting the column line to ground and pulling it low. Thus, this implements a NOR gate. However, it is more useful to think of it as an AND of the complemented inputs (via De Morgan's Law): if all the inputs are "correct", the output is high. In this way, each column matches a particular bit pattern.

Closeup of a column in the Group Decode ROM.

Closeup of a column in the Group Decode ROM.

The structure of the ROM is implemented through the silicon doping pattern, which is visible above. A transistor is formed where a polysilicon wire crosses a doped silicon region: the polysilicon forms the gate, turning the transistor on or off. At each intersection point, a transistor can be created or not, depending on the doping pattern. If a particular transistor is created, then the corresponding input must be 0 to produce a high output.

At the top of the diagram above, the column outputs are switched from the metal layer to polysilicon wires and become the inputs to the upper "OR" plane. This plane is implemented in a similar fashion as a grid of NOR gates. The plane is rotated 90 degrees, with the inputs vertical and each row forming an output.

Intermediate decoding in the Group Decode ROM

The first plane of the Group Decode ROM categorizes instructions into 36 types based on the instruction bit pattern.2 The table below shows the 256 instruction values, colored according to their categorization.3 For instance, the first blue block consists of the 32 ALU instructions corresponding to the bit pattern 00XXX0XX, where X indicates that the bit can be 0 or 1. These instructions are all decoded and executed in a similar way. Almost all instructions have a single category, that is, they activate a single column line in the Group Decode ROM. However, a few instructions activate two lines and have two colors below.

Grid of 8086 instructions, colored according to the first level of the Group Decode Rom.

Grid of 8086 instructions, colored according to the first level of the Group Decode Rom.

Note that the instructions do not have arbitrary numeric opcodes, but are assigned in a way that makes decoding simpler. Because these blocks correspond to bit patterns, there is little flexibility. One of the challenges of instruction set design for early microprocessors was to assign numeric values to the opcodes in a way that made decoding straightforward. It's a bit like a jigsaw puzzle, fitting the instructions into the 256 available values, while making them easy to decode.

Outputs from the Group Decode ROM

The Group Decode ROM has 15 outputs, one for each row of the upper half. In this section, I'll briefly discuss these outputs and their roles in the 8086. For an interactive exploration of these signals, see this page, which shows the outputs that are triggered by each instruction.

Out 0 indicates an IN or OUT instruction. This signal controls the M/IO (S2) status line, which distinguishes between a memory read/write and an I/O read/write. Apart from this, memory and I/O accesses are basically the same.

Out 1 indicates (inverted) that the instruction has a Mod R/M byte and performs a read/modify/write on its argument. This signal is used by the Translation ROM when dispatching an address handler (details). (This signal distinguishes between, say, ADD [AX],BX and MOV [AX],BX. The former both reads and writes [AX], while the latter only writes to it.)

Out 2 indicates a "group 3/4/5" opcode, an instruction where the second byte specifies the particular instruction, and thus decoding needs to wait for the second byte. This controls the loading of the microcode address register.

Out 3 indicates an instruction prefix (segment, LOCK, or REP). This causes the next byte to be decoded as a new instruction, while blocking interrupt handling.

Out 4 indicates (inverted) a two-byte ROM instruction (2BR), i.e. an instruction is handled by the microcode ROM, but requires the second byte for decoding. This is an instruction with a Mod R/M byte. This signal controls the loader indicating that it needs to fetch the second byte. This signal is almost the same as output 1 with a few differences.

Out 5 specifies the top bit for an ALU operation. The 8086 uses a 5-bit field to specify an ALU operation. If not specified explicitly by the microcode, the field uses bits 5 through 3 of the opcode. (These bits distinguish, say, an ADD instruction from AND or SUB.) This control line sets the top bit of the ALU field for instructions such as DAA, DAS, AAA, AAS, INC, and DE that fall into a different set from the "regular" ALU instructions.

Out 6 indicates an instruction that sets or clears a condition code directly: CLC, STC, CLI, STI, CLD, or STD (but not CMC). This signal is used by the flag circuitry to update the condition code.

Out 7 indicates an instruction that uses the AL or AX register, depending on the instruction's size bit. (For instance MOVSB vs MOVSW.) This signal is used by the register selection circuitry, the M register specifically.

Out 8 indicates a MOV instruction that uses a segment register. This signal is used by the register selection circuitry, the N register specifically.

Out 9 indicates the instruction has a d bit, where bit 1 of the instruction swaps the source and destination. This signal is used by the register selection circuitry, swapping the roles of the M and N registers according to the d bit.

Out 10 indicates a one-byte logic (1BL) instruction, a one-byte instruction that is implemented in logic, not microcode. These instructions are the prefixes, HLT, and the condition-code instructions. This signal controls the loader, causing it to move to the next instruction.

Out 11 indicates instructions where bit 0 is the byte/word indicator. This signal controls the register handling and the ALU functionality.

Out 12 indicates an instruction that operates only on a byte: DAA, DAS, AAA, AAS, AAM, AAD, and XLAT. This signal operates in conjunction with the previous output to select a byte versus word.

Out 13 forces the instruction to use a byte argument if instruction bit 1 is set, overriding the regular byte/word pattern. Specifically, it forces the L8 (length 8 bits) condition for the JMP direct-within-segment and the ALU instructions that are immediate with sign extension (details).

Out 14 allows a carry update. This prevents the carry from being updated by the INC and DEC operations. This signal is used by the flag circuitry.

Columns

Most of the Group Decode ROM's column signals are used to derive the outputs listed above. However, some column outputs are also used as control signals directly. These are listed below.

Column 10 indicates an immediate MOV instruction. These instructions use instruction bit 3 (rather than bit 1) to select byte versus word, because the three low bits specify the register. This signal affects the L8 condition described earlier and also causes the M register selection to be converted from a word register to a byte register if necessary.

Column 12 indicates an instruction with bits 5-3 specifying the ALU instruction. This signal causes the X register to be loaded with the bits in the instruction that specify the ALU operation. (To be precise, this signal prevents the X register from being reloaded from the second instruction byte.)

Column 13 indicates the CMC (Complement Carry) instruction. This signal is used by the flags circuitry to complement the carry flag (details).

Column 14 indicates the HLT (Halt) instruction. This signal stops instruction processing by blocking the instruction queue.

Column 31 indicates a REP prefix. This signal causes the REPZ/NZ latch to be loaded with instruction bit 0 to indicate if the prefix is REPNZ or REPZ. It also sets the REP latch.

Column 32 indicates a segment prefix. This signal loads the segment latches with the desired segment type.

Column 33 indicates a LOCK prefix. It sets the LOCK latch, locking the bus.

Column 34 indicates a CLI instruction. This signal immediately blocks interrupt handling to avoid an interrupt between the CLI instruction and when the interrupt flag bit is cleared.

Timing

One important aspect of the Group Decode ROM is that its outputs are not instantaneous. It takes a clock cycle to get the outputs from the Group Decode ROM. In particular, when instruction decoding starts, the timing signal FC (First Clock) is activated to indicate the first clock cycle. However, the Group Decode ROM's outputs are not available until the Second Clock SC.

One consequence of this is that even the simplest instruction (such as a flag operation) takes two clock cycles, as does a prefix. The problem is that even though the instruction could be performed in one clock cycle, it takes two clock cycles for the Group Decode ROM to determine that the instruction only needs one cycle. This illustrates how a complex instruction format impacts performance.

The FC and SC timing signals are generated by a state machine called the Loader. These signals may seem trivial, but there are a few complications. First, the prefetch queue may run empty, in which case the FC and/or SC signal is delayed until the prefetch queue has a byte available. Second, to increase performance, the 8086 can start decoding an instruction during the last clock cycle of the previous instruction. Thus, if the microcode indicates that there is one cycle left, the Loader can proceed with the next instruction. Likewise, for a one-byte instruction implemented in hardware (one-byte logic or 1BL), the loader proceeds as soon as possible.

The diagram below shows the timing of an ADD instruction. Each line is half of a clock cycle. Execution is pipelined: the instruction is fetched during the first clock cycle (First Clock). During Second Clock, the Group Decode ROM produces its output. The microcode address register also generates the micro-address for the instruction's microcode. The microcode ROM supplies a micro-instruction during the third clock cycle and execution of the micro-instruction takes place during the fourth clock cycle.

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

The Group Decode ROM's outputs during Second Clock control the decoding. Most importantly, the ADD imm instruction used microcode; it is not a one-byte logic instruction (1BL). Moreover, it does not have a Mod R/M byte, so it does not need two bytes for decoding (2BR). For a 1BL instruction, microcode execution would be blocked and the next instruction would be immediately fetched. On the other hand, for a 2BR instruction, the loader would tell the prefetch queue that it was done with the second byte during the second half of Second Clock. Microcode execution would be blocked during the third cycle and the fourth cycle would execute a microcode subroutine to determine the memory address.

For more details, see my article on the 8086 pipeline.

Interrupts

The Group Decode ROM takes the 8 bits of the instruction as inputs, but it has an additional input indicating that an interrupt is being handled. This signal blocks most of the Group Decode ROM outputs. This prevents the current instruction's outputs from interfering with interrupt handling. I wrote about the 8086's interrupt handling in detail here, so I won't go into more detail in this post.

Conclusions

The Group Decode ROM indicates one of the key differences between CISC processors (Complex Instruction Set Computer) such as the 8086 and the RISC processors (Reduced Instruction Set Computer) that became popular a few years later. A RISC instruction set is designed to make instruction decoding very easy, with a small number of uniform instruction forms. On the other hand, the 8086's CISC instruction set was designed for compactness and high code density. As a result, instructions are squeezed into the available opcode space. Although there is a lot of structure to the 8086 opcodes, this structure is full of special cases and any patterns only apply to a subset of the instructions. The Group Decode ROM brings some order to this chaotic jumble of instructions, and the number of outputs from the Group Decode ROM is a measure of the instruction set's complexity.

The 8086's instruction set was extended over the decades to become the x86 instruction set in use today. During that time, more layers of complexity were added to the instruction set. Now, an x86 instruction can be up to 15 bytes long with multiple prefixes. Some prefixes change the register encoding or indicate a completely different instruction set such as VEX (Vector Extensions) or SSE (Streaming SIMD Extensions). Thus, x86 instruction decoding is very difficult, especially when trying to decode multiple instructions in parallel. This has an impact in modern systems, where x86 processors typically have 4 complex instruction decoders while Apple's ARM processors have 8 simpler decoders; this is said to give Apple a performance benefit. Thus, architectural decisions from 45 years ago are still impacting the performance of modern processors.

I've written numerous posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected]. Thanks to Arjan Holscher for suggesting this topic.

Notes and references

  1. You might wonder what the difference is between a ROM and a PLA. Both of them produce arbitrary outputs for a set of inputs. Moreover, you can replace a PLA with a ROM or vice versa. Typically a ROM has all the input combinations decoded, so it has a separate row for each input value, i.e. 2N rows. So you can think of a ROM as a fully-decoded PLA.

    Some ROMs are partially decoded, allowing identical rows to be combined and reducing the size of the ROM. This technique is used in the 8086 microcode, for instance. A partially-decoded ROM is fairly similar to a PLA, but the technical distinction is that a ROM has only one output row active at a time, while a PLA can have multiple output rows active and the results are OR'd together. (This definition is from The Architecture of Microprocessors p117.)

    The Group Decode ROM, however, has a few cases where multiple rows are active at the same time (for instance the segment register POP instructions). Thus, the Group Decode ROM is technically a PLA and not a ROM. This distinction isn't particularly important, but you might find it interesting. 

  2. The Group Decode ROM has 38 columns, but two columns (11 and 35) are unused. Presumably, these were provided as spares in case a bug fix or modification required additional decoding. 

  3. Like the 8008 and 8080, the 8086's instruction set was designed around a 3-bit octal structure. Thus, the 8086 instruction set makes much more sense if viewed in octal instead of hexadecimal. The table below shows the instructions with an octal organization. Each 8×8 block uses the two low octal digits, while the four large blocks are positioned according to the top octal digit (labeled). As you can see, the instruction set has a lot of structure that is obscured in the usual hexadecimal table.

    The 8086 instruction set, put in a table according to the octal opcode value.

    The 8086 instruction set, put in a table according to the octal opcode value.

    For details on the octal structure of the 8086 instruction set, see The 80x86 is an Octal Machine

Reverse-engineering the division microcode in the Intel 8086 processor

While programmers today take division for granted, most microprocessors in the 1970s could only add and subtract — division required a slow and tedious loop implemented in assembly code. One of the nice features of the Intel 8086 processor (1978) was that it provided machine instructions for integer multiplication and division. Internally, the 8086 still performed a loop, but the loop was implemented in microcode: faster and transparent to the programmer. Even so, division was a slow operation, about 50 times slower than addition.

I recently examined multiplication in the 8086, and now it's time to look at the division microcode.1 (There's a lot of overlap with the multiplication post so apologies for any deja vu.) The die photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. At the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations at the heart of division: subtraction and shifts. Division also uses a few special hardware features: the X register, the F1 flag, and a loop counter. The microcode ROM at the lower right controls the process.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

Microcode

Like most instructions, the division routines in the 8086 are implemented in microcode. Most people think of machine instructions as the basic steps that a computer performs. However, many processors have another layer of software underneath: microcode. With microcode, instead of building the CPU's control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. This is especially useful for a machine instruction such as division, which performs many steps in a loop.

Each micro-instruction in the 8086 is encoded into 21 bits as shown below. Every micro-instruction moves data from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits depends on the type field and can be anything from an ALU operation to a memory read or write to a change of microcode control flow. Thus, an 8086 micro-instruction typically does two things in parallel: the move and the action. For more about 8086 microcode, see my microcode blog post.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

A few details of the ALU (Arithmetic/Logic Unit) operations are necessary to understand the division microcode. The ALU has three temporary registers that are invisible to the programmer: tmpA, tmpB, and tmpC. An ALU operation takes its first argument from the specified temporary register, while the second argument always comes from tmpB. An ALU operation requires two micro-instructions. The first micro-instruction specifies the ALU operation and source register, configuring the ALU. For instance, ADD tmpA to add tmpA to the default tmpB. In the next micro-instruction (or a later one), the ALU result can be accessed through a register called Σ (Sigma) and moved to another register.

The carry flag plays a key role in division. The carry flag is one of the programmer-visible status flags that is set by arithmetic operations, but it is also used by the microcode. For unsigned addition, the carry flag is set if there is a carry out of the word (or byte). For subtraction, the carry flag indicates a borrow, and is set if the subtraction requires a borrow. Since a borrow results if you subtract a larger number from a smaller number, the carry flag also indicates the "less than" condition. The carry flag (and other status flags) are only updated if micro-instruction contains the F bit.

The RCL (Rotate through Carry, Left) micro-instruction is heavily used in the division microcode.2 This operation shifts the bits in a 16-bit word, similar to the << bit-shift operation in high-level languages, but with an additional feature. Instead of discarding the bit on the end, that bit is moved into the carry flag. Meanwhile, the bit formerly in the carry flag moves into the word. You can think of this as rotating the bits while treating the carry flag as a 17th bit of the word. (The RCL operation can also act on a byte.)

The rotate through carry left micro-instruction.

The rotate through carry left micro-instruction.

These shifts perform an important part of the division process since shifting can be viewed as multiplying or dividing by two. RCL also provides a convenient way to move the most-significant bit to the carry flag, where it can be tested for a conditional jump. (This is important because the top bit is used as the sign bit.) Another important property is that performing RCL on a lower word and then RCL on an upper word will perform a 32-bit shift, since the high bit of the lower word will be moved into the low bit of the upper word via the carry bit. Finally, the shift moves the quotient bit from the carry into the register.

Binary division

The division process in the 8086 is similar to grade-school long division, except in binary instead of decimal. The diagram below shows the process: dividing 67 (the dividend) by 9 (the divisor) yields the quotient 7 at the top and the remainder 4 at the bottom. Long division is easier in binary than decimal because you don't need to guess the right quotient digit. Instead, at each step you either subtract the divisor (appropriately shifted) or subtract nothing. Note that although the divisor is 4 bits in this example, the subtractions use 5-bit values. The need for an "extra" bit in division will be important in the discussion below; 16-bit division needs a 17-bit value.

 0111
100101000011
 -0000
 10000
 -1001
 01111
 -1001
 01101
 -1001
 0100

Instead of shifting the divisor to the right each step, the 8086's algorithm shifts the quotient and the current dividend to the left each step. This trick reduces the register space required. Dividing a 32-bit number (the dividend) by a 16-bit number yields a 16-bit result, so it seems like you'd need four 16-bit registers in total. The trick is that after each step, the 32-bit dividend gets one bit smaller, while the result gets one bit larger. Thus, the dividend and the result can be packed together into 32 bits. At the end, what's left of the dividend is the 16-bit remainder. The table below illustrates this process for a sample dividend (blue) and quotient (green).3 At the end, the 16-bit blue value is the remainder.

dividendquotient
00001111000000001111111100000000
00001110000001011111111000000000
00001100000011111111110000000000
00001000001000111111100000000000
00000000010010111111000000000000
00000000100101111110000000000001
00000001001011111100000000000011
00000010010111111000000000000111
00000100101111110000000000001111
00001001011111100000000000011111
00000011000000000000000000111110
00000110000000000000000001111101
00001100000000000000000011111011
00001000000001000000000111110110
00000000000011000000001111101100
00000000000110000000011111011001
00000000001100000000111110110011

The division microcode

The 8086 has four division instructions to handle signed and unsigned division of byte and word operands. I'll start by describing the microcode for the unsigned word division instruction DIV, which divides a 32-bit dividend by a 16-bit divisor. The dividend is supplied in the AX and DX registers while the divisor is specified by the source operand. The 16-bit quotient is returned in AX and the 16-bit remainder in DX. For a divide-by-zero, or if the quotient is larger than 16 bits, a type 0 "divide error" interrupt is generated.

CORD: The core division routine

The CORD microcode subroutine below implements the long-division algorithm for all division instructions; I think CORD stands for Core Divide. At entry, the arguments are in the ALU temporary registers: tmpA/tmpC hold the 32-bit dividend, while tmpB holds the 16-bit divisor. (I'll explain the configuration for byte division later.) Each cycle of the loop shifts the values and then potentially subtracts the divisor. One bit is appended to the quotient to indicate whether the divisor was subtracted or not. At the end of the loop, whatever is left of the dividend is the remainder.

Each micro-instruction specifies a register move on the left, and an action on the right. The moves transfer words between the visible registers and the ALU's temporary registers, while the actions are mostly ALU operations or control flow. As is usually the case with microcode, the details are tricky. The first three lines below check if the division will overflow or divide by zero. The code compares tmpA and tmpB by subtracting tmpB, discarding the result, but setting the status flags (F). If the upper word of the divisor is greater or equal to the dividend, the division will overflow, so execution jumps to INT0 to generate a divide-by-zero interrupt.4 (This handles both the case where the dividend is "too large" and the divide-by-0 case.) The number of loops in the division algorithm is controlled by a special-purpose loop counter. The MAXC micro-instruction initializes the counter to 7 or 15, for a byte or word divide instruction respectively.

   move        action
             SUBT tmpA   CORD: set up compare
Σ → no dest  MAXC F       compare, set up counter, update flags
             JMP NCY INT0 generate interrupt if overflow
             RCL tmpC    3: main loop: left shift tmpA/tmpC
Σ → tmpC     RCL tmpA     
Σ → tmpA     SUBT tmpA    set up compare/subtract
             JMPS CY 13   jump if top bit of tmpA was set
Σ → no dest  F            compare, update flags
             JMPS NCY 14  jump for subtract
             JMPS NCZ 3   test counter, loop back to 3
             RCL tmpC    10: done:
Σ → tmpC                  shift last bit into tmpC
Σ → no dest  RTN          done: get top bit, return

             RCY         13: reset carry
Σ → tmpA     JMPS NCZ 3  14: subtract, loop
             JMPS 10     done, goto 10

The main loop starts at 3. The tmpC and tmpA registers are shifted left. This has two important side effects. First, the old carry bit (which holds the latest quotient bit) is shifted into the bottom of tmpC. Second, the top bit of tmpA is shifted into the carry bit; this provides the necessary "extra" bit for the subtraction below. Specifically, if the carry (the "extra" tmpA bit) is set, tmpB can be subtracted, which is accomplished by jumping to 13. Otherwise, the code compares tmpA and tmpB by subtracting tmpB, discarding the result, and updating the flags (F). If there was no borrow/carry (tmpA ≥ tmpB), execution jumps to 14 to subtract. Otherwise, the internal loop counter is decremented and control flow goes back to the top of the loop if not done (NCZ, Not Counter Zero). If the loop is done, tmpC is rotated left to pick up the last quotient bit from the carry flag. Then a second rotate of tmpC is performed but the result is discarded; this puts the top bit of tmpC into the carry flag for use later in POSTIDIV. Finally, the subroutine returns.

The subtraction path is 13 and 14, which subtract tmpB from tmpA by storing the result (Σ) in tmpA. This path resets the carry flag for use as the quotient bit. As in the other path, the loop counter is decremented and tested (NCZ) and execution either continues back at 3 or finishes at 10.

One complication is that the carry bit is the opposite of the desired quotient bit. Specifically, if tmpA < tmpB, the comparison generates a borrow so the carry flag is set to 1. In this case, the desired quotient bit is 0 and no subtraction takes place. But if tmpA ≥ tmpB, the comparison does not generate a borrow (so the carry flag is set to 0), the code subtracts tmpB, and the desired quotient bit is 1. Thus, tmpC ends up holding the complement of the desired result; this is fixed later.

The microcode is carefully designed to pack the divide loop into a small number of micro-instructions. It uses the registers and the carry flag in tricky ways, using the carry flag to hold the top bit of tmpA, the comparison result, and the generated quotient bit. This makes the code impressively dense but tricky to understand.

The top-level division microcode

Now I'll pop up a level and take a look at the top-level microcode (below) that implements the DIV and IDIV machine instructions. The first three instructions load tmpA, tmpC, and tmpB from the specified registers. (The M register refers to the source specified in the instruction, either a register or a memory location.) Next, the X0 condition tests bit 3 of the instruction, which in this case distinguishes DIV from IDIV. For signed division (IDIV), the microcode calls PREIDIV, which I'll discuss below. Next, the CORD micro-subroutine discussed above is called to perform the division.

DX → tmpA                      iDIV rmw: load tmpA, tmpC, tmpB 
AX → tmpC    RCL tmpA           set up RCL left shift operation
M → tmpB     CALL X0 PREIDIV    set up integer division if IDIV
             CALL CORD          call CORD to perform division 
             COM1 tmpC          set up to complement the quotient 
DX → tmpB    CALL X0 POSTIDIV   get original dividend, handle IDIV
Σ → AX       NXT                store updated quotient
tmpA → DX    RNI                store remainder, run next instruction

As discussed above, the quotient in tmpC needs to be 1's-complemented, which is done with COM1. For IDIV, the micro-subroutine POSTIDIV sets the signs of the results appropriately. The results are stored in the AX and DX registers. The NXT micro-operation indicates the next micro-instruction is the last one, directing the microcode engine to start the next machine instruction. Finally, RNI directs the microcode engine to run the next machine instruction.

8-bit division

The 8086 has separate opcodes for 8-bit division. The 8086 supports many instructions with byte and word versions, using 8-bit or 16-bit arguments respectively. In most cases, the byte and word instructions use the same microcode, with the ALU and register hardware using bytes or words based on the instruction. In the case of division, the shift micro-operations act on tmpA and tmpC as 8-bit registers rather than 16-bit registers. Moreover, the MAXC micro-operation initializes the internal counter to 7 rather than 15. Thus, the same CORD microcode is used for word and byte division, but the number of loops and the specific operations are changed by the hardware.

The diagram below shows the tmpA and tmpC registers during each step of dividing 0x2345 by 0x34. Note that the registers are treated as 8-bit registers. The divisor (blue) steadily shrinks with the quotient (green) taking its place. At the end, the remainder is 0x41 (blue) and the quotient is 0xad, complement of the green value.

tmpAtmpC
00000000001000110000000001000101
00000000000100100000000010001010
00000000001001010000000000010101
00000000000101100000000000101010
00000000001011000000000001010101
00000000001001000000000010101010
00000000000101010000000001010100
00000000001010100000000010101001
00000000001000010000000001010010

Although the CORD routine is shared for byte and word division, the top-level microcode is different. In particular, the byte and word division instructions use different registers, requiring microcode changes. The microcode below is the top-level code for byte division. It is almost the same as the microcode above, except it uses the top and bottom bytes of the accumulator (AH and AL) rather than the AX and DX registers.

AH → tmpA                     iDIV rmb: get arguments
AL → tmpC    RCL tmpA          set up RCL left shift operation
M → tmpB     CALL X0 PREIDIV   handle signed division if IDIV
             CALL CORD         call CORD to perform division
             COM1 tmpC         complement the quotient
AH → tmpB    CALL X0 POSTIDIV  handle signed division if IDIV
Σ → AL       NXT               store quotient
tmpA → AH    RNI               store remainder, run next instruction

Signed division

The 8086 (like most computers) represents signed numbers using a format called two's complement. While a regular byte holds a number from 0 to 255, a signed byte holds a number from -128 to 127. A negative number is formed by flipping all the bits (known as the one's complement) and then adding 1, yielding the two's complement value. For instance, +5 is 0x05 while -5 is 0xfb. (Note that the top bit of a number is set for a negative number; this is the sign bit.) The nice thing about two's complement numbers is that the same addition and subtraction operations work on both signed and unsigned values. Unfortunately, this is not the case for signed multiplication and division.

The 8086 has separate IDIV (Integer Divide) instructions to perform signed division. The 8086 performs signed division by converting the arguments to positive values, performing unsigned division, and then negating the results if necessary. As shown earlier, signed and unsigned division both use the same top-level microcode and the microcode conditionally calls some subroutines for signed division. These additional subroutines cause a significant performance penalty, making signed division over 20 cycles slower than unsigned division. I will discuss those micro-subroutines below.

PREIDIV

The first subroutine for signed division is PREIDIV, performing preliminary operations for integer division. It converts the two arguments, stored in tmpA/tmpC and tmpB, to positive values. It keeps track of the signs using an internal flag called F1, toggling this flag for each negative argument. This conveniently handles the rule that two negatives make a positive since complementing the F1 flag twice will clear it. The point of this is that the main division code (CORD) only needs to handle unsigned division.

The microcode below implements PREIDIV. First it tests if tmpA is negative, but the 8086 does not have a microcode condition to directly test the sign of a value. Instead, the microcode determines if a value is negative by shifting the value left, which moves the top (sign) bit into the carry flag. The conditional jump (NCY) then tests if the carry is clear, jumping if the value is non-negative. If tmpA is negative, execution falls through to negate the first argument. This is tricky because the argument is split between the tmpA and tmpC registers. The two's complement operation (NEG) is applied to the low word, while either 2's complement or one's complement (COM1) is applied to the upper word, depending on the carry for mathematical reasons.5 The F1 flag is complemented to keep track of the sign. (The multiplication process reuses most of this code, starting at the NEGATE entry point.)

Σ → no dest             PREIDIV: shift tmpA left
             JMPS NCY 7  jump if non-negative
             NEG tmpC   NEGATE: negate tmpC
Σ → tmpC     COM1 tmpA F maybe complement tmpA
             JMPS CY 6  
             NEG tmpA    negate tmpA if there's no carry
Σ → tmpA     CF1        6: toggle F1 (sign)

             RCL tmpB  7: test sign of tmpB
Σ → no dest  NEG tmpB    maybe negate tmpB
             JMPS NCY 11 skip if tmpB positive
Σ → tmpB     CF1 RTN     else negate tmpB, toggle F1 (sign)
             RTN        11: return

The next part of the code, starting at 7, negates tmpB (the divisor) if it is negative. Since the divisor is a single word, this code is simpler. As before, the F1 flag is toggled if tmpB is negative. At the end, both arguments (tmpA/tmpC and tmpB) are positive, and F1 indicates the sign of the result.

POSTIDIV

After computing the result, the POSTIDIV routine is called for signed division. The routine first checks for a signed overflow and raises a divide-by-zero interrupt if so. Next, the routine negates the quotient and remainder if necessary.6

In more detail, the CORD routine left the top bit of tmpC (the complemented quotient) in the carry flag. Now, that bit is tested. If the carry bit is 0 (NCY), then the top bit of the quotient is 1 so the quotient is too big to fit in a signed value.7 In this case, the INT0 routine is executed to trigger a type 0 interrupt, indicating a divide overflow. (This is a rather roundabout way of testing the quotient, relying on a carry bit that was set in a previous subroutine.)

             JMP NCY INT0 POSTIDIV: if overflow, trigger interrupt
             RCL tmpB      set up rotate of tmpB
Σ → no dest  NEG tmpA      get sign of tmpB, set up negate of tmpA
             JMPS NCY 5    skip if tmpB non-negative
Σ → tmpA                   otherwise negate tmpA (remainder)
             INC tmpC     5: set up increment
             JMPS F1 8     test sign flag, skip if set
             COM1 tmpC     otherwise set up complement
             CCOF RTN     8: clear carry and overflow flags, return

Next, tmpB (the divisor) is rotated to see if it is negative. (The caller loaded tmpB with the original divisor, replacing the dividend that was in tmpB previously.) If the divisor is negative, tmpA (the remainder) is negated. This implements the 8086 rule that the sign of the remainder matches the sign of the divisor.

The quotient handling is a bit tricky. Recall that tmpC holds the complemented quotient. the F1 flag is set if the result should be negative. In that case, the complemented quotient needs to be incremented by 1 (INC) to convert from 1's complement to 2's complement. On the other hand, if the quotient should be positive, 1's-complementing tmpC (COM1) will yield the desired positive quotient. In either case, the ALU is configured in POSTIDIV, but the result will be stored back in the main routine.

Finally, the CCOF micro-operation clears the carry and overflow flags. Curiously, the 8086 documentation declares that the status flags are undefined following IDIV, but the microcode explicitly clears the carry and overflow flags. I assume that the flags were cleared in analogy with MUL, but then Intel decided that this wasn't useful so they didn't document it. (Documenting this feature would obligate them to provide the same functionality in later x86 chips.)

The hardware for division

For the most part, the 8086 uses the regular ALU addition and shifts for the division algorithm. Some special hardware features provide assistance. In this section, I'll look at this hardware.

Loop counter

The 8086 has a 4-bit loop counter for multiplication and division. This counter starts at 7 for byte division and 15 for word division, based on the low bit of the opcode. This loop counter allows the microcode to decrement the counter, test for the end, and perform a conditional branch in one micro-operation. The counter is implemented with four flip-flops, along with logic to compute the value after decrementing by one. The MAXC (Maximum Count) micro-instruction sets the counter to 7 or 15 for byte or word operations respectively. The NCZ (Not Counter Zero) micro-instruction has two actions. First, it performs a conditional jump if the counter is nonzero. Second, it decrements the counter.

The F1 flag

Signed multiplication and division use an internal flag called F18 to keep track of the sign. The F1 flag is toggled by microcode through the CF1 (Complement F1) micro-instruction. The F1 flag is implemented with a flip-flop, along with a multiplexer to select the value. It is cleared when a new instruction starts, set by a REP prefix, and toggled by the CF1 micro-instruction. The diagram below shows how the F1 latch and the loop counter appear on the die. In this image, the metal layer has been removed, showing the silicon and the polysilicon wiring underneath.

The counter and F1 latch as they appear on the die. The latch for the REP state is also here.

The counter and F1 latch as they appear on the die. The latch for the REP state is also here.

X register

The division microcode uses an internal register called the X register to distinguish between the DIV and IDIV instructions. The X register is a 3-bit register that holds the ALU opcode, indicated by bits 5–3 of the instruction.9 Since the instruction is held in the Instruction Register, you might wonder why a separate register is required. The motivation is that some opcodes specify the type of ALU operation in the second byte of the instruction, the ModR/M byte, bits 5–3.10 Since the ALU operation is sometimes specified in the first byte and sometimes in the second byte, the X register was added to handle both these cases.

For the most part, the X register indicates which of the eight standard ALU operations is selected (ADD, OR, ADC, SBB, AND, SUB, XOR, CMP). However, a few instructions use bit 0 of the X register to distinguish between other pairs of instructions. For instance, it distinguishes between MUL and IMUL, DIV and IDIV, CMPS and SCAS, MOVS and LODS, or AAA and AAS. While these instruction pairs may appear to have arbitrary opcodes, they have been carefully assigned so the microcode can distinguish them.

The implementation of the X register is straightforward, consisting of three flip-flops to hold the three bits of the instruction. The flip-flops are loaded from the prefetch queue bus during First Clock and during Second Clock for appropriate instructions, as the instruction bytes travel over the bus. Testing bit 0 of the X register with the X0 condition is supported by the microcode condition evaluation circuitry, so it can be used for conditional jumps in the microcode.

Algorithmic and historical context

As you can see from the microcode, division is a complicated and relatively slow process. On the 8086, division takes up to 184 clock cycles to perform all the microcode steps. (In comparison, two registers can be added in 3 clock cycles.) Multiplication and division both loop over the bits, performing repeated addition or subtraction respectively. But division requires a decision (subtract or not?) at each step, making it even slower, about half the speed of multiplication.11

Various algorithms have been developed to speed up division. Rather than performing long division one bit at a time, you can do long division in, say, base 4, producing two quotient bits in each step. As with decimal long division, the tricky part is figuring out what digit to select. The SRT algorithm (1957) uses a small lookup table to estimate the quotient digit from a few bits of the divisor and dividend. The clever part is that the selected digit doesn't need to be exactly right at each step; the algorithm will self-correct after a wrong "guess". The Pentium processor (1993) famously had a floating point division bug due to a few missing values in the SRT table. This bug cost Intel $475 million to replace the faulty processors.

Intel's x86 processors steadily improved divide performance. The 80286 (1982) performed a word divide in 22 clocks, about 6 times as fast as the 8086. In the Penryn architecture (2007), Intel upgraded from Radix-4 to Radix-16 division. Rather than having separate integer and floating-point hardware, integer divides were handled through the floating point divider. Although modern Intel processors have greatly improved multiplication and division compared to the 8086, division is still a relatively slow operation. While a Tiger Lake (2020) processor can perform an integer multiplication every clock cycle (with a latency of 3 cycles), division is much slower and can only be done once every 6-10 clock cycles (details).

I've written numerous posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected].

Notes and references

  1. My microcode analysis is based on Andrew Jenner's 8086 microcode disassembly

  2. The 8086 patent and Andrew Jenner's microcode use the name LRCY (Left Rotate through Carry) instead of RCL. I figure that RCL will be more familiar to people because of the corresponding machine instruction. 

  3. In the dividend/quotient table, the tmpA register is on the left and the tmpC register is on the right. 0x0f00ff00 divided by 0x0ffc yielding the remainder 0x0030 (blue) and quotient 0xf04c (green). (The green bits are the complement of the quotient due to implementation in the 8086.) 

  4. I described the 8086's interrupt circuitry in detail in this post

  5. The negation code is a bit tricky because the result is split across two words. In most cases, the upper word is bitwise complemented. However, if the lower word is zero, then the upper word is negated (two's complement). I'll demonstrate with 16-bit values to keep the examples small. The number 257 (0x0101) is negated to form -257 (0xfeff). Note that the upper byte is the one's complement (0x01 vs 0xfe) while the lower byte is two's complement (0x01 vs 0xff). On the other hand, the number 256 (0x0100) is negated to form -256 (0xff00). In this case, the upper byte is the two's complement (0x01 vs 0xff) and the lower byte is also the two's complement (0x00 vs 0x00).

    (Mathematical explanation: the two's complement is formed by taking the one's complement and adding 1. In most cases, there won't be a carry from the low byte to the upper byte, so the upper byte will remain the one's complement. However, if the low byte is 0, the complement is 0xff and adding 1 will form a carry. Adding this carry to the upper byte yields the two's complement of that byte.)

    To support multi-word negation, the 8086's NEG instruction clears the carry flag if the operand is 0, and otherwise sets the carry flag. (This is the opposite of the above because subtractions (including NEG) treat the carry flag as a borrow flag, with the opposite meaning.) The microcode NEG operation has identical behavior to the machine instruction, since it is used to implement the machine instruction.

    Thus to perform a two-word negation, the microcode negates the low word (tmpC) and updates the flags (F). If the carry is set, the one's complement is applied to the upper word (tmpA). But if the carry is cleared, the two's complement is applied to tmpA. 

  6. There is a bit of ambiguity with the quotient and remainder of negative numbers. For instance, consider -27 ÷ 7. -27 = 7 × -3 - 6 = 7 * -4 + 1. So you could consider the result to be a quotient of -3 and remainder of -6, or a quotient of -4 and a remainder of 1. The 8086 uses the rule that the remainder will have the same sign as the dividend, so the first result would be used. The advantage of this rule is that you can perform unsigned division and adjust the signs afterward:
    27 ÷ 7 = quotient 3, remainder 6.
    -27 ÷ 7 = quotient -3, remainder -6.
    27 ÷ -7 = quotient -3, remainder 6.
    -27 ÷ -7 = quotient 3, remainder -6.

    This rule is known as truncating division, but some languages use different approaches such as floored division, rounded division, or Euclidean division. Wikipedia has details. 

  7. The signed overflow condition is slightly stricter than necessary. For a word division, the 16-bit quotient is restricted to the range -32767 to 32767. However, a 16-bit signed value can take on the values -32768 to 32767. Thus, a quotient of -32768 fits in a 16-bit signed value even though the 8086 considers it an error. This is a consequence of the 8086 performing unsigned division and then updating the sign if necessary. 

  8. The internal F1 flag is also used to keep track of a REP prefix for use with a string operation. I discussed string operations and the F1 flag in this post

  9. Curiously, the 8086 patent states that the X register is a 4-bit register holding bits 3–6 of the byte (col. 9, line 20). But looking at the die, it is a 3-bit register holding bits 3–5 of the byte. 

  10. Some instructions are specified by bits 5–3 in the ModR/M byte rather than in the first opcode byte. The motivation is to avoid wasting bits for instructions that use a ModR/M byte but don't need a register specification. For instance, consider the instruction ADD [BX],0x1234. This instruction uses a ModR/M byte to specify the memory address. However, because it uses an immediate operand, it does not need the register specification normally provided by bits 5–3 of the ModR/M byte. This frees up the bits to specify the instruction. From one perspective, this is an ugly hack, while from another perspective it is a clever optimization. 

  11. Even the earliest computers such as ENIAC (1945) usually supported multiplication and division. However, early microprocessors did not provide multiplication and division instructions due to the complexity of these instructions. Instead, the programmer would need to write an assembly code loop, which was very slow. Early microprocessors often had binary-coded decimal instructions that could perform additions and subtractions in decimal. One motivation for these instructions was that converting between binary and decimal was extremely slow due to the need for multiplication and division. Instead, it was easier and faster to keep the values as decimal if that was how they were displayed.

    The Texas Instruments TMS9900 (1976) was one of the first microprocessors with multiplication and division instructions. Multiply and divide instructions remained somewhat controversial on RISC (Reduced Instruction-Set Computer) processors due to the complexity of these instructions. The early ARM processors, for instance, did not support multiplication and division. Multiplication was added to ARMv2 (1986) but most ARM processors still don't have integer division. The popular open-source RISC-V architecture (2015) doesn't include integer multiply and divide by default, but provides them as an optional "M" extension.

    The 8086's algorithm is designed for simplicity rather than speed. It is a "restoring" algorithm that checks before subtracting to ensure that the current term is always positive. This can require two ALU operations (comparison and subtraction) per cycle. A slightly more complex approach is a "nonrestoring" algorithm that subtracts even if it yields a negative term, and then adds during a later loop iteration.