Showing posts with label reverse-engineering. Show all posts
Showing posts with label reverse-engineering. Show all posts

The Intel 8086 processor's registers: from chip to transistors

The Intel 8086 microprocessor is one of the most influential chips ever created; it led to the x86 architecture that dominates desktop and server computing today. I've been reverse-engineering the 8086 from die photos, and in this post I discuss how its register file is implemented.

The 8086 die, showing the register storage. The upper registers are used by the Bus Interface Unit for memory accesses,
while the general-purpose lower registers are used by the Execution Unit. The instruction buffer is a 6-byte queue of prefetched
instructions.

The 8086 die, showing the register storage. The upper registers are used by the Bus Interface Unit for memory accesses, while the general-purpose lower registers are used by the Execution Unit. The instruction buffer is a 6-byte queue of prefetched instructions.

The photo above shows the silicon die of the 8086 processor under a microscope. The metal layer on top of the chip is visible, with the silicon hidden underneath. Around the outside edge, bond wires connect pads on the die to the chip's 40 external pins.

The highlighted region indicates the 8086's fifteen 16-bit registers and six bytes of instruction prefetch queue.1 Registers take up a significant portion of the die, even though they are just 36 bytes in total. Due to space limitations, early microprocessors had a relatively small number of registers; in comparison, a modern processor chip has kilobytes of registers and megabytes of cache storage.2

How a register is implemented in silicon

I'll start by explaining how the 8086 is built from NMOS transistors. Then I'll explain how an inverter is constructed, how a single bit is stored using inverters, and how a register is constructed.

The 8086 and other chips of that era were built from a type of transistor called NMOS. These chips consisted of a silicon substrate, which was "doped" by diffusion of arsenic or boron to form transistors. Above the silicon, polysilicon wiring created the gates of the transistors and wired components together. Finally, a metal layer on top provided more wiring. (Modern processors, in comparison, use CMOS technology, which combines NMOS and PMOS transistors, and they have many metal layers.)

The schematic below shows an inverter built from an NMOS transistor and a resistor3 With a low input, the transistor is off, so the pull-up resistor pulls the output high. With a high input, the transistor turns on, connecting ground and the output, pulling the output low. Thus, the input signal is inverted.

This schematic shows how an inverter is created from a transistor and resistor. The photo shows the implementation on the chip.
The metal layer was removed to show the polysilicon and silicon underneath.

This schematic shows how an inverter is created from a transistor and resistor. The photo shows the implementation on the chip. The metal layer was removed to show the polysilicon and silicon underneath.

The photo above shows how an inverter is physically constructed in the 8086. The pinkish regions are conductive doped silicon and the sparkly copper-colored lines are polysilicon on top. A transistor is created where polysilicon crosses silicon: the polysilicon forms the transistor's gate, while the silicon regions on either side are the transistor's source and drain. The large polysilicon rectangle forms the pull-up resistor between +5 volts and the output. Thus, the chip's circuitry matches the inverter schematic. Under a microscope, circuits such as this inverter are visible and can be reverse-engineered.

The building block for the registers is two inverters in a feedback loop, storing a single bit, as shown below. If the top wire has a 0, the right inverter will output a 1 to the bottom wire. The left inverter will then output a 0 to the top wire, completing the cycle. Thus, the circuit is stable and will "remember" the 0. Likewise, if the top wire is a 1, this will get inverted to a 0 at the bottom wire, and back to a 1 at the top. Thus, this circuit can store either a 0 or a 1, forming a 1-bit memory.

In the 8086, two coupled inverters hold a single bit in the register. This circuit is stable in either the 0 or 1 state.

In the 8086, two coupled inverters hold a single bit in the register. This circuit is stable in either the 0 or 1 state.

Three transistors are added to make a usable register cell from the inverter pair.4 One transistor selects the cell for reading, another transistor selects the cell for writing, and the third amplifies the signal when reading. In the center of the schematic below, two inverters store the bit. To read the bit, the read line is energized. This connects the inverter output to the bit line through the amplifying transistor. To write the bit, the write line is energized, connecting the bit line to the inverters. By putting a high-current 0 or 1 signal on the bit line, the inverters (and thus the stored bit) are forced to the desired value. Note that the bit line is used for both reading and writing.

Schematic diagram of a register cell storing a single bit. The register file is built from an array of these cells.

Schematic diagram of a register cell storing a single bit. The register file is built from an array of these cells.

The register file consists of a matrix of register cells like the one above. The matrix is 16 cells wide since registers are 16 bits wide. Each register is arranged horizontally, so a read line or write line select all the cells for a particular register. The 16 vertical bit lines form a bus, so all 16 bits in the selected register are read or written in parallel.

The photo below zooms in on the 8086's general-purpose register file, showing the matrix of register cells: 16 columns and 8 rows for eight 16-bit registers. It then zooms in on a single register cell in the register file. I'll now explain how this cell is implemented.

Die photo of the 8086, zooming in on the lower register file (eight 16-bit registers) and then a single register cell. The metal and polysilicon were removed for this photo to show the silicon structures.

Die photo of the 8086, zooming in on the lower register file (eight 16-bit registers) and then a single register cell. The metal and polysilicon were removed for this photo to show the silicon structures.

The 8086 is constructed from doped silicon and polysilicon wiring with metal wiring on top. The left photo below shows the vertical metal wiring of a register cell. The ground, power, and bit line wires are indicated. (The remaining wire crosses the register file but isn't connected to it.) In the right photo, the metal layer has been dissolved to show the polysilicon and silicon underneath. The read and write lines are horizontal polysilicon wires. (Because the chip has only one layer of metal, the register uses metal for the vertical lines and polysilicon for the horizontal lines so they don't run into each other.) The connections (called vias) between the metal and the silicon are visible as brighter circles in the metal photo and as circular spots in the silicon photo.

A register storage cell. The photo on the left shows the metal layer, while the photo on the right
shows the corresponding polysilicon and silicon underneath. The bright circles on the metal layer are vias connected to the circles on the silicon.

A register storage cell. The photo on the left shows the metal layer, while the photo on the right shows the corresponding polysilicon and silicon underneath. The bright circles on the metal layer are vias connected to the circles on the silicon.

The diagram below shows how the physical layout of the register cell matches up with the schematic. The inverters are formed from transistors A and B, along with the resistors. Transistors C, D, and E are formed by the labeled strips of polysilicon. The bit line is not visible below, since it is in the metal layer. Note that the layout of the memory cell is highly optimized to minimize its size. Also note that transistor A is much smaller than the other transistors; inverter A has a weak output so it can be overpowered by the bit line when a value is written.

A register cell in the 8086 with the corresponding schematic.

A register cell in the 8086 with the corresponding schematic.

8-bit register support

Careful examination of the die shows that some of the register cells have a slightly different structure. On the left is a pair of the register cells discussed above,5 while the right photo shows a pair of register cells with two write control lines instead of one. In the left photo, the write line crosses the silicon in both register cells. However, in the right photo, the "write right" line crosses the silicon on the right side but goes between the silicon regions on the left. Conversely, the "write left" line crosses the silicon on the left side and goes between the silicon on the right. Thus, one write line controls writes to the right-hand bit, while the other controls writes to the left-hand bit. In the full 16-bit register, this allows alternating 8-bit parts to be written separately.6

Two pairs of memory cells, showing different circuitry. The left cells have a single write line, while the right cells have separate write lines for the left and right bits.

Two pairs of memory cells, showing different circuitry. The left cells have a single write line, while the right cells have separate write lines for the left and right bits.

Why do some registers have two write lines while others have one? The reason is that the 8086 has 16-bit registers, but four of them can also be accessed as 8-bit registers, as shown below. For example, the 16-bit accumulator A can be accessed as an 8-bit AH (accumulator high) register and an 8-bit AL (accumulator low) register. By implementing the registers with two write control lines, either half of the register can be written separately.7

The general-purpose registers in the 8086 processor. The A, B, C, and D registers can be split into two 8-bit registers. From The 8086 Family User's Manual.

The general-purpose registers in the 8086 processor. The A, B, C, and D registers can be split into two 8-bit registers. From The 8086 Family User's Manual.

Multi-port registers

So far, I've discussed the eight general-purpose "lower registers". The 8086 also has seven "upper registers" used for memory accesses, including the infamous segment registers.8 These registers have a more complex "multi-port" design, allowing multiple reads and writes to take place simultaneously.9 For instance, the multi-ported register file would allow the program counter to be read, a segment register to be read, and a different segment register to be written, all at the same time.

The multi-ported register cell below is built around the same two-inverter circuit as before but it has three bit lines (compared to one earlier) and five control lines (compared to two). The three read control lines allow the register cell contents to be read to any of the three bit lines, while the two write control lines allow bit line A or bit line C to be written to the register cell.

A multi-ported register cell in the 8086 processor.

A multi-ported register cell in the 8086 processor.

At first glance, the 8086's register file looked like a uniform set of registers, but close examination reveals that each register has been optimized based on its function.10 Some registers are simple 16-bit registers, which have the most compact layout. Other 16-bit registers can also be accessed as two 8-bit registers, requiring another control line. The most complex registers have two or three read ports and one or two write ports. In each case, the physical layout of the register cell has been carefully designed to be as compact as possible, with elaborate transistor shapes, as seen below. Intel's engineers shrunk the register layout as much as possible to fit all the registers in the available space.

The upper register file, consisting of ten 16-bit registers. This photo shows the silicon and polysilicon. The vertical red lines are traces of the metal layer that was removed. Click for a larger image.

The upper register file, consisting of ten 16-bit registers. This photo shows the silicon and polysilicon. The vertical red lines are traces of the metal layer that was removed. Click for a larger image.

Conclusions

Although the 8086 processor is 42 years old, it still heavily influences modern computing through the x86 architecture in heavy use today. The registers of the 8086 still exist in modern x86 computers, although the registers are now 64 bits long and have been joined by many new registers.

The 8086 is an interesting subject for die analysis since its transistors are large enough to be visible under a microscope. It was a complex processor at the time, with 29,000 transistors, but it is still simple enough that the circuitry can be traced out and understood. I plan to analyze the 8086 in more detail in future blog posts so follow me on Twitter @kenshirriff or on RSS for updates.11

Notes and references

  1. The 8086 was apparently the first microprocessor to implement instruction prefetching. The Motorola 68000 (1979) had a 4-byte instruction prefetch buffer. Prefetching in mainframes goes back to the IBM Stretch (1961), CDC 6600 (1964), and IBM System/360 Model 91 (1966). 

  2. It's difficult to determine how many registers are in a modern processor; the only accurate description I could find was in The Anatomy of a High-Performance Microprocessor, which describes the AMD K6 processor (1997) in detail. Due to register renaming modern processors have many more physical registers than architectural registers (the registers visible to a programmer), and the number of physical registers is not documented. (In addition to the eight general-purpose x86 registers, the K6 had 16 microarchitecture scratch registers for renaming.)

    Processors supporting AVX-512 include 32 512-bit registers, so that's 2 kilobytes of registers for that feature alone. This makes it even harder to determine the register size. As for cache size, high-end processors have up to 77 MB of cache storage.) 

  3. The pull-up resistor in an NMOS gate is actually a special transistor. The depletion-mode transistor acts as a resistor but is more compact and performs better than an actual resistor. 

  4. Other processors use slightly different register storage cells. The 6502 uses an additional transistor in the inverter feedback loop to break the feedback loop when writing a new value. The Z-80 writes to both inverters at the same time, making the transition "easier" but requiring two write wires. While the 8086 has an amplification transistor in each register cell for reads, other processors read the outputs from both inverters and use an external differential amplifier to strengthen the signal. The 8086's basic register cell uses 7 transistors (7T), more than a typical 6-transistor (6T) or 4-transistor (4T) static RAM cell, but it only uses one bit line rather than two differential bit lines. Dynamic memory (DRAM) is much more efficient, using one transistor and a capacitor, but data will be lost without refresh. 

  5. On the die, register cells are not repeated uniformly, but instead alternating cells are mirror images. This improves the density of the register cells because a power line running between two mirror-image cells can feed both of them (and the same with ground). Thus, the mirror-image layout reduces the number of power and ground lines by half. 

  6. Although block diagrams always show the 16-bit registers split into a left half and a right half, the actual implementation alternates the bits from each half instead of storing one 8-bit part on the left and the other on the right. This implementation makes it easier to swap the two halves of a 16-bit word, which is required in several cases. (One is an unaligned memory read or write. Another is an ALU operation using the top half of a register, such as AH.) Swapping bits between the left half and the right half would require running long wires between the halves for each bit. But with the interleaved implementation, swapping the two halves is a matter of swapping each pair of neighboring bits, which doesn't need long wires. In other words, the interleaved layout in the 8086's registers simplifies the wiring for swapping the two halves of a word. 

  7. If the register file only supported 16-bit registers instead of 8-bit half-registers, the processor could still work but would be less efficient. Writes to an 8-bit half could be done by reading the full 16 bits, modifying the 8-bit half, and then writing back the full 16 bits. This would take three registers accesses instead of one. Note that the register file doesn't need special support for 8-bit reads since the unwanted half can be ignored. 

  8. The block diagram below is different from most 8086 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor. In particular, this diagram shows two "Internal Communication Registers" in the Bus Interface Unit registers (right) along with the segment registers, matching the 7 registers visible on the die. (The temporary registers below are physically part of the ALU, so I'm not discussing them in this blog post.)

    Block diagram of the 8086 processor. From The 8086 Family User's Manual.

    Block diagram of the 8086 processor. From The 8086 Family User's Manual.

     

  9. The book Modern Processor Design discusses the complex register systems of processors from the early 2000s. It says that circuit complexity increases rapidly beyond 3 ports, but some high-end processors had register files with 20 ports or more.  

  10. The upper registers have differing numbers of read and write ports, as follows: two registers with 3 read control lines and 2 write lines, one register with 2 read lines and 2 write lines, and four registers with 2 read lines and 1 write line. The first three registers are probably the program counter, the "indirect" temporary register, and the "operand" temporary register. The last four are probably the SS, DS, SS, and ES segment registers. There are also three instruction prefetch buffer registers, each with 1 read line and 1 write line.

    The 8088 processor, used in the original IBM PC was essentially identical to the 8086, except it had an external 8-bit bus instead of a 16-bit bus to reduce system cost. The 8088's prefetch buffer was four bytes instead of six, presumably because four bytes was sufficient with the 8088's slower memory bus.

    Unlike the 8086, the prefetch registers in the 8088 support writing to 8-bit halves independently (similar to the 8088's A, B, C, and D registers, but with a different register cell design). The reason is the 8088 fetched instructions one byte at a time instead of one word at a time, due to its narrower bus. Thus, the 8088's prefetch registers need to support byte-sized writes, while the 8086 does word-sized prefetches. 

  11. I wrote about the 8086 die and the die shrink process earlier. For more about register files, see my posts on registers in the Z-80 and in the 8085

Die shrink: How Intel scaled down the 8086 processor

The revolutionary Intel 8086 microprocessor was introduced 42 years ago this month so I've been studying its die.1 I came across two 8086 dies with different sizes, which reveal details of how a die shrink works. The concept of a die shrink is that as technology improved, a manufacturer could shrink the silicon die, reducing costs and improving performance. But there's more to it than simply scaling down the whole die. Although the internal circuitry can be directly scaled down,2 external-facing features can't shrink as easily. For instance, the bonding pads need a minimum size so wires can be attached, and the power-distribution traces must be large enough for the current. The result is that Intel scaled the interior of the 8086 without change, but the circuitry and pads around the edge of the chip were redesigned.

The photo below shows an 8086 chip from 1979, and a version with a visibly smaller die from 1986.3 (The ceramic lids have been removed to show the silicon dies inside.) In the updated 8086, the internal circuitry was scaled to about 64% of the original size by length, so it took 40% of the original area. The die as a whole wasn't reduced as much; it was about 54% of the original area. (The chip's package was unchanged, the 40-pin DIP package commonly used for microprocessors of that era.)

Comparison of two 8086 chips. The newer chip on the bottom has a significantly smaller die. The rectangle in the upper-right of each die is the microcode rom.

Comparison of two 8086 chips. The newer chip on the bottom has a significantly smaller die. The rectangle in the upper-right of each die is the microcode rom.

The 8086 is one of the most influential chips ever created; it started the x86 architecture that still dominates desktop and server computing today. Unlike modern CMOS processors, the 8086 was built from NMOS transistors, as were the 6502, Z-80, and other early processors.4 The first chip was built with HMOS,5, Intel's name for this process. Intel introduced improved HMOS-II in 1979 and in 1982, Intel moved to HMOS-III, the process used for the newer 8086 chip.6 Each newer HMOS version shrunk the size of features on the chip and improved performance.

Two versions of the 8086 die, at the same scale. The bond wires are connected to pads around the edge of the die.

Two versions of the 8086 die, at the same scale. The bond wires are connected to pads around the edge of the die.

The photo above shows the two 8086 dies at the same scale. The two chips have identical layout in the interior,7 although they may look different at first. The chip on the right has many dark lines in the middle that don't appear on the left, but this is an artifact. These lines are the polysilicon layer, underneath the metal; the die on the left has the same wiring, but it is very faint. I think the newer chip has a thinner metal layer, making the polysilicon more visible.

The magnified photo below shows the same circuitry on the two dies. There is an exact correspondence between components in the two images, showing the circuitry was reduced in size, not redesigned. (These photos show the metal layer on top of the chip; some polysilicon is visible in the right photo.)

The same region of the two dies at the same scale.

The same region of the two dies at the same scale.

However, there are significant differences around the edges of the dies. The bond pads around the outside are closer together, especially in the bottom right. There are two reasons for this. First, the bond pads can't shrink very much, since they need to be large enough to attach bond wires. Second, the power distribution traces around the edges are wider in order to support the necessary current. (Look to the right of the microcode ROM in the lower right, for instance.) Part of this is because the power traces in the middle of the circuitry were scaled down with the rest of the circuitry, so they are smaller; the outside traces need to pick up the slack. In addition, the thinner metal layer in the newer chip can't support as much current without being widened.

A bond pad and associated transistors, comparing the old chip (left) and new chip (right).
In the copyright date, the top of the "6" is strangely flat; it looks like they changed a "1985" to "1986".

A bond pad and associated transistors, comparing the old chip (left) and new chip (right). In the copyright date, the top of the "6" is strangely flat; it looks like they changed a "1985" to "1986".

The photo above shows a bonding pad with an attached bond wire. The drive transistors are above the pad. The newer chip has almost the same size pad, but the power drive transistors have both shrunk and been redesigned. Note the much thicker metal power wiring on the newer chip. The Intel logo was moved from the bottom right to the bottom left, probably because that's where there was room.

A closer look at the dies

First, a bit of background on the NMOS construction used in the 8086 and other chips of that era. These chips consist of a silicon substrate, which is doped (diffusion) with arsenic or boron to form transistors. On top, a layer of polysilicon creates the gates of the transistors as well as providing wiring between components. Finally, a single metal layer on top wires up the components.

A semiconductor process (such as HMOS-III) has specific rules on the minimum size and spacing for features on the silicon, polysilicon, and metal layers. By looking closely at the chips, we can see how the features correspond to the design rules for HMOS I and HMOS III. The table below (from HMOS III Technology) summarizes the characteristics of the different HMOS processes. The features get smaller and the performance gets better with each version. (Intel got a 40% overall performance improvement going from HMOS-II to HMOS-III.)

 HMOS IHMOS IIHMOS III
Diffusion Pitch (µ)8.06.45.0
Poly Pitch (µ)7.05.64.0
Metal Pitch (µ)11.08.06.4
Gate Oxide Thickness (Ã…)700400250
Channel Length (µ)3.02.01.5
Idsat (mA)8.014.027.0
Minimum Gate-Delay (ps)1000400200
Speed-Power Product (pJ)1.00.50.25
Linear Shrink Factor1.00.80.64

The microscope photo below shows a complex arrangement of transistors in the older 8086 chip. The dark regions are doped silicon, while the white rectangles are the transistor gates. (There are about 21 transistors in this photo.) A key measurement is the channel length, the length of the gate between the source and drain. (This is the narrower dimension of the white rectangles.) I measured 3 μm for these transistors, which nicely matches the published value for HMOS I.8 This indicates the chip was manufactured with a 3 μm process; in comparison, processors are now moving to a 5 nm process, 600 times smaller.

Transistors in the older 8086 chip. The metal and polysilicon were removed for this photo. Circles are vias that connect to the metal layer.

Transistors in the older 8086 chip. The metal and polysilicon were removed for this photo. Circles are vias that connect to the metal layer.

The photo below shows transistors in newer 8086 at the same scale; the transistors are much smaller. The linear dimensions are scaled by 64%, so the transistors have 40% of their original area. Because I processed this die differently, the polysilicon remained on the die, the yellowish lines. The doped silicon appears pinkish, much less visible than before. I measure the gate length as 1.9 μm, which is 64% of the previous 3 μm. Note that HMOS-III supports a considerably smaller 1.5 μm channel length, but since everything shrinks by the same 64% factor, the channel length is larger than necessary. This illustrates that uniformly shrinking the die wastes some of the potential gain from the new process, but it is much easier than completely redesigning the chip.

Transistors in the later 8086 chip. There are many vias between the silicon or polysilicon and the metal (which has been removed).

Transistors in the later 8086 chip. There are many vias between the silicon or polysilicon and the metal (which has been removed).

I also looked at the spacing (pitch) of lines in the metal layer. The photo below shows some horizontal and vertical metal wiring in the older chip. I measured 11μm pitch for the metal lines, which matches the published HMOS I figure. The shrink to 64% yields 7 μm pitch on the new chip, even though HMOS III supported 6.4 μm. As before, the constant shrink factor doesn't take full advantage of the new process.

The metal layer of the older 8086 chip. Reddish polysilicon wiring is visible underneath the metal.

The metal layer of the older 8086 chip. Reddish polysilicon wiring is visible underneath the metal.

Finally, I looked at the pitch of the polysilicon wiring. The photo below shows the older 8086; the polysilicon has been removed leaving faint white traces. These parallel polysilicon lines probably formed a bus, routing signals from one part of the chip to another. I measured 7 μm pitch for the polysilicon lines, matching the published HMOS figure. (Interestingly, polysilicon wiring can be denser than metal wiring under HMOS rules.) The newer chip has 4.5 μm polysilicon pitch, compared to possible 4.0 μm.

Polysilicon traces on the older 8086 chip.

Polysilicon traces on the older 8086 chip.

Conclusions

A die shrink provides a way to improve the performance of a processor and reduce its cost without the effort of a complete redesign. Comparing the two chips, however, shows that a die shrink is more complex than uniformly shrinking the whole die. While most of the circuitry is a straightforward shrink, the bond pads didn't shrink to the same degree, so they needed to be moved around. The power distribution was also modified, adding more power wiring around the outer part of the chip.

Modern microprocessors still use die shrinks. In 2007, Intel moved to a tick-tock model, where they would alternate shrinks of an existing chip (the "tick") with the production of a new microarchitecture (the "tock").

I plan to analyze the 8086 in more detail in future blog posts so follow me on Twitter at @kenshirriff for updates. I also have an RSS feed.

Notes and references

  1. The 8086 was released on June 8, 1978. 

  2. It's actually quite remarkable that MOSFET circuits still work after being scaled down over a large range, since most things don't scale as easily. For instance, you can't scale down an engine by a factor of 10 and expect it to work. Most physical things suffer from the square-cube law: the area scales with the square of the ratio, while the volume scales with the cube of the ratio. For MOS circuits, however, most things either stay the same with scaling, or get better (such as frequency and power consumption). For more details on scaling, see Mead and Conway's Introduction to VLSI Systems Ch 1 sect 2. Interestingly, that 1978 book says that scaling had a fundamental limit of 1/4 micron (250 nm) channel length due to physical effects. That limit was wildly wrong; transistors are now moving to 5 nm, through technologies such as FinFETs. 

  3. The older chip says ©'78, ©'79 on the package and ©1979 on the die and has a 7947 (47th week of 1979) date code on the underside. The newer chip says ©1978 on the package but ©1986 on the die and has no identifiable date code, so I figure it is from 1986 or slightly later. It's unclear why the newer chip has an older copyright date on the external package. 

  4. A brief description of the technologies in early processors. N-channel MOSFETs are a particular type of MOSFET transistor. They have considerably better performance than the P-channel MOSFETs used in the earliest microprocessors, such as the Intel 4004. (Modern processors use N-channel and P-channel transistors together for lower power consumption; this is CMOS.) Gates built from N-channel MOSFETs require a pull-up resistor, which is implemented by a transistor. Depletion load transistors are a type of transistor introduced in the mid-1970s that perform better as pull-up resistors and don't require an extra power supply voltage. Finally, MOS transistors originally used metal for the gate (the M in MOS). But in the late 1960s, Fairchild developed the use of polysilicon for the gate instead of metal. This provided much better performance and was easier to manufacture. The point of all this is that between the late 1960s and mid-1970s, several radical changes were introduced in MOS integrated circuit production, and these led to the success of the 6502, Z-80, 8085, 8086, and other early processors. In the 1980s, CMOS processors took over due to their lower power consumption and better performance. 

  5. Strangely, it's unclear what the "H" stands for in HMOS. I couldn't find anywhere that Intel expands the acronym; databooks refer to "Intel's advanced N-channel silicon gate HMOS process" or say "HMOS is a high-performance n-channel MOS process". Intel later defined CHMOS as Complementary High Speed Metal Oxide Semiconductor) (example). Motorola defined HMOS as High-density MOS (example) while other sources defined it as High-speed MOS or High-density, short channel MOS. Intel has a patent on "High density/high speed MOS process and device", so perhaps the "H" stands for both "high density" and "high speed". 

  6. Interestingly, Intel used a 4K static RAM chip to develop each of their HMOS processes, before using the process for their microprocessors and other chips. They probably developed with the RAM chip because it has dense circuitry, but is relatively easy to design because it repeats the same memory cell over and over. Once they had all the design rules figured out, then they could create the much more complex processor. 

  7. I scaled complete, high-resolution images of the two chips to compare and the main part of the chips is an exact match except for some trivial changes. I found a couple of places where a via was slightly moved, which is puzzling because I see no logical reason for that. The circuit was unchanged, so it's not a bug fix. One question is if there were any microcode changes. The microcode looks identical, but I didn't do a bit-by-bit comparison. 

  8. You may have noticed that three transistors in the photo have much larger gates. These are transistors that are acting as pull-up resistors, as is typical for NMOS circuits. The larger size makes the transistors weaker, so they provide a weak pull-up current. 

Die analysis of the 8087 math coprocessor's fast bit shifter

Floating-point numbers are very useful for scientific programming, but early microprocessors only supported integers directly.1 Although floating-point was common in mainframes back in the 1950s and 1960s, it wasn't until 1980 that Intel introduced the 8087 floating-point coprocessor for microcomputers.2 Adding this chip to a microcomputer such as the IBM PC made floating-point operations up to 100 times faster. This was a huge benefit for applications such as AutoCAD, spreadsheets, or flight simulators.3 The downside was the 8087 chip cost hundreds of dollars.4

It's hard to implement floating-point operations so they are computed quickly and accurately. Problems can arise from overflow, rounding, transcendental operations, and numerous edge cases. Prior to the 8087, each manufacturer had their own incompatible ad hoc implementation of floating point. Intel, however, enlisted numerical analysis expert William Kahan to design accurate floating point based on rigorous principles.5 The result was the floating-point architecture of the 8087. This became the IEEE 754 standard used in almost all modern computers, so I consider the 8087 one of the most influential chips ever designed.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The die is 5mm×6mm. The shifter is outlined in red. Click for a larger image.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The die is 5mm×6mm. The shifter is outlined in red. Click for a larger image.

To explore how the 8087 works, I opened up an 8087 chip and took photos of the silicon die with a microscope. Containing 40,000 transistors, the 8087 pushed chip manufacturing to the limit; in comparison, the companion 8086 microprocessor only had 29,000 transistors. To make the chip possible, Intel developed new techniques. In this article, I focus on the high-speed binary shifter (outlined in red above). The shifter takes up a large fraction of the chip's area, so minimizing its area was vital to making the 8087 possible.

A floating-point number consists of a fraction (also called significand or mantissa), an exponent, and a sign bit. (These are expressed in binary, but for a base-10 analogy, the number 6.02×1023 has 6.02 as the fraction and 23 as the exponent.) The circuitry to process the fraction is at the bottom of the die photo. From left to right, the fraction circuitry consists of a constant ROM, a shifter (highlighted), adder/subtracters, and the register stack. The exponent processing circuitry is in the middle of the chip. Above it, the microcode engine and ROM control the chip.

The shifter

The role of the shifter is to shift binary numbers left or right, a task with several critical roles in floating-point operations. When two floating-point numbers are added or subtracted, the numbers must be shifted so the binary points line up. (The binary point is like the decimal point, but for a binary number.) The 8087's transcendental instructions are built around shift and add operations, using an algorithm called CORDIC. The shifter is also used to assemble a floating-point number from 16-bit chunks read from memory.8

Since shifts are so essential to performance, the 8087 uses a "barrel shifter", which can shift a number by any number of bits in a single step.6 Intel used a two-stage shifter design that kept its size manageable while still providing high performance. The first stage shifts the value by 0 to 7 bits, while the second stage shifts by 0 to 7 bytes. In combination, the two stages shift a value by any amount from 0 to 63 bits.

The bit shifter

I'll start by describing the bit shifter, which performs a shift of 0 to 7 bit positions. The diagram below outlines the structure of the bit shifter, showing five of the inputs and outputs; the full shifter supports 68 bits.7 The concept is that by activating a particular column, the input is shifted by the desired amount. Each circle indicates a transistor that can act as a switch between an input line and an output line. The vertical select lines are used to activate the desired transistors. Each input line is connected diagonally to eight transistors, allowing it to be directed to one of eight outputs. For example, the diagram shows shift select line 3 activated, turning on the associated transistors (green). The highlighted input 20 (orange) is directed to output 23 (blue). Similarly, the other inputs are connected to the corresponding outputs, yielding a shift by 3. By activating a different shift select line, the input will be shifted by a different amount between 0 and 7 bits.

Structure of the bit shifter. By energizing a shift select line, the inputs are connected to outputs with the desired bit shift.

Structure of the bit shifter. By energizing a shift select line, the inputs are connected to outputs with the desired bit shift.

To explain the internal construction of the shifter, I'll start by describing the NMOS transistors used in the 8087 chip. Transistors are built by doping areas of the silicon substrate with impurities to create "diffusion" regions with different electrical properties. The transistor can be considered a switch, controlling the flow of current between two regions called the source and drain. The transistor is activated by the gate, made of a special type of silicon called polysilicon, layered above the substrate silicon. Applying voltage to the gate lets current flow between the source and drain, which is otherwise blocked. Transistors are wired together by a metal layer on top, building a complex integrated circuit.

Structure of a MOSFET as implemented in an integrated circuit.

Structure of a MOSFET as implemented in an integrated circuit.

The photo below shows a transistor in the 8087 as it appears under the microscope. Its structure matches the diagram above, although its shape is more complex. The source, gate, and drain all continue out of the photo, connected to other transistors. In addition, wiring in the metal layer is connected to the silicon at the circular vias. (The metal layer was removed with acid for this photo.)

An NMOS transistor in the 8087 chip, as seen under the microscope.

An NMOS transistor in the 8087 chip, as seen under the microscope.

Zooming out, the diagram below shows part of the bit shifter as implemented on the chip. About 48 transistors, similar to the one above, are in this photo. The orange and yellow diagonal corresponds to one of the inputs: the orange regions show transistors connected through the silicon, while the yellow lines show connections in the metal layer. (The metal layer is used to jump over the polysilicon select lines.) The green highlight shows the polysilicon line for shift-by-three. In the center, this polysilicon gate line turns on a transistor, connecting the input to the long yellow output line, shifting the highlighted input by three positions. (The other non-highlighted inputs are shifted similarly.) Thus, this circuit implements the shifter as described at the beginning of the section. The photo shows six of the 68 inputs, so the complete shifter is much taller.

Closeup of the silicon circuitry for the bit shifter. The path of one signal is shown, as controlled by the shift-by-three control (green).

Closeup of the silicon circuitry for the bit shifter. The path of one signal is shown, as controlled by the shift-by-three control (green).

The byte shifter

The byte shifter shifts its inputs by multiples of eight bits, rather than one bit. Its design is similar to the bit shifter, except each input connects to every eighth output. For instance, input 20 connects to outputs 20, 28, 36, and so forth, shifting by bytes. As a result, the diagonal connections are steep and packed tightly, with eight lines between each switch. In the diagram below, the line for shift-by-four is selected, with the connection from input 0 to output 32 highlighted. Note the lack of wires in the right half of the diagram because any bit shifted from beyond input 0 becomes zeroed. For instance, when shifting left by 4 bytes, low-order bits 31 and below become zero.

The structure of the byte shifter.

The structure of the byte shifter.

The die photo below shows part of the bit shifter and the byte shifter. This photo is zoomed-out to show the overall structure; individual transistors are barely visible. The bit shifter's area is densely packed with transistors, but the byte shifter consists mostly of wiring, with columns of transistors in between.9 Also note that the byte shifter is partially empty at the top, filling in with more wiring towards the bottom. The wiring layout isn't as orderly as in the diagram above, but is arranged for maximum efficiency.

The bit shifter and byte shifter in the 8087 chip.

The bit shifter and byte shifter in the 8087 chip.

The bidirectional drivers

So far, the bit and byte shifters only shift bits in one direction.11 However, bits need to be shifted in both directions. One of the key innovations of the 8087's shifter is its bidirectional design: data can be passed through the shifter in reverse to shift bits the opposite direction. This is possible because the shifter is constructed with pass transistors, not logic gates. Pass transistor logic uses transistors as switches that pass or block signals, so signals can travel in either direction. (In contrast, regular logic gates such as NOR gates have specific inputs and outputs.)

Special driver circuitry on the left and right sides of the shifter allows the shifter to operate in either direction. To send data from left to right, the left-hand driver reads data from the fraction bus and sends it into the shifter. The right-hand driver circuit receives this shifted data, latches it temporarily, and then writes it back to the fraction bus. To send data in the opposite direction, the driver circuits reverse roles: the right-hand driver sends data from the fraction bus into the shifter while the left-hand circuit receives the shifted data.10

The multiplexer / decoders

The final feature I'll describe is the circuitry that controlled the shifter. Three different sources control how many positions to shift. First, the microcode engine can specify the number directly. Second, the number can come from a loop counter; this is used as part of the CORDIC transcendental algorithms. Finally, the number can come from a leading zero counter; this allows numbers to be normalized by eliminating leading zeroes through shifting. Each of these sources provides a 6-bit shift number; the six multiplexers each select one bit from the desired source.12

The multiplexer/decoder circuitry.

The multiplexer/decoder circuitry.

Next, decoders activate one of eight bit-shift lines and one of eight byte-shift lines to control the appropriate pass transistors in the shifter. (Each decoder takes a 3-bit input and activates one of 8 output lines.) Because each decoder line controls a large column of pass transistors in the shifter, the decoder uses relatively large power transistors.13 At the bottom, the 16 control lines exit the circuitry.

Conclusion

The 8087 is a complex chip with many functional units. However, by examining the die closely, the circuits of the 8087 can be understood. This blog post described the 8087's fast barrel shifter, capable of shifting by up to 63 bits at a time.14 Intel received a patent on this innovative programmable bidirectional shifter.

The shifter was just one of the features that let the 8087 compute floating-point operations much faster than the 8086 processor could. The 8087 operates on 80 bits at a time instead of 16. The 8087 has 80-bit wide registers, reducing memory accesses during computations. The 8087 stores constants for transcendental operations in a ROM, also avoiding memory accesses. Hardware in the 8087 checked for NaN, underflow, overflow, etc., avoiding slow checks in code. The 8087's hardware made multiplication and division faster. I don't know the relative contributions of these factors, but in combination, they improved floating-point performance dramatically, by up to a factor of 100.

The benefits of floating point hardware are so great that Intel started integrating the floating-point unit into the processor with the 80486 (1989). Now, most processors include a floating-point unit and the expense of purchasing a separate floating-point coprocessor is a thing of the past.

Die photo of the 8087 with the metal layer removed. The colors are due to some of the oxide layer remaining. Click for a larger image.

Die photo of the 8087 with the metal layer removed. The colors are due to some of the oxide layer remaining. Click for a larger image.

For more information on the 8087, see my other articles: Extracting ROM constants from the 8087, The two-bit-per-transistor ROM and The substrate bias generator. I announce my latest blog posts on Twitter, so follow me @kenshirriff for future articles. I also have an RSS feed.

Notes and references

  1. Even without floating-point hardware, early microcomputers could perform floating-point operations. The operations would be broken down into many integer operations, manipulating the exponent and fraction as necessary. In other words, floating-point support didn't make floating-point operations possible, it just made them much faster. (Another way to represent non-integers is fixed-point numbers, which have a fixed number of digits after the decimal. Fixed-point numbers are simpler than floating-point, but can't represent as large a range.) 

  2. The 8087 wasn't the first floating-point chip. National Semiconductor introduced the MM57109 Number Cruncher Unit (that is the real name) in 1977. It was essentially a repackaged 12-digit scientific calculator chip, operating on binary-coded decimal values with values entered in Reverse Polish Notation. This chip was absurdly slow; a tangent, for instance, could take over a second. AMD introduced their floating-point chip, the Am9511, in 1978 (details). This chip supported 32-bit floating-point numbers and took up to 1.4 milliseconds for a tangent. (Intel ended up licensing the Am9511 from AMD and selling it as the 8231.) A 10-MHz 8087 in comparison, could do a tangent in 54 microseconds, operating on an 80-bit floating-point number. Thus, the 8087's performance and accuracy were far superior to previous chips. 

  3. The original IBM PC (1981) had an empty socket on its motherboard for adding an 8087 coprocessor. a huge benefit for applications such as AutoCAD. The large empty socket is visible in the upper left below, above the 8088 microprocessor. A list of applications with support for the 8087 is here.

    Motherboard of the original IBM PC (1981).
Photo from Wikimedia, CC BY-SA 3.0.

    Motherboard of the original IBM PC (1981). Photo from Wikimedia, CC BY-SA 3.0.

     

  4. I couldn't find the original price for the 8087, but it was expensive. At first, Intel only sold the 8087 as a matched and tested pair with an 8088, due to timing flakiness with the 8087. By 1982, Intel dropped the price of the 8087 to $230, equivalent to about $500 in current dollars. Compared to today's open-source world, it seems strange that customers also had to pay for software support: using the 8087 with the BASIC language cost another $150, while Intel's 8087 development library was $1250. 

  5. The designers of the 8087 commented on the guidance offered by Professor Kahan: "We did not do as well as he wanted, but we did better than he expected." Kahan later received a Turing Award for his work on floating point.  

  6. Processors often include a variety of shift instructions, including rotate operations that shift bits from one end of the word to the other. The 8087 only performs straight shifts, not rotates. 

  7. The shifter handles the 8087's 64-bit fraction, along with three extra bits for rounding accuracy, so it supports 67 bits. Unless I miscounted, the shifter also has an extra bit in the most significant position, making it 68 bits wide. 

  8. Multiplication and division make heavy use of shifting; multiplication is performed by shifts and adds, while division uses shifts and subtracts. However, the 8087 does not use the general-purpose shifter for these operations, but has specialized shifters optimized for these operations. 

  9. In order to pack the wiring as close together as possible, the shifter alternated wires of diffused silicon and wires of polysilicon. In the photo below, the diffused silicon wires are pinkish, while the polysilicon is yellowish. The 8087 was built with Intel's HMOS III process, which required a 4µm spacing for polysilicon and 5µm for diffusion, probably due to the resolution of the photolithography practice. However, the spacing between a diffusion line and a polysilicon line could be much smaller, probably because they were created with separate masks and were on separate layers. Thus, alternating diffusion and polysilicon lines could be packed together tightly, saving space.

    Wiring in the byte shifter consists of alternating, tightly-packed silicon and polysilicon lines. The large rectangles on either side are pairs of transistors, controlled by vertical polysilicon lines.

    Wiring in the byte shifter consists of alternating, tightly-packed silicon and polysilicon lines. The large rectangles on either side are pairs of transistors, controlled by vertical polysilicon lines.

  10. The driver circuitry has a few subtleties. Instead of sending data directly into the shifter, bits are transferred in two steps. First, the shifter lines are pre-charged to a high level. Then, any 1-bit inputs cause the corresponding shifter lines to be pulled low. In other words, the shifter lines are active-low, with a low voltage representing a 1. Since any unused outputs keep their high voltage (a 0 bit), 0 bits are shifted into low bit positions automatically. I think the pre-charge technique also was a better match for NMOS circuitry, which was better at pulling a signal low than pulling it high, so pre-charging the lines helped performance, especially given their relatively high capacitance. The latch between the shifter and the fraction bus prevents an unwanted cycle with the shifted data immediately flowing back into the shifter and getting re-shifted. 

  11. This footnote will clarify the physical shift versus the logical shift. On the die, the fraction circuitry is arranged with the most-significant bit at the bottom. Passing data through the shifter from left to right shifts bits physically downward. This corresponds to a left-shift of a binary number, moving bits to a higher position. In the opposite direction, passing data through the shifter from right to left performs a right-shift of the data. 

  12. The left/right direction also needs to be selected from one of the three shift sources, but I haven't located the circuitry for that yet. 

  13. Each decoder essentially consists of eight NOR gates: seven will be pulled low and only the one with all inputs low will be high. However, it's not implemented as a straightforward logic gate. Instead, all outputs are precharged high, and then the seven undesired outputs are pulled low. This sort of dynamic precharge logic is still used in modern circuits; see the book Synchronous Precharge Logic. The multiplexers are also implemented with precharge logic. 

  14. Intel's x86 processors didn't include a barrel shifter until the 80386 (1985), which provided a 64-bit barrel shifter. Before that, the 8086 and descendants shifted one bit at a time, so shifts by many bit positions were much slower. 

Extracting ROM constants from the 8087 math coprocessor's die

Intel introduced the 8087 chip in 1980 to improve floating-point performance on the 8086 and 8088 processors, and it was used with the original IBM PC. Since early microprocessors operated only on integers, arithmetic with floating-point numbers was slow and transcendental operations such as arctangent or logarithms were even worse. Adding the 8087 co-processor chip to a system made floating-point operations up to 100 times faster.

I opened up an 8087 chip and took photos with a microscope. The photo below shows the chip's tiny silicon die. Around the edges of the chip, tiny bond wires connect the chip to the 40 external pins. The labels show the main functional blocks, based on my reverse engineering. By examining the chip closely, various constants can be read out of the chip's ROM, numbers such as pi that the chip uses in its calculations.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The constant ROM is outlined in green. Click for a larger image.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The constant ROM is outlined in green. Click for a larger image.

The top half of the chip contains the control circuitry. Performing a floating-point instruction might require 1000 steps; the 8087 used microcode to specify these steps. The die photo above shows the "engine" that ran the microcode program; it is basically a simple CPU. Next to it is the large ROM that holds the microcode.

The bottom half of the die holds the circuitry that processes floating-point numbers. A floating-point number consists of a fraction (also called significand or mantissa), an exponent, and a sign bit. (For a base-10 analogy, in the number 6.02×1023, 6.02 is the fraction and 23 is the exponent.) The chip has separate circuitry to process the fraction and the exponent in parallel. The fraction processing circuitry supports 67-bit values, a 64-bit fraction with three extra bits for accuracy. From left to right, the fraction circuitry consists of a constant ROM, a shifter, adder/subtracters, and the register stack. The constant ROM (highlighted in green) is the subject of this post.

The 8087 operated as a co-processor with the 8086 processor. When the 8086 encountered a special floating-point instruction, the processor ignored it and let the 8087 execute the instruction in parallel.1 I won't explain in detail how the 8087 works internally, but as an overview, floating-point operations are implemented using integer adds/subtracts and shifts. To add or subtract two floating-point numbers, the 8087 shifts the numbers until the binary points (i.e. the decimal points but in binary) line up, and then adds or subtracts the fraction. Multiplication, division, and square root are performed through repeated shifts and adds or subtracts. Transcendental operations (tan, arctan, log, power) use CORDIC algorithms, which use shifts and adds of special constants for efficient computation.

Implementation of the ROM

This post describes the ROM that holds constants (not to be confused with the larger, four-level microcode ROM.2) The constant ROM holds the constants (such as pi, ln(2), and sqrt(2)) that the 8087 needs for its computations. The photo below shows part of the constant ROM. The metal layer has been removed to show the silicon underneath. The pinkish regions are silicon doped to have different properties, while the reddish and greenish lines are polysilicon, a special type of silicon wiring layered on top. Note the regular grid structure of the ROM. The ROM consists of two columns of transistors, holding the bits. To explain how the ROM works, I'll start by explaining how a transistor works.

Part of the constant ROM, with the metal layer removed. The three columns of larger transistors are used to select between rows.

Part of the constant ROM, with the metal layer removed. The three columns of larger transistors are used to select between rows.

High-density integrated circuits in the 1970s were usually built from a type of transistor known as NMOS. (Modern computers are built from CMOS, which consists of NMOS transistors along with opposite-polarity PMOS transistors.) The diagram below shows the structure of an NMOS transistor. An integrated circuit is constructed from a silicon substrate, with transistors built on it. Regions of the silicon are doped with impurities to create "diffusion" regions with desired electrical properties. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. Applying voltage to the gate lets current flow between the source and drain, which is otherwise blocked. The die of the 8087 is fairly complex, with about 40,000 of these transistors.3

Structure of a MOSFET as implemented in an integrated circuit.

Structure of a MOSFET as implemented in an integrated circuit.

Zooming in on the ROM shows the individual transistors. The pinkish regions are the doped silicon, forming transistor sources and drains. The vertical polysilicon select lines form the gates of the transistors. The indicated silicon regions are connected to ground, pulling one side of each transistor low. The circles are connections called vias between the silicon and the metal lines above. (The metal lines have been removed; the orange line shows the position of one.)

A portion of the constant ROM. Each select line selects a particular constant. Transistors are indicated by the yellow symbols. An X indicates a missing transistor, corresponding to a 0 bit. The orange line indicates the position of a metal wire. (The metal layer was dissolved for this picture.)

A portion of the constant ROM. Each select line selects a particular constant. Transistors are indicated by the yellow symbols. An X indicates a missing transistor, corresponding to a 0 bit. The orange line indicates the position of a metal wire. (The metal layer was dissolved for this picture.)

The important feature of the ROM is that some of the transistors are missing, the first one in the upper row, and two marked with X in the lower row. Bits are programmed into the ROM by changing the silicon doping pattern, creating transistors or leaving insulating regions. Each transistor or missing transistor represents one bit. When a select line is activated, all the transistors in that column will turn on, pulling the corresponding output lines low. But if the transistor is missing from a selected position, the corresponding output line will remain high. Thus, a value is read from the ROM by activating a select line, reading that ROM value onto the output lines.

Contents of the ROM

The constant ROM has 134 rows of 21 columns.5 Under a microscope, the bit pattern of the ROM is visible and can be extracted.4 How to interpret the raw bits is not obvious, though. The first question is if a transistor (versus a gap) indicates a 0 or a 1. (It turns out that a transistor indicates a 1 bit.) The next issue is how to map the 134×21 grid of bits into values.6

The chip's data path consists of 67 horizontal rows, so it seemed pretty clear that the 134 rows in the ROM corresponded to two sets of 67-bit constants. I extracted one set of constants for the odd rows and one for the even rows, but the values didn't make any sense. After more thought, I determined that the rows do not alternate but are arranged in a repeating "ABBA" pattern.7 Using this pattern yielded a bunch of recognizable constants, including pi and 1. Bits from those constants are shown in the diagram below. (In this photo, a 1 bit appears as a green stripe, while a 0 bit appears as a red stripe.) In binary, pi is 11.001001... and this value is visible in the upper labeled bits. The bottom value is the constant 1.8

Bit values labeled in the constant ROM. The top bits are the first part of pi, while the lower bits are the constant 1, This diagram has been rotated 90 degrees compared to the other diagrams. The unlabeled bits form other constants.

Bit values labeled in the constant ROM. The top bits are the first part of pi, while the lower bits are the constant 1, This diagram has been rotated 90 degrees compared to the other diagrams. The unlabeled bits form other constants.

The next difficulty in interpretation is that this ROM holds just the fractional parts of the numbers, not the exponents. (I haven't found the separate exponent ROM yet.) I experimented with various exponents until I got values that were sensible numbers. Some were straightforward: for instance, the constant 1.204120 yielded log10(2) when the exponent 2-2 was used. Others were harder,9 such as 1.734723. Eventually, I figured out that 1.734723×259 is 1018.10

The complete table of constants is in the footnotes.11 Physically, the constants are arranged in three groups. The first group is values that the user can load (1, pi, log210, log2e, log102, and ln 2)12 along with values used internally (1018, ln(2)/3, 3*log2(e), log2(e), and sqrt(2)). The second group is sixteen arctan constants, and the third is fourteen log2 constants. The last two groups of constants are used to compute transcendental functions using the CORDIC algorithm, which I will discuss next.

The CORDIC algorithms

The constants in the ROM reveal some details about the algorithms used by the 8087. The ROM contains 16 arctangent values, the arctans of 2-n. It also contains 14 log values, the base-2 logs of (1+2-n). These may seem like unusual values, but they are used in an efficient algorithm called CORDIC, which was invented in 1958.

The basic idea of CORDIC is to compute tangent and arctangent by breaking down an angle into smaller angles, and rotating a vector by these angles. The trick is that by carefully choosing the smaller angles, each rotation can be computed with efficient shifts and adds instead of trig functions. Specifically, suppose we want to find tan(z). We can break z into a sum of smaller angles: z ≈ {atan(2-1) or 0} + {atan(2-2) or 0} + {atan(2-3) or 0} + ... + {atan(2-16) or 0}. Now, rotating a vector by, say atan(2-2), can be done by multiplying by 2-2 and adding. The key thing is that multiplying by 2-2 is just a fast bit shift. Putting this all together, computing tan(z) can be done by comparing z with the atan constants, and then doing 16 cycles of additions and shifts, which are fast to perform in hardware.13 To make the algorithm work, the atan constants are precomputed and stored in the constant ROM.14

Computing the base-2 log and base-2 exponential also use CORDIC algorithms, with the associated logarithmic constants. The key observation is that multiplying by (1 + 2-n) can be done quickly with a shift and addition. By multiplying one side of the equation by the sequence of values, and adding the corresponding log constants to the other side, the log or exponential can be computed.15

The 8087's support for transcendental functions is more limited than you might expect. It only supports tangent and arctangent, not sine or cosine; the user must apply trig identities to compute sine or cosine. Logs and exponentials only support base 2; for base 10 or base e, the user must apply the appropriate scale factor. At the time, the 8087 pushed the limits of what could fit on a chip, so the instruction set was limited to the essentials.

Conclusion

The 8087 is a complex chip and at first it looks like a hopeless maze of circuitry. But much of it can be understood with careful study. It contains 42 constants in a ROM, and the values of these constants can be extracted under a microscope. Some of the constants (such as pi) are expected, while others (such as ln(2)/3) are more puzzling. Many of the constants are used for computing the tangent, arctangent, log, and power functions, using fast CORDIC algorithms.

Die photo of the 8087 with the metal layer removed. Click for a larger image.

Die photo of the 8087 with the metal layer removed. Click for a larger image.

Even though Intel's 8087 floating point unit chip was introduced 40 years ago, it still has a large influence today. It spawned the IEEE 754 floating-point standard used for most modern floating-point arithmetic, and the 8087's instructions remain a part of the x86 processors used in most computers.

For more information on the 8087, see my other articles: the two-bit-per-transistor ROM and the substrate bias generator. I announce my latest blog posts on Twitter, so follow me @kenshirriff for future articles. I also have an RSS feed.

Notes and references

  1. The interaction between the 8086 processor and the 8087 floating point unit is somewhat tricky; I'll discuss some highlights. The simplified view is that the 8087 watches the 8086's instruction stream, and executes any instructions that are 8087 instructions. The complication is that the 8086 has an instruction prefetch buffer, so the instruction being fetched isn't the one being executed. Thus, the 8087 duplicates the 8086's prefetch buffer (or the 8088's smaller prefetch buffer), so it knows that the 8086 is doing. (A Twitter thread discusses this in detail.) Another complication is the complex addressing modes used by the 8086, which use registers inside the 8086. The 8087 can't perform these addressing modes since it doesn't have access to the 8086 registers. Instead, when the 8086 sees an 8087 instruction, it does a memory fetch from the addressed location and ignores the result. Meanwhile, the 8087 grabs the address off the bus so it can use the address if it needs it. If there is no 8087 present, you might expect a trap, but that's not what happens. Instead, for a system without an 8087, the linker rewrites the 8087 instructions, replacing them with subroutine calls to the emulation library. 

  2. The 8087's microcode ROM is built with an unusual technique that stores two bits per transistor. It does this by using three different transistor sizes or no transistor in each position. The four possibilities at each position represent two bits. This complex technique was necessary in order to fit the large ROM onto the 8087 die. I wrote a blog post with more details. The constant ROM, in comparison, is built using standard techniques. 

  3. Sources provide inconsistent values for the number of transistors in the 8087: Intel claims 40,000 transistors while Wikipedia claims 45,000. The discrepancy could be due to different ways of counting transistors. In particular, since the number of transistors in a ROM, PLA or similar structure depends on the data stored in it, sources often count "potential" transistors rather than the number of physical transistors. Other discrepancies can be due to whether or not pull-up transistors are counted and if high-current drivers are counted as multiple transistors in parallel or one large transistor. 

  4. Instead of copying bits from the ROM by hand, I made a simple JavaScript program to help me read out the ROM. I clicked on the ROM image to indicate each transistor, and the program produced the corresponding pattern of 0's and 1's. 

  5. The ROM has 134 rows of 21 bits, except there is a 6×6 chunk missing from the upper left. Thus, the physical size is of the constant ROM is 2946 bits.

    The upper-left corner of the constant ROM, showing the missing 6×6 section.

    The upper-left corner of the constant ROM, showing the missing 6×6 section.

    Because of the ROM layout, this missing section means that the first 12 constants are 64 bits long, rather than 67 bits. These are the non-CORDIC constants, which apparently don't require the extra bits for accuracy. 

  6. There are two ways to determine the encoding of the bits. The first is to trace out the circuitry that reads from the ROM and examine how the data is used. The second is to look for patterns in the raw data, and determine what makes sense for an encoding. Since the 8087 is very complex, I wanted to avoid a full reverse-engineering to understand the constants and I used the second approach. 

  7. The organization of the rows follows the pattern ABBAABBAABBA..., where "A" rows hold bits for one set of constants and "B" rows hold bits for the second set of constants. This layout was probably used instead of alternating rows ("ABAB") because one connection can drive two neighboring selection transistors. That is, each "AA" or "BB" group can be selected with one wire. 

  8. A bit more trial-and-error was necessary to pull the values out of the ROM. I determined three key factors. First, the bits started at the bottom of the ROM, going up. Second, a transistor indicated a 1, rather than a 0. Third, the constants did not have an implicit 1 bit at the beginning. (In other words, the constant format does not match the external data format used by the 8087.) 

  9. Some of the exponents were tricky to determine. I used brute force for some of them, seeing if any exponent would yield the log or power of some number. One of the hardest numbers to figure out was ln(2)/3; I'm not sure why this value is important. 

  10. Why does the 8087 contain the constant 1018? Probably because the 8087 supports a packed BCD datatype holding 18 digits, so it can hold up to 1018

  11. The following table summarizes the contents of the constant ROM. The "meaning" column is my interpretation of the number.

    ConstantDecimal valueMeaning
    1.204120×2-20.3010300log10(2)
    1.386294×2-10.6931472ln(2)
    1.442695×201.4426950log2(e)
    1.570796×213.1415927Pi
    1.000000×201.00000001
    1.660964×213.3219281log2(10)
    1.734723×2591.000e+181018
    1.734723×2591.000e+181018
    1.848392×2-30.2310491ln(2)/3
    1.082021×224.32808513*log2(e)
    1.442695×201.4426950log2(e)
    1.414214×201.4142136sqrt(2)
    1.570796×2-10.7853982atan(20)
    1.854590×2-20.4636476atan(2-1)
    2.000000×2-150.0000610atan(2-14)
    2.000000×2-160.0000305atan(2-15)
    1.959829×2-30.2449787atan(2-2)
    1.989680×2-40.1243550atan(2-3)
    2.000000×2-130.0002441atan(2-12)
    2.000000×2-140.0001221atan(2-13)
    1.997402×2-50.0624188atan(2-4)
    1.999349×2-60.0312398atan(2-5)
    1.999999×2-110.0009766atan(2-10)
    2.000000×2-120.0004883atan(2-11)
    1.999837×2-70.0156237atan(2-6)
    1.999959×2-80.0078123atan(2-7)
    1.999990×2-90.0039062atan(2-8)
    1.999997×2-100.0019531atan(2-9)
    1.441288×2-90.0028150log2(1+2-9)
    1.439885×2-80.0056245log2(1+2-8)
    1.437089×2-70.0112273log2(1+2-7)
    1.431540×2-60.0223678log2(1+2-6)
    1.442343×2-110.0007043log2(1+2-11)
    1.441991×2-100.0014082log2(1+2-10)
    1.420612×2-50.0443941log2(1+2-5)
    1.399405×2-40.0874628log2(1+2-4)
    1.442607×2-130.0001761log2(1+2-13)
    1.442519×2-120.0003522log2(1+2-12)
    1.359400×2-30.1699250log2(1+2-3)
    1.287712×2-20.3219281log2(1+2-2)
    1.442673×2-150.0000440log2(1+2-15)
    1.442651×2-140.0000881log2(1+2-14)

    It's clear from the CORDIC constants that the values in the ROM are not physically stored in order, i.e. sequential rows are not addressed in order. I'm not sure why 1018 appears twice; probably one exponent is different. The binary exponents are not in the ROM that I examined, so I had to estimate them. 

  12. The 8087 provides seven instructions to load constants directly. The instructions FDLZ, FLD1, FLDPI, FLD2T, FLD2E, FLDLG2, and FLDLN2 load onto the stack the constants 0, 1, pi, log210, log2e, log102, and ln 2, respectively. Apart from 0, these constants can be found in the ROM. 

  13. The 8087's CORDIC algorithm is described in Implementation of transcendental functions on a numerics processor. I wrote sample tangent code based on that description here. There are also a couple of multiplications and divisions in the 8087's full tan algorithm. It uses a simple rational approximation of tangent on the "leftover" angle, giving it a bit more accuracy than straight CORDIC. 

  14. Computing the arctangent of an angle uses an algorithm that is similar to the tangent algorithm, but in reverse: as rotations are performed, the angles (from the constant ROM) are summed up to yield the resulting angle. 

  15. I couldn't find documentation on the 8087's log and exponent algorithms. I think the algorithms are very similar to the ones on this page, except the 8087 uses base 2 instead of base e. I'm a bit puzzled why the 8087 doesn't need the constant log2(1 + 2-1), which is used by that algorithm.