Showing posts with label intel. Show all posts
Showing posts with label intel. Show all posts

Inside the die of Intel's 8087 coprocessor chip, root of modern floating point

Looking inside the Intel 8087, an early floating point chip, I noticed an interesting feature on the die: the substrate bias generation circuit. In this article I explain how this circuit is implemented, using analog and digital circuitry to create a negative voltage.

Intel introduced the 8087 chip in 1980 to improve floating-point performance on 8086/8088 computers such as the original IBM PC. Since early microprocessors were designed to operate on integers, arithmetic on floating point numbers was slow, and transcendental operations such as trig or logarithms were even worse. But the 8087 co-processor greatly improved floating point speed, up to 100 times faster. The 8087's architecture became part of later Intel processors, and the 8087's instructions are still a part of today's x86 desktop computers.1

I opened up an 8087 chip and took die photos with a microscope yielding the composite photo below. The die of the 8087 is fairly complex, with 40,000 transistors (according to Intel) or 45,000 transistors (according to Wikipedia). The photo shows the metal layer of the chip, the connections on top of the chip. The thickest white lines provide power and ground connections to all parts of the chip. Hidden underneath the metal are the polysilicon and silicon that form the chip's transistors. (Click the photo for a large image.)

Die photo of the Intel 8087 floating point coprocessor chip.

Die photo of the Intel 8087 floating point coprocessor chip.

The bottom half of the chip holds the 80 bit wide arithmetic circuitry: an adder, shifters, mathematical constant storage and registers. The large rectangle in the middle of the chip is the microcode that controls the chip. At the top is control logic and bus circuitry that interfaced with the 8086 processor. (I'll discuss the inner workings of the 8087 in more detail in later blog posts.)

The black lines around the outside of the die photo are the tiny bond wires connecting the pads on the die to the 40 pins of the chip. By studying the 8087 datasheet, it's not too hard to figure out which pad on the die corresponds to each pin of the chip; the chip's 40 pins (numbered counterclockwise) are wired in order to 40 pads on the chip. The diagram below zooms in on the center right part of the die, labeling some of the pads. (Note that the ground and +5V power (Vcc) pads have multiple wires in parallel to carry more current.) However, one puzzle appeared—an extra pad and wire located between pads 40 and 1, not associated with any of the chip's pins.

Each pad on the die of the 8087 FPU chip is wired to one of the 40 pins of the chip. But there is one extra wire between pins 1 and 40. It is connected to the chips's substrate.

Each pad on the die of the 8087 FPU chip is wired to one of the 40 pins of the chip. But there is one extra wire between pins 1 and 40. It is connected to the chips's substrate.

Looking at the bond wires on the chip (below) revealed that the mystery pad wasn't connected to one of the pins but to a tiny cubical block to the right of the die. Since the cube is on the same metallic base as the die, it connects to the die's underlying silicon, the substrate. I did some reverse-engineering and determined that this is part of the 8087's substrate bias circuit, which uses this connection to put a negative voltage on the substrate. The rest of this blog post explains how this circuit works.

The die of the 8087 FPU chip, showing the bond wires from the die to the package.

The die of the 8087 FPU chip, showing the bond wires from the die to the package.

What is substrate bias?

High-density integrated circuits in the 1970s were usually built from NMOS transistors. The diagram below shows the structure of an NMOS transistor. The integrated circuit starts with a silicon substrate, and transistors are built on this. Regions of the silicon are doped with impurities to create diffusion regions with desired properties. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. A high signal voltage on the gate lets current flow between the source and drain, while a low signal voltage blocks current flow. An insulating oxide layer separates the gate from the silicon underneath; this insulating layer will be important later. These tiny transistors can be combined to form logic gates, the components of microprocessors and other digital chips.

Structure of a MOSFET as implemented in an integrated circuit.

Structure of a MOSFET as implemented in an integrated circuit.

For high-performance integrated circuits, it was beneficial to apply a negative "bias" voltage to the substrate. 2 To obtain this substrate bias voltage, many chips in the 1970s had an external pin that was connected to -5V.3 However, engineers didn't like chips that required an inconvenient extra voltage. Even worse, chips of that era often required a third voltage,4 so systems required three power supplies to support these chips. In addition, the number of pins on ICs was limited (typically just 18 pins for memory chips), so using up two pins for extra voltages was unfortunate. Part of the solution, developed around the end of the 1970s, was for chips to generate the negative bias voltage internally. The result was chips that used a single convenient +5V supply, making engineers happier.

Inside the 8087's substrate bias circuit

You might wonder how a chip can turn a positive voltage into a negative voltage. The answer is a circuit called the charge pump, which uses capacitors to generate the desired voltage.

The 8087's bias generator has two charge pumps working in alternation. The schematics below show the operation of one of the charge pumps. The charge pump is driven by an oscillating signal (Q) and its inverse (Q). In the first step, the upper transistor is switched on, causing the capacitor to charge to 5 volts with respect to ground. The second step is where the magic happens. The lower transistor turns on, connecting the high side of the capacitor to ground. Since the capacitor is still charged to 5 volts, the low side of the capacitor must now be at -5 volts, producing the desired negative voltage at the output. When the oscillator flips again, the upper transistor is turned on and the cycle repeats.5 The charge pump gets its name because it pumps charge from the output to ground. If you view the diodes as check valves, the charge pump is analogous to a manual water pump.

Schematic of the charge pump used in the Intel 8087 to provide negative substrate bias.

Schematic of the charge pump used in the Intel 8087 to provide negative substrate bias.

To reverse engineer the charge pump circuitry, I examined the die with a microscope. The metal layer obscures the transistors underneath, making it difficult to see the circuitry. But by dissolving the metal layer with hydrochloric acid, I exposed the polysilicon and silicon layers, revealing the transistors and capacitors, as seen below. (The colorful regions are simply interference patterns due to some oxide that wasn't fully removed.) The die photo below shows the two charge pumps: one to the left of the pad, and one below. Each charge pump matches the schematic above, with two diodes, a large capacitor, and two drive transistors.

The substrate bias circuit of the 8087. The metal layer has been removed in this die photo.

The substrate bias circuit of the 8087. The metal layer has been removed in this die photo.

The capacitors are the most visible feature of the substrate bias circuitry. Although microscopic, they are huge by chip standards. The area used by the capacitors is about the same as 72 bits of register storage, over 400 transistors. Each capacitor consists of polysilicon over a silicon region, separated by insulating oxide; the polysilicon and silicon form the plates of the capacitor. In the photo, the capacitors are studded with squares; these squares are contacts between the polysilicon or silicon and the metal layer on top. (The metal layer is not visible as it was removed.)

The four drive transistors are much larger than regular transistors since they must handle high current. The red lines are the polysilicon wires forming the gates. The green lines are contacts to the metal layer, connecting the transistors to +5V or ground. The diodes next to the pad are formed from transistors by connecting the gate and drain together (details).

The charge pumps are driven by the ring oscillator at the bottom of the above image. This ring oscillator consists of five inverters in a loop as shown below. Because the number of inverters is odd, the system is unstable and will oscillate. For instance, if the input to the first inverter is 0, the output from the fifth inverter will be 1. This will flip the first inverter, and the "flip" will travel through the loop causing oscillation. To slow down the oscillation rate, two resistor-capacitor networks are inserted into the ring. Since the capacitors will take some time to charge and discharge, the oscillations will be slowed, giving the charge pump time to operate.

The ring oscillator circuit in the 8087's charge pump.

The ring oscillator circuit in the 8087's charge pump.

Before explaining the ring oscillator, I'll show how a standard NMOS inverter is implemented in silicon. The diagram below shows an inverter, its schematic, and how it appears on the die. The inverter uses a transistor and a pull-up resistor (which is really a transistor). If the input is low, the transistor is off and the pull-up resistor pulls the output to +5V. If the input is high, the transistor is on, pulling the output to ground. Thus, the circuit inverts the input.

How an inverter is implemented with NMOS logic, and how it appears on the chip die.

How an inverter is implemented with NMOS logic, and how it appears on the chip die.

In the die photo above, the inverter's physical layout matches the schematic. The large beige regions are doped silicon. The thinner yellow areas bordered with purple are polysilicon. The input is a polysilicon wire. Where it crosses the doped silicon it forms the gate of a transistor between ground (below the input) and the output (above the input). The pull-up resistor is implemented with a transistor that has the gate and drain tied together; the indicated contact forms this connection between the transistor's polysilicon gate and its silicon drain. The polysilicon also forms the output wire. Thus, an inverter is implemented on the chip with two transistors.

The ring oscillator in the 8087 FPU chip, as seen on the die.

The ring oscillator in the 8087 FPU chip, as seen on the die.

The photo above shows how the ring oscillator appears on the die. The five inverters are outlined. Each inverter has a different orientation to optimize the layout, but careful examination shows the same transistor and pull-up structure explained above. The resistors and capacitors for the R-C delays are also indicated. The resistors are simply transistors with a long distance between source and drain, reducing the current flow. These capacitors are constructed like the charge pump capacitors, but are much smaller; the silicon on the bottom and the polysilicon on top form the capacitor plates, separated by the thin insulating oxide layer.

Conclusions

The substrate bias generator on the 8087 chip is an interesting combination of digital circuitry (a ring oscillator formed from inverters) and an analog charge pump. Substrate bias generator circuits were introduced in the late 1970s, helping memory chips and microprocessors to operate from a single +5V supply, much more convenient than requiring three different voltages. The substrate bias generator produces a negative voltage from the positive supply voltage by using a charge pump.

While the bias generator may seem like an obscure part of 1970s computer history, bias generation is still part of modern integrated circuits but has become much more complex, with multiple carefully regulated biases in multiple power domains. There is even a standard (IEEE 1801 power format) that allows IC design tools to generate the necessary circuitry.6

Likewise, even though Intel's 8087 floating point unit chip was introduced 38 years ago, it still has a large impact today. It spawned the IEEE 754 floating point standard used for most modern floating point arithmetic, and the 8087's instructions remain a part of the x86 processors used in most computers.

I'll end with one more 8087 die photo; this one shows the polysilicon and silicon after stripping off the metal. You may recognize the substrate bias generator circuit at the center right. (Click for a large image.)

Die photo of the Intel 8087 floating point unit. The metal layer has been stripped off with acid, revealing the polysilicon and silicon underneath.

Die photo of the Intel 8087 floating point unit. The metal layer has been stripped off with acid, revealing the polysilicon and silicon underneath.

I announce my latest blog posts on Twitter, so follow me at @kenshirriff for future 8087 articles. I also have an RSS feed. Thanks to Ed Spittles and Eric Smith for comments.

Notes and references

  1. The 8087 introduced a bunch of new instructions to the 8086, such as FADD (floating add), FDIV (floating divide) and FPTAN (tangent). These instructions were implemented using the 8086's ESC "escape" instruction, which was designed to let the 8086 processor interact with a coprocessor.

    The 8087 led to the IEEE 754 floating point standard in 1985; this defines the floating point used by most computers today. For more information on how the 8087 works, see The Intel 8087 numeric data processor by John Palmer or The 8087 Primer

  2. Putting a negative bias voltage on the substrate had several benefits. It decreased parasitic capacitance making the chip faster, made the transistor threshold voltage more predictable, and reduced leakage current

  3. Early DRAM memory chips and microprocessor chips often required three supplies: +5V (Vcc), +12V (Vdd) and -5V (Vbb) bias voltage. In the late 1970s, improvements in chip technology allowed a single supply to be used instead. For example, Mostek's MK4116 (a 16 kilobit DRAM from 1977) required three voltages while the improved MK4516 (1981) operated on a single +5V supply, simplifying hardware designs. (Amusingly, some of these chips still kept the Vbb and Vcc pins for backwards compatibility but left them unconnected.) Intel's memory chips followed a similar path, with the 2116 DRAM (16K, 1977) using three voltages and the improved 2118 (1979) using a single voltage. Similarly, the famous Intel 8080 microprocessor (1974) used enhancement-mode transistors and required three voltages. An improved version, the 8085 (1976), used depletion-mode transistors and was powered by a single +5V supply. The Motorola 6800 microprocessor (1974) used a different approach for a single supply; although the 6800 was built from the older enhancement-load transistors it avoided the +12 supply by implementing an on-chip voltage doubler, a charge pump that increased the voltage. 

  4. The third (+12V) supply in old chips is unrelated to the substrate bias. This supply was used because early MOS integrated circuits used enhancement-mode transistors as pull-up loads in gates. These transistors couldn't pull signals all the way up to the +5V level, so chips added an an even higher +12V supply. In the mid 1970's, new technology (ion implantation) allowed the creation of depletion-load transistors, which functioned much better as pull-up loads and eliminated the need for the +12V supply. For details, see Wikipedia, StackExchange and VLSI design techniques for analog and digital circuits page 539. 

  5. I've simplified the charge pump discussion slightly. Due to voltage drops in the transistors, the substrate voltage will probably be around -3V, not -5V. (If a chip requires a larger voltage drop, charge pump stages can be cascaded.) For the pump direction, I'm referring to current flow. If you think of it as pumping electrons, the negative electrons are being pumped the opposite direction, into the substrate. 

  6. Bias generators are now available as IP blocks that can be licensed and be plugged into a chip design. For more information on bias in modern chips, see Body bias, Multi bias domain implementation, or this presentation

Inside Intel's first product: the 3101 RAM chip held just 64 bits

Intel's first product was not a processor, but a memory chip: the 31011 RAM chip, released in April 1969. This chip held just 64 bits of data (equivalent to 8 letters or 16 digits) and had the steep price tag of $99.50.2 The chip's capacity was way too small to replace core memory, the dominant storage technology at the time, which stored bits in tiny magnetized ferrite cores. However, the 3101 performed at high speed due to its special Schottky transistors, making it useful in minicomputers where CPU registers required fast storage. The overthrow of core memory would require a different technology—MOS DRAM chips—and the 3101 remained in use in the 1980s.3

This article looks inside the 3101 chip and explains how it works. I received two 3101 chips from Evan Wasserman and used a microscope to take photos of the tiny silicon die inside.4 Around the outside of the die, sixteen black bond wires connect pads on the die to the chip's external pins. The die itself consists of silicon circuitry connected by a metal layer on top, which appears golden in the photo. The thick metal lines through the middle of the chip power the chip. The silicon circuitry has a grayish-purple color, but it largely covered by the metal layer. Most of the chip contains a repeated pattern: this is the 16x4 array of storage cells. In the upper left corner of the chip, the digits "3101" in metal identify the chip, but "Intel" is not to be found.

Die photo of the Intel 3101 64-bit RAM chip. Click for a larger image.

Die photo of the Intel 3101 64-bit RAM chip. Click for a larger image.

Overview of the chip

The 3101 chip is controlled through its 16 external pins. To select one of the chip's 16 words of memory, the address in binary is fed into the chip through the four address pins (A0 to A3). Memory is written by providing the 4-bit value on the data input pins (D1 to D4). Four data output pins (O1 to O4) are used to read memory; these pins are inverted as indicated by the overbar. The chip has two control inputs. The chip select pin (CS) enables or disables the chip. The write enable pin (WE) selects between reading or writing the memory. The chip is powered with 5 volts across the Vcc and ground pins.

The diagram below shows how the key components of the 3101 are arranged on the die. The RAM storage cells are arranged as 16 rows of 4 bits. Each row stores a word, with bits D1 and D2 on the left and D3 and D4 on the right. The address decode logic in the middle selects which row of storage is active, based on the address signals coming from the address drivers at the top. At the bottom, the read/write drivers provide the interface between the storage cells and the data in and out pins.

Block diagram of the 3101 RAM chip.

Block diagram of the 3101 RAM chip.

Transistors

Transistors are the key components in a chip. The 3101 uses NPN bipolar transistors, different from the MOS transistors used in modern memory chips. The diagram below shows one of the transistors in the 3101 as it appears on the die. The slightly different tints in the silicon indicate regions that have been doped to form N and P type silicon with different semiconductor properties. The cross-section diagram illustrates the internal structure of the transistor. On top (black) are the metal contacts for the collector (C), emitter (E), and base (B). Underneath, the silicon has been doped to form the N and P regions that make up the transistor.

A key innovation of the 3101 was using Schottky transistors (details), which made the 3101 almost twice as fast as other memory chips.5 In the cross section, note that the base's metal contact touches both the P and N regions. You might think this shorts the two regions together, but instead a Schottky diode is formed where the metal contacts the N layer.6

The structure of an NPN Schottky transistor inside the Intel 3101 chip.

The structure of an NPN Schottky transistor inside the Intel 3101 chip.

The 3101 also used many multiple-emitter transistors. While a multiple-emitter transistors may seem strange, they are common in bipolar integrated circuits, especially TTL logic chips. A multiple-emitter transistor simply has several emitter regions embedded in the base region. The die photo below shows one of these transistors with the collector on the left, followed by the base and two emitters.

A multiple-emitter transistor from the Intel 3101 chip.

A multiple-emitter transistor from the Intel 3101 chip.

Driving the data output pins requires larger, high-current transistors. The image below shows one of these transistors. The central rectangle is the base, surrounded by the C-shaped emitter in the middle and the large collector on the outside. Eight of these high-current transistors are also used to drive the internal address select lines.

For the high-current output, the Intel 3101 chip uses larger transistors.

For the high-current output, the Intel 3101 chip uses larger transistors.

Diodes

While examining the 3101 chip, I was surprised by the large number of diodes on the chip. Eventually I figured out that the chip used DTL (diode-transistor logic) for most of its logic rather than TTL (transistor-transistor logic) that I was expecting. The diagram below shows one of the diodes on the chip. I believe the chip builds diodes using the standard technique of connecting an NPN transistor as a diode.

Presumed structure of a diode inside the 3101 chip. I believe this is a regular diode, not a Schottky diode.

Presumed structure of a diode inside the 3101 chip. I believe this is a regular diode, not a Schottky diode.

Resistors

The die photo below shows several resistors on the 3101 die. The long, narrow snaking regions of p-type silicon provide resistance. Resistors in integrated circuits are inconveniently large, but are heavily used in the 3101 for pull-up and pull-down resistors. At the right is a square resistor, which has low resistance because it is very wide.7 It is used to route a signal under the metal layer, rather than functioning a resistor per se.

Resistors inside the 3101 chip.

Resistors inside the 3101 chip.

The static RAM cell

Now that I've explained the individual components of the chip, I'll explain how the circuitry is wired together for storage. The diagram below shows the cell for one bit of storage with the circuit diagram overlaid. Each cell consists of two multi-emitter transistors (outlined in red) and two resistors (at the top). The horizontal and vertical wiring connects cells together. This circuit forms a static RAM cell, basically a latch that can be in one of two states, storing one data bit.

The circuitry of one storage cell of the 3101 RAM chip. The two multiple-emitter transistors are outlined in red.

The circuitry of one storage cell of the 3101 RAM chip. The two multiple-emitter transistors are outlined in red.

Before explaining how this storage cell works, I'll explain a simpler latch circuit, below. This circuit has two transistors cross-connected so if one transistor is on, it forces the other off. In the diagram, the left transistor is on, which keeps the right transistor off, which keeps the left transistor on. Thus, the circuit will remain in this stable configuration. The opposite state—with the left transistor off and the right transistor on—is also stable. Thus, the latch has two stable configurations, allowing it to hold a 0 or a 1.

A simple latch circuit. The transistor on the left is on, forcing the transistor on the right off, forcing the transistor on the left off...

A simple latch circuit. The transistor on the left is on, forcing the transistor on the right off, forcing the transistor on the left off...

To make this circuit usable—so the bit can be read or modified—more complex transistors with two emitters are used. One emitter is used to select which cell to read or write, while the other emitter is used for the read or write data. This yields the schematic below, which matches the storage cell die photo diagram above.

The RAM cell used in the Intel 3101 is based on multiple-emitter transistors. The row select lines are raised to read/write the row of cells. Each data line accesses a column of cells.

The RAM cell used in the Intel 3101 is based on multiple-emitter transistors. The row select lines are raised to read/write the row of cells. Each data line accesses a column of cells.

Multiple storage cells are combined into a grid to form the memory memory. One word of memory consists of cells in the same row that share select lines. All the cells in a column store the same bit position; their data lines are tied together. (The bias line provides a voltage level to all cells in the memory.8)

Note that unlike the simplified cell, the circuit above doesn't have an explicit ground connection; to be powered, it requires a low input on either the select or data/bias lines. There are three cases of interest:

  • Unselected: If the negative row select line is low, current flows out through the row select line. The data and bias lines are unaffected by this cell.
  • Read: If the negative row select line is higher than the data and bias lines, current will flow out the data line if the left transistor is on, and out the bias line if the right transistor is on. Thus, the state of the cell can be read by examining the current on the data line.
  • Write: If the negative row select line is higher and the data and bias lines have significantly different voltages, the transistor on the lower side will switch on, forcing the cell into a particular state. This allows a 0 or 1 to be written to the cell.

Thus, by carefully manipulating the voltages on the select lines, data lines and the bias line, one row of memory can be read or written, while the other cells hold their current value without influencing the data line. The storage cell and the associated read/write circuitry are essentially analog circuits rather than digital since the select, data, and bias voltages must be carefully controlled voltages rather than logic levels.

The address decode logic

The address decode circuitry determines which row of memory cells is selected by the address lines.11 The interesting thing about this circuitry is that you can easily see how it works just by looking at the die photo. The address driver circuitry sends the four address signals along with their complements on eight metal traces through the chip. Each storage row has a four-emitter transistor. In each row you can see four black dots, which are the connections between emitters and address lines. A row will be selected if all the emitter inputs are high.9 A dot on an address line (e.g. A0) will "match" a 1, while a dot on the complemented address line (e.g. A0) will match a 0, so each row matches a unique four-bit address. In the die photo below, you can see the decoding logic counting down in binary for rows 15 down to 11;10 the remainder of the circuit follows the same pattern.

The address decode logic in the Intel 3101 RAM chip. Each row decodes matches four address lines to decode one of the 16 address combinations. You can see the value counting down in binary.

The address decode logic in the Intel 3101 RAM chip. Each row decodes matches four address lines to decode one of the 16 address combinations. You can see the value counting down in binary.

Some systems that used the 3101

The 64-bit storage capacity of the 3101 was too small for a system's main memory, but the chip had a role in many minicomputers. For example, the Burroughs D Machine was a military computer (and the source of the chips I examined). It used core memory for its main storage, but a board full of 3101 chips provided high-speed storage for its microcode. The Xerox Alto used four 3101 chips to provide 16 high-speed registers for the CPU, while the main memory used slower DRAM chips. Interdata used 3101 chips in many of its 16- and 32-bit minicomputers up until the 1980s.12

The 3101 was also used in smaller systems. The Diablo 8233 terminal used them as RAM.13 The Datapoint 2200 was a "programmable terminal" that held its processor stack in fast 3101 chips rather than the slow main memory which was built from Intel 1405 shift registers.

The CPU of the Datapoint 2200 computer was built from a board full of TTL chips. The four white chips in the lower center-right are Intel 3101 RAM chips holding the stack. Photo courtesy of Austin Roche (I think).

The CPU of the Datapoint 2200 computer was built from a board full of TTL chips. The four white chips in the lower center-right are Intel 3101 RAM chips holding the stack. Photo courtesy of Austin Roche (I think).

How I created the die photos

To get the die photos, I started with two chips that I received thanks to Evan Wasserman and John Culver. The pins on the chips had been crushed in the mail, but this didn't affect the die photos. The chips had two different lot numbers that indicate they were manufactured a few months apart. Strangely, the metal lids on the chips were different sizes and the dies were slightly different. For more information, see the CPU Shack writeup of the 3101.

Two 3101 RAM chips. The chip on the right was manufactured slightly later and has a larger lid over the die.

Two 3101 RAM chips. The chip on the right was manufactured slightly later and has a larger lid over the die.

Popping the metal lid off the chips was easy—just a tap with a hammer and chisel. This revealed the die inside.

With the lid removed, you can see the die of the 3101 RAM chip and the bond wires connected to the die.

With the lid removed, you can see the die of the 3101 RAM chip and the bond wires connected to the die.

Using a metallurgical microscope and Hugin stitching software (details), I stitched together multiple microscope photos to create an image of the die. The metal layer is clearly visible, but it obscures the silicon underneath, making it hard to determine the chip's circuitry. The photo below shows a closeup of the die showing the "3101" part number.

The die photo of the Intel 3101 shows mostly the metal layer.

The die photo of the Intel 3101 shows mostly the metal layer.

I applied acid14 to remove the metal layer. This removed most of the metal, revealing the silicon circuitry underneath. Some of the metal is still visible, but thinner, appearing transparent green. Strangely, the number 3101 turned into 101; apparently the first digit wasn't as protected by oxide as the other digits.

Treatment with acid dissolved most of the metal layer of the 3101 chip, revealing the silicon circuits underneath.

Treatment with acid dissolved most of the metal layer of the 3101 chip, revealing the silicon circuits underneath.

Below is the complete die photo of the chip with the metal layer partially stripped off. (Click it for a larger version.) This die photo was most useful for analyzing the chip. Enough of the metal was removed to clearly show the silicon circuits, but the remaining traces of metal showed most of the wiring. The N+ silicon regions appear to have darkened in this etch cycle.

Die photo of the Intel 3101 64-bit RAM chip with metal layer partially stripped off.

Die photo of the Intel 3101 64-bit RAM chip with metal layer partially stripped off.

I wanted to see how the chip looked with the metal entirely removed so I did a second etch cycle. Unfortunately, this left the die looking like it had been destroyed.

After dissolving most of the oxide layer, the die looks like a mess. (This is a different region from the other photos.)

After dissolving most of the oxide layer, the die looks like a mess. (This is a different region from the other photos.)

I performed a third etch cycle. It turns out that the previous etch hadn't destroyed the die, but just left a thin layer of oxide that caused colored interference bands. The final etch removed the remaining oxide, leaving a nice, clean die. Only a ghost of the "101" number is visible. The contacts between the metal layer and the silicon remained after the etch; they may be different type of metal that didn't dissolve.

The metal and oxide have been completely removed from the 3101 die, showing the silicon layer.

The metal and oxide have been completely removed from the 3101 die, showing the silicon layer.

Below is the full die photo with all the metal stripped off. (Click it for a full-size image.)

Die photo of the Intel 3101 64-bit RAM chip with metal layer stripped off.

Die photo of the Intel 3101 64-bit RAM chip with metal layer stripped off.

Conclusion

The 3101 RAM chip illustrates the amazing improvements in integrated circuits driven by Moore's Law.15 While the 3101 originally cost $99.50 for 64 bits, you can now buy 16 gigabytes of RAM for that price, two billion times as much storage. If you built a 16 GB memory from two billion 3101 chips, the chips alone would weigh about 3000 tons and use over a billion watts, half of Hoover Dam's power. A modern 16GB DRAM module, in comparison, uses only about 5 watts.

As for Intel, the 3101 RAM was soon followed by many other memory products with rapidly increasing capacity, making Intel primarily a memory company that also produced processors. However, facing strong competition from Japanese memory manufacturers, Intel changed its focus to microprocessors and abandoned the DRAM business in 1985.16 By 1992, the success of the x86 processor line had made Intel the largest chip maker, justifying this decision. Even though Intel is now viewed as a processor company, it was the humble 3101 memory chip that gave Intel its start.

Thanks to Evan Wasserman and John Culver for sending me the chips. John also did a writeup of the 3101 chip, which you can read at CPU Shack.

Notes and references

  1. You might wonder why Intel's first chip had the seemingly-arbitrary number 3101. Intel had a highly-structured naming system. A 3xxx part number indicated a bipolar product. A 1 for the second digit indicated RAM, while the last two digits (01) were a sequence number. Fortunately, the marketing department stepped in and gave the 4004 and 8008 processors better names. 

  2. Memory chips started out very expensive, but prices rapidly dropped. Computer Design Volume 9 page 28, 1970, announced a price drop of the 3101 from $99.50 to $40 in small volumes. Ironically, the Intel 3101 is now a collector's item and on eBay costs much more than the original price—hundreds of dollars for the right package. 

  3. Several sources say that the 3101 was the first solid state memory, but this isn't accurate. There were many companies making memory chips in the 1960s. For instance, Texas Instruments announced the 16-bit SN5481 bipolar memory chip in 1966 (Electronics, V39 #1, p151) and Transitron had the TMC 3162 and 3164 16-bit RAM (Electrical Design News, Volume 11, p14). In 1968, RCA made 72-bit and 288-bit CMOS memories for the Air Force (document, photo). Lee Boysel built 256-bit dynamic RAMs at Fairchild in 1968 and 1K dynamic RAMs at Four Phase Systems in 1969 (timeline and Boysel presentation). For more information on the history of memory technology, see timeline and History of Semiconductor Engineering, p215. Another source for memory history is To the Digital Age, p193. 

  4. From my measurements, the 3101 die is about 2.39mm by 3.65mm. Feature size is about 12µm. 

  5. If you've used TTL chips, you probably used the 74LSxx family. The "S" stands for the Schottky transistors that make these chip fast. These chips were "the single most profitable product line in the history of Texas Instruments" (ref). 

  6. The Schottky diode in the Schottky transistor is formed between the base and collector. This diode prevents the transistor from becoming saturated, allowing it to switch faster. 

  7. The resistance of an IC resistor is proportional to the length divided by the width. The sheet resistance of a material is measured in the unusual unit of ohms per square. You might think it should be per square nanometer or square mm or something, but since the resistance depends on the ratio of length to width, the unit cancels out. 

  8. The bias line is shared by all the cells. For reading, it is set to a low voltage. For writing, it is set to an intermediate voltage: higher than the data 0 voltage, but lower than the data 1 voltage. The bias voltage is controlled by the write enable pin.

    More advanced chips use two data lines instead of a bias line for more sensitivity. A differential amplifier to compare the currents on the two data lines and distinguish the tiny change between a zero bit and a one bit. However, the 3101 uses such high currents internally that this isn't necessary; it can read the data line directly. 

  9. If my analysis is correct, when a row is selected, the address decode logic raises both the positive row select and negative row select lines by about 0.8 volts (one diode drop). Thus, the cell is still powered by the same voltage differential, but the voltage shift makes the data and bias lines active. 

  10. Address lines A3 and A2 are reversed in the decoding logic, presumably because it made chip layout simpler. This has no effect on the operation of the chip since it doesn't matter of the physical word order matches the binary order. 

  11. The 3101 has a chip select pin that makes it easy to combine multiple chips into a larger memory. If this pin is high, the chip will not read or write its contents. One strange thing about the address decoding logic is that each pair of address lines is driven by a NAND gate latch. There's no actual latching happening, so I don't understand why this circuit is used.

    How the 3101 implements this feature is a bit surprising. The chip select signal is fed into the address decoding circuit; if the chip is not selected, both A0 and the complement A0 are forced low. Thus, none of the rows will match in the address decoding logic and the chip doesn't respond. 

  12. The Interdata 7/32 (the first 32-bit minicomputer) used 3101 chips in its memory controller. (See the maintenance manual page 338.) The Interdata 16/HSALU used 3101 chips for its CPU registers. (See the maintenance manual page 259.) As late as 1982, the Interdata 3210 used 3101 chips to hold cache tags (see manual page 456). On the schematics note that part number 19-075 indicates the 3101. 

  13. The Diablo 8233 terminal used 3101A (74S289) chips as RAM for its discrete TTL-based processor (which was more of a microcontroller) that controlled the printer. (See maintenance manual page 187.) This systems was unusual since it contained both an 8080 microprocessor and a TTL-based processor. 

  14. The metal layer of the chip is protected by silicon dioxide passivation layer. The professional way to remove this layer is with dangerous hydrofluoric acid. Instead, I used Armour Etch glass etching cream, which is slightly safer and can be obtained at craft stores. I applied the etching cream to the die and wiped it for four minutes with a Q-tip. (Since the cream is designed for frosting glass, it only etches in spots. It must be moved around to obtain a uniform etch.) Next, I applied a few drops of hydrochloric acid (pool acid from the hardware store) to the die for a few hours. 

  15. Moore's law not only describes the exponential growth in transistors per chip, but drives this growth. The semiconductor industry sets its roadmap according to Moore's law, making it in some sense a self-fulfilling prophecy. See chapter 8 of Technological Innovation in the Semiconductor Industry for a thorough discussion. 

  16. Intel's 1985 Annual Report says "It was a miserable year for Intel" and discusses the decision to leave the DRAM business. 

Analyzing the vintage 8008 processor from die photos: its unusual counters

The revolutionary Intel 8008 microprocessor is 45 years old today (March 13, 2017), so I figured it's time for a blog post on reverse-engineering its internal circuits. One of the interesting things about old computers is how they implemented things in unexpected ways, and the 8008 is no exception. Compared to modern architectures, one unusual feature of the 8008 is it had an on-chip stack for subroutine calls, rather than storing the stack in RAM. And instead of using normal binary counters for the stack, the 8008 saved a few gates by using shift-register counters that generated pseudo-random values. In this article, I reverse-engineer these circuits from die photos and explain how they work.

The image below shows the 8008's tiny silicon die, highly magnified. Around the outside of the die, you can see the 18 wires connecting the die to the chip's external pins. The 8008's circuitry is built from about 3500 tiny transistors (yellow) connected by a metal wiring layer (white). This article will focus on the stack circuits on the right side of the chip and how they interact with the data bus (blue).

The die of the Intel 8008 microprocessor, showing the stack and other important subcomponents.

The die of the Intel 8008 microprocessor, showing the stack and other important subcomponents.

For the 8008 processor's birthday, I'm using the date of its first public announcement, an article in Electronics on March 13, 1972 entitled "8-bit parallel processor offered on a single chip." This article described the 8008 as a complete central processing unit for use in "intelligent terminals" and stated that chips were available at $200 each.1

You might think that an intelligent terminal is a curiously specific application for the 8008 processor. There's an interesting story behind that, going back to the roots of the chip: the Datapoint 2200 "programmable terminal", introduced in June 1970. The popular Datapoint 2200 was essentially a desktop minicomputer with its processor consisting of a board full of simple TTL chips. The photo below shows the CPU board from the Datapoint 2200. The chips are gates, flip flops, decoders, and so forth, combined to build a processor, since microprocessors didn't exist at the time.

The processor board from the Datapoint 2200. The 8008 microprocessor was created to replace this board, but was never used by Datapoint. Photo courtesy of unknown source.

The processor board from the Datapoint 2200. The 8008 microprocessor was created to replace this board, but was never used by Datapoint. Photo courtesy of unknown source.

Processors typically use a stack to store addresses for subroutine calls, so they can "pop" the return address off the stack. This stack is usually stored in main memory. However, the Datapoint 2200 used slow shift-register memory2 instead of expensive RAM for its main storage, so implementing a stack in main memory would be slow and inconvenient. Instead, the Datapoint 2200's stack was stored in four i3101 RAM chips, providing a small stack of 16 entries. 3 4 The i3101 was Intel's very first product, and held just 64 bits. In the photo above, you can see the chips in their distinctive white packaging each with a large "i" for Intel. 5

To keep track of the top of the stack, the Datapoint 2200 used a 4-bit up/down counter chip to hold the stack pointer. The clever thing about this design is there's no separate program counter (PC) and stack; the PC is simply the value at the top of the stack. You don't need to explicitly push and pop the PC onto the stack; for a subroutine call you just update the counter and write the subroutine address to the stack.

The story of the 8008's origin is that Datapoint went to Intel and asked if Intel could build a chip that combined the stack memory and the stack pointer onto a single chip. Intel said not only could they do that, they could put the whole processor board onto a single chip! This was the start of Intel's 8008 project to duplicate the Datapoint 2200's processor board onto a chip, keeping the Datapoint 2200 instruction set and architecture.6 After various delays, Intel completed the 8008 microprocessor, but Datapoint rejected it. Intel decided to sell the 8008 as a general-purpose processor chip, sparking the microprocessor revolution. Intel improved the 8008 with the 8080 and then the 16-bit 8086, leading to the x86 architecture that dominates desktop and server computers today.

The consequence of the 8008's history is that it inherited its architecture and instruction set from the Datapoint 2200 intelligent terminal. One of these features was the fixed, internal stack. But the 8008's implementation of that stack is unusual.

Shift-register counter

The most unexpected part of the 8008's stack is how it keeps track of the current position. The straightforward way to implement the stack would be with a binary up/down counter to keep track of the current stack position (which is what the Datapoint 2200 did). But to save a few transistors, the 8008 uses a nonlinear feedback shift register instead of a counter. The result is the stack entries are accessed in a pseudo-random order! But since they are read and written in the same order, everything works out fine.

The shift register outputs are based on a de Bruijn sequence, a cyclic sequence in which every possible output occurs as a subsequence exactly once. The 8008's de Bruijn sequence is shown below. The first value (000) is underlined in red. Shifting to the blue position yields the second value (001). Proceeding around the circle clockwise yields all eight values in the sequence: 000, 001, 010, 101, 011, 111, 110, 100 and finally back to 000. Note that each value appears exactly once, but they are not in standard binary order.

8This de Bruijn sequence contains all eight 3-bit values as subsequences. 000 and 001 are underlined. The Intel 8008's internal counters are built form this sequence.

This de Bruijn sequence contains all eight 3-bit values as subsequences. 000 and 001 are underlined. The Intel 8008's internal counters are built form this sequence.

At each step in the sequence, the last two bits are shifted to the left and a new bit is placed on the right. Counting down is the converse: the first two bits are shifted to the right and a new bit is placed one the left. This process can be implemented with a shift register, a circuit that allows a bit sequence to be shifted and an additional bit inserted.7

The diagram below shows how the 8008 implements the nonlinear feedback shift register counter. While it make look complex, it's a straightforward implementation of the de Bruijn sequence. The three latches in the middle form a shift register, with each latch holding one bit. To count up, each bit is shifted to the left and a new bit is added on the right (green arrows). To count down, each bit is shifted to the right and a new bit is added on the left (purple arrows). The logic gate on the left generate the "new" bit for counting down and the gates on the right generate the new bit for counting up.

The 8008 uses the above circuit for its internal stack counter. The refresh counter is based on this, but counts up only.

The 8008 uses the above circuit for its internal stack counter. The refresh counter is based on this, but counts up only.

The logic gates may appear complex. However, one feature of PMOS logic is it's as simple to build an AND-OR-NOR gate as a plain NOR gate, just by wiring transistors in parallel or series. Designing the logic is also straightforward: for each triple of current bits, the de Bruijn sequence specifies the next bit. If you've studied digital logic, Karnaugh maps can be used to create the logic circuits to generate the desired next bit.

Inside the stack storage

The 8008 uses dynamic RAM (DRAM) to for its stack storage and its registers. The other 1970s microprocessors that I've examined use static latches, so the 8008 is a bit unusual in this regard. Since Intel was primarily a RAM company at the time, I assume they wanted to leverage their RAM skills and save transistors by using DRAM.

Each bit of storage in the 8008 uses a cell with three transistors and one capacitors, called a 3T1C cell, similar to the cell in Intel's i1103 DRAM chip. The diagram below shows a closeup of the 8008's stack storage, with six DRAM cells visible. Each row is one 14-bit address in the stack. Each row has a read enable and write enable control line coming from the left. Each column stores one of the 14 bits; the column sense line is used to read and write the selected bit.

Detail of the Intel 8008 microprocessor's die, showing six storage cells for the stack registers. Each bit is stored with a DRAM cell consisting of three transistors and a capacitor.

Detail of the Intel 8008 microprocessor's die, showing six storage cells for the stack registers. Each bit is stored with a DRAM cell consisting of three transistors and a capacitor.

The transistors for the first cell are labeled T1, T2 and T3. The value is stored on the capacitor labeled C. (There is no separate physical capacitor; the capacitance of the wiring is sufficient to store the bit.)

To write a bit, the write line for the desired row is pulled low, turning on T1. The desired voltage (low or high) is fed onto the sense line, passes through T1, and is stored by the capacitor. To read the value, the appropriate read line is pulled low, turning on T3. If C has a low voltage, T2 is turned on. This connects the sense line to ground through T3 and T2. On the other hand, if C has a high voltage, T2 is turned off and the sense line is not grounded. Thus, the circuitry connected to the sense line can tell what bit value is stored on C.

The inconvenience with dynamic RAM is that values can only be stored temporarily. After a few hundred microseconds, the charge stored on capacitor C will leak away and the value will be lost. The solution is a refresh circuit that periodically reads each value and writes it back, before the bit fades away. (A similar refresh process is used by your computer's RAM.) The 8008's internal RAM is refreshed at least every 240 microseconds, ensuring that bits are not lost. (Static RAM, on the other hand, uses a larger, more complex circuit for each bit, but will preserve the bit as long as the circuit is powered up.)

In the 8008, the stack storage (and the registers) are refreshed by continuously stepping through each entry: reading it and writing it back. To accomplish this, a second 3-bit shift-register counter is used as a refresh counter, tracking the current position that is being refreshed. The circuit for this is the same as the stack counter, except it omits the logic to count down, as it only needs to count in one direction.9

Understanding the die photo

I'll briefly explain what you're looking at in the die photo above. The chip itself is made from a silicon wafer. Plain silicon is essentially an insulator, but by doping it with impurities, it becomes a semiconductor. The dark lines indicate the boundary between doped and undoped regions; the doped silicon in the first cell is indicated in red.

On top of the silicon is the polysilicon layer, which is the yellowish stripes. Polysilicon acts as a conductor and is used as internal wiring of the chip. More importantly, a transistor is created when polysilicon crosses doped silicon. A thin oxide layer separates the polysilicon from the silicon, forming the transistor's gate. A low voltage on the polysilicon gate causes the transistor to conduct, connecting the two sides (called source and drain) of the transistor. A high voltage on the gate turns the transistor off, disconnecting the two sides. Thus, the transistor acts as a switch, controlled by the gate.

The top layer of the chip is the metal layer, which is also used as wiring. For the photo above, I removed the metal layer with hydrochloric acid to make the underlying silicon more visible. The green, blue and gray lines indicate where the metal wiring was before being removed. Transistors T1 and T3 are connected to the sense line (blue), while transistor T2 is connected to ground (green). The read and write lines enter the circuit on the left as metal wiring, connected to polysilicon lines.

The interface between the stack and the data bus

To access memory, the address in the stack must be provided to external memory via the 8 data/address pins on the chip. These pins are connected to the stack (and other parts of the 8008) via the data bus. The die photo below shows the circuitry that interfaces the 14-bit stack storage to the 8-bit data bus.11 At the top of the photo are the metal control lines and three of the data bus lines. At the bottom are the sense lines, discussed earlier, from the stack storage. In between are the transistors (orange) that connect the data bus and the stack.

The control lines select the low (L) or high (H) half of the address. These activate the appropriate read or write transistors, connecting the appropriate stack columns to the data bus.

The stack /bus driver circuit provides the "glue" between the data bus and the stack DRAM storage.

The stack /bus driver circuit provides the "glue" between the data bus and the stack DRAM storage.

The transistors to write an address to the data bus are much larger than typical transistors, appearing as vertical yellow bars in the die photo. The reason for this is the data bus passes through the whole chip. Due to the length of the bus, it has relatively high capacitance and larger, high-current transistors are required to drive a signal on the data bus.

Near the bottom of the photo are the inverter amplifiers. Each sense line is attached to an inverter that boosts the signal from the stack storage. During refresh, this boosted signal is written back, strengthening the bit stored on the capacitor.10

Conclusion

By examining die photos, it is possible to reverse-engineer the 8008 microprocessor. One unusual feature of the 8008 is that instead of using standard binary counters internally, it saves a few gates by using shift-register counters. Although these count in a pseudo-random order rather than sequentially, the 8008 still functions correctly. One counter is used for the on-chip address stack. The 8008 also uses DRAM internally for stack storage and register storage, requiring a second counter to refresh the DRAM. Since every transistor was precious at the dawn of the microprocessor age, the 8008 has these interesting design decisions that produced compact circuitry.

If you're interested in the 8008, my previous article has a detailed discussion of the architecture, more die photos and information on how to take them. This article explains the 8008's ALU.

I announce my latest blog posts on Twitter, so follow me at kenshirriff. I also have an RSS feed.

Notes and references

  1. The first announcement of the 8008 microprocessor in Electronics is shown below (click for a larger version). The announcement called the chip a "parallel processor", a term that had a different meaning back then, indicating that the processor operated on all 8 bits at the same time. This was in contrast to serial processors (such as the Datapoint 2200) that handled one bit of the word at a time.)

    The 8008 chip was announced in Electronics on March 13, 1972: "8-bit parallel processor offered on a single chip."

    The 8008 chip was announced in Electronics on March 13, 1972: "8-bit parallel processor offered on a single chip."

  2. In 1970, RAM memory chips were extremely expensive: $99.50 for an i3101 chip with just 64 bits of storage. Shift-register memory was cheaper and denser, with 512 bits of storage in an Intel 1405 chip. The big disadvantage is the bits were circulated around and around inside the chip, with only one bit available at a time. Sequential access wasn't a problem, but if you wanted to read memory out of order, you might need to wait half a millisecond for the right bit to circle around. I wrote about shift-register memories in detail here, with detailed die photos. 

  3. The i3101 memory was called the 3101 due to Intel's part numbering system at the time, described in Intel Technology Journal, Q1 2001. To summarize, the first digit indicate the product family: 1xxx is PMOS, 2xxx is NMOS, 3xxx is bipolar and so forth. The second digit indicates the product type: 1 is RAM, 2 is a controller, 3 is ROM, and so forth. The last two digits are sequence numbers typically starting with 01. Thus, the first bipolar RAM was the 3101.

    During development, the 8008 chip was called the 1201, following Intel's naming scheme: the 1 indicated the chip was built from PMOS technology, the 2 indicated a custom chip and the 01 was a serial number. Fortunately, when it came time to market microprocessors, Intel decided that marketing was more important than systematic numbering: Intel's 4-bit microprocessor became the 4004 and their 8-bit microprocessor the 8008. 

  4. Intel introduced the i3101 chip in April 1969. The i3101 RAM chip was a static memory chip, rather than the dynamic RAM chips common today. It was also built from Schottky TTL technology, rather than MOS used in modern RAM chips. Other companies, such as National Semiconductor, Signetics and Fairchild, made 64-bit memory chips compatible with the Intel i3101. However, they typically used the standard 74xx numbering scheme, calling the chip the 7489

  5. Although the Datapoint's stack could hold 16-bit values, the Datapoint 2200 only used 13 address bits, supporting a maximum of 8K of memory. The 8008 expanded the address range to 14 bits, supporting 16K of memory, which was a huge amount at that time. However, the 8008's internal stack was only 8 values, rather than the 16 of the Datapoint 2200. 

  6. Texas Instruments heard that Intel was designing a processor for Datapoint and asked Datapoint if they could build a processor for Datapoint too. TI beat Intel to the finish, creating the TMC 1795 processor before Intel completed the 8008, largely because Intel put the 8008 on the back burner. After Datapoint rejected TI's microprocessor, TI tried to find a new customer for the chip. TI was unsuccessful, and the TMC 1795 was abandoned and mostly forgotten. I've written about the TI chip in more detail here

  7. You may be familiar with linear-feedback shift registers (LFSRs), which can be used as pseudo-random number generators or noise generators. With N stages, a LFSR can generate 2N-1 output values. The de Bruijn sequence is generated from a nonlinear-feedback shift register. Nonlinear-feedback shift registers are a generalization of LFSRs; by using more complex feedback circuitry than just XOR, a nonlinear feedback shift register can generate sequences of arbitrary length. In particular, it can generate a sequence of 2N values, while a LFSR is limited to 2N-1. 

  8. Nonlinear feedback shift registers seem pretty obscure. The only other use I've seen is the TMS 0100 calculator chip, which generates an internal sequence of length 11. For information on the theory, see The Synthesis of Nonlinear Feedback Shift Registers and Counting with Nonlinear Binary Feedback Shift Registers. The book Shift Register Sequences goes into great detail on linear and nonlinear sequences; Section VII:5 is probably most relevant, describing how to make a shift register cycle of any length.

    The TMS 1000 microcontroller saves a few gates by using a LFSR for the program counter. Instead of incrementing, the PC goes through a pseudo-random sequence. The code is stored in the ROM in the same sequence; everything works out, but it seems like a strange way to implement a program counter. 

  9. I was expecting the stack counter and refresh counter to have a regular layout on the chip, with a single shift register stage repeated three times. However, on the 8008 die, the transistors are arranged irregularly, scattered around where there was room. Presumably this made the layout more compact. 

  10. Since the signal read from stack storage passes through an inverter before being written back, you might expect the bit to get flipped. The explanation is that transistor T2 in the storage cell inverts the value on C. Thus, the value read from a sense line is inverted compared to the value written on the sense line. The inverter amplifier provides a second inversion, restoring the original value. 

  11. Each 8008 instruction takes multiple clock cycles to execute. An instruction is broken into one or more machine cycles; each machine cycle typically corresponds to one memory access for instruction or data. Each machine cycle consists of up to 5 states (T1 through T5). An address is transmitted to memory during state T1 and T2, and the memory location is read or written during T3. Each T state requires two clock cycles, so an 8008 instruction takes a minimum of 10 clock cycles. The Intel 8008 user's manual provides detailed timings. 

Reverse-engineering the surprisingly advanced ALU of the 8008 microprocessor

A computer's arithmetic-logic unit (ALU) is the heart of the processor, performing arithmetic and logic operations on data. If you've studied digital logic, you've probably learned how to combine simple binary adder circuits to build an ALU. However, the 8008's ALU uses clever logic circuits that can perform multiple operations efficiently. And unlike most 1970's microprocessors, the 8008 uses a complex carry-lookahead circuit to increase its performance.

The 8008 was Intel's first 8-bit microprocessor, introduced 45 years ago.1 While primitive by today's standards, the 8008 is historically important because it essentially started the microprocessor revolution and is the ancestor of the x86 processor family that you are probably using right now.2 I recently took some die photos of the 8008, which I described earlier. In this article, I reverse-engineer the 8008's ALU circuits from these die photos and explain how the ALU functions.

Inside the 8008 chip

The image below shows the 8008's tiny silicon die, highly magnified. Around the outside of the die, you can see the 18 wires connecting the die to the chip's external pins. The rest of the chip contains the chip's circuitry, built from about 3500 tiny transistors (yellow) connected by a metal wiring layer (white).

Die photo of the 8008 microprocessor, showing important functional blocks.

Die photo of the 8008 microprocessor, showing important functional blocks.

Many parts of the chip work together to perform an arithmetic operation. First, two values are copied from the registers (on the right side of the chip) to the ALU's temporary registers (left side of the chip) via the 8-bit data bus. The ALU computes the result, which is stored back into the accumulator register via the data bus. (Note that the data bus splits and goes around both sides of the ALU to simplify routing.) The carry lookahead circuit generates the carry bits for the sum in parallel for higher performance.3 This is all controlled by the instruction decode logic in the center of the chip that examines each machine instruction and generates signals that control the ALU (and other parts of the chip).

The Arithmetic-Logic Unit

The 8008's ALU implements four functions: Sum, AND, XOR and OR. The Sum operation adds two 8-bit numbers. The remaining three operations are standard Boolean logic operations. The AND operation sets an output bit if the bit is set in the first AND the second number. OR checks if a bit is set in the first OR the second number (or both). XOR (exclusive-or) checks if a bit is set in the first OR the second number (but not both).

The concept of carries during addition is a key part of the ALU. Binary addition in a processor is similar to grade-school long addition, except with binary numbers instead of decimal. Starting at the right, each column of two numbers is added and there can be a carry to the next column. Thus, in each column, the ALU adds two bits as well as a carry bit.

In most early microprocessors, addition of each column needs to wait until the column to the right has been added and the carry is available. The carry "ripples" through the bits, right to left, slowing the addition. The 8008, however, uses a fast carry-lookahead circuit3 to generate the carries for all 8 columns in parallel before the addition happens. Then all the columns can all be added in parallel without waiting for the carry to "ripple" through the sum. This carry-lookahead circuit is an unusual feature to see in an early microprocessor due to its complexity.

Since the 8008 is an 8-bit processor, the ALU operates on two eight-bit arguments. Most 8-bit processors (including the 8008) use a "bit-slice" construction for the ALU, with a one-bit ALU slice repeated eight times. Each one-bit ALU slice takes two input bits and the carry-in bit, and produces the output bit. In most 8-bit processors, the bit-slice ALU is arranged by stacking 8 rectangular ALU slices to form a compact, regular block. However, the 8008 has its eight ALU slices arranged in an irregular fashion—some blocks are even sideways—as shown in the diagram below. The motivation for this is that the carry lookahead circuit takes up a triangular space on the chip. To fit the remaining space better, the 8008's ALU is arranged into its unusual triangular layout.

Arrangement of the eight ALU slices on the 8008 microprocessor die. Unlike most processors, the 8008's ALU slices are arranged in a haphazard triangular arrangement. This fits better with the triangular carry-lookahead circuit above the ALU.

Arrangement of the eight ALU slices on the 8008 microprocessor die. Unlike most processors, the 8008's ALU slices are arranged in a haphazard triangular arrangement. This fits better with the triangular carry-lookahead circuit above the ALU.

Zooming in on the die photo, we can look at one of the ALU slices and see how the circuitry is constructed. The chip is built from three layers (to simplify slightly). The topmost layer is the metal wiring. It is the most visible feature, and looks metallic (not surprisingly). In the detail below, you can see the horizontal and vertical metal traces. The polysilicon layer is underneath the metal layer and appears yellow/orange under the microscope. Polysilicon can act as wiring, but more importantly it forms the gates of the transistors, switching them on and off. The bottom layer is the grayish silicon die itself, but it is hard to see under the other layers.

Die photo of the 8008 processor, zoomed in on the circuit for one bit of the ALU.

Die photo of the 8008 processor, zoomed in on the circuit for one bit of the ALU.

In the diagram above, the carry c and the complemented a and b inputs enter through the metal wires at the top. The ALU output is at the bottom. The control signals are horizontal metal lines. The circuit is powered by the Vcc (+5 volts) and Vdd (-9 volts) metal lines. The brighter yellow polysilicon regions are transistors. Each gate in the circuit requires a "load resistor" connected to Vdd to pull its output low; for improved performance, these are implemented with transistors rather than resistors.

Removing the metal layer with acid makes the silicon and polysilicon layers more visible, as shown below.6 The chip is formed on a silicon wafer with regions of it "doped" with impurities to create regions of semiconducting silicon. You can see dark lines along the border between doped silicon and undoped silicon. A transistor is formed where a yellowish polysilicon wire crosses the doped silicon. The transistor forms a switch between the two silicon sides, controlled by the polysilicon gate. Each ALU slice contains 20 transistors; the diagram below points out two of them.5

With the metal layer removed from the 8008 processor die, the underlying silicon is visible. The photo shows bit 1 of the 8008's ALU.

With the metal layer removed from the 8008 processor die, the underlying silicon is visible. The photo shows bit 1 of the 8008's ALU.

Simulating one slice of the ALU

By examining the die photos carefully, you can map out the ALU slice's 20 transistors and their connections. From this, you can reverse-engineer the gates that make up the circuit. I explained in my previous article how PMOS gates are structured, so I won't go into the details here. The result is the schematic below, showing one bit of the ALU. Each ALU slice takes two inputs (a and b) and the input carry c, and outputs one result bit. There are three mode lines (m1, m2 and m3) that select one of the four ALU operations.7

The schematic below is interactive. First, select an operation and the table will update with results for the eight different inputs. Next, click a row in the table, and the schematic will update, showing how the ALU computes that row. (Note that the a and b inputs to the ALU are inverted, indicated by an overbar.)

Operation:

While this ALU slice looks like it is made of many gates, physically it is only three gates: two large, multilevel AND-OR-NAND gates and one NAND gate. The AND-OR-NAND logic is implemented on the chip as a single complex gate, rather than by combining simpler gates, since a single large gate provides better performance with less circuitry than multiple small gates. One feature of MOS logic is it's just as easy to form an AND-OR-NAND gate (for instance) as a plain NAND gate.

Understanding the ALU logic

The 8008's ALU circuit above looks like a mysterious collection of gates, but eventually I figured out the structure behind it. The starting point is a full adder that handles the Sum operation. (A full adder adds three input bits (a, b and c) and outputs the (low-order) sum bit and a carry bit.) The full adder is then heavily modified to support the logic operations, yielding the ALU from the previous section. The logic operations are implemented by using the mode lines to block parts of the circuit, yielding XOR, AND or OR, rather than the more complex Sum.

The diagram below strips down the 8008's ALU circuit to reveal the full adder "hidden" inside. The gate in red generates the carry-out from the three inverted inputs, using relatively straightforward logic. (Since the 8008 uses carry-lookahead, this carry-out signal isn't passed to the next ALU slice, but just used to generate the ALU output.) If you examine the possible sum cases, you will see that the sum bit is almost always just the carry-out inverted, except for the 0+0+0 and 1+1+1 cases. Thus, the sum bit can be generated by inverting the carry-out and handling the two exceptional cases.8 The two gates indicated below handle the exceptions by forcing the sum output to the correct value.

Simplified 8008 ALU slice, showing the full adder circuit.

Simplified 8008 ALU slice, showing the full adder circuit.

Comparing the full adder with the full ALU circuit earlier shows how the mode lines support the logic operations. Once you have a full adder, generating XOR is simply a matter of setting the carry-in to 0, which is done by the m3 control line. For the OR and AND operations, mode lines m3 and m2 respectively disable all of the circuit except the gates labeled in green.9 Thus, if you start with a full-adder and extend it to support XOR, AND and OR, the 8008's ALU circuit is a logical result.

Intel's earlier 4004 microprocessor had a simple ALU that only supported addition and subtraction, not any logic operations.10 Interestingly, the 4004's ALU circuit is almost identical to the full adder circuit shown above. So it's very likely that Intel designed the 8008 ALU by extending the 4004 ALU as described above. This would explain why the 8008's ALU generates carries internally, even though the carry lookahead circuit made this redundant.11

The 8008's ALU logic is very similar to the Z80's ALU,12 although the Z80's ALU is (surprisingly) 4 bits (details). The 8085 uses a different complex gate arrangement. The 6502 on the other hand, uses an entirely different approach: straightforward circuits for addition, AND, OR, XOR and shift-right, using pass-transistor multiplexers to select the operation.

Instruction decoding: how the ALU knows what operation to do

The 8008 executes 8-bit instructions, which move data, perform I/O, branch, call subroutines, and so forth. The instruction decoding logic examines the instruction and determines what operation to perform, generating about 30 control signals.13 Over a quarter of the instructions perform ALU operations, and the instruction set is carefully designed so three bits of the instruction specify which of the eight operations to perform.14 By examining these bits, the instruction decoder generates the ALU's mode control lines m1, m2 and m3.

Looking at AND instructions illustrates how this works. All AND instructions have the bit pattern xx100xxx (where x is either 0 or 1). For instance, the instruction to AND with memory is 10100111 and the instruction to AND with a constant is 00100100. When the instruction decode circuit matches this pattern, it pulls the m1 control line low, which causes the ALU to perform an AND operation.7 Other bit patterns generate the other ALU control signals.15

Part of the 8008's instruction decode PLA. The three indicated transistors match opcode pattern XX100XXX, indicating an AND instruction.

Part of the 8008's instruction decode PLA. The three indicated transistors match opcode pattern XX100XXX, indicating an AND instruction.

The diagram above shows part of the instruction decode circuit. The instruction bits (and their complements) are on yellow polysilicon wires running vertically through the circuit. Each row matches a bit pattern, with a transistor connected to each instruction bit to be matched. (The doped silicon regions forming transistors are the black outlines. Circles are connections between a transistor and the row's metal line.) For example, the three transistors marked with arrows match bit 3 low, bit 4 low, and bit 5 high, detecting the AND instruction pattern. Thus, the processor uses the grid of transistors in the instruction decoder to determine the meaning of each instruction.

Loose ends: Subtraction and rotating

The ALU implements a Sum operation, so you might wonder how subtraction is implemented. By using two's complement arithmetic, the CPU can perform subtraction by simply flipping all the bits on a value and then adding it. The ALU uses two temporary registers to hold the two operands since the ALU can't read the operands from the register file and write the result back simultaneously. One of the temporary registers has the feature that its value can be fed to the ALU directly or inverted. The subtraction instructions generate a signal causing the temporary register to provide the inverted value to the ALU, causing the ALU to perform subtraction.

One important operation in most processors is rotating or shifting the bits in a value, to the left or to the right. In most of the microprocessors I've examined, shifting is performed by the ALU.16 The 8008, on the other hand, implements the rotate logic in the register access circuit, on the opposite side of the chip from the ALU. When reading a register, the bits can be shifted one position left or right by a simple circuit before going onto the data bus.

History of the 8008

The Intel 8008 is important historically since it is the ancestor of the dominant Intel x86 architecture that you're probably using right now.2 I wrote a detailed article for the IEEE Spectrum on early microprocessor history, so I'll just give the outline of the 8008's complicated history here.

The 8008 copies the instruction set and architecture of the Datapoint 2200, a popular minicomputer introduced in 1970 as a programmable terminal.17 As was typical for minicomputers, the Datapoint 2200 contained a CPU build from individual TTL chips, filling up a circuit board. Datapoint contracted with both Intel and Texas Instruments to build a single-chip CPU that would replace this processor board, but keeping the same architecture and instruction set.

The Datapoint 2200 computer. The 8008 microprocessor was built to implement the Datapoint 2200's architecture and instruction set. Photo courtesy of Austin Roche.

The Datapoint 2200 computer. The 8008 microprocessor was built to implement the Datapoint 2200's architecture and instruction set. Photo courtesy of Austin Roche.

Texas Instruments was first to build a 2200-compatible microprocessor, creating the TMC 1795 chip. Intel got their version, the 8008, working a bit later, around the end of 1971. Datapoint rejected both processors, instead updating the Datapoint 2200 to use the 74181 TTL ALU chip. Texas Instruments couldn't find a new customer for the TMC 1795 and abandoned it. Intel, on the other hand, came up with the idea of selling the 8008 as a well-supported general-purpose processor. The 8008 led to the 8080, the 8085, 8086, and Intel's x86 line, which still retains some features of the 8008.

Conclusion

Although the 8008 was a very early microprocessor, its ALU was more advanced than you might expect. In particular, it used a complex carry-lookahead circuit for higher performance. Unfortunately, even with the carry-lookahead circuit, the 8008 was slower than the TTL-based Datapoint 2200 processor it was supposed to replace; addition took 20µs on the 8008, compared to 16µs on the original Datapoint 2200 and just 3.2µs on the upgraded Datapoint 2200. This illustrates the speed advantage that TTL had over MOS in the early 1970s. To us, a microprocessor may seem obviously better than a board of chips, but this wasn't always the case.

If you're interested in the 8008, my previous article has a detailed discussion of the architecture, more die photos and information on how to take them, and information on semiconductor history, so take a look.

I announce my latest blog posts on Twitter, so follow me at kenshirriff. I also have an RSS feed.

Notes and references

  1. The 8008 chip was publicly announced in an article in Electronics on March 13, 1972, entitled "8-bit parallel processor offered on a single chip", offering the chips for $200 each. 

  2. If you're not using an x86 processor right now, you're probably using an ARM processor. Don't feel neglected, though, since I've reverse-engineered the ARM-1 too. (Although there are many more ARM chips out there than x86, analytics show 71% of my readers are on x86.)  

  3. Using a carry look ahead circuit avoids the delay from a standard ripple-carry adder, where the carries propagate through the sum. The 8008's carry-lookahead is based on the Manchester carry chain, but with a separate carry chain for each carry, yielding the triangular structure you see on the die. For performance, the carry chain is implemented with dynamic logic, depending on wire capacitance, rather than with standard Boolean gates. The 74181 ALU chip in comparison, uses a different carry lookahead scheme implemented with standard logic. I plan to write more about the 8008's carry lookahead later. 

  4. The 8008 implements eight different arithmetic/logic functions: Add, Add with carry, Subtract, Subtract with borrow, AND, XOR, OR, and Compare.14 These are implemented in terms of the ALU's four basic operations. Subtraction is performed by inverting the second argument. The operations without carry/borrow clear the carry-in bit. Compare is simply a subtraction that doesn't store the result; it just sets the flags with the status. Thus, the four fundamental operations of the ALU are used to implement eight different arithmetic/logic operations. 

  5. Note that the 8008 uses PMOS transistors, rather than the faster NMOS transistors in later microprocessors such as the 8080, 6502 and Z80. If you're familiar with NMOS circuits, PMOS can be confusing since everything is backwards. PMOS transistors turn on if the gate is low, and typically pull the output high. Vdd in PMOS is negative, and "ground" is positive. The "pull-up resistor" in a PMOS gate pulls the output down. A PMOS NAND gate has transistors in parallel (compared to serial for an NMOS NAND gate). A PMOS NOR gate has transistors in serial (compared to parallel for an NMOS NOR gate). 

  6. The metal layer of the chip is protected by silicon dioxide passivation layer. The professional way to remove this layer is with dangerous hydrofluoric acid. Instead, I used Armour Etch glass etching cream, which is slightly safer and can be obtained at craft stores. I applied the etching cream to the die and wiped it for four minutes with a Q-tip. (Since the cream is designed for frosting glass, it only etches in spots. It must be moved around to obtain a uniform etch.) After this, I soaked the die in hydrochloric acid (pool acid from the hardware store) overnight to dissolve the metal. This was probably too long, since the edges of the polysilicon were eaten away in places. 

  7. The following values are used for the three mode lines to select the ALU function:

    Operationm1m2m3
    Sum111
    And010
    Or100
    Xor110
     

  8. A more straightforward way of generating the sum bit is by xoring the three inputs: a⊕b⊕c. Unfortunately, an XOR gate is relatively difficult to implement with Boolean logic, so designers will often try to avoid XOR. 

  9. You might wonder why the OR operation is implemented with an AND gate, and vice versa. Since the inputs and the output of the OR gate are inverted, this is equivalent to an AND gate (by De Morgan's laws), and similarly for the AND gate. 

  10. Strictly speaking, the 4004 microprocessor has an AU (arithmetic unit), not an ALU (arithmetic/logic unit), since it doesn't do logical operations. Since the 4004 was designed for a calculator, logical operations weren't required. 

  11. The 8008's full adder generates the carry-out first, and generates the sum from that. In contrast, the typical full adder circuit combines two half adders to generate the sum and carry-out separately. If the typical full adder circuit had been used in the 8008, the carry-out logic could easily be omitted. 

  12. To see the similarity between the Z80's ALU circuit and the 8008's, you need to swap AND and OR gates. (Apply De Morgan's laws since the 8008's ALU inputs are inverted.) In the Z80, the carry-out comes from the ALU rather than a carry-lookahead circuit, so the control lines are somewhat different. But the fundamental ALU circuit is otherwise the same between the 8008 and Z80, which is not surprising since Federico Faggin worked on both chips. 

  13. Instruction decoding is based on a Programmable Logic Array (PLA), an arrangement of transistors that efficiently implements logic gates. These gates match bit patterns and generate the appropriate control signals for the rest of the chip. The 8008's PLA has 16 input lines flow vertically through the PLA. Each row in the PLA matches a bit pattern and generates a control signal output.

    In more detail, each row output line is pulled low by a load resistor/transistor to Vdd. The transistors are connected between the row line and Vcc (+5V). The bit lines are connected to the transistor's gate. If any bit line is low (indicating a mismatch), the PMOS transistor turns on, pulling the row line high. Thus, if there is no mismatch, the control line is low, and if there is a mismatch, the control line is high. In other words, each row is a NAND gate with instruction bit inputs.

    The input lines are ordered as follows: bit 3, bit 3 complement, 4, 4', 5, 5', 0, 0', 1, 1', 2, 2', 6, 6', 7, 7'. This order may seem strange, but there's a reason for it. In the 8008, the ALU operation is selected by bits 3, 4 and 5 of the instruction. By putting those bits on the left side of the PLA, they are closer to the ALU. Some rows of the PLA actually decode two instructions: bits 3, 4 and 5 are decoded on the left side, generating an ALU control signal, while the remaining bits are decoded on the right side generating a different control signal. This increases the PLA density and saves space on the chip. 

  14. The 8008's instruction set is designed around octal. Among other things, there are 8 ALU operations, 8 registers and 8 conditionals. In octal, the ALU instructions have the value 2ar, where a is the ALU operation to perform (0 through 7) and r is the register to use (0 through 7, where 7 indicates memory). The octal structure originates with the Datapoint 2200, which decoded instructions with TTL 7442 BCD chips that decoded groups of three bits. This octal structure persisted in descendants of the 8008, including the Z80 and x86. Unfortunately, these instruction sets are almost always presented in hexadecimal, which hides the underlying structure. 

  15. The instruction decoder generates all the signals required by the ALU. As described above, AND matches xx100xxx, pulling the m1 control signal low. An OR opcode has the bit pattern xx110xxx, which causes the instruction decode circuit to pull the m2 control line low. An XOR instruction has the bit pattern xx101xxx. The m3 control line is pulled low for patterns xx10xxxx or xx1x0xxx, matching AND, OR or XOR instructions. The subtract (with and without borrow) instructions match xx01xxxx, generating a signal that inverts the second argument. 

  16. Different processors use a variety of techniques for shifting. In the Z80, shifting is performed as data enters the ALU. The 6502 performs a left shift with "A plus A", and has a path inside the ALU for right shifts; the 8085 is similar. The ARM-1 has a barrel shifter next to the ALU that performs arbitrary shifts. 

  17. The instruction set of the Datapoint 2200 is described in the Reference Manual. The 8008 has a couple minor changes. For instance, the 8008 has increment and decrement instructions that are not present in the 2200.