Showing posts with label chips. Show all posts
Showing posts with label chips. Show all posts

Reverse-engineering the classic MK4116 16-kilobit DRAM chip

Back in the late 1970s, the most popular memory chip was Mostek's MK4116, holding a whopping (for the time) 16 kilobits. It provided storage for computers such as the Apple II, TRS-80, ZX Spectrum, Commodore PET, IBM PC, and Xerox Alto as well as video games such as Defender and Missile Command. To see how the chip is implemented I opened one up and reverse-engineered it. I expected the circuitry to be similar to other chips of the era, using standard NMOS gates, but it was much more complex than I expected, built from low-power dynamic logic. The MK4116 also used advanced manufacturing processes to fit 16,384 high-density memory cells on the chip.12

I created the die photo below from multiple microscope images. The white lines are the metal wiring on top of the chip, while the silicon underneath appears dark red. The two large rectangular regions are the 16,384 memory cells, arranged as a 128×128 matrix split in two. In between the two memory arrays are the amplifiers and selection circuits. The control and interface circuitry is at the left and right, connected to the external pins via tiny bond wires.

Die photo of the 4116 memory chip. Click for a larger image.

Die photo of the 4116 memory chip. Click for a larger image.

In dynamic RAM, each bit is stored in a capacitor with the bit's value, 0 or 1, represented by the voltage on the capacitor.3 The advantage of dynamic RAM is that each memory cell is very small, so a lot of data can be stored on one chip.4 The downside of dynamic RAM is that the charge on a capacitor leaks away after a few milliseconds. To avoid losing data, dynamic RAM must be constantly refreshed: bits are read from the capacitors, amplified, and then written back to the capacitors. For the MK4116, all the data must be refreshed every two milliseconds.

The diagram below illustrates four of the 16,384 memory cells. Each memory cell has a capacitor, along with a transistor that connects the capacitor to the associated bit line. To read or write data, a row select line is energized, turning on the transistors in that row. The row's capacitors are connected to the bit lines, allowing the bits in that row to be accessed.

Structure of the memory cells, based on the patent.

Structure of the memory cells, based on the patent.

One of Mostek's key innovations was to multiplex the address pins.6 Earlier memory chips used a separate pin for each address bit; as memory sizes increased, so did the number of address pins. This forced Intel's 4096-bit memory chip, for instance, to use a large, more costly 22-pin package.5 Mostek cut the number of address pins in half by using each address pin twice, first for a "row" address, and then a "column" address. This approach became the industry standard, allowing memory chips to fit into inexpensive 16-pin packages.

Externally, the chip stores a single bit for 16,384 different addresses. (Typically, eight of these chips were used in parallel to store bytes.) Internally, however, the chip is implemented as a 128×128 matrix of storage cells. The row address selects a row of 128 cells7 and then the column address selects one of these 128 cells to read or write.8 Meanwhile, the entire row of 128 cells is refreshed by amplifying the signals and storing them back in the capacitors.

The 4116 die with key blocks labeled. Most of the memory cell area has been cut out.

The 4116 die with key blocks labeled. Most of the memory cell area has been cut out.

The die image above is labeled with the main functional blocks.9 The chip's 16 pins are labeled around the perimeter,10 including the seven address pins (A0-A6). The Row Address Strobe pin (RAS) is used to indicate the row address is ready, while the Column Address Strobe pin (CAS) indicates that the column address is ready. The two memory arrays are in the center; I've cut out most of the cells to keep the diagram compact. The column select circuitry and sense amplifiers are between the two memory arrays. At the right, the row decode circuitry selects a row based on the address pins, while the column address circuitry buffers the address for the column select circuitry. At the left, the clock circuits generate the chip's timing pulses, triggered by the RAS, CAS, and WRITE pins. Finally, the Data Out and Data In pins provide access to the selected data bit.

Memory cell structure

The key to the DRAM chip is the memory storage cell, designed to be as compact as possible. The highly magnified photo below shows some of the storage cells, densely packed together. It's a bit hard to visualize what's going on because the chip is constructed from multiple layers. The bottom layer is the grayish silicon die. On top of the silicon are two layers of polysilicon, a special type of deposited silicon used for transistor gates, capacitors, and wiring. The top layer of the chip is the metal wiring, which was removed for this photo. The photo shows three bit lines in the silicon, with bulb-shaped storage cells connected on either side. Vertical strips of polysilicon (poly 1) over the storage cells implement capacitors: the silicon forms the lower plate, while the polysilicon forms the upper plate. The second layer of polysilicon (poly 2) is arranged in diagonal regions to implement the selection transistors, where square notches in the poly 1 layer allow the poly 2 layer to approach the silicon.

A closeup of the memory chip under the microscope, showing individual storage cells.

A closeup of the memory chip under the microscope, showing individual storage cells.

The cross-section diagram below shows the three-dimensional, layered structure of a memory cell. At the bottom is the silicon (brown); the bit line (dark brown) is made from doped silicon. Above the silicon are the two polysilicon layers (red) and the metal layer (purple), separated by insulating silicon dioxide (gray). At the far left, the poly 1 layer and underlying silicon form a capacitor. In between the capacitor and the bit line, the poly 2 layer forms the gate of the transistor. At the left, the poly 2 layer is connected to the metal of the word line, which turns the transistor on, connecting the capacitor to the bit line.

Cross-section structure of a storage cell. Based on 16K—The new generation dynamic RAM.

Cross-section structure of a storage cell. Based on 16K—The new generation dynamic RAM.

The diagram below illustrates how bits are addressed in the storage matrix. The arrangement is somewhat confusing because columns of cells are offset and interlocked like zippers. A row select line is connected to the centers of diagonal poly 2 regions, so each region controls two transistors on neighboring bit lines. (For instance, in the upper left, the poly region connected to row select 0 forms transistors 0A and 0B.) The result is that each row select line activates 128 cells, one for each bit line in a staggered arrangement.

Arrangement of bits in the matrix. The transistors are labeled according to their corresponding row and bit line.

Arrangement of bits in the matrix. The transistors are labeled according to their corresponding row and bit line.

Low-power circuitry

A key feature of the MK4116 memory chip is that it uses almost no power when it is sitting idle. Although it consumes 462 milliwatts when active, it uses just 20 milliwatts in standby mode. Although low-power circuitry is straightforward to build with modern CMOS technology, the 4116 used earlier NMOS transistors. Most NMOS integrated circuits constructed logic gates with load transistors, a simple technique with the disadvantage of wasting power. Instead, the MK4116 memory chip uses dynamic logic, which is considerably more complex but saves power while idle.

A typical dynamic logic gate (below) operates in two phases. In the first phase, a clock signal turns on the upper transistor, precharging the output to +12 volts, the "1" state. The upper transistor then turns off, but the output remains high due to the capacitance of the wire. In the second phase, the lower transistors can pull the output low. In particular, if either input is 1, the corresponding transistor turns on and pulls the output low, so the circuit implements a NOR gate. This circuit doesn't consume any static power, just a small current to charge the wire capacitance when switching. (The inputs must be carefully timed so they don't overlap with the precharge clock.) The use of dynamic circuitry makes the 4116 much more complex than it would be otherwise since the gates are controlled by clock signals, which need to be generated.

A NOR gate using dynamic logic.

A NOR gate using dynamic logic.

The row select circuitry

The purpose of the row-select circuitry is to decode the 7 address bits and energize the corresponding row select line (out of 128) to read one row of memory. In the first step, 32 5-input NOR gates decode address bits A0 through A4. These NOR gates are implemented in the compact circuit shown below. Each NOR gate takes a different combination of non-inverted and inverted address bits and matches a particular 5-bit address. These NOR gates use dynamic logic, first pulled high and then discharged to ground, except for the selected address which remains high. Next, each NOR output is split into four, based on A5 and A6. The result is that one of 128 row select lines is activated, turning on the transistors for that row in the matrix.

The NOR gates are implemented in several compact blocks; one block of three NOR gates is shown below. Each NOR gate is a horizontal stripe of doped silicon, with ground above and below it. Each NOR gate has transistors (pink stripes) connected to ground alternating above and below it. A transistor will pull the NOR gate low if the connected address line is high. The precharge transistors at the left pull the NOR gates to +12 volts, while the output control transistors control the flow of the decoded outputs to the rest of the circuitry.

Three NOR gates in the row decoder. The vertical yellow strips indicate metal wiring, removed for this photo.

Three NOR gates in the row decoder. The vertical yellow strips indicate metal wiring, removed for this photo.

The small greenish blobs at the end of a transistor gate (pink stripe) are connections (vias) between a transistor gate and an address line. The address lines are represented as vertical yellow stripes (since the metal layer was removed). Note that each transistor gate has an address line at the right and the inverted address line at the left; thus, the NOR gates all have the same basic layout, but with the contacts changed to match a particular address. For instance, the upper NOR gate has transistors connected to A0, A2, A1, A3, and A4, so it will be active for address 00000; any other address will pull it low.

The sense amplifiers

The sense amplifiers are one of the most challenging parts of designing a memory chip. The job of the sense amplifier is to take the tiny voltage from a capacitor and amplify it into a binary 0 or 1.11 The challenge is that even though 12 volts is stored in a capacitor, the signal from the capacitor is very small, is only 100 millivolts or so. (Because the bit line is much larger than the tiny memory cell capacitor, the capacitor causes a very small voltage swing.)12 It is critically important for the sense amplifier to operate accurately, even in the presence of noise or voltage fluctuations, because any error will corrupt the data. The sense amplifier circuit must also be compact and low power since there are 128 sense amplifiers.

The 128 sense amplifiers are in the middle of the die, between the upper and lower memory arrays.

The 128 sense amplifiers are in the middle of the die, between the upper and lower memory arrays.

The chip's 128 sense amplifiers, one for each column, are located between the two memory arrays as shown above. During a read, 128 values in a row are accessed in parallel and amplified by the sense amplifiers. These 128 values are then written back to refresh the values in the capacitor. For a write operation, one of the bits is updated with the new value before they are written back.

The sense amplifier as it appears on the die, and corresponding schematic.

The sense amplifier as it appears on the die, and corresponding schematic.

Each sense amplifier (above) is a very simple circuit. It takes two inputs and compares them, pulling the lower one to 0.13 It is built from two cross-coupled transistors, each trying to pull the other one low. Whichever transistor has the higher voltage to start with will "win", forcing the other side low.14 The sense amplifier is sensitive to very small voltage differentials, allowing it to distinguish the small signals from a storage cell.

Locating the sense amplifiers between the two memory arrays isn't arbitrary, but the key to their operation: this is the "divided bit line" architecture introduced in 1972. The idea is that one input to the sense amp is the voltage from the desired memory cell, while the other input is a threshold voltage from a "dummy cell" in the opposite memory array. Dummy cells are constructed and precharged like real memory cells except the capacitor is half-sized, so they provide a voltage midway between a 0 bit and a 1 bit.3 If the voltage from the real memory cell is lower, the sense amp outputs a 0, and if higher, it outputs a 1.

Dummy cells provide the threshold voltage for deciding if a bit is 0 or 1. The dummy cells are located at the top and bottom of the memory arrays. They are on the same bit lines as real memory cells.

Dummy cells provide the threshold voltage for deciding if a bit is 0 or 1. The dummy cells are located at the top and bottom of the memory arrays. They are on the same bit lines as real memory cells.

The dummy cells are located on the edges of the memory arrays, as shown above. They consist of capacitors and transistors (similar to real memory cells), but with a separate line to charge them. The advantage of the dummy cell approach is that manufacturing differences or fluctuations during operation will (hopefully) affect the real cells and dummy cells equally, so the voltage from the dummy cell will remain at the correct level to distinguish beween a 0 and a 1. Address bit A0 controls which half of the array provides real data to the bit lines and which half connects dummy cells to the bit lines.

The column select circuitry

The purpose of the column select circuitry is to select one column out of the 128-bit row; this is the bit that is read or written. Each column select circuit is twice as wide as a memory cell, so they only decode one of 64 columns. The result is that two bits are selected at a time, and circuitry elsewhere selects one of the two bits. Like the row select circuitry, the column select circuitry is implemented by numerous NOR gates, each matching one address. For column select address bits A0 through A5 select one of 64 lines, selecting two columns at a time. These two bit lines are connected to data lines transmitting the signals to the I/O circuitry. (Since the bit lines for the upper and lower halves of the matrix are separate, there are actually four bit lines selected by the column select circuit.) As with the row select circuitry, dynamic logic is used, controlled by various timing signals. Note that each NOR gate is physically split into two parts with the sense amp in the middle.

Footnote 15

Five of the column decoders, with one highlighted.

Five of the column decoders, with one highlighted.

The schematic below shows how the column decoder works with the sense amplifier. The diagram shows two bit lines and the top half of the column decoder and sense circuitry; it is mirrored for the lower array. At the top, the sense precharge circuit pulls all the bit lines high. At the bottom, the sense amplifiers amplify and refresh the signals as explained above. The column decoder matches a particular 6-bit address, so one of the 64 decoders will activate the associated sense select circuit, connecting the chip's I/O circuitry to four bit lines (two from the upper memory array as shown here and two from the lower memory array).

Schematic of half the column decoder and sense amplifier.

Schematic of half the column decoder and sense amplifier.

At this point, four bit lines have been selected for use and their signals are passed to the input/output circuitry; the column select circuitry only decoded 1-of-64, while there are 128 columns, and each half of the array has separate bit lines. Column address bit A6 provides the final selection between the two columns. The selected bit is sent to the data-out pin for a read. For a write, the value on the data-in pin is sent back through the appropriate wire to overwrite the value in the sense amplifier. This circuitry is implemented using dynamic logic and latches, controlled by various timing signals. Much of the circuitry is duplicated, with one copy for the upper half of the memory array and one copy for the lower half. Row address bit A0 distinguishes which half of the matrix is active and which half is providing dummy data). (Note that row address bit A0 was already used to select a particular row, but the circuitry has "lost track" of which was the real row and which was the dummy row, so it must make the selection again.)

Clock generation

The chip requires many timing signals for the various steps in a memory operations. The memory chip doesn't use an external clock, unlike a CPU, but generates its own timing signals internally. The diagram below illustrates the clock generators, using buffers to create a delay between each successive clock output. The first set of timing signals is triggered by the row-access strobe (RAS), indicating that the computer has put the row address on the address pins. The next set of timing signals is triggered by the column-access strobe (CAS), indicating the column address is on the address pins. Other timing signals are triggered by the WRITE pin.

Conceptual diagram of the clock generation, from
16K—The new generation dynamic RAM.

Conceptual diagram of the clock generation, from 16K—The new generation dynamic RAM.

The real clock circuitry is much more complex than the diagram indicates, consisting of dozens of transistors in multiple chains, feeding back in complex ways to shape the pulses. (Among other things, using dynamic logic requires each buffer to have both an input that pulls it high and an input that pulls it low, forming almost a circular problem.) These gates are mostly built from large transistors, as shown below, to provide enough current to drive the circuitry, and to increase the gate delay sufficiently. The clock circuitry also uses many capacitors, probably bootstrap loads to pull signals up sharper. I'm not going to describe the clocks in detail since it's a complicated mess.

A small part of the clock circuitry.

A small part of the clock circuitry.

Input pins

The chip uses surprisingly complex circuits for the address pins and the data input pin. Mostek's earlier memory chip had problems due to noise margins on the inputs, so the MK4116 uses a complex circuit an analog threshold, capacitor drive, and multiple controls and latches.

The diagram below shows the threshold generation circuit, which generates a 1.5-volt reference. It uses many tiny transistors in series to generate the voltage level. Conceptually, it is similar to a resistor divider between power and ground to produce an output voltage. However, resistors are both power-hungry and difficult to build in integrated circuits, so transistors are used instead. Since this circuit is always active, the designers needed to minimize its current; this was achieved by using many transistors in series.

The voltage reference used for the address pins. (×18 indicates 18 transistors in series, for instance.)

The voltage reference used for the address pins. (×18 indicates 18 transistors in series, for instance.)

The voltage on the input pin and the threshold voltage are fed into a differential amplifier/comparator, conceptually similar to the sense amplifiers. Each side tries to pull the other side low, ending up with a 1 for the "winning" side and 0 for the "losing" side. Thus, the input is converted into a binary value. The result from the comparator is stored in a latch. Multiple timing signals gate the input signal, precharge the circuitry, and control the latch.

The data-in circuitry: the pin, latch circuit, and voltage reference. This circuitry is in the lower-left corner of the die.

The data-in circuitry: the pin, latch circuit, and voltage reference. This circuitry is in the lower-left corner of the die.

The photo above shows the input circuit for the data-in pin. Next to the pin's bond wire is the threshold circuit and latch; the two capacitors are the large rectangles of metal. The voltage reference circuit is next; the data-in voltage reference is similar to the address voltage reference described above. (I left the metal layer on for this photo; the polysilicon and silicon underneath is obscured by the oxide layer.)

Conclusion

This memory chip was much more complex than I expected. I studied a simple Intel memory chip earlier so I assumed this DRAM would be larger but not much more complicated. Instead, the MK4116 has complex circuitry with over 1000 transistors controlling it, in addition to the 16,384 transistors for the memory cells and about 1500 transistors for the column selects and sense amps. A cause of the complexity is that the design needed to optimize multiple axes: density, speed, and power efficiency.16

The table below shows that each generation of DRAM chips required substantial technological changes and new developments. Memory designers don't just sit around waiting for Moore's Law to increase the memory capacity; they have to constantly develop new techniques because DRAM storage cells are fundamentally analog. Fortunately, DRAM designers have continued to solve memory scaling problems; 16-gigabit DRAMs recently went into production, an amazing factor of a million larger than the 16-kilobit MK4116 DRAM chip of 1976.

DRAM cell evolution from 4 kilobits to 16 megabits. From Impact of Processing Technology on DRAM Sense Amplifier Design (Gealow, 1990).

DRAM cell evolution from 4 kilobits to 16 megabits. From Impact of Processing Technology on DRAM Sense Amplifier Design (Gealow, 1990).

I announce my latest blog posts on Twitter, so follow me @kenshirriff or my RSS feed. Thanks to Mike Braden for suggesting the MK4332 chip to me.

Notes and references

  1. A brief history of memory innovations is here. For detailed information on DRAM circuits, see this 1990 thesis on sense amplifier design. For history, Storage array and sense/refresh circuit for single-transistor memory cells (1972) introduced the concepts of dummy cells and cross-coupled sense amplifiers. Intel's chip is discussed in A 16 384-Bit Dynamic RAM (1976) while Mostek's chip is discussed in A 16K × 1 bit dynamic RAM (1977) and 16K - The new generation dynamic RAM (1977). Inconveniently, I found most of these references after I had this blog post nearly completed. 

  2. An unusual characteristic of the chip is that it doesn't use "buried contacts". The issue is how to connect a polysilicon wire to a silicon circuit. In integrated circuits of the 1960s, polysilicon couldn't be connected to silicon directly, so a via connected the polysilicon wire to the metal layer, which had a short connection to a second via that connected down to the silicon. In 1968 at Fairchild, Federico Faggin invented the buried contact, a way to connect the polysilicon and silicon directly. This was much more convenient, so all the NMOS chips that I have examined use buried contacts.

    However, the 4116 doesn't use buried contacts. Instead, it uses the obsolete connections through the metal layer. It's a mystery why they did this. Perhaps the metal wiring density was low enough that the additional segments weren't a problem and they could eliminate one masking and processing step. (Another theory is maybe there were patent issues, but I'm not aware of any patent on the buried contact.) But this illustrates that technological progress isn't consistently linear. Even an advanced chip like the 4116 can use obsolete techniques in some areas. 

  3. In the MK4116, a 0 bit is represented by storing 12 volts on the capacitor, while a 1 bit is represented by 0 volts on the capacitor. This is backward from what you might expect, but probably saved an inverter somewhere in the circuitry. To avoid confusion, I ignore this in the text. 

  4. Early dynamic RAMs such as the Intel 1103 used three transistors per cell and used separate lines for reading and writing data. Improvements in memory technology shrunk the circuit to a single transistor and a single data line. Static RAM, in comparison, often requires 6 transistors per bit, but has the advantage of not needing to be refreshed. 

  5. For example, Intel's 2107 4096-bit DRAM required 22 pins, as did the 2101 256×4 static RAM chip. It's ironic that Intel used larger packages for these memory chips because a few years earlier, Intel had steadfastly refused to go beyond 16 pins, forcing the Intel 4004 microprocessor to use a 16-pin package. The 8008 microprocessor was barely allowed 18 pins, when 24 pins would have been more convenient. This made the 8008 slower and harder to use. 

  6. Although multiplexing the address pins might seem trivial, Mostek claims that they bet the company on this idea. The problem is how to implement multiplexing without making memory accesses wait while both parts of the address are loaded. (The time to read memory was a key factor in computer design, so every nanosecond counted.) In Mostek's solution, first the row address is put on the address pins, and the row-access strobe (RAS) is activated. While the chip is reading that row from memory, the computer puts the column address on the address pins and activates the column-access strobe (CAS). By the time the 128 bits of the storage row have been read, the column address is available and the desired bit is selected from the row of 128 bits. In other words, reading of the row is overlapped with loading of the column address, so multiplexing doesn't slow the system. However, careful timing is required to make this multiplexing work; much of the chip is devoted to clock circuitry to generate the necessary timing pulses. 

  7. The RAM chip operates on memory a row at a time, and then selects one entry from the row. This isn't the obvious way to access memory. In comparison, magnetic core memory also holds memory cells (cores) in a matrix, but accesses a single cell using X and Y select lines. A key reason for a DRAM to operate a row at a time is so the entire row can be refreshed at once, dramatically reducing the performance overhead from refresh operations. 

  8. You might wonder if it's possible to read multiple bits from a row without repeating the entire row-read operation. The chip designers thought of that and provided several techniques to boost efficiency. The page-read and page-write functions let you rapidly access multiple bits in a 128-bit page (i.e. row). A read-modify-write sequence lets you read a row, modify bits in it, and write it back without repeating the row-read. A RAS-only refresh operation lets you read and refresh a row without providing a column address. The point of this is that the chip designers implemented clever features so customers could squeeze as much performance out of their memory system as possible. See the datasheet for details. 

  9. The block diagram below shows the main functional blocks of the 4116. Many parts of this block diagram didn't make sense to me until after I had reverse-engineered the chip, such as the clock generator, dummy cells, and "1 of 2 data bus select". Many datasheets present a somewhat abstracted view of how the chip operates, but the 4116 datasheet accurately matches the implementation.

    Block diagram of the 4116 memory chip, from the databook.

    Block diagram of the 4116 memory chip, from the databook.

     

  10. One inconvenient feature of the memory chip is it requires three different voltages: +12 volts, +5 volts, and -5 volts. Almost all the circuitry runs on 12 volts. The 5-volt supply is used only to provide a standard TTL voltage level for the data out pin. The -5 volts is a substrate bias, connected to the underlying silicon die to improve the characteristics of the transistors. Later chips implemented a charge pump circuit to generate the bias voltage, eliminating the need for an external bias voltage. Later memory chips also eliminated the need for +12 volts. This simplified use of the chips, since only a single-voltage power supply was required. A less-obvious benefit is that this made two of the chip's 16 pins available for other uses. Specifically, these pins were used as additional address bits in the next two generations of memory chips, the 64-kilobit and 256-kilobit chips. As a side effect, the address pins are in a somewhat scrambled order, due to the location of the available pins. 

  11. It's not a coincidence that the input to the sense amp is very small, just enough to be reliably amplified. This is a consequence of economics: if the DRAM produced a large voltage difference, the designers would shrink the cells to save money. But if the voltage difference was too small for reliability, the designers would need to increase the cells. The result is a design where the voltage difference is just barely large enough to be reliably amplified by advanced circuitry. (We noticed the same thing when using a vintage 1960s IBM core memory (video); we were just barely able to read the core values. The cause is the same: if the cores had produced nice clean pulses, they were larger than they needed to be.) 

  12. When the capacitor is connected to the bit line, the resulting voltage will depend on the relative capacitances of the capacitor and the bit line. The bit line capacitance is said to be 800 fF, while the storage cell has 40 fF capacitance, for a 20:1 ratio. Thus, the resulting voltage will be very close to the +12V precharge voltage on the bit line, but perturbed a few hundred millivolts. 

  13. The sense amplifier can only pull a signal low, not raise it, so you might wonder where the amplification happens. Both sides are precharged to +12 volts and the memory cell capacitance only pulls the sides down by 100 millivolts or so. The "winning" side will remain very close to 12 volts, while the other side is pulled to 0 by the sense amp. Thus a 1 bit is pulled higher by the precharge, while a 0 bit is pulled lower by the sense amp. 

  14. The diagram below shows the sense amplifier voltages during operation of a prototype DRAM sense amp. First, the two sides of the sense amp are precharged to the same voltage. Next, a DRAM storage node is selected on one side and a dummy node on the other. Note that the voltage difference between the two sides is very small, maybe 200 millivolts. Finally, the difference is amplified, forcing the higher side up and the lower side down. In this case, the storage node held a 1 so it started slightly higher. If it held a 0, it would start slightly lower and the two lines would diverge in opposite directions. The point is that the sense amp takes a very small voltage differential and amplifies it into a large binary signal.

    Voltage diagram for a prototype sense amp (not the 4116). Based on  Storage Array and Sense/Refresh Circuit for Single-Transistor Memory Cells, 1972.

    Voltage diagram for a prototype sense amp (not the 4116). Based on Storage Array and Sense/Refresh Circuit for Single-Transistor Memory Cells, 1972.

    One difference between this sense amp and the MK4116 is that this circuit is precharged to a midpoint voltage, while the MK4116's is precharged to +12 volts. In this sense amp, one signal must be pulled high, while in the MK4116 both signals start near +12V and one is forced low.  

  15. Robert Proebsting, co-founder of Mostek and developer of address multiplexing, has an oral history that provide some information on the 4116. He discusses why the column decoder selects one of 64 columns and the selection between the pair happens earlier. The reason is they wanted the noise from the address lines to be equal on both sides of the sense amp, so they have three address line pairs on each side.  

  16. Intel produced 16,384-bit DRAM chips before Mostek, the 2116 and others, but Mostek's chips beat Intel in the marketplace. Interestingly, the internal structure was completely different from the MK4116. The 2116 contained four memory arrays internally and was structured as two independent 8-kilobit memories. This saved on power since the unused half could be left unpowered during a memory access. Moreover, if a 2116 chip had a manufacturing flaw in one half, Intel repackaged it as an 8-kilobit 2108 chip with either the upper or lower half operational. The user had to set address bit A6 appropriately to get the working half. 

Inside the stacked RAM modules used in the Apple III

In 1978, a memory chip stored just 16 kilobits of data. To make a 32-kilobit memory chip, Mostek came up with the idea of putting two 16K chips onto a carrier the size of a standard integrated circuit, creating the first memory module, the MK4332 "RAM-pak". This module allowed computer manufacturers to double the density of their memory systems and by 1982, Mostek had sold over 3 million modules. The Apple III is the best-known system that used these memory modules.

The MK4332 memory module combined two 16-kilobit memory chips on a ceramic substrate.

The MK4332 memory module combined two 16-kilobit memory chips on a ceramic substrate.

This module was built from two 16-kilobit memory chips, constructed from the standard MK4116 dynamic RAM (DRAM) chip packaged in a leadless ceramic chip carrier; these are the golden rectangles on top of the carrier.

You might wonder why customers didn't simply use these surface-mount packages directly, but at the time soldering surface-mount components was still a challenge for many customers. However, mounting two leadless chips on a dual inline-package (DIP) carrier allowed customers to double their memory density while still using their standard through-hole soldering techniques.

The purple carrier holding the chips was a ceramic substrate designed for thermal compatibility with the chips.1 There is no circuitry inside the ceramic carrier except wiring between the chips and the eighteen DIP pins. The two memory chips were wired in parallel except for their two select lines, which were kept separate. This allowed the desired memory chip to be selected. As a result, the MK4332 module has 18 pins, compared to 16 pins for the chips on top. Mostek used the same module design with the next generation of RAM chips, creating a 128-kilobit RAM module (MK4528) from two 64-kilobit RAM chips (MK4564).

Inside the 4116 memory chip

Although you might expect a complex mounting technique, the two 4116 chips are simply soldered onto the substrate with standard reflow techniques. For the photo below, I removed the metal lid from the left chip with a chisel and unsoldered the right chip with a hot air gun. On the left, you can see the rectangular silicon die inside the leadless carrier package. On the right are the 16 solder pads on the ceramic substrate. The wiring between the solder pads and the DIP pins is inside the ceramic substrate.

The MK4332 with the left package opened and the right package unsoldered.

The MK4332 with the left package opened and the right package unsoldered.

I created the die photo below from multiple microscope images. The white lines are the metal wiring on top of the chip, while the silicon underneath appears dark red. The two large rectangular regions are the 16,384 memory cells, arranged as a 128×128 matrix, split in two. The circuitry in between these regions consists of 128 sense amplifiers to amplify the bits read from memory, and selection circuitry to select one bit out of the 128. (Externally, the chip is accessed as 16,384×1, outputting a single bit. Typically, eight of these chips were used to store bytes.) The control and interface circuitry is at the left and right, connected to the external pads via tiny bond wires.

Die photo of the 4116 memory chip. Click for a larger image.

Die photo of the 4116 memory chip. Click for a larger image.

In dynamic RAM, a bit is stored in a capacitor, with a transistor providing access to the capacitor. The value of the bit is represented by the presence or absence of charge on the capacitor. The advantage of dynamic RAM is that each memory cell is very small, constructed from just two components,2 allowing a high memory density. (In comparison, static RAM may require six transistors per cell.) The downside of dynamic RAM is that the charge on a capacitor leaks away after a few milliseconds. To avoid losing data, dynamic RAM must be constantly refreshed: bits are read from the capacitors, amplified, and then written back to the capacitors. For this particular chip, all the data must be refreshed every two milliseconds.

The diagram below illustrates the wiring of the memory cells, showing two of the 128 rows and columns. To read or write data, a row select line is energized. The transistors in that row turn on, connecting that row's capacitors to the data in/out lines. The data from that row is read out of the capacitors and amplified. At that point, the data can either be written back to refresh the row, or a new bit can be written. Note that although the chip accesses 128 bits in parallel internally, the chip provides access to one bit at a time externally, selecting one of the 128 bits to read or write.

Structure of the memory cells, based on the patent.

Structure of the memory cells, based on the patent.

The magnified photo below shows some of the storage cells, densely packed together. It's a bit hard to visualize what's going on because the chip is constructed from multiple layers. The bottom layer is the grayish silicon die. On top of the silicon are two layers of polysilicon. Above this is the metal wiring, which was removed for this photo. The photo shows three sense lines (data in/out) in the silicon, with bulb-shaped storage cells connected on either side. Vertical strips of polysilicon (poly 1) over the storage cells implement capacitors: the silicon forms the lower plate, while the polysilicon forms the upper plate. The second layer of polysilicon (poly 2) is arranged in diagonal regions to implement the selection transistors. Square notches in the poly 1 layer allow the poly 2 layer to approach the silicon to form transistors. Horizontal metal wiring (not visible) is connected to the poly 2 regions to select a row by driving the transistors. Note that the rows are staggered and interlocking (kind of like a zipper) due to the highly-optimized layout. At the time, fitting this much memory on a chip was a challenge that pushed the limits of integrated circuit technology.

A closeup of the memory chip under the microscope, showing individual storage cells.

A closeup of the memory chip under the microscope, showing individual storage cells.

Memory chips in the Apple III

Apple was a major customer of these memory modules, using them in the Apple III computer (1980). The Apple III was marketed as a business computer to follow the popular Apple II. Unfortunately, the Apple III was a business failure due to reliability issues and competition from the IBM PC introduced a year later.

Apple III Plus computer. Photo by Bilby, CC BY 3.0.

Apple III Plus computer. Photo by Bilby, CC BY 3.0.

As was usual for the time, the Apple III's memory board3 was stuffed with memory chips to achieve more capacity. An unusual part of the design is it used three rows of memory chips (instead of a power of two), mixing 16-kilobit and 32-kilobit memory chips to achieve 128 kilobytes of storage. (The Apple III's case was designed before the boards, so the boards had to be designed to fit the available space.) In the photo below, the top row holds MK4332 memory modules, while the bottom two rows hold 16-kilobit MK4116 chips.4

Apple III main memory card. Photo courtesy of DigiBarn, CC BY-NC 3.0

Apple III main memory card. Photo courtesy of DigiBarn, CC BY-NC 3.0

A brief history of memory

Memory is an under-appreciated part of computing. The CPU usually gets the attention, but memory was often the limiting factor. The problem with memory is that storing a single bit is easy, but most approaches are impractical when you try to scale up to thousands or millions of bits.

The early ENIAC computer (1946) used vacuum tubes for storage, but these were bulky and expensive, limiting ENIAC to just 20 words (of 10 digits) stored in its accumulators. Early computers such as EDSAC (1949) used mercury delay lines for memory, sending pulse trains of sound waves through tubes of mercury. Although EDSAC could store 512 words, you had to wait for bits to circulate serially through the mercury. An improvement was the random-access Williams tube which stored data as spots on a cathode-ray tube screen. Although they were temperamental, Williams tubes were used in the Manchester Mark 1 (1949) and the commercial IBM 701 (1952).

The introduction of core memory revolutionized computing, providing fast, cheap, and reliable storage, storing each bit in a tiny magnetized ferrite ring. Core memory was introduced in the Whirlwind computer (1953) and used in most computers of the late 1950s and 1960s. However, since each bit required a separate physical ferrite core, memory sizes were limited to a few megabytes for even the largest customers. For example, memory cabinets for the IBM System/360 (1969) held 256 kilobytes but weighed over a ton each (below).

Magnetic core memory was relatively bulky. This photo shows an IBM System/360 Model 85 installation. The cabinets in the front are IBM 2365 Processor Storage, each holding 256 kilobytes. The double-H cabinet in the center is the CPU. Photo from IBM.

Magnetic core memory was relatively bulky. This photo shows an IBM System/360 Model 85 installation. The cabinets in the front are IBM 2365 Processor Storage, each holding 256 kilobytes. The double-H cabinet in the center is the CPU. Photo from IBM.

Semiconductor memory led to another dramatic shift. At first, semiconductor memory was costly and had very small capacity; Intel's first product was a memory chip holding just 64 bits and costing $99.50. In 1968, Dennard at IBM invented cost-effective dynamic RAM and semiconductor DRAM technology advanced quickly at various companies. Intel introduced the first commercially available DRAM chip in 1970, the i1103 holding 1K bits. This chip was nicknamed the "core killer" because of its impact on the magnetic core memory industry.

Computer storage rapidly moved from core memory to DRAM as the capacity of DRAM increased and the price fell.5 Mostek introduced the 4-kilobit MK4096 chip in 1973, followed by the 16-kilobit MK4116 in 1976. In 1978, Fujitsu introduced the first commercial 64-kilobit DRAM chip and Japan took the lead in DRAM manufacturing.6 Intel left the DRAM industry in 1985 due to decreasing market share and profits, followed by the remaining US DRAM manufacturers.

Fifty years after the introduction of DRAM, it is still the dominant technology for main storage, a remarkably long lifetime. Compared to the 16-kilobit chip I described, Samsung's recent 16-gigabit DRAMs are a factor of a million larger, showing the incredible increase in density. It remains to be seen if anything will challenge the long storage leadership of DRAM.

I announce my latest blog posts on Twitter, so follow me at kenshirriff. I also have an RSS feed. Thanks to Mike Braden for suggesting the MK4332 chip to me.

Notes and references

  1. For details on the construction of the memory modules, see Rectangular chip-carriers double memory-board density, Electronics, 1982. 

  2. Early dynamic RAMs such as the Intel 1103 used three transistors per cell and used separate lines for reading and writing data. Improvements in memory technology shrunk the circuit to a single transistor and a single data line. 

  3. The Apple III memory board pictured is the "12 volt memory board", given that name because the memory chips required 12 volts (as well as +5 and -5). It was upgraded by the "5 volt memory board", which used only a 5 volt supply. The 5 volt memory board used more modern 64-kilobit memory chips (4864) giving it a larger capacity of 128 or 256 kilobytes. Inconveniently, the power supply required a 12-volt load to operate, so the 5-volt memory board has a power resistor to draw 0.4 amps from the otherwise-unused 12-volt supply. Details are in the Apple III reference manual

  4. The Apple III memory board was also available in a lower-cost 96-kilobyte module. In that configuration, the 4332 memory modules were replaced with the 16-kilobit (MK4116) chips used on the rest of the board. One clever feature of the 4332 module is the two "extra" select pins are on the end of the package. The result is that a memory board (such as the Apple III's) can be designed to accept either the 16-pin 16-kilobyte chips or the 18-pin 32-kilobyte modules, depending on how much memory is desired. With the smaller chips, the two extra pins are unused. It's strange, however, that the Apple III memory board only accepted the larger modules in one of the three rows of chips. 

  5. The industry switch from magnetic core memory to semiconductor memory wasn't as straightforward as superior semiconductor memory overthrowing inferior core memory. Instead, there was a time period where they co-existed, due to tradeoffs. For instance, in 1972, a customer could select core memory, semiconductor memory, or a mixture for the D-112 minicomputer (a PDP-8 clone); semiconductor memory was 5 times faster, but core memory supplied four times the capacity per board. By 1973, industry publications were reporting that "Semiconductor memories are taking over data-storage applications". As late as 1980, core memory manufacturers were advertising the benefits of core memory, battling the "myths" that semiconductor was better.

    Was the overthrow of magnetic core by semiconductor memory inevitable? My view is that "technological determinism" acts in some ways; the development of DRAM memory was almost unavoidable following the development of MOS transistors. However, "economic determinism" was more responsible for the success of semiconductor memory: if magnetic core had remained the lower-cost option, it probably would have remained dominant. As a counterexample, CCD (charge-coupled device) memory and bubble memory were hyped as storage technologies of the future, but couldn't achieve the price-performance to dislodge either semiconductor memory or hard disks. 

  6. Note that the capacity of memory chips increased by a factor of 4 each generation (1-, 4-, 16-, 64-kilobit) rather than a factor of 2. The reason is that each address pin was multiplexed to provide two address bits, so each additional address pin resulted in a factor of four increase. By reusing each address pin for both a row address and a column address, the number of address pins was kept low so compact 16-pin packages could be used even as memory sizes expanded to 256-kilobit. Conveniently, as technology improved, memory chips required fewer voltages, freeing up pins formerly used for power. One consequence, though, was the ordering of address pins on the chip was essentially random as new address pins were assigned based on which pins were available, rather than sequentially. The multiplexed address system was introduced in the Mostek MK4096 chip and meant that the 256-kilobit 41256 chip used fewer pins than the original 1-kilobit Intel 1103 (16 pins vs 18). 

Inside the HP Nanoprocessor: a high-speed processor that can't even add

The Nanoprocessor is a mostly-forgotten processor developed by Hewlett-Packard in 19741 as a microcontroller2 for their products. Strangely, this processor couldn't even add or subtract,3 probably why it was called a nanoprocessor and not a microprocessor. Despite this limitation, the Nanoprocessor powered numerous Hewlett-Packard devices ranging from interface boards and voltmeters to spectrum analyzers and data capture terminals.4 The Nanoprocessor's key feature was its low cost and high speed: Compared against the contemporary Motorola 6800,7 the Nanoprocessor cost $15 instead of $360 and was an order of magnitude faster for control tasks.

Recently, the six masks used to manufacture the Nanoprocessor were released by Larry Bower, the chip's designer, revealing details about its design. The masks were carefully cleaned and scanned by The CPU Shack, and stitched by Antoine Bercovici. The composite mask image below shows the internal circuitry of the integrated circuit.5 The blue layer shows the metal on top of the chip, while the green shows the silicon underneath. The black squares around the outside are the 40 pads for connection to the IC's external pins. I used these masks to reverse-engineer the circuitry of the processor and understand its simple but clever RISC-like design.6

Combined masks from the Nanoprocessor. Click for larger image. "GLB", to the left of the data bus, stands for the designers George Latham and Larry Bower. Files courtesy of Antoine Bercovici.

Combined masks from the Nanoprocessor. Click for larger image. "GLB", to the left of the data bus, stands for the designers George Latham and Larry Bower. Files courtesy of Antoine Bercovici from scans by The CPU Shack.

The Nanoprocessor was designed in 1974, the same year that the classic Intel 8080 and Motorola 6800 microprocessors were announced. However, the Nanoprocessor's silicon fabrication technology was a few years behind, using metal-gate transistors rather than silicon-gate transistors that were developed in the late 1960s. This may seem like an obscure difference, but silicon gate technology was much better in several ways. First, silicon-gate transistors were smaller, faster, and more reliable. Second, silicon-gate chips had a layer of polysilicon wiring in addition to the metal wiring; this made chip layouts about twice as dense.8 Third, metal-gate circuitry required an additional +12 V power supply. The Intel 4004 processor used silicon gates in 1971, so I'm surprised that HP was still using metal gates in 1974.9

A bizarre characteristic of the Nanoprocessor is its variable substrate bias voltage. For performance reasons, many 1970s microprocessors applied a negative voltage to the silicon substrate, with -5V provided through a bias pin.10 The Nanoprocessor has a bias pin, but strangely the bias voltage varied from chip to chip, from -2 volts to -5 volts. During manufacturing, the required voltage was hand-written on the chip (below). Each Nanoprocessor had to be installed with a matching resistor to provide the right voltage. If a Nanoprocessor was replaced on a board, the resistor had to be replaced as well. The variable bias voltage seems like a flaw in the manufacturing process; I can't imagine Intel making a processor like that.

The HP Nanoprocessor, part number 1820-1691. Note the hand-written voltage "-2.5 V". The last digit (1) of the part number is also hand-written, indicating
the speed of the chip. Photo courtesy of Marc Verdiell.

The HP Nanoprocessor, part number 1820-1691. Note the hand-written voltage "-2.5 V". The last digit (1) of the part number is also hand-written, indicating the speed of the chip. Photo courtesy of Marc Verdiell.

Like most processors of that era, the Nanoprocessor was an 8-bit processor. However, it didn't use RAM, but ran code from an external 2-kilobyte ROM. It contained 16 8-bit registers, more than most processors and enough to make up for the lack of RAM in many applications. Based on transistor count, the Nanoprocessor is more complex than the Intel 8008 (1972) and slightly less complex than the 6800 (1974) or 6502 (1975).11 Its architecture uses its transistor count on different purposes from these processors, though. The Nanoprocessor lacks ALU functionality but in exchange, it has a large register set, taking up much of the die area. The Nanoprocessor has 48 instructions, a considerably smaller instruction set than the 6800's 72 instructions. However, the Nanoprocessor includes convenient bit set, clear, and test operations, which these other processors lacked.12 The Nanoprocessor supports indexed register access, but lacks the complex addressing modes of the other processors.

The block diagram below shows the internal structure of the Nanoprocessor. The main I/O feature is the 4-bit "I/O Instruction Device Select" which allows 15 devices to receive I/O operations. In other words, the select pins indicate which I/O device is being read or written over the data lines. External circuitry uses these signals to do whatever is necessary for the particular application, such as storing the data in a latch, sending it to another system, or reading values. More I/O is provided through seven "Direct Control I/O" pins (GPIO pins) that can be used for inputs or outputs. If not connected to external circuitry, these pins operate as convenient bit flags; the Nanocomputer can set a value and then read it back. The Control Logic Unit performs increments, decrements, shifts, and bit operations on the accumulator, lacking the arithmetic and logical operations of a standard Arithmetic/Logic Unit (ALU).

Block diagram, from the Nanoprocessor User's Guide.

Block diagram, from the Nanoprocessor User's Guide.

I reverse-engineered the Nanoprocessor's circuitry from the masks and determined how the functional blocks map onto the die, below. The largest feature is the set of 16 registers in the center-left. To the right is the comparator and then the accumulator, along with its increment, decrement, shift, and complement circuitry. The instruction decoder circuitry takes up much of the space above and to the right of the comparator and accumulator. The bottom part of the chip is dominated by the 11-bit program counter, along with the one-entry interrupt stack and subroutine stack. The control circuitry implements the Nanoprocessor's almost-trivial instruction timing: one fetch cycle followed by one execute cycle.13 In most microprocessors, the control circuitry takes up a large fraction of the chip, but the Nanoprocessor's control circuitry is just a small block.

Functional components of the HP Nanoprocessor, based on my reverse-engineering.
Underlying die photo by Pauli Rautakorpi, CC BY 3.0.

Functional components of the HP Nanoprocessor, based on my reverse-engineering. Underlying die photo by Pauli Rautakorpi, CC BY 3.0.

Understanding the masks

The chip was fabricated using six masks, each used for constructing one layer of the processor using photolithography. The photo below shows the masks; each one is a 47.2×39.8 cm Mylar sheet. These sheets are 100× enlargements of the masks used to produce the 4.72×3.98 mm silicon die (for comparison, about 33% smaller than the 6800's die). Each 3-inch silicon wafer held about 200 integrated circuits, fabricated together on the wafer, and then tested, cut apart, and packaged.

The chip's masks, courtesy of The CPU Shack.

The chip's masks, courtesy of The CPU Shack.

To explain the role of the masks, I'll start with the structure of a metal-gate MOSFET, the transistor used in the Nanoprocessor. At the bottom, two regions of silicon (green) are doped to make them conductive, forming the source and drain of the transistor. A metal strip in between forms the gate, separated from the silicon by a thin layer of insulating oxide. (These layers—Metal, Oxide, Semiconductor—give the MOS transistor its name.) The transistor can be considered a switch controlled by the gate. The metal layer also provides the main wiring of the integrated circuit, although the silicon layer is also used for some wiring.

Structure of a metal-gate MOSFET.

Structure of a metal-gate MOSFET.

Masks are a key part of the integrated circuit construction process, specifying the position of the components. The diagram below shows how a mask is used to dope regions of the silicon. First, the silicon wafer is oxidized to form an insulating oxide layer on top, and then light-sensitive photoresist is applied. Ultraviolet light polymerizes and hardens the photoresist, except where the mask blocks the light. Next, the soft, unexposed photoresist is dissolved. The wafer is exposed to hydrofluoric acid, which removes the oxide layer where it is not protected by photoresist. This yields holes in the oxide that match the mask pattern. The wafer is then exposed to a high-temperature gas which diffuses into the unprotected silicon regions, modifying the silicon's conductivity. These processing steps create tiny doped silicon regions matching the masks's pattern. As will be shown below, the other masks are used for different processing steps, but using the same photoresist-and-mask process.

How a photomask is used to dope regions of silicon.

How a photomask is used to dope regions of silicon.

I'll zoom in on the Nanoprocessor's die and show how one of its circuits is constructed from the six masks. (This two-transistor circuit is an inverter, flipping the binary value of its input.) The first mask dopes regions of silicon to make them conductive, using the photolithography steps described above. The doped regions (green) will become transistor source/drains or wiring between components.

The first mask creates conductive silicon regions.

The first mask creates conductive silicon regions.

Next, the die is covered with an insulating oxide layer. The second mask (magenta) is used to etch openings in the oxide, exposing the silicon underneath. These openings will be used to create transistor gates as well as connecting metal wiring to the silicon.

The second mask creates openings in the oxide layer.

The second mask creates openings in the oxide layer.

The third mask (gray) exposes a region to ion implantation, which changes the doping of the silicon, and thus the transistor's properties. This turns the upper transistor into a special depletion-mode transistor that pulls logic gate outputs high.

The third mask is used to increase the doping of the upper transistor.

The third mask is used to increase the doping of the upper transistor.

Next, the silicon is covered with an additional thin layer of insulating oxide, forming the gate oxide for the transistors. The fourth mask (orange) removes this oxide from regions that will become contacts between the silicon and the metal layer. After this step, most of the die is covered with a thick insulating oxide layer. The oxide layer is very thin over the transistor gates (magenta), and there are contact holes in the oxide from the current mask (orange).

The fourth mask creates holes in the oxide.

The fourth mask creates holes in the oxide.

The fifth mask (blue) is used to create the metal wiring on top; a uniform metal layer is applied and then the undesired parts are etched off. In locations where the fourth mask created holes in the oxide, the metal layer contacts the silicon and forms a connection. In locations where the third mask created a thin oxide layer, the metal layer forms the transistor gate between two silicon regions. Finally, the entire wafer is covered with a protective glassy layer. The sixth mask (not shown) is used to form holes in this layer over the pads around the edges of each chip. Once the wafer is cut into individual silicon dies (dice?), bond wires are attached to the pads, connecting the die to the external pins.

The fifth mask creates the metal wiring.

The fifth mask creates the metal wiring.

The schematic below shows how the circuitry above forms a two-transistor inverter. The two transistor symbols correspond to the two transistors created by the masks. When there is no input, the upper transistor (connected to +5 volts) pulls the output high. When the input is high, it turns on the lower transistor. This connects the output to ground, pulling the output low. Thus, the circuit inverts the output.

Schematic of an NMOS inverter, corresponding to the masks above.

Schematic of an NMOS inverter, corresponding to the masks above.

Although the diagrams above show just a single inverter, these masking steps create the entire processor with its 4639 transistors.11 The diagram below shows a larger part of the die with dozens of transistors forming more complex gates and circuitry. One cute thing I noticed on the masks is a tiny heart with HP inside, below the chip's number.14

Chip art: HP inside a heart, below the part number 9-4332A

Chip art: HP inside a heart, below the part number 9-4332A

Controlling a clock with the Nanoprocessor

To understand how the Nanoprocessor was used in practice, I reverse-engineered the code from an HP 98035 clock module. This module was plugged into an HP desktop computer15 to provide a real-time clock, as well as millisecond-accurate timings, intervals, and periodic events. The design of the clock module was rather unusual. To preserve the time when the computer was powered-down, the clock module was built around a digital watch chip with a backup battery.17 Inconveniently, the digital watch chip wasn't designed for computer control: it generated 7-segment signals to drive an LED, and it was set through three buttons. To read the time, the Nanoprocessor had to convert the 7-segment display outputs back into digits. And to set the time, the Nanoprocessor had to simulate the right sequence of button presses to advance through the digits.

Nanoprocessor (white chip) as part of an HP clock module. The 2-kilobyte ROM is to the left of the Nanoprocessor.
The two 256-bit×4 RAM chips are to the right.
The Texas Instruments clock chip is the large black chip below the green NiCad battery. Photo courtesy of Marc Verdiell.

Nanoprocessor (white chip) as part of an HP clock module. The 2-kilobyte ROM is to the left of the Nanoprocessor. The two 256-bit×4 RAM chips are to the right. The Texas Instruments clock chip is the large black chip below the green NiCad battery. Photo courtesy of Marc Verdiell.

The host computer controlled the clock module by sending it ASCII strings such as "S 12:07:12:45:00" to set the clock to 12:45:00 on December 7 (or on July 12 if the module was running in European mode). The module's various interval timers, periodic alarms, and counters were controlled with similar commands such as "Unit 2 Period 12345". The module supported 24 different commands, and the Nanoprocessor had to parse them. (See the manual for details.)

Here's some sample code reverse-engineered from the clock board ROM. This code is from the interrupt handler that increases the time and date every second. The code below determines how many days in the current month so it knows when to move to the next month. The columns are the byte value, the corresponding opcode, and my description of the instruction.

This code takes a month number (01-12 BCD) in the accumulator and returns (in register 0) the number of days in the month (28, 30, or 31 BCD). Not bad for 16 bytes of code, even if it ignores leap years. How does it work? For months past 7 (July), it subtracts 1. Then, if the month is odd, it has 31 days, while an even month has 30 days. To handle February, the code clears bit 1 of the month. If the month is now 0 (i.e. February), it has 28 days.

This code demonstrates that even though a processor without addition sounds useless, the Nanoprocessor's bit operations and increment/decrement allow more computation than you'd expect.16 It also shows that Nanoprocessor code is compact and efficient. Many things can be done in a single byte (such as bit test and skip) that would take multiple bytes on other processors.12 The Nanoprocessor's large register file also avoids much of the tedious shuffling of data back and forth often required in other processors. Although some call the Nanoprocessor more of a state machine controller than a microprocessor, that understates the capabilities and role of the Nanoprocessor.

While the Nanoprocessor doesn't include an ALU or have instructions for accessing RAM, these could be added as I/O devices. The clock module has 256 bytes of RAM to hold its multiple counter and timer values, accessed through four I/O ports. Other products added ALU chips to support arithmetic operations.18

Conclusions

The Nanoprocessor is an unusual processor. My first impression was that it wasn't even a "real processor", lacking basic arithmetic functionality. The chip was built with obsolete metal-gate technology, a few years behind other microprocessors. Most bizarrely, each chip required a different voltage, hand-written on the package, suggesting difficulty with manufacturing consistency. However, the Nanoprocessor provided high performance in its microcontroller role, much faster than other processors at the time. Hewlett-Packard used the Nanoprocessor in many products in the 1970s and 1980s, in roles that were more complex than you'd expect. strings and performing calculations.

While the Nanoprocessor has languished in obscurity, without even a mention on Wikipedia, the masks recently revealed by its designer shed light on this unusual corner of processor history. Thanks to Larry Bower for the donation of the masks, John Culver at The CPU Shack for scanning and sharing the masks, and Antoine Bercovici for remastering the masks. Thanks to Marc Verdiell for dumping the clock board ROM.

I plan to write about the internal circuitry of the Nanoprocessor so follow me on Twitter at @kenshirriff for updates on Part II. I also have an RSS feed.

Notes and references

  1. More information on the HP Nanoprocessor and its history is in CPU Shack's recent article The Forgotten Ones: HP Nanoprocessor, as well as at HP9825.com and The HP 9845 Project

  2. I'm not completely comfortable calling the Nanoprocessor a microcontroller since it uses an external program ROM, while a microcontroller usually has everything, including the ROM, on a single chip. (It is like the Intel 4004 in this way.) However, the Nanoprocessor resembles a microcontroller in most ways: it is designed for embedded control applications, with a Harvard architecture and an instruction set optimized for I/O, running a program from ROM with minimal storage. 

  3. On the topic of computers that can't add, the desk-sized IBM 1620 computer (1959) didn't have addition circuitry, but used table lookup for addition. It had the codename CADET; people joked that this stood for "Can't Add, Doesn't Even Try." 

  4. I've determined that the Nanoprocessor was used in the following HP products (and probably others): HP 9845B, HP 3585A spectrum analyzer, HP 3325A Synthesizer / Function Generator, HP 9885 floppy disk drive, HP 3070B data capture terminal, HP 98034 HPIB interface for the HP 9825 calculator, HP 98035 real time clock for the HP 9825 desktop computer, HP 7970E tape drive interface, HP 4262A LCR meter, HP 3852 Spectrum Analyzer, and HP 3455A voltmeter. Poul-Henning Kamp informs me that the HP 3336 Synthesizer / Function Generator and HP 9411 Switch Controller also used the Nanoprocessor. I've also been informed that the HP3437A System Voltmeter uses the Nanoprocessor. 

  5. The mask images can be downloaded here (warning: 122 MB PSD file). 

  6. The Nanoprocessor is like a RISC (Reduced Instruction Set Computer) processor in many ways, although it predated the RISC concept by several years. In particular, the Nanoprocessor is designed with a simple opcode structure, all instructions execute in one cycle (after the fetch cycle), the register set is large and orthogonal, and addressing is simple. These RISC characteristics yielded a high clock speed compared to more complex processors. 

  7. Interestingly, the Nanoprocessor's competition during development was the Motorola 6800, rather than an Intel processor. The Nanoprocessor's key feature was performance: it ran at 4 MHz, compared to 1 MHz for the 6800. (Both processors took 2 cycles to perform a basic instruction, while the 6800 took up to 7 cycles for more complex instructions.)

    The Nanoprocessor designers wrote a timing comparison, estimating that the Nanoprocessor could count six times faster than the 6800 and handle interrupts over sixteen times faster. The proposal assumed a 5 MHz Nanoprocessor while the actual chip fell a bit short, running at 4 MHz. The projected cost of the Nanoprocessor was $15 per chip, compared to $360 for the Motorola 6800. 

  8. I'm impressed with the density of the Nanocomputer's layout given its limitations: one layer of metal wiring and no polysilicon. I've looked at other metal-gate chips and their layouts are horribly inefficient, with a lot more wiring than transistors. However, the Nanoprocessor's circuits are arranged efficiently, with very little wasted space. 

  9. The Nanoprocessor's fabrication technology was ahead of the Intel 8080 and Motorola 6800 in one way: it used depletion-mode pull-up transistors, more advanced than the enhancement-mode transistors in the 8080 and 6800. Depletion-mode transistors resulted in faster, lower-power logic gates, but required an additional manufacturing step. For the Nanoprocessor, this step used mask #3 (the gray mask). In processors such as the MOS Technology 6502 and Zilog Z-80, depletion-mode transistors allowed the processor to run off a single voltage instead of three. Unfortunately, the Nanoprocessor still required three voltages due to its metal-gate transistors. 

  10. Early DRAM memory chips and microprocessor chips often required three supplies: +5V (Vcc), +12V (Vdd) and -5V (Vbb) bias voltage. In the late 1970s, improvements in chip technology allowed a single supply to be used instead. The Intel 8080 microprocessor (1974) used enhancement-mode transistors and required three voltages, but the improved 8085 (1976) used depletion-mode transistors and was powered by a single +5V supply. Starting in the late 1970s, many microprocessors used an on-chip charge pump to generate the negative bias voltage. I wrote about the 8086's charge pump here

  11. By my count, the Nanoprocessor has 4639 transistors. The instruction decoder is constructed from pairs of small transistors for layout reasons; combining these pairs yields 3829 unique transistors. Of these, 1061 act as pull-ups, while 2668 are active. In comparison, the 6502 has 4237 transistors, of which 3218 are active. The 8008 has 3500 transistors and the Motorola 6800 has 4100 transistors. 

  12. Early microprocessors didn't have bit set, reset, and test operations (although these could be accomplished with AND and OR). The Z-80 (1976) added bit operations, but they took two bytes and were much slower than the Nanoprocessor. 

  13. The Nanoprocessor sticks to its model of executing the instruction in one cycle even for two-byte instructions: The second byte is fetched during the execute cycle, so the overall timing is unchanged. 

  14. The Nanoprocessor has two different part numbers. The 1820-1691 was the 2.66 MHz version, while the 1820-1692 was the 4 MHz version. The last digit of the part number was hand-written on each chip after testing its performance. (The part number is unrelated to the chip's number 9-4332A on the die.) 

  15. The HP 9825 was a 16-bit desktop computer, running a BASIC-like language. It was introduced in 1976, five years before the IBM PC, and was a remarkably advanced system for its time. The back of the HP 9825 had three I/O slots for adding modules such as the real time clock.

    An HP 9825 with tape drive, LED display, and printer. From Marc Verdiell's collection.

    An HP 9825 with tape drive, LED display, and printer. From Marc Verdiell's collection.

     

  16. I came across one place in the code where it needs to add two BCD digits to form one byte. This was accomplished by a loop that decremented one number while incrementing the second. When the first number reached zero, the result was the sum. Thus, even without an ALU, addition is possible but slow. 

  17. The Texas Instruments watch chip was implemented with Integrated Injection Logic (I2L) to keep power consumption low. Nowadays, a low-power chip would use CMOS, but that wasn't common at the time. Integrated Injection Logic was built from bipolar transistors, similar to TTL, but using different high-density, low-power circuitry. I discussed Integrated Injection Logic in detail in this blog post. The Texas Instruments chip may be the X-902 in a DIP package. 

  18. The clock board schematic shows how the two 256×4 RAM chips are connected to the Nanoprocessor. The Nanoprocessor's I/O port select pins are connected to the "3-8 Decoder" U5, which produces a separate signal for each I/O port. Three of these signals go to the RAM chip's control pins, while one signal controls the Data Latch chips U9 and U10 that hold write data.

    RAM chips connected to the Nanoprocessor. From the Clock service manual.

    RAM chips connected to the Nanoprocessor. From the Clock service manual.

    All I/O ports use the Nanoprocessor's data bus (top) for communication, so the data bus is connected to both the address and data pins of the RAM chips. For a read, the RAM address is written to the RAM chips via one I/O port and then the data is read from RAM via a second port. In both cases, the values go across the data bus, while the signal from the "3-8 Decoder" indicates what to do with the values. For a write, the first I/O operation stores the byte value in the latches, and then the second I/O operation sends the address to the RAM chips. While this may seem like a clunky, Rube-Goldberg approach, it works well in practice; a read or write can be done with two bytes of instructions.

    (Many processors, such as the 6502, used memory-mapped I/O; I/O devices were mapped into the memory address space and accessed through memory read/write operations. The Nanoprocessor is the opposite, putting RAM into the I/O port space and accessing it through I/O operations.)

    Adding an ALU uses a similar approach, as in the HP 3455A voltmeter (schematic), which contains two Nanoprocessors. The voltmeter uses two 74LS181 ALU chips to implement an 8-bit ALU that it uses to scale value and compute percentage error. Two output ports provide the arguments and another port specifies the operation. The 8-bit result is read from a port, while the processor reads the carry through a GPIO pin. (At this point, I'd wonder if it wouldn't be better to use a processor that includes arithmetic.)