Showing posts with label intel. Show all posts
Showing posts with label intel. Show all posts

How the 8086 processor's microcode engine works

The 8086 microprocessor was a groundbreaking processor introduced by Intel in 1978. It led to the x86 architecture that still dominates desktop and server computing. The 8086 chip uses microcode internally to implement its instruction set. I've been reverse-engineering the 8086 from die photos and this blog post discusses how the chip's microcode engine operated. I'm not going to discuss the contents of the microcode1 or how the microcode controls the rest of the processor here. Instead, I'll look at how the 8086 decides what microcode to run, steps through the microcode, handles jumps and calls inside the microcode, and physically stores the microcode. It was a challenge to fit the microcode onto the chip with 1978 technology, so Intel used many optimization techniques to reduce the size of the microcode.

In brief, the microcode in the 8086 consists of 512 micro-instructions, each 21 bits wide. The microcode engine has a 13-bit register that steps through the microcode, along with a 13-bit subroutine register to store the return address for microcode subroutine calls. The microcode engine is assisted by two smaller ROMs: the "Group Decode ROM" to categorize machine instructions, and the "Translation ROM" to branch to microcode subroutines for address calculation and other roles. Physically, the microcode is stored in a 128×84 array. It has a special address decoder that optimizes the storage. The microcode circuitry is visible in the die photo below.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

What is microcode?

Machine instructions are generally considered the basic steps that a computer performs. However, each instruction usually requires multiple operations inside the processor. For instance, an ADD instruction may involve computing the memory address, accessing the value, moving the value to the Arithmetic-Logic Unit (ALU), computing the sum, and storing the result in a register. One of the hardest parts of computer design is creating the control logic that signals the appropriate parts of the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated and error-prone.

In 1951, Maurice Wilkes came up with the idea of microcode: instead of building the control circuitry from complex logic gates, the control logic could be replaced with another layer of code (i. e. microcode) stored in a special memory called a control store. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task. Microcode also permits complex instructions and a large instruction set to be implemented without making the processor more complex (apart from the size of the microcode). Finally, it is generally easier to fix a bug in microcode than in circuit logic.

Early computers didn't use microcode, largely due to the lack of good storage technologies to hold the microcode. This changed in the 1960s; for example IBM made extensive use of microcode in the System/360 (1964). (I've written about that here.) But early microprocessors didn't use microcode, returning to hard-coded control logic with logic gates.3 This logic was generally more compact and ran faster than microcode, since the circuitry could be optimized. Since space was at a premium in early microprocessors and the instruction sets were relatively simple, this tradeoff made sense. But as microprocessor instruction sets became complex and transistors became cheaper, microcode became appealing. This led to the use of microcode in the Intel 8086 (1978) and 8088 (1979) and Motorola 68000 (1979), for instance.2

The 8086's microcode

The 8086's microcode is much simpler than in most processors, but it's still fairly complex. The code below is the microcode routine from the 8086 for a routine called "CORD", part of integer division, consisting of 16 micro-instructions. I'm not going to explain how this microcode works in detail, but I want to give a flavor of it. Each line has an address on the left (blue) and the micro-instruction on the right (yellow), specifying the low-level actions during one time step (i.e. clock cycle). Each micro-instruction performs a move, transferring data from a source register (S) to a destination register (D). (The source Σ indicates the ALU output.) For parallelism, the micro-instruction performs an operation or two at the same time as the move. This operation is specified by the "a" and "b" fields; their meanings depend on the type field. For instance, type 1 indicates an ALU instruction such as subtract (SUBT) or left-rotate through carry (LRCY). Type 4 selects two general operations such as "RTN" which returns from a microcode subroutine. Type 0 indicates a jump operation; "UNC 10" is an unconditional jump to line 10 while "CY 13" jumps to line 13 if the carry flag is set. Finally, the "F" field indicates if the condition code flags should be updated. The key points are that the micro-instructions are simple and execute in one clock cycle, they can perform multiple operations in parallel to maximize performance, and they include control-flow operations such as conditional jumps and subroutines.

An example of a microcode routine. The CORD routine implements integer division with subtracts and left rotates. This is from patent 4,449,184.

An example of a microcode routine. The CORD routine implements integer division with subtracts and left rotates. This is from patent 4,449,184.

Each instruction is stored at a 13-bit address (blue) which consists of 9 bits shown explicitly and a 4-bit sequence counter "CR". The eight numbered address bits usually correspond to the machine instruction's opcode. The "X" bit is an extra bit to provide more address space for code that is not directly tied to a machine instruction, such as reset and interrupt code, address computation, and the multiply/divide algorithms.

A micro-instruction is encoded into 21 bits as shown below. Every micro-instruction contains a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits is a bit tricky since it depends on the type field, which is two or three bits long. The "short jump" (type 0) is a conditional jump within the current block of 16 micro-instructions. The ALU operation (type 1) sets up the arithmetic-logic unit to perform an operation. Bookkeeping operations (type 4) are anything from flushing the prefetch queue to ending the current instruction. A memory read or write is type 6. A "long jump" (type 5) is a conditional jump to any of 16 fixed microcode locations (specified in an external table). Finally, a "long call" (type 7) is a conditional subroutine call to one of 16 locations (different from the jump targets).

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

This "vertical" microcode format reduces the storage required for the microcode by encoding control signals into various fields. However, it requires some decoding logic to process the fields and generate the low-level control signals. Surprisingly, there's no specific "microcode decoder" circuit. Instead, the logic is scattered across the chip, looking for various microcode bit patterns to generate control signals where they are needed.

How instructions map onto the ROM

One interesting issue is how the micro-instructions are organized in the ROM, and how the right micro-instructions are executed for a particular machine instruction. The 8086 uses a clever mapping from the machine instruction to a microcode address that allows machine instructions to share microcode.

Different processors use a variety of approaches to microcode organization. One technique is for each micro-instruction to contain a field with the address of the next micro-instruction. This provides complete flexibility for the arrangement of micro-instructions, but requires a field to hold the address, increasing the number of bits in each micro-instruction. A common alternative is to execute micro-instructions sequentially, with a micro-program-counter stepping through each micro-address unless there is an explicit jump to a new address. This approach avoids the cost of an address field in each instruction, but requires a program counter with an incrementer, increasing the hardware complexity.

The 8086 uses a hybrid approach. A 4-bit program counter steps through the bottom 4 bits of the address, so up to 16 micro-instructions can be executed in sequence without a jump. This approach has the advantage of requiring a smaller 4-bit incrementer for the program counter, rather than a 13-bit incrementer. The microcode engine provides a "short jump" operation that makes it easy to jump within the group of 16 instructions using a 4-bit jump target, rather than a full 13-bit address.

Another important design decision in microcode is how to determine the starting micro-address for each machine instruction. In other words, if you want to do an ADD, where does the microcode for ADD start? One approach is a table of starting addresses: the system looks in the table to find the starting address for ADD, but this requires a large table of 256 entries. A second approach is to use the opcode code value as the starting address. That is, an ADD instruction 0x05 would start at micro-address 5. This approach has two problems. First, you can't run the microcode sequentially since consecutive micro-instructions belong to different machine instructions. Moreover, you can't share microcode since each instruction has a different address in the microcode ROM.

The 8086 solves these problems in two ways. First, the machine instructions are spaced sixteen slots apart in the microcode. In other words, the opcode is multiplied by 16 (has four zeros appended) to form the starting address in the microcode ROM, so there is plenty of space to implement each machine instruction. The second technique is that the ROM's addressing is partially decoded rather than fully decoded, so multiple micro-addresses can correspond to the same physical storage.4

To make this concrete, consider the 8086's arithmetic-logic instructions: one-byte add register to memory, one-byte add memory to register, one-word subtract memory from register, one-word xor register to memory, and so forth. There are 8 ALU operations and each can be byte- or word-sized, with memory as source or destination. This yields 32 different machine opcodes. These opcodes were carefully assigned, so they all have the format 00xxx0xx. The ROM address decoder is designed to look for three 0 bits in those positions, and ignore the other bits, so it will match that pattern. The result is that all 32 of these ALU instructions activate the same ROM column select line, and thus they all share the same microcode, shrinking the size of the ROM.

The microcode ROM's physical layout

The microcode ROM holds 512 words of 21 bits, so the obvious layout would be 512 columns and 21 rows. However, these dimensions are not practical for physically building the ROM because it would be too long and skinny. Instead, the ROM is constructed by grouping four words in each column, resulting in 128 columns of 84 rows, much closer to square. Not only does this make the physical layout more convenient, but it also reduces the number of column decoders from 512 to 128, reducing the circuitry size. Although the ROM now requires 21 multiplexers to select which of the four rows corresponds to each output bit, the circuitry is still much smaller. There is a tradeoff with the ability to merge addresses together by ignoring bits, though. Each decoder now selects a column of four words, rather than a single word, so each block of four words must have consecutive addresses.

The main components of the microcode engine. The metal layer has been removed to show the silicon and polysilicon underneath. If you zoom in, the bit pattern is visible in the silicon doping pattern.

The main components of the microcode engine. The metal layer has been removed to show the silicon and polysilicon underneath. If you zoom in, the bit pattern is visible in the silicon doping pattern.

The image above shows how microcode is stored and accessed. At the top is the 13-bit microcode address register, which will be discussed in detail below. The column selection circuit decodes 11 of the 13 address bits to select one column of the microcode storage. At the left, multiplexers select one bit out of each four rows using the two remaining address bits (specifically, the two lowest sequence bits). The selected 21 microcode outputs are latched and fed to the rest of the processor, where they are decoded as described earlier and control the processor's actions.

Optimizing the microcode

In 1978, the number of bits that could be stored in the microcode ROM was rather limited. In particular, the 8086 holds only 512 micro-instructions. Since it has approximately 256 machine-code instructions in its one-byte opcode, combined with multiple addressing modes, and each instruction requires multiple micro-instructions, compression and optimization were necessary to make the microcode fit.5 The main idea was to move functionality out of the microcode and into discrete logic when it made sense. I'll describe some of the ways they did this.

The 8086 has an arithmetic-logic unit (ALU) that performs operations such as addition and subtraction, as well as logical operations such as AND and XOR. Consider the machine instruction ADD, implemented with a few micro-operations that compute the memory address, fetch data, perform the addition, and store the result. The machine instructions for subtraction, AND, or XOR require identical steps, except that the ALU performs a different operation. In total, the 8086 has eight ALU-based operations that are identical except for the operation performed by the ALU.6 The 8086 uses a "trick" where these eight machine instructions share the same microcode. Specifically, the microcode tells the ALU to perform a special operation XI, which indicates that the ALU should look at the appropriate bits of the instruction and do the appropriate operation.7 This shrinks the microcode for these operations by a factor of eight, at the cost of requiring additional logic for the ALU. In particular, the ALU control circuitry has a register to hold the relevant instruction bits, and a PLA to decode these bits into low-level ALU control signals.

Similarly, the 8086 has eight machine instructions to increment a specific register (out of a set of 8), and eight instructions to decrement a register. All 16 instructions are handled by the same set of micro-instructions and the ALU does the increment or decrement as appropriate. Moreover, the register control circuitry determines which register is specified by the instruction, without involvement from the microcode.

Another optimization is that the 8086 has many machine instructions in pairs: an 8-bit version and a 16-bit version. One approach would be to have separate microcode for the two instructions, one to handle a single byte and one to handle two bytes. Instead, the machine instructions share microcode. The complexity is moved to the circuitry that moves data on the bus: it looks at the low bit of the instruction to determine if it should process a byte or a word. This cuts the microcode size in half for the many affected instructions.

Finally, simple instructions that can take place in one cycle are implemented with logic gates, rather than through microcode. For instance, the CLC (clear carry flag) instruction updates the flag directly. Similarly, prefix instructions for segment selection, instruction locking, or repetition are performed in logic. These instructions don't use any microcode at all, which will be important later below.

Using techniques such as these, about 75 different instruction types are implemented in the microcode (instead of about 256), making the microcode much smaller. The tradeoff is that the 8086 requires more logic circuitry, but the designers found the tradeoff to be worthwhile.

The ModR/M byte

There's another complication for 8086 microcode, however. Most 8086 instructions have a second byte: the ModR/M byte, which controls the addressing mode for the instructions in a complex way (shown below). This byte gives 8086 instructions a lot of flexibility: you can use two registers, a register and a memory location, or a register and an "immediate" value specified in the instruction. The memory location can be specified by 8 index register combinations with a one-byte or two-byte displacement optionally added. (This is useful for accessing data in an array or structure, for instance.) Although these addressing modes are powerful, they pose a problem for the microcode.

A summary of the ModR/M byte, from MCS-86 Assembly Language Reference Guide.

A summary of the ModR/M byte, from MCS-86 Assembly Language Reference Guide.

These different addressing modes need to be implemented in microcode, since different addressing modes require different sequences of steps. In other words, you can't use the previous trick of pushing the problem into logic gates. And you clearly don't want a separate implementation of each instruction for each addressing mode since the size of the microcode would multiply out of control.

The solution is to use a subroutine (in microcode) to compute the memory address. Thus, instructions can share the microcode for each addressing mode. This adds a lot of complexity to the microcode engine, however, since it needs to store the micro-address for a micro-subroutine-call so it can return to the right location. To support this, the microcode engine has a register to hold this return address. (Since it doesn't have a full stack, you can't perform nested subroutine calls, but this isn't a significant limitation.)

The microcode ends up having about 10 subroutines for the different addressing modes, as well as four routines for the different sizes of displacement. (The 8 possibilities for source registers are handled in the register selection logic, rather than microcode.) Thus, the microcode handles the 256 different addressing modes with about 14 short routines that add the appropriate address register(s) and the displacement to obtain the memory address.

One more complication is that machine instructions can switch the source and destination specified by the ModR/M byte, depending on the opcode. For example, one subtract instruction will subtract a memory location from a register, while a different subtract instruction subtracts a register from a memory location. The two variants are distinguished by bit 1 of the instruction, the "direction" bit. These variants are handled by the control logic, so the microcode can ignore them. Specifically, before the source and destination specifications go to the register control circuitry, a crossover circuit can swap them based on the value of the direction bit.

The Translation ROM

As explained above, the starting address for a machine instruction is derived directly from the instruction's opcode. However, the microcode engine needs a mechanism to provide the address for jump and call operations. In the 8086, this address is hard-coded into the Translation ROM, which provides a 13-bit address.8 It holds ten destination addresses for jump operations and ten (different) addresses for call operations.

A second role of the Translation ROM is to hold target addresses for each ModR/M addressing mode, pointing to the code to compute the effective address. As a complication, two of the jump table entries in the Translation ROM are implemented with conditional logic, depending on whether or not the instruction's memory address calculation includes a displacement. By wiring this condition into the Translation ROM, the microcode avoids the need to test this condition.

The image below shows how the Translation ROM appears on the die. It is implemented as a partially-decoded ROM with multiplexed inputs.9 The inputs are at the bottom left. For a jump or call, the ROM uses 4 input bits from the microcode output, since the microcode selects the jump targets. For an address computation, it takes 5 bits from the instruction's ModR/M byte, so the routine is selected by the instruction. The ROM has additional input bits to select the mode (jump, call, or address) and for the conditional jumps. The decoding logic (left half) activates a row in the right half, generating the output address. This address exits at the bottom and is loaded into the micro-address register below the Translation ROM.

The Translation ROM holds addresses of routines in the microcode.

The Translation ROM holds addresses of routines in the microcode.

The Group Decode ROM

In the discussion above, I've discussed how various categories of instructions are optimized. For instance, many instructions have a bit that selects if they act on a byte or a word. Many instructions have a bit to reverse the direction of the operation's memory and register accesses. These features are implemented in logic rather than microcode. Other instructions are implemented outside microcode entirely. How does the 8086 determine which way to process an instruction?

The Group Decode ROM takes an instruction opcode and generate 15 signals that indicate various categories of instructions that are handled differently.10 The outputs from the Group Decode ROM are used by various logic circuits to determine how to handle the instruction. Some cases affect the microcode, for instance calling a microcode addressing routine if the instruction has a ModR/M byte. In other cases, these signals act "downstream" of the microcode, for example to determine if the operation should act on a byte or a word. Other signals cause the microcode to be bypassed completely.

A closeup of the Group Decode ROM. The circuit uses two layers of NOR gates to generate the output signals from the opcode inputs. This image shows a composite of the metal, polysilicon, and silicon layers.

A closeup of the Group Decode ROM. The circuit uses two layers of NOR gates to generate the output signals from the opcode inputs. This image shows a composite of the metal, polysilicon, and silicon layers.

Specially-encoded instructions

For most of the 8086 instructions, the first byte specifies the instruction. However, the 8086 has a few instructions where the ModR/M byte completely changes the meaning of the first byte. For instance, opcode 0xF6 (Grp 1 below) can be a TEST, NOT, NEG, MUL, IMUL, DIV, or IDIV instruction based on the value of the ModR/M byte. Similarly, opcode 0xFE (Grp 2) indicates an INC, DEC, CALL, JMP, or PUSH instruction.11

The 8086 instruction map for opcodes 0xF0 to 0xFF. Based on MCS-86 Assembly Language Reference Guide.

The 8086 instruction map for opcodes 0xF0 to 0xFF. Based on MCS-86 Assembly Language Reference Guide.

This encoding may seem a bit random, but there's a reason behind it. Most instructions act on a source and a destination. But some, such as INC (increment) use the same register or memory location for the source and the destination. Others, such as CALL or JMP, only use one address. Thus, the "reg" field in the ModR/M byte is redundant. Since these bits would be otherwise "wasted", they are used instead to specify different instructions. (There are only 256 single-byte opcodes, so you want to make the best use of them.)

The implementation of these instructions in microcode is interesting. Since the instructions share the same first byte, the standard microcode mapping would put them at the same microcode address. However, these instructions are treated specially, with the "reg" field from the ModR/M byte copied into the lower bits of the microcode address. In effect, the instructions are treated as opcodes 0xF0 through 0xFF, so the different instruction variants execute at separate microcode addresses. You might expect a collision with the opcodes that really have the values 0xF0 through 0xFF. However, the 8086 opcodes were cleverly arranged so none of the other instructions in this range use microcode. As you can see above, the other instructions are prefixes (LOCK, REP, REPZ), halt (HLT), or flag operations (CMC, CLC, STC, CLI, STI, CLD, STD), all implemented outside microcode. Thus, the range 0xF0-0xFF is freed up for the "expanded" instructions.

The hardware implementation for this is not too complex. The Group ROM produces an output for these special instructions. This causes the microcode address register to load the appropriate bits from the ModR/M byte, causing the appropriate microcode routine to be executed.

The microcode address register

The heart of the microcode engine is the microcode address register, which determines which microcode address to execute. As described earlier, the microcode address is 13 bits, of which 8 bits generally correspond to the instruction opcode, one bit is an extra "X" instruction bit, and 4 bits are sequentially incremented. The diagram below shows how the circuitry for the bits is arranged. The 9 instruction bits each have a nearly-identical circuit. The sequence bits have more circuitry and each one is different, because the circuit to increment the address is different for each bit.

Layout of the microcode address register. Each bit has a roughly vertical block of circuitry.

Layout of the microcode address register. Each bit has a roughly vertical block of circuitry.

The schematic below shows the circuitry for one bit in the microcode address register. It has two flip-flops: one to hold the current address bit and one to hold the old address while performing a subroutine call. A multiplexer (mux) selects the input to each flip-flop. For instance, if the microcode is waiting for a memory access, the "hold" input to the multiplexer causes the current address to loop around and get reloaded into the flip-flop. For a subroutine call, the "call" input saves the current address in the subroutine flip-flop. Conversely, when returning from a subroutine, the "return" input loads the old address from the subroutine flip-flop. The address flip-flop also has inputs to load the instruction as the address, to load an address from the translation ROM, or to load an interrupt microcode handler address. The circuit sends the address bit (and inverted address bit) to the microcode ROM's address decoder.

Schematic of a typical bit in the microcode address register.

Schematic of a typical bit in the microcode address register.

Each bit has some special-case handling, so this schematic should be viewed as an illustration, not an accurate wiring diagram. In particular, the sequence bits also have inputs from the incrementer, so they can step to the next address. The low-order bits have instruction inputs to handle the specially-encoded "group" instructions discussed in the previous section.

The control signals for the multiplexers are generated from various sources. A circuit called the loader starts processing of an instruction, synchronized to the prefetch queue and instruction fetch from memory. The call and return operations are microcode instructions. The Group Decode ROM controls some of the inputs, for instance to process a ModR/M byte. Thus, there is a moderate amount of conditional logic that determines the microcode address and thus what microcode gets executed.

Conclusions

This has been a lot of material, so thank you for sticking with it to the end. I draw three conclusions from studying the microcode engine of the 8086. First, the implementation of microcode is considerably more complex than the clean description of microcode that is presented in books. A lot of functionality is implemented in logic outside of microcode, so it's not a "pure" microcode implementation. Moreover, there are many optimizations and corner cases. The microcode engine has two supporting ROMs: the Translation ROM and the Group Decode ROM. Even the microcode address register has complications.

Second, the need for all these optimizations shows how the 8086 was just on the edge of what was practical. The designers clearly went to a lot of effort to get the microcode to fit in the space available.

Finally, looking at the 8086 in detail shows how complex its instruction set is. I knew in the abstract that it was much more convoluted than, say, an ARM chip. But seeing all the special case circuitry on the die to handle the corner cases of the instruction set really makes this clear.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected]. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

  1. The 8086 microcode was disassembled (link) a couple of years ago by Andrew Jenner by extracting the bits from my die photos. My post here is a bit different, looking at hardware that runs the microcode, rather than the contents of the microcode. 

  2. According to Wikipedia, the Zilog Z8000 (1979) didn't use microcode, which is a bit surprising for that timeframe. This design decision had the advantage of reducing transistor count, but the disadvantage of hard-to-fix logic bugs in the instruction decoder. 

  3. As an example of a non-microcoded processor, the MOS 6502 (1975) used a PLA (Programmable Logic Array) to perform much of the decoding (details). A PLA provides a structured way of implementing logic gates in a relatively dense array. A PLA is kind of like a ROM—a PLA can implement a ROM and vice versa—so it can be hard to see the difference. The usual distinction is that only one row of a ROM is active at a time, the address that you're reading out. A PLA is more general since multiple rows can be active at a time, combined to form the outputs.

    The Z80 had a slightly different implementation. It used a smaller PLA to perform basic decoding of instructions into various types. It then generated control signals through a large amount of "random" logic (so-called because of its appearance, not because it's actually random). This logic combined instruction types and timing signals to generate the appropriate control signals. 

  4. In a "normal" ROM, the address bits are decoded with logic gates so each unique address selects a different storage column in the ROM. However, parts of the decoder circuitry can "ignore" bits so multiple addresses select the same storage column. For a hypothetical example, suppose you're decoding 5-bit addresses. Instead of decoding 11010 and 11011 separately, you could ignore the last bit so both addresses access the same ROM data. or you could ignore the last three bits so all 8 addresses of the form 11xxx go to the same ROM location (where x indicates a bit that can be either 0 or 1). This makes the ROM more like a PLA (Programmable Logic Array), but still accessing a single row at a time. 

  5. The Intel 8087 was the floating point co-processor for the 8086. The 8087 required a lot of microcode, more than could fit in a standard ROM on the die. To get the microcode to fit, Intel created a special ROM that stored two bits per transistor (instead of one) by using four different sizes of transistors to generate four different voltages. Analog circuitry converted each voltage level into two bits. This complex technique doubled the density (at least in theory), allowing the microcode to fit on the chip. I wrote about the 8087's non-binary ROM here

  6. The ALU operations that are grouped together are add, add with carry, subtract, subtract with borrow, logical AND, logical XOR, logical OR, and compare. The compare operation may seem out of place in this list, but it is implemented as a subtract operation that updates the condition flags without changing the value. Operations such as increment, decrement, negation, and logical NOT may seem like they should be included, but since they operate on a single argument instead of two, they are implemented differently at the microcode level. Increment and decrement are combined in the microcode, however. Negation and logical NOT could be combined except that negation affects the condition code flags, while NOT doesn't, so they need separate microcode. (This illustrates how subtle features of the instruction set can have more impact than you might expect.) Since the ALU doesn't have hardware multiplication and division, the multiplication and division operations are implemented separately in microcode with loops. 

  7. The ALU itself isn't examining instruction bits to decide what to do. There's some control circuitry next to the ALU that uses a PLA (Programmable Logic Array) to examine the instruction bits and the microcode's ALU command to generate low-level control signals for the ALU. These signals control things such as carry propagation, argument negation, and logical operation selection to cause the ALU to perform the desired operation. 

  8. The Translation ROM has one additional output: a wire indicating an address mode that uses the BP register. This output goes to the segment register selection circuitry and selects a different segment register. The reason is that the 8086 uses the Data Segment by default for effective address computation, unless the address mode uses BP as a base register. In that case, the Stack Segment is used. This is an example of how the 8086 architecture is not orthogonal and has lots of corner cases. 

  9. You can also view the Translation ROM as a PLA (Programmable Logic Array) constructed from two layers of NOR gates. The conditional entries make it seem more like a PLA than a ROM. Technically, it can be considered a ROM since a single row is active at a time. I'm using the name "Translation ROM" because that's what Intel calls it in the patents. 

  10. Although the Group Decode ROM is called a ROM in the patent, I'd consider it more of a PLA (programmable logic array). Conceptually it holds 256 words, one for each instruction. But its implementation is an array of logic functions. 

  11. These instructions were called "Group 1" and "Group 2" instructions in the 8086 documentation. Later Intel documentation renamed them as "Unary Group 3", "INC/DEC Group 4" and "Indirect Group 5". Some details are here. The 8086 has two other groups of instructions where the reg field defines the instruction: the "Immediate" instructions 0x80-0x83 and the "Shift" instructions 0xD0-0xD3. For these opcodes, the different instructions were implemented by the ALU. As far as the microcode was concerned, these were "normal" instructions so I won't discuss them in this post.

    I should mention that although the 8086 opcodes are always expressed in hexadecimal, the encoding makes much more sense if you look at it in octal. Details here. The octal encoding also applies to other related chips including the 8008, 8080, and Z80. 

Reverse-engineering the 8086's Arithmetic/Logic Unit from die photos

The Intel 8086 processor was introduced in 1978, setting the course of modern computing. While the x86 processor family has supported 64-bit processing for decades, the original 8086 was a 16-bit processor. As such, it has a 16-bit arithmetic logic unit (ALU).1 The arithmetic logic unit is the heart of a processor: it performs arithmetic operations such as addition and subtraction. It also carries out Boolean logic operations such as bitwise AND and OR as well as also bit shifts and rotates. Since a fast ALU is essential to the overall performance of a processor, ALUs often incorporate interesting design tricks.

The die photo below shows the silicon die of the 8086 processor. The ALU is in the lower-left corner. Above it are the general- and special-purpose registers. An adder, used for address calculation, is in the upper left. (For performance, the 8086 has a separate adder to add the segment register and memory offset when accessing memory.) The large microcode ROM is in the lower right.

The 8086 die, zooming in on one bit of the ALU. The metal and polysilicon layers were removed for this photo, showing the silicon layer.

The 8086 die, zooming in on one bit of the ALU. The metal and polysilicon layers were removed for this photo, showing the silicon layer.

Zooming in on the ALU shows that it is constructed from 16 nearly-identical stages, one for each bit. The upper row handles bits 7 to 0 while the lower row handles bits 15 to 8.3 In between, the flag circuitry indicates the status of an arithmetic operation through condition codes such as zero or nonzero, positive or negative, carry, overflow, parity, and so forth. These are typically used for conditional branches.

In this blog post, I reverse-engineer the 8086's ALU and explain how it works. It's more complex than other vintage ALUs that I've studied,2 using a flexible circuit that can implement arbitrary bit functions. The carry is implemented with a Manchester carry chain, a fast design dating back to a 1960s supercomputer.

The ALU circuitry

The 8086's ALU circuitry is a bit tricky, so I'll start by explaining how it adds two numbers. If you've studied digital logic, you may be familiar with the full adder, a building-block for adding binary numbers. Specifically, a full adder takes two bits and a carry-in bit. It adds these three bits and outputs the 1-bit sum, as well as a carry-out bit. (For instance 1+0+1 = 10 in binary, so the carry-out is 1 and the sum bit is 0.) A 16-bit adder can be created by joining 16 full-adders, with the carry-out from one fed into the carry-in of the next.

The simplified diagram below represents one stage of the ALU's adder. It takes two inputs and the carry-in and sums them, forming a 1-bit sum output and a carry-out. (Note that the carry signal travels right-to-left.) The sum bit output is generated by the exclusive-or of the two arguments and the carry-in, using the two exclusive-or gates at the bottom. Generating the carry, however, is more complex.

A simplified diagram of the 8086 ALU, showing how it performs addition. Two transistors control the carry-out.

A simplified diagram of the 8086 ALU, showing how it performs addition. Two transistors control the carry-out.

The carry computation uses an optimization called the Manchester carry chain4, dating back to 1959, to avoid delays as the carry ripples from one stage to the next. The idea is to decide, in parallel, if each stage will generate a carry, propagate an existing carry, or block an incoming carry. Then, the carry can rapidly flow through the "carry chain" without sequential evaluation. To understand this, consider the cases when adding two bits and a carry-in. For 0+0, there will be no carry-out, regardless of any carry-in. On the other hand, adding 1+1 will always produce a carry, regardless of any carry-in; this case is called "carry-generate". The interesting cases are 0+1 and 1+0; there will be a carry-out if there was a carry-in. This case is called "carry-propagate" since the carry-in propagates through the stage unchanged.

In the Manchester carry chain, the carry-propagate signal opens or closes transistors in the carry line. In the carry-propagate case, the top transistor is activated, connecting carry-in to carry-out, so the carry can flow through. Otherwise, the lower transistor is activated and the carry-out receives the carry-generate signal, generating a carry if both arguments are 1. Since these transistors can all be set in parallel, carry computation is quick. There is still some propagation delay as the carry signal flows through the transistors in the carry chain, but this is much faster than computing the carry through a sequence of logic gates.5

That explains how the ALU performs addition,6 but what about logic functions? How does it compute AND, OR, or XOR? Suppose you replace the carry-propagate XOR gate with a logic gate (AND, OR, or XOR) and replace the carry-generate gate with 0, as shown below. The output will simply be the AND (or OR or XOR) of the two arguments, depending on the new gate. (The right XOR gate has no effect since XOR with 0 passes the value through unchanged.) The point is that if you could somehow replace the gates, the same circuit could compute the AND, OR, and XOR logic functions, as well as addition.

To compute a logic function, the XOR gate is (conceptually) replaced by a different logic gate, and the carry-generation is blocked.

To compute a logic function, the XOR gate is (conceptually) replaced by a different logic gate, and the carry-generation is blocked.

Another important operation is bit shifting. The ALU shifts a value to the left by taking advantage of the carry line in an unusual way (below).7 The bit from the first argument is directed into the carry-out, sending it one bit position to the left. The received carry bit passes through the XOR gate, resulting in a left shift by one bit. The carry-propagate signal is set to 0; this both directs the argument bit to carry-out, and turns the XOR gate into a pass-through. (A right shift is implemented with a separate circuit, as will be explained below.)

Shifting left by one bit takes advantage of the carry line to pass each bit to the left.

Shifting left by one bit takes advantage of the carry line to pass each bit to the left.

Thus, the ALU can reuse this circuit to perform a variety of operations, by reprogramming the carry-propagate and generate gates with different functions. But how are these magic reprogrammable gates implemented? The trick is that any Boolean function of two variables can be specified by the four values in the truth table. For instance, AND has the truth table below, so it can be specified by the four values: 0, 0, 0, 1:

ABA AND B
000
010
100
111

If we feed those values into a multiplexer, and select the desired value based on the two inputs, we will get the AND of the inputs. If instead, we feed 0, 1, 1, 0 into the multiplexer, we will get the XOR of the inputs. Other inputs create other logic functions similarly. With the appropriate values, any logic function of two variables can be implemented.8 (Some special cases: 0, 0, 0, 0 will output the constant 0; while 0, 0, 1, 1 will output the input A. This multiplexer circuit is used for the carry-propagate gate. A similar but half-sized circuit is used for the carry-generate gate.9

A multiplexer acts as a generic gate in the ALU.

A multiplexer acts as a generic gate in the ALU.

Now that I've presented the background, the complete ALU circuit is shown below, with multiplexers in place of the carry-propagate and generate gates. On the chip, the carry-in and carry-out are inverted, and this is reflected below. The schematic also shows the connection from the ALU to the bus, outputting the result. The circuitry at the bottom supports the shift right operation, which doesn't fit into the general circuit of the ALU. For this blog post, I'll ignore how the control signals are generated.10

One bit of the ALU circuitry in the 8086.

One bit of the ALU circuitry in the 8086.

The ALU's implementation in silicon

The 8086 and other processors of that era were built from a type of transistor called NMOS. The silicon substrate was "doped" by diffusion of arsenic or boron to form conductive silicon and transistors. On top of the silicon, polysilicon wiring created the gates of the transistors and wired components together. Finally, a metal layer on top provided more wiring. (In comparison, modern processors are built from CMOS technology, which combines NMOS and PMOS transistors, and they have many layers of metal wiring.)

Structure of an NMOS transistor (MOSFET) as implemented in an integrated circuit.

Structure of an NMOS transistor (MOSFET) as implemented in an integrated circuit.

The diagram above shows the structure of an NMOS transistor. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. A high voltage on the gate lets current flow between the source and drain, while low voltage on the gate blocks the current flow.

The simplest logic gate is an inverter; the diagram below shows how an inverter is built from an NMOS transistor and a resistor.11 The pinkish regions are doped silicon, while the brownish lines are the polysilicon wiring on top. A transistor is formed where the polysilicon line crosses the doped silicon. With a low input, the transistor is off, so the pull-up resistor pulls the output high. With a high input, the transistor turns on. This connects the output to ground, pulling the output low. Thus, the input signal is inverted.

An inverter, as implemented in the 8086's ALU.

An inverter, as implemented in the 8086's ALU.

A more complex gate, such as the 2-input NOR gate below, uses similar principles. With low inputs, the transistors are turned off, so the pullup resistor pulls the output high. If either input is high, the corresponding transistor turns on and pulls the output low. Thus, this circuit implements a NOR gate. The die layout matches the schematic, but has a complicated appearance due to space-saving optimization. You might expect the transistors to be simple rectangles, but the silicon regions have irregular shapes to make the most use of the space. In addition, other transistors (not part of the NOR gate) share the ground connections to save space.

A NOR gate as implemented in the 8086's ALU. The metal wiring has been removed for this photo, showing the silicon and polysilicon underneath.

A NOR gate as implemented in the 8086's ALU. The metal wiring has been removed for this photo, showing the silicon and polysilicon underneath.

The multiplexers are built using a completely different technique: pass transistors. Instead of pulling the output to ground, pass transistors pass an input signal through to the output. In the multiplexer, each input is connected to a different pair of transistors. Depending on the arguments, exactly one pair will have both transistors on. For instance, if arg2 is 0 and arg1 is 1, the transistor pair in the upper left will connect ctl01 to the output. Each other input will be blocked by a transistor that is turned off. Thus, the multiplexer selects one of the four inputs, passing it through to the output. (This pass-transistor approach is more compact than building a multiplexer out of standard logic gates.)

Implementation of the multiplexed gate in the ALU.

Implementation of the multiplexed gate in the ALU.

The diagram below shows an ALU stage with the major components labeled. You may spot the inverter, NOR gate, and multiplexer described earlier. Other components are implemented with similar techniques. This diagram can be compared with the earlier schematic. The reddish horizontal lines are remnants of the metal layer, which was removed for this photo. These lines carried the control signals, power, and ground.

Die photo of ALU with main components labeled.

Die photo of ALU with main components labeled.

The ALU's temporary registers

The diagram below (from the 8086 patent) shows how the ALU is connected to the rest of the processor by the ALU bus. The discussion above covered the "Full Function ALU" in the middle of the diagram, which takes two 16-bit inputs and produces a 16-bit output. These inputs are supplied from three temporary registers: A, B, and C. (These temporary registers are invisible to the programmer and should not be confused with the 8086's AX, BX, and CX registers.) I'll mention a few features of these registers that will be important later. Any register can provide the ALU's first input, but the second input always comes from the B register. These registers have a bidirectional connection to the ALU bus, so they can be both written and read. One unusual feature of the ALU is that it has a single data connection to the rest of the 8086, through the ALU bus.12 This seems like a bottleneck, since two clock cycles are required to load the registers, followed by another clock cycle to access the result. But apparently the single bus worked well enough for the 8086.

This diagram from the 8086 patent shows the ALU and its associated registers.

This diagram from the 8086 patent shows the ALU and its associated registers.

The Processor Status Word (PSW) shown above holds the condition flags, status bits on the ALU result: zero, negative, overflow, and so forth. Although the PSW looks trivial in the diagram above, the die photo at the top of the article shows that it constitutes about a third of the ALU circuitry. I'll leave the flag circuitry for a later discussion due to its complexity: each flag has unique circuitry that handles many special cases.

The schematic below shows one bit of the reverse-engineered implementation of the ALU's temporary registers. The registers are implemented with latches; each box represents a latch, a circuit that holds one bit. The two large AND-NOR gates act as multiplexers, selecting the output from one of the latches. The upper gate selects one of the registers for reading. The lower gate selects one of the registers as an argument for the ALU.

This circuit implements the ALU's three temporary registers and the associated circuitry.

This circuit implements the ALU's three temporary registers and the associated circuitry.

While the 6-input AND-NOR gate multiplexer may look complex, it is straightforward to implement with NMOS transistors. The schematic shows how it is built from 6 transistors and a pull-up. You can verify that if both transistors in a pair are energized, the output will be pulled to ground, providing the AND-NOR function.

The 6-input AND-NOR gate is built from 6 transistors, arranged in pairs.

The 6-input AND-NOR gate is built from 6 transistors, arranged in pairs.

The latch circuit is shown below. I've written about the 8086's latches in detail, so I'll give just a quick summary. The idea of the latch is that it can stably hold either a 0 or a 1 bit. When the clock signal clk' is high, the upper transistor is on, connecting the inverters into a loop. If the input to the first inverter is 1, it outputs a 0 to the second inverter, which outputs a 1 to the first, so they stay in that state, storing the bit. Similarly, the loop is stable if the input is a 0.

A one-bit latch in the 8086's ALU.

A one-bit latch in the 8086's ALU.

The special thing about this latch is that it's a dynamic latch. When the clock signal clk' is low, the loop is broken, but the input on the first inverter remains, due to the capacitance of the wire and transistor. When clk' goes high again, this voltage is refreshed. Alternatively, when clk' is low, a new value can be loaded into the latch by activating load, turning on the first transistor and allowing a new input signal to pass into the latch. The 8086 uses dynamic latches because the latch is compact, using just two transistors and two inverters. The latch is implemented in silicon as shown below.

Implementation of a latch in the 8086's ALU.

Implementation of a latch in the 8086's ALU.

The diagram below summarizes the components of the temporary register implementation. This circuitry is repeated 16 times to complete the registers.13 The output from the registers is fed into the ALU circuitry described earlier.

The ALU uses three temporary registers to hold arguments. This diagram shows the implementation of one bit.

The ALU uses three temporary registers to hold arguments. This diagram shows the implementation of one bit.

Conclusions

Although the Intel 8086 has complex circuits, its features are large enough that it can be studied under a microscope. The ALU is a key part of the processor and takes up a large fraction of the die. Its circuitry can be reverse-engineered through careful examination, revealing its interesting construction. It uses a Manchester carry chain for fast carry propagation. The carry-generate and carry-propagate signals are created by multiplexers that operate as arbitrary function generators, creating a flexible ALU with a small amount of circuitry. The ALU is built from a combination of standard logic, pass-transistor logic, and dynamic logic to optimize performance and minimize size.

You might have noticed that the 8086's ALU doesn't have support for multiplication, division, or multiple-bit shifts, even though the 8086 has instructions for these operations. These operations are computed in microcode using simpler ALU operations (shift, add, subtract for multiplication and division, and repeated single-bit shifts for larger shifts).

Some features of the ALU remain to be described, in particular the condition flags and how the ALU control signals are generated from opcodes. I plan to write about these soon, so follow me on Twitter @kenshirriff or RSS for updates.

Notes and references

  1. The ALU size almost always matches the processor word size, but there are exceptions. Notably, the Z-80 is an 8-bit processor but has a 4-bit ALU. As a result, the Z-80's ALU runs twice for each arithmetic operation, processing half the byte at a time. Some early computers used a 1-bit ALU to keep costs down, but these serial processors were slow. 

  2. I've looked at the ALU of various other early microprocessors including the 8008, Z-80, and the 8085. I've also reverse-engineered the 74181 and Am2901 ALU chips. 

  3. The ALU's layout has bits 15-8 in the top and bits 7-0 below. This layout is a consequence of the bit ordering in the data path: the bits are interleaved 15-7-14-6-...-8-0, instead of linearly 15-14-...-0. The reason behind this interleaving is that it makes it easy to swap the two bytes in the 16-bit word, by swapping pairs of bits. The ALU is split into two rows so it fits into the horizontal space available. Even with the tall, narrow layout of an ALU stage, a bit of the ALU is wider than a bit of the register file. Splitting the ALU into two rows keeps the bit spacing approximately the same, avoiding long wires between the register file and the ALU. 

  4. The Manchester carry chain was developed by the University of Manchester and described in the article Parallel addition in digital computers: a new fast 'carry' circuit, 1959. It was used in the Atlas supercomputer (1962). 

  5. The ALU also uses carry-skip techniques to speed up carry calculation; I'll briefly summarize. The idea of carry-skip is to skip over some of the stages in the carry chain if possible, reducing the worst-case delay through the chain. For example, if there is a carry-in to bit 8, and the carry-propagate is set for bits 8, 9, 10, and 11, then it can be immediately determined that there is a carry-in to bit 12. Thus, by ANDing together the carry-in and the four carry-propagate values, the carry-in to bit 12 can be calculated immediately for this case. In other words, the carry skips from bit 8 to bit 12. Likewise, similar carry-skip circuits allow the carry to skip from bit 2 to bit 4, and bit 4 to bit 8. These carry-skip circuits reduced the ALU's worst-case computation time. The carry-skip circuitry explains why each stage in the ALU is similar but not quite identical. Note that for logic operations or shift, either carry-propagate or carry-generate is 0, so the carry-skip won't activate and corrupt the result. 

  6. I should mention how subtraction is handled. A typical ALU inverts one of the inputs before adding, reusing the addition circuitry for subtraction. However, the 8086's ALU implements subtraction by changing the inputs to the multiplexers, as shown below. This leverages the general-purpose multiplexer and avoids implementing separate negation circuitry. (The comparison operation is implemented as subtraction but without storing the result. If the difference is zero, the values are equal, while a positive difference indicates the first value is larger.)

    Subtraction is similar to addition, but with the second argument negated. This is accomplished by inverting one input of the carry-generate AND gate and changing the carry-propagate XOR to XNOR.

    Subtraction is similar to addition, but with the second argument negated. This is accomplished by inverting one input of the carry-generate AND gate and changing the carry-propagate XOR to XNOR.

     

  7. The typical way a processor implements a left shift by one bit is by adding the value to itself. I don't know why the 8086 used the carry approach rather than the adding approach. 

  8. An FPGA (field-programmable gate array) uses similar techniques to implement arbitrary logic functions. The truth table is stored in a lookup table (LUT). These lookup tables are typically larger; a 6-input lookup table has 26 = 64 entries. One difference between the FPGA and the ALU is that the FPGA is programmed and then the gate functions are fixed, while the ALU's gates can change functions every operation. 

  9. The carry-generate multiplexer returns 0 if argument 1 is 0. In other words, it only implements two cases of the truth table and has two control inputs. To handle the other two cases, it is pulled low by the clock signal so it outputs 0. Because it is driven by the clock and depends on the value held by the circuit capacitance, it is a form of dynamic logic. The 8086 primarily uses standard static logic, but uses dynamic logic in some places. 

  10. The control signals for the ALU are generated from a PLA (similar to a ROM) that takes a 5-bit opcode as input. This opcode can either come from the instruction or be specified by the microcode. For an instruction, the ALU portion of the instruction is typically bits 5-3 of the first byte of the instruction or bits 5-3 of the MOD R/M byte. The point of this is that one microcode routine can handle all the similar arithmetic/logic instructions, making the microcode smaller. The ALU control PLA generates the signals to perform the correct ALU operation, transparently to the microcode. I should mention that there are many more ALU control signals than I described. Many of these control the flag handling, while others control various special cases.

    The control signals pass through the peculiar circuit below. If the input is high, it sends a clock pulse to the ALU. Otherwise, it remains low. The drive signal is discharged to ground on the negative clock phase by the lower transistor. In the absence of an input, the signal is not driven during the positive clock phase, but remains low due to dynamic capacitance. One mystery is the transistor with its gate tied to +5V, leaving it permanently on, which seems pointless. It will reduce the voltage to the gate of the clk transistor, and thus the output voltage, but I don't see why. Maybe to reduce current? To slow the signal?

    The drive signals to the ALU gates are generated with this dynamic circuit.

    The drive signals to the ALU gates are generated with this dynamic circuit.

     

  11. The pull-up resistor in an NMOS gate is implemented by a special depletion-mode transistor. The depletion-mode transistor acts as a resistor but is more compact and performs better than an actual resistor. 

  12. In the 6502, the two inputs of the ALU are connected to separate buses (details), so they can both be loaded at the same time. The 8085 (and many other early microprocessors) connect the accumulator register to one input of the ALU to avoid use of the bus (details). 

  13. The silicon implementation of the lower eight bits of the ALU / registers is flipped compared to the upper eight bits. The motivation is to put the ALU signals next to the flag circuitry that needs these signals. Since the flag circuitry is sandwiched between the two halves of the ALU, the two halves become (approximate) mirror images. (See the die photo at the top of the article.) 

A visit to the Large Scale Systems Museum

I didn't expect to find two floors filled with vintage computers in a small town outside Pittsburgh. But that's the location of the Large Scale System Museum, housed in a former department store. The ground floor of this private collection concentrates on mainframes and minicomputers from the 1970s to 1990s featuring IBM, Cray, and DEC systems, along with less common computers. Amazingly, most of these vintage systems are working. Upstairs, the museum is filled with vintage home computers from the pre-PC era.

IBM

IBM set the standard for the mainframe computer with its introduction of the System/360 in 1964, a line of computers designed to support the full circle (i.e. 360°) of business and scientific applications. The System/360 evolved into the System/370 in the 1970s and the System/390 in the 1990s. Most of these mainframes filled a data center, but the museum has some smaller S/370 and S/390 mainframes designed for offices. The IBM System/370 9375 (1986; below), is described as a "baby mainframe" or "super-mini computer" for engineering or commercial applications.

IBM System/370 9375. The computer itself is in the middle rack. The left rack has a 3490E tape cartridge storage system, while the right rack holds 9335 disk controllers and disk drives (856 MB per drive).

IBM System/370 9375. The computer itself is in the middle rack. The left rack has a 3490E tape cartridge storage system, while the right rack holds 9335 disk controllers and disk drives (856 MB per drive).

The System/390 line is represented by the IBM System/390 Multiprise-2003 (1997; below). This mainframe could not boot up on its own, but required a special desktop PC called the Mainframe Service Element (photo) to initialize the mainframe with microcode and start it up.

This low-end IBM System/390 Multiprise-2003 had 1 GB of memory and supported hundreds of simultaneous database transactions.

This low-end IBM System/390 Multiprise-2003 had 1 GB of memory and supported hundreds of simultaneous database transactions.

To support smaller customers, IBM also produced minicomputers, which they called "midrange systems". The IBM System/32 (1975; below) is a minicomputer built into a desk, designed for small businesses. IBM's midrange systems evolved into the IBM AS/400 (1992; photo).

This IBM System/32 had 16 KB of memory and 13 MB of disk storage. It leased for $1200 per month.

This IBM System/32 had 16 KB of memory and 13 MB of disk storage. It leased for $1200 per month.

The museum has many disk drives and tape drives. One example is the massive 3380E disk drive (below; 1985), providing 5 gigabytes of storage. It's amazing to think that you can now hold a thousand times as much storage in your hand.

The IBM 3380E disk system stored 5 gigabytes of data. The 14-inch disk platter is in the center, labeled "E".

The IBM 3380E disk system stored 5 gigabytes of data. The 14-inch disk platter is in the center, labeled "E".

Cray

Computer designer Seymour Cray and his company Cray Research were famed for building the world's fastest supercomputers. The museum has several Cray computers from the 1990s. The Cray YMP-EL supercomputer (1992; below) was an "Entry Level" Cray, costing $300,000. It was built from CMOS chips rather than the fast but hot ECL chips in earlier Crays, allowing it to be air-cooled rather than Freon cooled. The museum also has the related, low-end Cray EL-94, packaged in an ugly box (photo; 1992);

The Cray YMP-EL supercomputer.

The Cray YMP-EL supercomputer.

The Cray J90 (1996; below) was a popular low-end Cray, an evolution of the Y-MP EL. This one holds 1 GB of memory and cost $300,000.

Cray J 90 supercomputer.

Cray J 90 supercomputer.

The Cray SV1 (1999; below) followed the J90. It introduced more high-performance features such as a vector cache and multi-streaming. This one has 16 processors and 16 GB of memory, and cost about $1 million.

The Cray SV1 supercomputer.

The Cray SV1 supercomputer.

Digital Equipment Corporation (DEC)

Dave McGuire, curator of the large systems, in front of PDP "Straight 8" minicomputers.

Dave McGuire, curator of the large systems, in front of PDP "Straight 8" minicomputers.

Digital Equipment Corporation was founded in 1957 and became the second-largest computer manufacturer, concentrating on minicomputers. DEC's PDP-8 was a very popular 12-bit minicomputer that essentially created the "minicomputer" category of computers. The first PDP-8 was the Straight-8 (1966; photos above and below), a compact all-transistor computer built from circuit cards plugged into a wire-wrapped backplane.

The "Straight 8" PDP-8 was built from transistorized circuits on small cards.

The "Straight 8" PDP-8 was built from transistorized circuits on small cards.

The PDP-8/E (1969; below) used integrated circuits (7400-series TTL) in place of discrete transistors as did the compact and cheaper PDP-8/A (1974; photo).

PDP-8/E minicomputer. The paper tape reader is at the top, above the front panel. An RK05 DECpack is at the bottom, storing 2.4 megabytes on a removable disk pack.

PDP-8/E minicomputer. The paper tape reader is at the top, above the front panel. An RK05 DECpack is at the bottom, storing 2.4 megabytes on a removable disk pack.

DEC started producing mainframes in 1966 with the PDP-10, a 36-bit computer that popularized time-sharing. The museum has a DECsystem-2020 (1978), the smallest member of the PDP-10 family.

A DECsystem-2020 mainframe next to an RM02 disk drive. The drive's removable disk packs each store 67 megabytes.

A DECsystem-2020 mainframe next to an RM02 disk drive. The drive's removable disk packs each store 67 megabytes.

In 1970, DEC introduced the 16-bit PDP-11, which became the most popular minicomputer with about 600,000 sold. The museum has many different PDP-11 models including the PDP-11/05 (1972; photo, console), the fast PDP-11/50 (1972; below, photo), the compact and popular PDP-11/34 (1976; photo), and the PDP-11/44 (1981; photo).

Console of the PDP-11/50 minicomputer.

Console of the PDP-11/50 minicomputer.

DEC's PDP-11 evolved into the VAX line of 32-bit computers. Larger and more powerful than earlier minicomputers, these systems were known as superminicomputers. The VAX-11/780 (1978; below) was the first member of the VAX family, and was implemented with TTL chips. The museum has a VAX-11/750 (1980) and the cheap single-cabinet VAX-11/730 (1982; photo), the powerful VAX-6000 (1991; photo), and top-of-the-line VAX-7000 (1992; photo). The VAXstation 4000 Model 90 (1991; photo) was a workstation implementing the VAX instruction set.

The VAX 11/780 "superminicomputer".

The VAX 11/780 "superminicomputer".

DEC started struggling in the 1990s as the market shifted to personal computers. DEC was acquired in 1998 by personal computer manufacturer Compaq, which in turn was soon acquired by Hewlett-Packard in 2002.

Other systems

The museum has systems from many other companies such as Varian, Control Data, Wang, Panasonic, Silicon Graphics, Singer, and Tektronix, but I'll just touch on some highlights.

Data General was a major producer of minicomputers, third behind DEC and IBM. The Data General Eclipse was the successor to the popular Data General Nova 16-bit minicomputer. It is represented in the museum by the Eclipse S/280 (1975; below) and Eclipse S/120 (1982; photo). Data General moved into the microcomputer market with the microNOVA (1977; photo), but it wasn't commercially successful.

Data General Eclipse S/280 minicomputer.

Data General Eclipse S/280 minicomputer.

In the late 1970s, Hewlett-Packard was the fourth-largest producer of minicomputers. The HP 2116B minicomputer (1968; photo) was part of the HP 1000 (photo) family of 16-bit minicomputers designed for instrument control and automation. The HP 2645A terminal (below) was part of HP's line of terminals.

HP 2645A terminal

HP 2645A terminal

Another interesting terminal is the Friden Flexowriter from the early 1960s (below). It has a paper tape reader and punch on the left. Flexowriters were often used as console terminals for computers.

Friden Flexowriter

Friden Flexowriter

The Burroughs B80 is a multi-user office minicomputer (1978; below). It has as dot-matrix printer above the keyboard. The computer on display was used by a funeral home, and has a paper product list taped above the keyboard with products such as "Tranquility urn", "Open/Close grave", and "Move dirt more than 25 miles".

The Burroughs B80 office minicomputer.

The Burroughs B80 office minicomputer.

The collection also includes analog computers, such as the Heathkit H-1 (1950s) which used vacuum tube amplifiers and represented values by signals from -100 to 100 volts. It could be programmed to solve differential equations by wiring the patch board. The museum also has a Comdyna GP-6 (photo), a more modern transistorized analog computer from the late 1960s.

A Heathkit H1 analog computer. Vacuum tubes are on top, the plugboard is in the middle, and potentiometer controls are in the front.

A Heathkit H1 analog computer. Vacuum tubes are on top, the plugboard is in the middle, and potentiometer controls are in the front.

Microcomputers in the Large Scale Integration Museum

Upstairs is the "Large Scale Integration Museum", a large collection of microcomputers of the 1970s and 1980s. The collection focuses on microcomputers before to the IBM PC and x86 processors. Since I'm more interested in the larger computers, I'll discuss this collection briefly, but I don't want to downplay its impressive scope.

Corey Little, curator of the microcomputer collection, in front of Imsai, ASR-33 teletype, Kenbek-1 replica, and Altair.

Corey Little, curator of the microcomputer collection, in front of Imsai, ASR-33 teletype, Kenbek-1 replica, and Altair.

The first commercial microprocessor was Intel's 4-bit 4004, introduced in 1971. The Intel Intellec 4/40 development system (below), used the 4040 microprocessor (1974), an improved version of the 4004. This system was intended for engineers to develop software for embedded systems using the 4040 chip.

Intel Intellec 4/40 development system. An EPROM socket below the key allowed software to be burned into EPROM chips.

Intel Intellec 4/40 development system. An EPROM socket below the key allowed software to be burned into EPROM chips.

The microcomputer revolution took off when Intel released the 8-bit 8080 microprocessor in 1974, leading to the first commercially successful personal computer, the MITS Altair 8800 kit (1975). In addition to the Altair 8800, the museum has the updated Altair 8800b and the more obscure Altair 680, which uses the Motorola 6800 microprocessor.

Altair 8800 (with the famous manifesto Computer Lib on top), Altair 680, Altair 8800b, and disk drive for Altair.

Altair 8800 (with the famous manifesto Computer Lib on top), Altair 680, Altair 8800b, and disk drive for Altair.

Single-board computers also helped popularize microprocessors. Companies produced development kits for engineers to experiment with new microprocessors and hobbyists often used them due to their low cost. The museum has several racks of these development boards; the rack below includes the Intel SDK-85 System Design Kit for the 8085 microprocessor, Artisan Electronics Model 85 microcalculator (a single-board scientific calculator that could be interfaced to a microcomputer), Rockwell's 6502-based AIM-65, Synertek's 6502-based SYM-1, and Transputer parallel processor boards.

A variety of development boards and single-board computers.

A variety of development boards and single-board computers.

By the late 1970s, microcomputers became mass-market products, with the introduction of home computers that were more affordable and usable by the general public. The museum has many other popular home computers from manufacturers such as Atari, Sinclair, Radio Shack, Heathkit, and Texas Instruments. The photo below shows part of the Commodore collection.

The Commodore collection includes calculators, Commodore Super PET, Educator 64, PET 4032, and PET 2001

The Commodore collection includes calculators, Commodore Super PET, Educator 64, PET 4032, and PET 2001

Early portable computers were suitcase-sized and often called luggables. The museum has a large collection including the IBM 5100 (1975; below), Osborne One (1981), Osborne Executive, Osborne Vixen, and Kaypro II, as well as more obscure machines such as the Telcon Zorba and General Electric Workmaster.

The IBM 5100 portable computer was introduced in 1975, six years before the IBM PC. Its keyboard has special characters for the APL language.

The IBM 5100 portable computer was introduced in 1975, six years before the IBM PC. Its keyboard has special characters for the APL language.

Apple is represented by a variety of Apple II, Apple III, Lisa, and Macintosh systems. The collection also includes a NeXTcube, the workstation developed by Steve Jobs in the 1980s after he was forced out of Apple. Steve Jobs returned to Apple when Apple purchased NeXT in 1997, leading to Apple's dramatic rise. The NeXTcube's operating system led to Apple's current macOS and iOS operating systems.

The NeXTcube workstation was packaged in a 1-foot magnesium cube.

The NeXTcube workstation was packaged in a 1-foot magnesium cube.

The museum has various toys and educational devices that were produced to explain computers, including the CALCULO Analog Computer (1959), Minivac 6010 (1962) created by the father of information theory Claude Shannon, Radio Shack Science Fair Digital Computer Kit (1977), and Digi-Comp 1 (1963).

The collection includes toy computers such as the CALCULO Analog Computer, MINIVAC 6010, Radio Shack ScienceFair Digital Computer, and Digi-Comp 1.

The collection includes toy computers such as the CALCULO Analog Computer, MINIVAC 6010, Radio Shack ScienceFair Digital Computer, and Digi-Comp 1.

Heathkit introduced the HERO-1 kit robot in 1982, providing a way for hobbyists to experiment with robotics. Nowadays, Arduinos and cheap servos and stepper motors make it easy to build a simple robot, but in 1982, robotics was much more difficult. The HERO-1 kit cost $1500 (equivalent to about $4000 today).

Three Heathkit HERO robots. The HERO 2000 (1986, left) included multiple processors and speech synthesis, while the older HERO-1 robots have a single 6808 processor. The "eyes" are an ultrasonic distance sensor.

Three Heathkit HERO robots. The HERO 2000 (1986, left) included multiple processors and speech synthesis, while the older HERO-1 robots have a single 6808 processor. The "eyes" are an ultrasonic distance sensor.

Conclusion

The Large Scale Systems Museum contains a remarkable collection of large computer systems and microcomputers from the 1970s to 1990s The museum, hidden behind a storefront on a quiet small-town main street, illustrates an interesting period in computer history. During this time, mainframes, minicomputers, and supercomputers reached their peak and then went into steep decline. Meanwhile, the microprocessor passed through the hobbyist phase and the home computer phase before achieving its dominance. Amazingly most of the systems at the museum are up and running, giving the visitor a feel for the computers of that era.

The museum is open by appointment only; details are here and on their Facebook page. If you ever find yourself near New Kensington, PA (half an hour outside Pittsburgh), get in touch with them. I've only presented the highlights of the museum; more photos are here. I announce my latest blog posts on Twitter, so follow me @kenshirriff for future articles. I also have an RSS feed.