Showing posts with label 6502. Show all posts
Showing posts with label 6502. Show all posts

The Z-80 has a 4-bit ALU. Here's how it works.

The 8-bit Z-80 processor is famed for use in many early personal computers such the Osborne 1, TRS-80, and Sinclair ZX Spectrum, and it is still used in embedded systems and TI graphing calculators. I had always assumed that the ALU (arithmetic-logic unit) in the Z-80 was 8 bits wide, like just about every other 8-bit processor. But while reverse-engineering the Z-80, I was shocked to discover the ALU is only 4 bits wide! The founders of Zilog mentioned the 4-bit ALU in a very interesting discussion at the Computer History Museum, so it's not exactly a secret, but it's not well-known either.

I have been reverse-engineering the Z-80 processor using images from the Visual 6502 team. The image below shows the overall structure of the Z-80 chip and the location of the ALU. The remainder of this article dives into the details of the ALU: its architecture, how it works, and exactly how it is implemented.

I've created the following block diagram to give an overview of the structure of the Z-80's ALU. Unlike Z-80 block diagrams published elsewhere, this block diagram is based on the actual silicon. The ALU consists of 4 single-bit units that are stacked to form a 4-bit ALU. At the left of the diagram, the register bus provides the ALU's connection to the register file and the rest of the CPU.

The operation of the ALU starts by loading two 8-bit operands from registers into internal latches. The ALU does a computation on the low 4 bits of the operands and stores the result internally in latches. Next the ALU processes the high 4 bits of the operands. Finally, the ALU writes the 8 bits of result (the 4 low bits from the latch, and the 4 high bits just computed) back to the registers. Thus, by doing two computation cycles, the ALU is able to process a full 8 bits of data. ("Full 8 bits" may not sound like much if you're reading this on a 64-bit processor, but it was plenty at the time.)

As the block diagram shows, the ALU has two internal 4-bit buses connected to the 8-bit register bus: the low bus provides access to bits 0, 1, 2, and 3 of registers, while the high bus provides access to bits 4, 5, 6, and 7. The ALU uses latches to store the operands until it can use them. The op1 latches hold the first operand, and the op2 latches hold the second operand. Each operand has 4 bits of low latch and 4 bits of high latch, to store 8 bits.

Multiplexers select which data is used for the computation. The op1 latches are connected to a multiplexer that selects either the low or high four bits. The op2 latches are connected to a multiplexer that selects either the low or high four bits, as well as selecting either the value or the inverted value. The inverted value is used for subtraction, negation, and comparison.

The two operands go to the "alu core", which performs the desired operation: addition, logical AND, logical OR, or logical XOR. The ALU first performs one computation on the low bits, storing the 4-bit result into the result low latch. The ALU then performs a second computation on the high bits, writing the latched low result and the freshly-computed high bits back to the bus. The carry from the first computation is used in the second computation if needed.

The Z-80 provides extensive bit-addressed operations, allowing a single bit in a byte to be set, reset, or tested. In a bit-addressed operation, bits 5, 4, and 3 of the instruction select which of the 8 bits to use. On the far right of the ALU block diagram is the bit select circuit that support these operations. In this circuit, simple logic gates select one of eight bits based on the instruction. The 8-bit result is written to the ALU bus, where it is used for the bit-addressed operation. Thus, decoding this part of an instruction happens right at the ALU, rather than in the regular instruction decode logic.

The Z-80's shift circuitry is interesting. The 6502 and 8085 have an additional ALU operation for shift right, and perform shift left by adding the number to itself. The Z-80 in comparison performs a shift while loading a value into the ALU. While the Z-80 reads a value from the register bus, the shift circuit selects which lines from the register bus to use. The circuit loads the value unchanged, shifted left one bit, or shifted right one bit. Values shifted in to bit 0 and 7 are handled separately, since they depend on the specific instruction.

The block diagram also shows a path from the low bus to the high op2 latch, and from the high bus to the low op1 latch. These are for the 4-bit BCD shifts RRD and RLD, which rotate a 4-bit digit in the accumulator with two digits in memory.

Not shown in the block diagram are the simple circuits to compute parity, test for zero, and check if a 4-bit value is less than 10. These values are used to set the condition flags.

The silicon that implements the ALU

The image above zooms in on the ALU region of the Z-80 chip. The four horizontal "slices" are visible. The organization of each slice approximately matches the block diagram. The register bus is visible on the left, running vertically with the shifter inputs sticking out from the ALU like "fingers" to obtain the desired bits. The data bus is visible on the right, also running vertically. The horizontal ALU low and ALU high lines are visible at the top and bottom of each slice. The yellow arrows show the locations of some ALU components in one of the slices, but the individual circuits of the ALU are not distinguishable at this scale. In a separate article, I zoom in to some individual gates in the ALU and show how they work: Reverse-engineering the Z-80: the silicon for two interesting gates explained.

The ALU's core computation circuit

The silicon that implements one bit of ALU processing

The heart of each bit of the ALU is a circuit that computes the sum, AND, OR, or XOR for two one-bit operands. Zooming in shows the silicon that implements this circuit; at this scale the transistors and connections that make up the gates are visible. Power, ground, and the control lines are the vertical metal stripes. The shiny horizontal bands are polysilicon "wires" which form the connections in the circuit as well as the transistors. I know this looks like mysterious gray lines, but by examining it methodically, you can figure out the underlying circuit. (For details on how to figure out the logic from this silicon, see my article on the Z-80's gates.) The circuit is shown in the schematic below.

The Z-80 ALU circuit that computes one bit

This circuit takes two operands (op1 and op2), and a carry in. It performs an operation (selected by control lines R, S, and V) and generates an internal carry, a carry-out, and the result.

ALU computation logic in detail

The first step is the "carry computation", which is done by one big multi-level gate. It takes the two operand bits (op1 and op2) and the carry in, and computes the (complemented) internal carry that results from adding op1 plus op2 plus carry-in. There are just two ways this sum can cause a carry: if op1 and op2 are both 1 (bottom AND gate); or if there's a carry-in and at least one of the operands is a 1 (top gates). These two possibilities are combined in the NOR gate to yield the (complemented) internal carry. The internal carry is inverted by the NOR gate at the bottom to yield the carry out, which is the carry in for the next bit. There are a couple control lines that complicate carry generation slightly. If S is 1, the internal carry will be forced to 0. If R is 1, the carry out will be forced to 0 (and thus the carry in for the next bit).

The multi-level result computation gate is interesting as it computes the SUM, XOR, AND or OR. It takes some work to step through the different cases, but if anyone wants the details:

  • SUM: If R is 0, S is 0, and V is 0, then the circuit generates the 1's bit of op1 plus op2 plus carry-in, i.e. op1 xor op2 xor carry-in. To see this, the output is 1 if all three of op1, op2, and carry-in are set, or if at least one is set and there's no internal carry (i.e. exactly one is set).
  • XOR: If R is 1, S is 0, and V is 0, then the circuit generates op1 xor op2 To see this, note that this is like the previous case except carry-in is 0 because of R.
  • AND: If R is 0, S is 1, and V is 0, then the circuit generates op1 and op2. To see this, first note the internal carry is forced to 0, so the lower AND gate can never be active. The carry-in is forced to 1, so the result is generated by the upper AND gate.
  • OR: If R is 1, S is 1, and V is 1, then the circuit generates op1 or op2. The internal carry is forced to 0 by S and the the carry-out (carry-in) is forced to 0 by R. Thus, the top AND gate is disabled, and the 3-input OR gate controls the result.

Believe it or not, this is conceptually a lot simpler than the 8085's ALU, which I described in detail earlier. It's harder to understand, though, then the 6502's ALU, which uses simple gates to compute the AND, OR, SUM, and XOR in parallel, and then selects the desired result with pass transistors.

Conclusion

The Z-80's ALU is significantly different from the 6502 or 8085's ALU. The biggest difference is the 6502 and 8085 use 8-bit ALUs, while the Z-80 uses a 4-bit ALU. The Z-80 supports bit-addressed operations, which the 6502 and 8085 do not. The Z-80's BCD support is more advanced than the 8085's decimal adjust, since the Z-80 handles addition and subtraction, while the 8085 only handles addition. But the 6502 has more advanced BCD support with a decimal mode flag and fast, patented BCD logic.

If you've designed an ALU as part of a college class, it's interesting to compare an "academic" ALU with the highly-optimized ALU used in a real chip. It's interesting to see the short-cuts and tradeoffs that real chips use.

I've created a more detailed schematic of the Z-80 ALU that expands on the block diagram and the core schematic above and shows the gates and transistors that make up the ALU.

I hope this exploration into the Z-80 has convinced you that even with a 4-bit ALU, the Z-80 could still do 8-bit operations. You didn't get ripped off on your old TRS-80.

Credits: This couldn't have been done without the Visual 6502 team especially Chris Smith, Ed Spittles, Pavel Zima, Phil Mainwaring, and Julien Oster.

Intel x86 documentation has more pages than the 6502 has transistors

Microprocessors have become immensely more complex thanks to Moore's Law, but one thing that has been lost is the ability to fully understand them. The 6502 microprocessor was simple enough that its instruction set could almost be memorized. But now processors are so complex that understanding their architecture and instruction set even at a superficial level is a huge task. I've been reverse-engineering parts of the 6502, and with some work you can understand the role of each transistor in the 6502. After studying the x86 instruction set, I started wondering which was bigger: the number of transistors in the 6502 or the number of pages of documentation for the x86.

It turns out that Intel's Intel® 64 and IA-32 Architectures Software Developer Manuals (2011) have 4181 pages in total, while the 6502 has 3510 transistors. There are actually more pages of documentation for the x86 than the number of individual transistors in the 6502.

The above photo shows Intel's IA-32 software developer's manuals from 2004 on top of the 6502 chip's schematic. Since then the manuals have expanded to 7 volumes.

The 6502 has 3510 transistors, or 4528, or 6630, or maybe 9000?

As a slight tangent, it's actually hard to define the transistor count of a chip. The 6502 is usually reported as having 3510 transistors. This comes from the Visual 6502 team, which dissolved a 6502 chip in acid, photographed the die (below), traced every transistor in the image, and built a transistor-level simulator that runs 6502 code (which you really should try). Their number is 3510 transistors.

The 6502 processor chip

One complication is the 6502 is built with NMOS logic which builds gates out of active "enhancement" transistors as well as pull-up "depletion" transistors which basically act as resistors. The count of 3510 is just the enhancement transistors. If you include the 2102 1018 depletion transistors, the total transistor count is 5612 4528.

A second complication is that when manufacturers report the transistor count of chips, they often report "potential" transistors. Chips that include a ROM or PLA will have different numbers of transistors depending on the values stored in the ROM. Since marketing doesn't want to publish different transistor numbers depending on the number of 1 bits and 0 bits programmed into the chip, they often count ROM or PLA sites: places that could have transistors, but might not. By my count, the 6502 decode PLA has 21×131=2751 PLA sites, of which 649 actually have transistors. Adding these 2102 "potential" transistors yields a count of 6630 transistors.

Finally, some sources such as Microsoft Encarta and A History of the Personal Computer state the 6502 contains 9000 transistors, but I don't know how they could have come up with that value.

(The number of pages of Intel documentation is also not constant; the latest 2013 Software Developer Manuals have shrunk to 3251 pages.)

Thus, the x86 has more pages of documentation than the 6502 has transistors, but it depends how you count.

Reverse-engineering the Z-80: the silicon for two interesting gates explained

I've been reverse-engineering the Z-80 processor, using images from the Visual 6502 team. One interesting thing about the Z-80's silicon is it uses complex gates with multiple inputs and multiple levels of logic. It also implements an XOR gate with an unusual pass-transistor circuit. I thought it would be interesting to examine these gates at the silicon level and show how they work.

The image above shows the overall organization of the Z-80 chip. I'm going to zoom way in on the ALU and look at the silicon that implements one of the complex gates there: a 5-input, three-level gate. I'll walk through this gate and show how it works at the silicon level. While the silicon look like a jumble of lines, its operation is actually straightforward if you step through it.

Let's begin with an (oversimplified) description of how the chip is constructed. The chip starts with the silicon wafer. Regions are diffused with an element such as boron, yielding conductive diffusion regions. A layer of polysilicon strips is put on top. Finally, a layer of metal "wires" above the polysilicon provides more connections. For our purposes, diffusion regions, polysilicon, and metal can all be consider conductors.

In the image below, the bright vertical bands are metal wires. The slightly darker horizontal bands are polysilicon; the borders are more visible than the regions themselves. In this part of the Z-80, the polysilicon connections run mostly horizontally, and the metal wires run vertically. The large irregular regions outlined in black are doped silicon diffusion regions. The circles are vias between different layers.

Transistors are formed where a polysilicon line crosses a diffusion region. You might expect transistors to be very visible in the image, but a polysilicon line looks the same whether its a conductor or a transistor. So transistors just appear as long skinny regions in the image. The diagram below shows the physical structure of a transistor: the source and drain are connected if the gate is positive.

Structure of an NMOS transistor

Let's dive in and see how this circuit works. There's a lot going on, but the image below has been colored to make it clearer. Only three of the vertical metal lines are relevant. On the left, the yellow metal line ties together parts of the gate. In the middle is the blue ground line, which is critical to the operation of the gate. At the right, the red positive voltage line is used to pull the output high through a resistor. The large diffusion region has been tinted cyan. This region can be thought of as big conductive areas interrupted by transistors. There are 5 pinkish polysilicon input wires, labeled A, B, C, D, E. When they cross the diffusion region they still act as wires, but also form a transistor below in the diffusion region. For instance, input A is connected to two transistors.

With all the pieces labeled, we can figure out the operation of the circuit. If input A is high, the first transistor will conduct and connect the yellow strip to ground (dotted line 1). Likewise, if input B is high, the second transistor will conduct and ground the yellow strip (dotted line 2). C will ground the yellow strip via 3. So the yellow strip will be grounded for A or B or C. This forms a three-input OR gate.

If input D is high, transistor 4 will connect the yellow strip to the output. Likewise, if input E is high, transistor 5 will connect the yellow strip to the output. Thus, the output will be grounded if (A or B or C) and (D or E).

In the upper right, arrow 6/7/8 will ground the output if A and B and C are high and the three associated transistors (6, 7, 8) conduct. This computes A and B and C.

Putting this all together, the output will be grounded if [(A or B or C) and (D or E)] or [A and B and C]. If the output is not grounded, the resistor (actually a depletion transistor) will pull the output high. Thus, the final output is not [(A or B or C) and (D or E)] or [A and B and C].

The diagram below shows the gate logic implemented by this circuit. This rather complex gate is created from just nine transistors. Note that the final AND and NOR gates are "for free" - they are formed by wiring together previous outputs and don't require additional transistors. Another point of interest is that with NMOS, the output will be high unless something pulls it low, which explains why circuits are based on NAND and NOR gates rather than AND and OR gates.

If you want to see more low-level silicon analysis, see my article on the overflow circuit in the 6502 at the silicon level.

What does this gate do?

This gate is a key part of one bit of the Z-80's ALU. The gate generates the (inverted) sum, AND, OR, or XOR of B and C depending on the inputs. Specifically, B and C are the two operand inputs, and A is the carry in. D is a control input and E is an inverted intermediate carry from B plus C plus carry_in. By controlling D and overriding A and E, the operation is selected.

The Z-80's interesting XOR gate

The Z-80 uses an unusual circuit for its XOR gate. XOR is an inconvenient function to implement since it has a worst-case Karnaugh map, making it expensive to implement from simple gates. Instead, the Z-80 uses a combination of inverters and pass transistors, different from regular NMOS logic.

As before, the diagram below shows the power and ground metal lines, a connecting metal line in yellow, the polysilicon in pink, the polysilicon transistor gates in green, and diffusion in cyan. The two inputs are A and B.

Starting with input A: if it is high, transistor 1 will connect A' to ground. Otherwise the pullup resistor (way on the left), will pull A' high. (Note that A' is the whole diffusion region between transistor 1 and transistor 3 up to the resistor.) Thus transistor 1 forms a simple inverter with inverted output A'. Likewise, transistor 2 inverts input B to give inverted B' (in the whole diffusion region between transistors 2 and 4).

Now comes the tricky part. If A' is high, pass transistor 4 will connect B' to the yellow metal. If B' is high, pass transistor 3 will connect A' to the yellow metal. The third pullup resistor will pull the yellow metal high unless something ties it to ground . Working through the combinations, if A' and B' are both high, both A' and B' are connected to the yellow metal, which gets pulled high. If A' is high and B' is low, B' is connected to the yellow metal, pulling it low. Likewise, if A' is low and B' is high, A' pulls the yellow metal low. Finally if A' and B' are low, nothing gets connected to the yellow metal, so the resistor pulls it high.

To summarize, the yellow metal is pulled high if A' and B' are both high or both low. That is, it is the exclusive-nor of A' and B', which is also the exclusive-or of A and B.

Finally, the xnor value controls transistors 5a and 5b which form an inverter. If xnor is high, transistors 5a and 5b conduct and the xor output is connected to ground, and if xnor is low, the pullup resistors pull the xor output high. One unusual feature here is the parallel transistors 5a and 5b with separate pullup resistors. I haven't seen this in the 8085 or 6502; they use a single larger transistor instead of parallel transistors.

The schematic below summarizes the circuit. In case you're wondering, this XOR gate is used to compute the parity flag. All the bits are XORed together to generate the parity flag.

Comparison to other processors

From what I've seen so far, the Z-80 uses considerably more complex gates than the 8085 and the 6502. The 6502 uses mostly simple NAND/NOR gates and only a few two-level gates, not as complex as on the Z-80. The 8085 uses more complex gates, but still less than the Z-80. I don't know if the difference is due to technical limits on the number of gate levels, or the preferences of the designers.

The XOR circuit in the Z-80 is different from the 8085 and 6502. I'm not sure it saves any transistors, but it is unusual. I've seen other pass-transistor implementations of XOR, but none like the Z-80.

Credits: The Visual 6502 team especially Chris Smith, Ed Spittles, Pavel Zima, Phil Mainwaring, and Julien Oster.

Reverse-engineering the 8085's decimal adjust circuitry

In this post I reverse-engineer and describe the simple decimal adjust circuit in the 8085 microprocessor. Binary-coded decimal arithmetic was an important performance feature on early microprocessors. The idea behind BCD is to store two 4-bit decimal numbers in a byte. For instance, the number 42 is represented in BCD as 0100 0010 (0x42) instead of binary 00101010 (0x2a). This continues my reverse engineering series on the 8085's ALU, flag logic, undocumented flags, register file, and instruction set.

The motivation behind BCD is to make working with decimal numbers easier. Programs usually need to input and output numbers in decimal, so if the number is stored in binary it must be converted to decimal for output. Since early microprocessors didn't have division instructions, converting a number from binary to decimal is moderately complex and slow. On the other hand, if a number is stored in BCD, outputting decimal digits is trivial. (Nowadays, the DAA operation is hardly ever used).

Photograph of the 8085 chip showing the location of the ALU, flags, and registers.

One problem with BCD is the 8085's ALU operates on binary numbers, not BCD. To support BCD operations, the 8085 provides a DAA (decimal adjust accumulator) operation that adjusts the result of an addition to correct any overflowing BCD values. For instance, adding 5 + 6 = binary 0000 1011 (hex 0x0b). The value needs to be corrected by adding 6 to yield hex 0x11. Adding 9 + 9 = binary 0001 0010 (hex 0x12) which is a valid BCD number, but the wrong one. Again, adding 6 fixes the value. In general, if the result is ≥ 10 or has a carry, it needs to be decimal adjusted by adding 6. Likewise, the upper 4 BCD bits get corrected by adding 0x60 as necessary. The DAA operation performs this adjustment by adding the appropriate value. (Note that the correction value 6 is the difference between a binary carry at 16 and a decimal carry at 10.)

The DAA operation in the 8085 is implemented by several components: a signal if the lower bits of the accumulator are ≥ 10, a signal if the upper bits are ≥ 10 (including any half carry from the lower bits), and circuits to load the ACT register with the proper correction constant 0x00, 0x06, 0x60, or 0x66. The DAA operation then simply uses the ALU to add the proper correction constant.

The block diagram below shows the relevant parts of the 8085: the ALU, the ACT (accumulator temp) register, the connection to the data bus (dbus), and the various control lines.

The accumulator and ACT (Accumulator Temporary) registers and their control lines in the 8085 microprocessor.

The circuit below implements this logic. If the low-order 4 bits of the ALU are 10 or more, alu_lo_ge_10 is set. The logic to compute this is fairly simple: the 8's place must be set, and either the 4's or 2's. If DAA is active, the low-order bits must be adjusted by 6 if either the low-order bits are ≥ 10 or there was a half-carry (A flag).

Similarly, alu_hi_ge_10 is set if the high-order 4 bits are 10 or more. However, a base-10 overflow from the low order bits will add 1 to the high-order value so a value of 9 will also set alu_hi_ge_10 if there's an overflow from the low-order bits. A decimal adjust is performed by loading 6 into the high-order bits of the ACT register and adding it. A carry out also triggers this decimal adjust.

Schematic of the decimal adjust circuitry in the 8085 microprocessor.

Schematic of the decimal adjust circuitry in the 8085 microprocessor.

The circuits to load the correction value into ACT are controlled by the load_act_x6 signal for the low digit and load_act_6x for the high digit. These circuits are shown in my earlier article Reverse-engineering the 8085's ALU and its hidden registers.

Comparison to the 6502

By reverse-engineering the 8085, we see how the simple decimal adjust circuit in the 8085 works. In comparison, the 6502 handles BCD in a much more efficient but complex way. The 6502 has a decimal mode flag that causes addition and subtraction to automatically do decimal correction, rather than using a separate instruction. This patented technique avoids the performance penalty of using a separate DAA instruction. To correct the result of a subtraction, the 6502 needs to subtract 6 (or equivalently add 10). The 6502 uses a fast adder circuit that does the necessary correction factor addition or subtraction without using the ALU. Finally, the 6502 determines if correction is needed before the original addition/subtraction completes, rather than examining the result of the addition/subtraction, providing an additional speedup.

This information is based on the 8085 reverse-engineering done by the visual 6502 team. This team dissolves chips in acid to remove the packaging and then takes many close-up photographs of the die inside. Pavel Zima converted these photographs into mask layer images, generated a transistor net from the layers, and wrote a transistor-level 8085 simulator.

The 8085's register file reverse engineered

On the surface, a microprocessor's registers seem like simple storage, but not in the 8085 microprocessor. Reverse-engineering the 8085 reveals many interesting tricks that make the registers fast and compact. The picture below shows that the registers and associated control circuitry occupy a large fraction of the chip, so efficiency is important. Each bit is implemented with a surprisingly compact circuit. The instruction set is designed to make register accesses efficient. An indirection trick allows quick register exchanges. Many register operations use the unexpected but efficient data path of going through the ALU.

While the 8085's register complement is tiny compared to current processors, it has a solid register set by 1977 standards - about twice as many registers as the 6502. The 8085 has a 16-bit program counter, a 16-bit stack pointer, 16-bit BC, DE, and HL register pairs, and the 8-bit accumulator. The 8085 also has little-known hidden registers that are invisible to the programmer but used internally: the WZ register pair, and two 8-bit registers for the ALU: ACT and TMP.

Photograph of the 8085 chip showing components relevant to register operations.

Photograph of the 8085 chip showing components relevant to register operations.

The register file is in the lower left quadrant of the chip. It contains the 6 register pairs and associated circuitry. Underneath the registers is the 16-bit address latch and increment/decrement circuit. The register file is controlled by a set of control lines on the right, which are driven by register control logic circuits and the register control PLA. The current instruction is loaded into the instruction register (upper right) via the data bus. In the upper left is the 8-bit arithmetic-logic unit (ALU), with the accumulator and two temporary registers (ACT and TMP).

The 8085 has only 40 pins (visible around the edge of the image) to communicate with the outside world, a tiny number compared to current microprocessors with more than 1000 pins. For memory accesses, the 8085 reads or writes 8 bits of data using a 16-bit memory address (for a maximum of 64K of memory). In the image above, memory addresses flow through the 16-bit address bus (abus) provides memory addresses, while data flows through the chip over the 8-bit data bus (dbus). The 8 A pins handle half of the address, while the 8 AD pins are used both for the other half of the address and for data (at different times). This frees up pins for other uses, but makes computers using the 8085 slightly more complicated. In comparison, the 6502 is more straightforward, with separate pins for address and data.

Overall architecture of the register file

The diagram below shows the implementation of the 8085 register file in the same layout as on the actual chip. The 8-bit data bus is at the top, and the 16-bit address bus is at the bottom. The register control lines are on the right.

In the middle are the registers, arranged as pairs of 8-bit registers. Note that the registers are arranged "backwards" with the high-order bit on the right and the low-order bit on the left. The 16-bit program counter and stack pointer are first. Next is the WZ temporary register, and underneath it the BC register pair. The HL and DE register pairs are at the bottom - these registers do not have fixed locations, but can swap roles during execution. A 16-bit register bus (regbus) provides access to the registers.

Underneath the registers is the address latch, which holds a 16-bit value that is written to the address bus. This value is also the input to the 16-bit increment/decrement circuit. The output of the incrementer/decrementer can be written back to the registers.

The triangles indicate tri-state buffers, basically switches that control the flow of data. Buffers containing a + are amplifiers to boost the weak signals from the registers. Buffers containing a S are superbuffers, that provide extra current to send data across the long data bus.

Architecture diagram of the 8085 register file, as it is implemented on the chip. The register file is connected to the data bus at top, and address bus at bottom. The control lines are along the right.

Architecture diagram of the 8085 register file, as it is implemented on the chip. The register file is connected to the data bus at top, and address bus at bottom. The control lines are along the right.

The picture below zooms in on the chip image above, showing the register file in detail. The components in silicon exactly map onto the diagram above. Note the repeated patterns for the 16-bit circuits. The large transistors used as high-current drivers are clearly visible. The transistors in each bit of register storage are much smaller.

A closeup of the 8085 microprocessor, showing the details of the register file and the locations of the major components.

A closeup of the 8085 microprocessor, showing the details of the register file and the locations of the major components.

Storing bits in the register file

The implementation of the 8085 registers is unusual in several ways. The registers don't have explicit read and write modes; instead the register will be overwritten if there is a stronger signal on the bus. Instead of having a bus with one wire for each bit, the 8085 uses a sort of differential bus, with two wires for each bit: one wire transmits the value, and the other transmits the complement of the value.

Each bit consists of two inverters in a feedback loop, with pass transistors to connect the inverters to the bus. An unusual feature of this is the lack of any circuit to break the feedback loop when modifying the register (unlike the 6502). Instead, the 8085 uses a "might makes right" technique - if a stronger signal is written to the bus, it will overwrite a register connected to the bus. The transistors driving the register bus are about twice as large as the transistors in the inverters, so they can forcibly overwrite the inverter loop.

One consequence of this register implementation is that a register can't be copied directly to another register, since there's nothing to distinguish the source register from the destination register - each register could potentially damage the other's bits. To get around this, the 8085 uses an interesting trick - copies are actually done through the ALU, as will be explained later.

One bit of a register in the 8085 register file. Each bit is stored in two inverters in a feedback loop. The register bus uses two lines of opposite polarity for each bit. Access to the register is controlled by the reg_rw control line, which connects the inverters to the bus, allowing the value to be read or written.

One bit of a register in the 8085 register file. Each bit is stored in two inverters in a feedback loop. The register bus uses two lines of opposite polarity for each bit. Access to the register is controlled by the reg_rw control line, which connects the inverters to the bus, allowing the value to be read or written.
The image below zooms in on the chip closer, showing the silicon for six individual register bits. The schematic for one bit is overlaid, as are some of the metal lines providing power, ground, and the register bus. Each bit consists of two transistors for the inverters, two depletion pullup transistors for the inverters (shown as resistors), and two pass transistors connecting the bit to the register bus. The pink regions are transistors, with the green strips the gates (details).

Detail of the 8085 chip showing six bits in the 8085's register file. Bit 2 of the stack pointer is shown with schematic. The two transistors form two inverters in a feedback loop. The light blue lines are the metal layer wires connected to bit 2. The program counter is in the upper half of the image.

Detail of the 8085 chip showing six bits in the 8085's register file. Bit 2 of the stack pointer is shown with schematic. The two transistors form two inverters in a feedback loop. The light blue lines are the metal layer wires connected to bit 2. The program counter is in the upper half of the image.

To read a register, an amplifier circuit is used to boost the signal from the differential register bus to write it to the dbus or address latch. I assume this is a tradeoff to make the register file smaller. Each inverter pair can be made as small as possible, but then requires amplification to produce a signal strong enough for use elsewhere in the chip. The amplification circuit that drives the data bus is more complex than I'd expect, probably because of the extra power to drive the bus (details and schematic).

The incrementer/decrementer

The 16-bit incrementer/decrementer at the bottom of the register file is used for multiple purposes. It increments the program counter as instructions execute, increments and decrements the stack pointer as needed, and supports the 16-bit increment and decrement instructions.

An interesting feature of the incrementer is it also supports incrementing by 2, which is used to quickly skip over the two byte address in a call or jump not taken. This allows these operations to complete faster on the 8085 than the 8080.

Two bits of the 16-bit increment/decrement circuit in the 8085. Odd bits and even bits use a different circuit for efficiency. The carry out from even bits is complemented.

Two bits of the 16-bit increment/decrement circuit in the 8085. Odd bits and even bits use a different circuit for efficiency. The carry out from even bits is complemented.

The incrementer/decrementer is implemented by a chain of adders with ripple carry - the carry from each bit flows into the adder for the next bit. (The above schematic shows two bits, and is repeated 8 times in the full circuit.) The DREG_INC and DREG_DEC control lines select increment or decrement. One performance trick is that alternating bits are implemented with different circuits and the carry out of even bits is inverted. This avoids the inverters that would otherwise be needed to flip the carry back to its regular state. This saves space, but even more importantly it speeds up carry propagation. Because the carry has to propagate bit-by-bit through all 16 bits to generate the final result, adding an inverter to each bit would slow it down significantly. The carry out is used to compute the undocumented K flag value (details).

In comparison, the 6502 has a 16-bit incrementer (no decrement) used exclusively by the program counter. To reduce the carry propagation speed, this incrementer uses a carry-skip. That is, the carry out of the low-order byte is immediately generated and fed into the high-order byte. Thus the carries only need to propagate through 8-bits, the two bytes working in parallel. (The carry is easily generated by ANDing together the low-order bits. If they are all 1, there will be a carry into the high-order byte.)

The WZ Temporary registers

The WZ register pair in the 8085 is used for temporary storage, but is invisible to the programmer. Internally, the WZ register pair is implemented like the other register pairs.

The primary use of WZ is to hold operands from a two or three byte instruction until it can be used. The WZ registers are used to hold 16-bit addresses for LDA, STA, LHLD, JMP, CALL, and RST instructions. The registers hold the port for IN and OUT. The WZ register pair can also temporarily hold information read from memory. The registers hold the address popped off the stack for RET. For XTHL, the registers hold the value from the stack.

Register decoding and the instruction set

The instruction set of the 8085 is organized so an instruction can be quickly and easily decoded to determine the instruction to use. The underlying structure for most 8085 instructions is the octal bit pattern bbDDDSSS, where destination bits DDD and/or source bits SSS select the register usage. The move (MOV) instructions follow this structure. Other instructions (e.g. INR) use just the DDD bits to select the register, while math instructions use the three SSS bits. Some instructions only use DDD or SSS, and some instructions operate on register pairs so they don't use the lowest bit. This instruction pattern is visible if the instructions are arranged in an instruction table according to their octal values.

The three bits select the register as follows:

D2D1D0Register
000B
001C
010D
011E
100H
101L
110M
111A

M indicates a memory operation and is treated as a pseudo-register in the instruction set. Some instructions (e.g. INX) use the top two bits to select a register pair: BC, DE, HL, or "special" (stack pointer or accumulator). Note that in the table above the low-order bit selects a register out of a register pair.

This instruction set structure allows simple logic to control the registers. A multiplexer pulls out the right group of three bits, depending on the instruction and the cycle in the instruction (link to schematic). These three bits are then used to pick the specific register control lines to activate at each step.

The registers are controlled by about 18 control lines that affect the movement of data and the operation of the incremented/decrementer. The following table summarizes the control lines.

/RREG_RDReads the right-hand side register bus onto the data bus.
This implements the multiplexing of 16-bit registers onto the 8-bit data bus.
/LREG_RDReads the left-hand side register bus onto the data bus.
LREG_WRWrites the data bus to the left-hand side register bus.
This implements the demultiplexing of the 8-bit data bus to the 16-bit registers.
RREG_WRWrites the data bus to the right-hand side register bus.
REG_PC_RWConnects the PC to the register bus.
REG_SP_RWConnects the SP to the register bus.
REG_WZ_RWConnects the WZ register pair to the register bus.
REG_BC_RWConnects the BC register pair to the register bus.
REG_HL_RWConnects the HL (DE) register pair to the register bus.
REG_DE_RWConnects the DE (HL) register pair to the register bus.
DREG_WRWrites the output of the incrementer/decrementer to the register bus.
DREG_RDReads the register bus into the address latch.
/DREG_RDInverted DREG_RD.
DREG_DECIncrementer/decrementer performs decrement.
DREG_INCIncrementer/decrementer performs increment.
CARRY_OUTThe carry/borrow out from the incrementer/decrementer.
DREG_CNTIncrement/decrement by 1.
DREG_CNT2Increment/decrement by 2.

The first step in register control is the register control PLA, which generates 19 control signals based on the instruction type and the cycle step. The register control logic (between the register file and the PLA) mixes in the register selection bits as appropriate (and a few other inputs) to generate the register control lines listed above.

For instance, REG_BC_RW control line is activated if the PLA indicates a register access and the register bits are 00x. The RREG_RD control line is activated for a single-register read instruction if the register bits are xx0, and LREG_RD is activated if the bits are xx1. Both control lines are activated at the same time if the PLA indicates a register pair read.

The DE/HL exchange trick

The XCHG instruction exchanges the contents of the HL register pair with the contents of the DE register pair in a single M-cycle. You might wonder how the registers can be exchanged so quickly. It turns out that this instruction is implemented with a trick - an extra level of indirection.

Although most 8085 architecture diagrams label one register pair as DE and another as HL, this isn't exactly true. In fact, the 8085 has two register pairs and either one can be the DE or HL pair. A status flip flop keeps track of which pair is DE and which is HL. As Pavel Zima figured out, the XCHG instruction doesn't move any data; it simply toggles the flip flop. The data remains in the same place, but the DE register is now HL and vice versa. Thus, the XCHG instruction is completed quickly. The consequence is every use of DE or HL uses this flip flop to determine which register to access (link to schematic).

Using the ALU to move registers

You wouldn't expect the ALU (arithmetic-logic unit) to take part in a register-to-register move, but it happens in the 8085. Many register operations take advantage of the ALU's temporary registers.

The ALU doesn't directly operate on the accumulator and input register. Instead, the accumulator is copied to the ACT (Accumulator Temporary) register and the other input is copied to the TMP register. This way, the result can be written to the accumulator without the race condition that would occur if the accumulator were an input and output at the same time.

For register moves, the source value is copied to the TMP register, the ACT register is set to 0, and the ALU performs an OR operation (ALU details), writing the result (i.e. the source value) to the dbus. This result can then be stored to the register file during a later cycle.

The register file in action

The step-by-step operation of the register file is surprisingly complex. One complication is that the register file and buses must handle stepping the program counter, fetching the instruction, and performing any register moves, without interference. A second complication is that register moves go through the ALU as described above.

Stepping through an operation in detail will show the complexity of the register operations. The following shows the data flow for a MOV B,E instruction, which copies the contents of the E register into the B register.

To understand this table, a bit of background on 8085 instruction timing. An instruction cycle is broken down into one or more M (machine) cycles, where an 8-bit memory access can be done in one M cycle. Each M cycle is broken down into several T-states, where each T-state corresponds to one clock cycle. Each clock cycle has a low phase and a high phase.

The single-byte register-to-register MOV instruction takes one M cycle (M1), 4 T cycles, or 8 clock phases. Each clock phase is a separate line in the table. To make things more complicated, the activity for an instruction isn't entirely within its own instruction cycle. To improve performance, the 8085 uses simple pipelining, where the M1 opcode fetch of the next instruction overlaps with completion of the previous instruction.

The MOV B, E instruction (which copies the E register to the B register) is illustrated in the table below. The PC is copied to the incrementer latch at the end of the previous operation, and then is written to the address pins during the T1 cycle. The PC is updated with the incremented value at the end of the T2 cycle.

The instruction opcode is fetched in the T3 cycle, and at this point execution can start on the instruction. It's not until the T1 cycle of the next instruction that the register file swings into action. The E register is written to the dbus at the end of the T1 cycle. Then the ALU's TMP register is loaded from the dbus. The ALU's other argument, the ACT register is 0 at this point, and the ALU is configured to perform an OR operation. At the end of the (next instruction's) T3 cycle, the result of the ALU operation (i.e. the E register) is stored in the B register via the dbus. Meanwhile, the next instruction is getting fetched (grayed out).

CycleT/clockPC actionRegister action
T4/0
T4/1PC → inc latch
M1
opcode fetch
T1/0inc latch → address pins
T1/1inc latch → address pins
T2/0
T2/1inc → PC
T3/0data pins → dbus → instruction reg
T3/1
T4/0
T4/1PC → inc latch
M1
opcode fetch
T1/0inc latch → address pins
T1/1inc latch → address pinsE reg → dbus
T2/0dbus → TMP reg
T2/1inc → PC
T3/0data pins → dbus → instruction reg
T3/1ALU → dbus → B reg

Each step in the table above is activated by the appropriate register control lines. For instance, in T2/1, the PC is updated by triggering the reg_pc_rw and dreg_wr lines.

Conclusion

The 8085 has a complex register set, and it uses some interesting tricks to reduce the size of the chip and to optimize some operations. The register set is much harder to understand than I expected, but with careful examination it reveals its secrets.

Credits: The chip images are from visual6502.org. The visual6502 team did the hard work of dissolving chips in acid to remove the packaging and then taking many close-up photographs of the die inside. Pavel Zima converted these photographs into mask layer images, a transistor net, an 8085 simulator, and register file schematics (top, bottom).

See discussion at Hacker News. Thanks for visiting!

Silicon reverse engineering: The 8085's undocumented flags

The 8085 microprocessor has two undocumented status flags: V and K. These flags can be reverse-engineered by looking at the silicon of the chip, and their function turns out to be different from previous explanations. In addition, the implementation of these flags shows that they were deliberately implemented, which raises the question of why there were not documented or supported by Intel. Finally, examining how these flag circuits were implemented in silicon provides an interesting look at how microprocessors are physically implemented.

Like most microprocessors, the 8085 has a flag register that holds status information on the results of an operation. The flag register is 8 bits: bit 0 holds the carry flag, bit 2 holds the parity, bit 3 is always 0, bit 4 holds the half-carry, bit 6 holds the zero status, and bit 7 holds the sign. But what about the missing bits: 1 and 5?

Back in 1979, users of the 8085 determined that these flag bits had real functions.[1] Bit 1 is a signed-number overflow flag, called V, indicating that the result of a signed add or subtract won't fit in a byte.[2] Bit 5 of the flag is poorly understood and has been given the names K, X5, or UI. For an increment/decrement operation it simply indicates 16-bit overflow or underflow. But it has a totally diffrent value for arithmetic operations. The flag has been described[1][3] as:

K =  O1·O2 + O1·R + O2·R, where:
O1 = sign of operand 1
O2 = sign of operand 2
R = sign of result
For subtraction and comparisons, replace O2 with complement of O2.
As I will show, that published description is mistaken. The K flag actually is the V flag exclusive-ored with the sign of the result. And the purpose of the K flag is to compare signed numbers.

The circuit for the K and V flags

The following schematic shows the reverse-engineered circuit for the K and V flags in the 8085. The V flag is simply the exclusive-or of the carry into the top bit and the carry out of the top bit. This is a standard formula for computing overflow[2] for signed addition and subtraction. (The 6502 computes the same overflow value through different logic.) The V flag has values for other arithmetic operations, but the values aren't useful.[4] A latch stores the value of the V flag. The computed V value is stored in the latch under the control of a store_v_flag control signal. Alternatively, the flag value can be read off the bus and stored in the latch under the control of the bus_to_flags control signal; this is how the POP PSW instruction, which pops the flags from the stack, is implemented. Finally, a tri-state superbuffer (the large triangle) writes the flag value to the bus when needed.

The K flag circuitry is on the right. The first function of the K flag is overflow/underflow for an INX/DEX instruction. This is implemented simply: the carry_to_k_flag control line sets the K flag according to the carry from the incrementer/decrementer. The next function of K flag is reading from the databus for the POP PSW instruction, which is the same as for the V flag. The final function of the K flag is the result of a signed comparison. The K flag is the exclusive-or of the V flag and the sign bit of the result. For subtraction and comparison, the K flag is 1 if the second value is larger than the first.[5] The K flag is set for other arithmetic operations, but doesn't have a useful value except for signed comparison and subtraction.[4]

The circuit in the 8085 for the undocumented V and K flags. The flags are generated from the carries and results from the ALU. The K flag can also be set by the carry from the incrementer/decrementer.

The circuit in the 8085 for the undocumented V and K flags. The flags are generated from the carries and results from the ALU. The K flag can also be set by the carry from the incrementer/decrementer.
One mystery was the purpose of the K flag: "It does not resemble any normal flag bit."[1] Its use for increment and decrement is clear, but for arithmetic operations why would you want the exclusive-or of the overflow and sign? It turns out the the K flag is useful for signed comparisons. If you're comparing two signed values, the first is smaller if the exclusive-or of the sign and overflow is 1.[6] This is exactly what the K flag computes.

From the circuit above, it is clear that the V and K flags were deliberately added to the chip. (This is in contrast to the 6502, where undocumented opcodes have arbitrary results due to how the circuitry just happens to work for unexpected inputs.[7]) Why would Intel add the above circuitry to the chip and then not document or support it? My theory is that Intel decided they didn't want to support K or (8-bit) V flags in the 8086, so in order to make the 8086 source-compatible with the 8085, they dropped those flags from the 8085 documentation, but the circuitry remained in the chip.

The silicon

The 8085 microprocessor showing the data bus, ALU, flag logic, registers, and incrementer/decrementer.
The 8085 microprocessor showing the data bus, ALU, flag logic, registers, and incrementer/decrementer.
The remainder of this article will show how the V and K flag circuits work, diving all the way down to the silicon circuits. The above image of the 8085 chip shows the layout of the chip and the components that are important to the discussion. In the upper left of the chip is the ALU (arithmetic-logic unit), where computations happen (details). The data bus is the main interconnect in the chip, connects the data pins (upper left), the ALU, the data registers, the flag register, and the instruction decoding (upper right). In the lower left of the chip is the 16-bit register file. Underneath the register file is a 16-bit increment/decrement circuit which handles incrementing the program counter, as well as supporting 16-bint increment and decrement instructions. The increment/decrement circuit has a carry-out in the lower right corner - this will be important for the discussion of the K flag. For some reason, the ALU has the low-order bit on the right, while the registers have the low-order bit on the left.

The flag logic circuitry sits underneath the ALU, with high-current drivers right on top of the data bus. The flags are arranged in apparently-random order with bit 7 (sign) on the left and bit 6 (zero) on the right. Because the carry logic is much more complicated (handling not only arithmetic operations but shifts and rotates, carry complement, and decimal adjust), the carry logic is stuck off to the right of the ALU where there was enough room.

Zooming in

Next we will zoom in on the V flag circuitry, labeled V1 above. Looking at the die under a microscope shows the metal layer of the chip, consisting of mostly-horizontal metal interconnects, which are the white lines below. The bottom part of the chip has the 8-bit data bus. Other wires are the VCC power supply, ground, and a variety of signals. While modern processors can have ten or more metal layers, the 8085 only has a single layer. Some of the circuitry underneath the metal is visible.

The metal layer of the 8085 microprocessor, zoomed in on the V flag circuit.

The metal layer of the 8085 microprocessor, zoomed in on the V flag circuit.
If the metal is removed from the chip, the silicon layer becomes visible. The blotchy green/purple is plain silicon. The pink regions are N-type doped silicon. The grayish regions are polysilicon, which can be considered as simply conductive wires. When polysilicon crosses doped silicon, it forms a transistor, which appears light green in this image. Note that transistors form a fairly small portion of the chip; there is a lot more connection and wiring than actual transistors. The small squares are vias, connections to the metal layer.

The V flag circuit in the 8085 CPU. This is the silicon/polysilicon after the metal layer has been removed. The data bus is not visible as it is in the metal layer, but it is in the lower third of the image. The rectangles at the bottom connect the data bus to the registers.

The V flag circuit in the 8085 CPU. This is the silicon/polysilicon after the metal layer has been removed. The data bus is not visible as it is in the metal layer, but it is in the lower third of the image. The rectangles at the bottom connect the data bus to the registers.

MOSFET transistors

For this discussion, a MOSFET can be considered simply a switch that closes if the gate input is 1 and opens if the gate input is 0. A MOSFET transistor is implemented by separating two diffusion regions, and putting a polysilicon wire over the gate. An insulating layer prevents any current from flowing between the gate and the rest of the transistor. In the following diagram, the n+ diffusion regions are pink, the polysilicon gate conductor is dull green, and the insulating oxide layer is turquoise.

Structure of a MOSFET transistor. The n+ diffusion regions are pink, the polysilicon gate conductor is dull green, and the insulating oxide layer is turquoise.

NOR gate

The NOR gate is a fundamental building block in the 8085, since it is a very simple gate that can form more complex logic. A NOR gate is implemented through two transistors and a pullup transistor. If either input (or both) is 1, the corresponding transistor connects the output to ground. Otherwise, the transistors are open, and the pullup pulls the output high. The pullup is shown as a resistor in the schematic, but it is actually a type of transistor called a depletion-mode transistor for better performance.

A NOR gate is implemented through two transistors and a pullup transistor. If either (or both) input is 1, the corresponding transistor connects the output to ground. Otherwise, the transistors are open, and the pullup pulls the output high.

By zooming in to a single NOR gate in the 8085, we can see how the gate is actually implemented. One surprise is that the circuit is almost all wiring; the transistors form a very small part of the circuit. The two transistors are connected to ground on the left, and tied together on the right. The pullup transistor is much larger than the other transistors for technical reasons.[8]

To understand the circuit, trace the path from ground to each transistor, across the gate, and to the output. In this way you can see there are two paths from ground to the output, and if either input is 1 the output will be 0.

The layout of the gate is intended to be as efficient as possible, given the constraints of where the power (VCC), ground, and other connections are, yielding a layout that looks a bit unusual. The power, ground, and input signals are all in the metal layer above (not shown here), and are connected to this circuit through vias between the metal and the silicon below.

A NOR gate in the 8085 microprocessor, showing the components.If either input is high, the associated transistor will connect the output to ground. Otherwise the pullup transistor will pull the output high.

A NOR gate in the 8085 microprocessor, showing the components.If either input is high, the associated transistor will connect the output to ground. Otherwise the pullup transistor will pull the output high.

Exclusive-or gates

The exclusive-or circuit (which outputs a 1 if exactly one input is 1) is a key component of the flag circuitry, and illustrates how more complex logic can be formed out of simpler gates. The schematic below shows how the exclusive-or is built from a NOR gate and an AND-NOR gate; it is straightforward to verify that if both inputs are 0 or both inputs are 1, the output is will be 0.

You may wonder why the 8085 uses so many "strange" gates such as a combined AND-NOR, instead of "normal" gates like AND. The transistor-level schematic shows that an AND-NOR gate can actually be implemented very simply with MOSFETs, in fact simpler than a plain AND gate. The two rightmost transistors form the "AND" - if they both have 1 inputs, they connect the output to ground. The transistor to the left forms the other part of the NOR - if it has a 1 input, it pulls the output to ground.

The exclusive-or circuit used in the 8085: gate-level and transistor-level.

The following diagram shows an XOR circuit in the 8085 that matches the schematic above. (This is the XOR gate that generates the K flag.) On the left is the NOR gate discussed above, and on the right is the AND-NOR circuit, both outlined with a dotted line. As before, the circuit is mostly wiring, with the transistors forming a small part of the circuit (the green regions between pink diffusion regions).

An XOR gate in the 8085 microprocessor, formed from a NOR gate and an AND-NOR gate. If both inputs are 0, the NOR gate output will be 1, and the NOR transistor will pull the output to 0. If both inputs are 1, the AND transistors will pull the output to 0. Otherwise the pullup transistor will pull the output 1.

An XOR gate in the 8085 microprocessor, formed from a NOR gate and an AND-NOR gate. If both inputs are 0, the NOR gate output will be 1, and the NOR transistor will pull the output to 0. If both inputs are 1, the AND transistors will pull the output to 0. Otherwise the pullup transistor will pull the output 1.

The flag latch

Each flag bit is stored in a simple latch circuit made up of two inverters. To store a 1, the inverter on the right outputs a 0, which is fed into the inverter on the left, which outputs a 1, which is fed back to the inverter on the right. A zero is stored in a similar (but opposite) manner. When the clock input is low, the pass transistor opens, breaking the feedback loop, and new data can be written into the latch. The complemented output (/out) is taken from the inverter.

You might wonder why the latch doesn't lose its data whenever the clock goes low. There's an interesting trick here called dynamic logic. Because the gate of a MOSFET consists of an insulating layer it has very high resistance. Thus, any electrical charge on the gate will remain there for some time[9] when the pass transistor opens. When the pass transistor closes, the charge is refreshed.

The latch used in the 8085 to store a flag value. The latch uses two inverters to store the data. When the clock is low, a new value can be written to the latch.

The latch used in the 8085 to store a flag value. The latch uses two inverters to store the data. When the clock is low, a new value can be written to the latch.

The following part of the 8085 chip shows the implementation of the latch for the V flag. The circuit closely matches the schematic above. The two inverters are outlined with dotted lines. The red arrows show the flow of data through the circuit. As before, the wiring and pullup transistors take up most of the silicon real estate.

Each flag in the 8085 uses a two-inverter latch to store the flag. This shows the latch for the undocumented V flag. The red arrows show the flow of data.

Each flag in the 8085 uses a two-inverter latch to store the flag. This shows the latch for the undocumented V flag. The red arrows show the flow of data.

Driving the data bus with a superbuffer

Another interesting feature of the flag circuit is the "superbuffer". Most transistors in the 8085 only send a signal a short distance. However, to send a signal on the data bus across the whole chip takes a lot more power, so a superbuffer is used. In the superbuffer, one transistor is driven to pull the output low, while a second transistor is driven to pull the output high. (This is in contrast to a regular gate, which uses a depletion-mode pullup transistor to pull the output high.) In addition, these transistors are considerably larger, to provide more current.[8] These two transistors are shown at the bottom the schematic below.

The other feature of this superbuffer is that it is tri-state. In addition to a 0 or 1 output, it has a third state, which basically consists of providing no output. This way, the flags do not affect the data bus except when desired. In the schematic, it can be seen that if the control input is 1, both NOR gates will output 0, and both transistors will do nothing.

The superbuffer used in the 8085 to drive the data bus.

The superbuffer used in the 8085 to drive the data bus.

The following diagram shows the two drive transistors, as well as the line used to read the flag from the data bus. (The NOR gates are not shown.) Note the size of these transistors compared to transistors seen earlier. Each flag bit requires a superbuffer such as this. Even flag bit 3, which is always 0, requires a large transistor to drive the 0 onto the bus - it's surprising that a do-nothing flag still takes up a fair bit of silicon.

Each flag in the 8085 uses a superbuffer to drive the value onto the data bus. This figure shows the two large transistors that drive the V flag onto bit 1 of the data bus.

Each flag in the 8085 uses a superbuffer to drive the value onto the data bus. This figure shows the two large transistors that drive the V flag onto bit 1 of the data bus.

Putting it all together

The above discussion has shown the details of the XOR gate that computes the K flag, and the latch and superbuffer for the V flag. The following diagram shows how these pieces fit into the overall circuitry. The latch and driver for the K flag are outside this image, to the right. The circuits below are tied together by the metal layer, which isn't shown. Compare this diagram with the schematic at the top of the article to see how the components are implemented. The two XOR circuits look totally different, since their layouts have been optimized to fit with the signals they need.

The 8085 circuits to implement the undocumented V and K flags. The ALU provides /carry6, /carry7, and result7. The XOR circuit on the left generates V, and the XOR circuit in the middle generates K. On the right are the latch for the V flag, and the superbuffer that outputs the flag to the data bus. The K flag latch and superbuffer are to the right, not shown.

The 8085 circuits to implement the undocumented V and K flags. The ALU provides /carry6, /carry7, and result7. The XOR circuit on the left generates V, and the XOR circuit in the middle generates K. On the right are the latch for the V flag, and the superbuffer that outputs the flag to the data bus. The K flag latch and superbuffer are to the right, not shown.
By looking at the silicon chip carefully, the transistors, gates, and complex circuits start to make sense. It's amazing to think that the complex computers we use are built out of these simple components. Of course, processors now are way more complex than the 8085, with billions of transistors instead of thousands, but the basic principles are still the same.

If you found this discussion interesting, check out my earlier analysis of the 6502's overflow flag and the 8085's ALU. You may also be interested in the book The Elements of Computing Systems, which describes how to build a computer starting with Boolean logic.

Credits

The chip images are from visual6502.org. The visual6502 team did the hard work of dissolving chips in acid to remove the packaging and then taking many close-up photographs of the die inside. Pavel Zima converted these photographs into mask layer images, a transistor net, and an 8085 simulator.

Notes and references

[1] The undocumented instructions and flags of the 8085 were discovered by Wolfgang Sehnhardt and Villy M. Sorensen in the process of writing an 8085 assembler, and were written up in the article Unspecified 8085 op codes enhance programming, Engineer's Notebook, "Electronics" magazine, Jan 18, 1979 p 144-145.

[2] See my article The 6502 overflow flag explained mathematically for details on overflow. There are multiple ways of computing overflow, and the 6502 uses a different technique.

[3] Tundra Semiconductor sold the CA80C85B, a CMOS version of the 8085. Interestingly, the undocumented opcodes and flags are described in the datasheet for this part: CA80C85B datasheet, 8000-series components.

The interesting thing about the Tundra datasheet is the descriptions of the "new" flags and instructions are copied almost exactly from Dehnhardt's article except for the introduction of errors, missing parentheses, and renaming the K flag as UI. In addition, as I described earlier, the published K/UI flag formula doesn't always work. Thus, it appears that despite manufacturing the chip, Tundra didn't actually know how these circuits worked.

[4] The V flag makes sense for signed addition and subtraction, and the K flag makes sense for signed subtraction and comparison. Many other operations affect these flags, but the flags may not have any useful meaning.

The V flag is 0 for RRC, RAR, AND, OR, and XOR operations, since these operations have constant carry values inside the ALU (details). The RLC and RAL operations add the accumulator to itself, so they can be treated the same as addition: V is set if the signed result is too big for a byte. The V flag for DAA can also be understood in terms of the underlying addition: V will only be set if the top digit goes from 7 to 8. However, since BCD digits are unsigned, V has no useful meaning with DAA. DAD is an interesting case, since the V flag indicates 16-bit signed overflow; it is actually computed from the result of the high-order addition. For INR, the only overflow case is going from 0x7f to 0x80 (127 to -128); note that going from 0xff to 0x00 corresponds to -1 to 0, which is not signed overflow even though it is unsigned overflow. Likewise, DCR sets the V flag going from hex 80 to 7f (-128 to 127); likewise 0x00 to 0xff is not signed overflow.

The K flag has a few special cases. For AND, OR, and XOR, the K flag is the same as the sign, since the V flag is 0. Note that the K flag is computed entirely differently for INR/DCR compared to INX/DCX. For INR and DCR, the K flag is S^V, which almost always is S. The K flag is set for DAA if S^V is true, which doesn't have any useful meaning since BCD values are unsigned.

The published formula for the K flag gives the wrong value for XOR if both arguments are negative.

[5] The following table illustrates the 8 possible cases when comparing signed numbers A and B. The inputs are the top bit of A, the top bit of B, and the carry from bit 6 when subtracting B from A. The outputs are the carry, borrow (complement of carry), sign, overflow, and K flags. An example is given for each row. Note that the K flag is set if A is less than B when treated as signed numbers.

InputsOutputsExample
A7B7C6CBSVKHexSigned comparison
010010000x50 - 0xf0 = 0x6080 - -16 = 96
011011100x50 - 0xb0 = 0xa080 - -80 = -96
000011010x50 - 0x70 = 0xe080 - 112 = -32
001100000x50 - 0x30 = 0x12080 - 48 = 32
110011010xd0 - 0xf0 = 0xe0-48 - -16 = -32
111100000xd0 - 0xb0 = 0x120-48 - -80 = 32
100100110xd0 - 0x70 = 0x160-48 - 112 = 96
101101010xd0 - 0x30 = 0x1a0-48 - 48 = -96

[6] A detailed explanation of signed comparisons is given in Beyond 8-bit Unsigned Comparisons by Bruce Clark, section 5. While this article is in the context of the 6502, the discussion applies equally to the 8085.

[7] The illegal opcodes in the 6502 are discussed in detail in How MOS 6502 Illegal Opcodes really work. In the 6502, the operations performed by illegal opcodes are unintended, just chance based on what the chip logic happens to do with unexpected inputs. In contrast, the undocumented opcodes in the 8085, like the undocumented flags, are deliberately implemented.

[8] The key parameter in the performance of a MOSFET transistor is the width to length ratio of the gate. Oversimplifying slightly, the current provided by the transistors is proportional to this ratio. (Width is the width of the source or drain, and length is the length across the gate from source to drain.) For an inverter, the W/L ratio of the pullup should be approximately 1/4 the W/L ratio of the input transistor for best performance. (See Introduction to VLSI Systems, Mead, Conway, p 8.) The result is that pullup transistors are big and blocky compared to pulldown transistors. Another consequence is that high-current transistors in a superbuffer have a very wide gate. The 8085 register file has some transistors where the W/L ratios are carefully configured so one transistor will "win" over the other if both are on at the same time. (This is why the 8085 simulator is more complex than the 6502 simulator, needing to take transistor sizes into account.)

[9] One effect of using pass-transistor dynamic buffers is that if the clock speed is too small, the charge will eventually drain away causing data loss. As a result the 8085 has a minimum clock speed of 500 kHz. Likewise, the 6502 has a minimum clock speed. The Z-80 in contrast is designed with static logic, so it has no minimum clock speed - the clock can be stepped as slowly as desired.

Inside the ALU of the 8085 microprocessor

The arithmetic-logic unit is a fundamental part of any computer, performing addition, subtraction, and logic operations, but how it works is a mystery to many people. I've reverse-engineered the ALU circuit from the 8085 microprocessor and explain how it works. The 8085's ALU is a surprisingly complex circuit that at first looks like a mysterious jumble of gates, but it can be understood if you don't mind diving into some Boolean logic.

The following diagram shows the location of the ALU in the 8085. The ALU is 8 bits wide, with the high-order bit on the left. The register file is the large block below the ALU. The registers are 16 bits wide, made up of pairs of 8-bit registers. Surprisingly, the register file has the high-order bit on the right, the opposite order from the ALU.

The 8085 microprocessor, showing the location of the 8-bit ALU.

The ALU takes two 8-bit inputs, which I'll call A and X, and performs one of five basic operations: ADD, OR, XOR, AND, and SHIFT-RIGHT. As well, if the input X is inverted, the ALU can perform subtraction and complement operations. You might think SHIFT-LEFT is missing from this list. However, it is simply performed by adding the number to itself, which shifts it to the left one bit in binary. Note that the 8085 arithmetic operations are very basic. There is no multiplication or division operation - these were added in the 8086.

The ALU consists of 8 mostly-identical slices, one for each bit. For addition, each slice of the ALU adds the appropriate input bits, computing the sum A + X + carry-in, generating a sum bit and a carry-out bit. That is, each bit of the ALU implements a full adder. The logic operations simply operate on the two input bits: A AND X, A OR X, A XOR X. Shift-right simply outputs the A bit from the slice to the right.

ALU schematic

The following schematic shows one bit of the ALU. The schematic has roughly the same layout as the implementation on the chip, flowing from bottom to top. Eight of these circuits are stacked side-by-side, with the low-order bit on the right. Carries flow from right to left, and bits shifted right flow from left to right.

Schematic of one bit of the ALU in the 8085 microprocessor.

Negation

Starting at the bottom of the schematic, is the complex gate labeled Negation. This gate optionally selects a negated second argument by selecting either XN or /XN. (XN is the Nth bit of the second argument, which I'll call X. The / indicates the complement.) For most of the discussion below I'll assume XN is uncomplemented to keep things simpler.

Operation

Above the complement selector are a few gates labeled Operation that perform the desired 2-input operation. The NAND gate on the left generates either A NAND X or 1 based on the select_op1 control line. The OR gate on the right generates either A OR X or 1, based on the select_op2 control line. Combining these in the NAND gate yields four different possibilities:
select_op1select_op2Result
00A NOR X
010
10A NXOR X
11A AND X
Note that instead of OR and XOR, the complemented value is produced by this circuit. This will be fixed in the next step.

Combine with carry

Above the operation circuit is the next block of gates labeled Combine with carry that generates the ALU output by merging the carry-in with the operation value via XOR.

To understand this circuit, first consider the following simple XOR circuit, which is used a couple times in the ALU. It can be understood fairly simply: if both inputs are 0 (top) or both inputs are 1 (bottom) then the output is 0.

An XOR gate can be implemented by a NOR gate and an AND-NOR gate. This circuit is used in the 8085 ALU.

Ignoring the shift_right circuit for a moment, the block of gates is simply the XOR circuit above. Note that XOR with 0 is a no-op, while XOR with 1 complements the value. And A XOR X XOR CARRY is the low-order bit of adding A, X, and CARRY.

The key point of this circuit is that the incoming carry is generated with the proper value to convert the operation output into the desired final result. The incoming carry /carry(N-1) is either 0, 1, or the complemented carry from bit N-1 as appropriate.

OpOperation outputCarryResult
orA NOR X1A OR X
addA NXOR X/carryA XOR X XOR CARRY
xorA NXOR X1A XOR X
andA AND X0A AND X
shift right00A(N+1)
complementA NOR /X1A OR /X
subtractA NXOR /X/carryA XOR /X XOR CARRY

Note that the carry-in line must have the right value in order to generate the appropriate output. For addition it passes the inverted carry from one bit to the next. But for OR, XOR, the line is set to 1. And for AND and SHIFT_RIGHT it is set to 0. As will be seen below, the carry circuitry generates the right value for the right operation.

The final aspect of this circuit is the shift-right circuit. With a 0 op input, 0 carry input, and shift_right set, the output is simply the bit from the right: A(N+1).

Generate carry

The circuit on the left, labeled Generate carry generates the carry out. It can generate three different outputs: 1, 0, or the (complemented) carry from the sum. If select_op2 is set, it will force the carry to 0. Otherwise if force_ncarry_1 is set, it will force the carry to 1. Otherwise, the carry is generated for the sum of A + X + carry-in through straightforward logic: If the carry-in is set, and one of the inputs is set, there will be a carry out. If both input bits are set, there will be a carry out.

Flags

The 8085 has a parity flag, which is 1 if the number of 1 bits is even, and 0 if the number of parity bits is odd. The parity flag is generated by XORing all the result bits together (and complementing). Each bit is XORed with the lower-order parity value by the parity circuit near the top of the schematic. The XOR circuit is the same circuit described above.

The zero flag is computed by a simple circuit: each result bit drives a transistor that will pull the zero line low if the bit is set. This forms an 8-input NOR gate, spread across the ALU.

The control lines

As seen in the schematic, the 8085 uses multiple control lines to control the activity inside the ALU. In total, the ALU provides 7 different operations and the following table summarizes the control lines that are used for each operation. It also lists the opcodes that use each ALU operation.

Operationselect_negselect_op1select_op2shift_rightforce_ncarry_1Opcodes
or00001ORA,ORI (and default)
add01000INR,DCR,RLC,DAD,RAL,DAA,ADD,ADC,ADI,ACI (and undocumented LDSI,LDHI,RDEL)
xor01001XRA,XRI
and01101ANA,ANI
shift right00111RRC,RAR (ARHL)
complement10001CMA
subtract11000SUB,SBB,SUI,SBI,CMP,CPI (DSUB)

The ALU control lines are generated from the opcode by the programmable logic array. Specifically, they are outputs from PLA F, which is to the right of the ALU. More details are in my article on the PLA. The ALU has additional control lines to set up the registers, initialize the carry bits, and set the flags. These control the differences between different op codes, beyond the categories above. the I will explain those in a future article.

The PLA in the 8085

Reverse-engineering the ALU

This information is based on the 8085 reverse-engineering done by the visual 6502 team. This team dissolves chips in acid to remove the packaging and then takes many close-up photographs of the die inside. Pavel Zima converted these photographs into mask layer images, generated a transistor net from the layers, and wrote a transistor-level 8085 simulator.

I took the transistor net and used it to figure out how the ALU works. First, I converted the transistor net into gates. Next I figured out which gates are part of the ALU and put them into a schematic. Then I examined how the circuit worked for different operations and eventually figured out how it works.

Conclusion

The ALU of the 8085 is an interesting circuit. At first it seemed like an incomprehensible pile of gates with mysterious control lines, but after some investigation I figured it out. The 8085 ALU is implemented very differently from the 6502's ALU (which I'll write up later). The 6502's ALU uses fairly straightforward circuits to generate the SUM, AND, XOR, OR, and SHIFT values in parallel, and then uses a simple pass-transistor multiplexor to pick the desired operation. This is in contrast to the 8085 ALU, which generates only the desired value.