Ken Shirriff's blog: August 2023

Tracing the roots of the 8086 instruction set to the Datapoint 2200 minicomputer

The Intel 8086 processor started the x86 architecture that is still extensively used today. The 8086 has some quirky characteristics: it is little-endian, has a parity flag, and uses explicit I/O instructions instead of just memory-mapped I/O. It has four 16-bit registers that can be split into 8-bit registers, but only one that can be used for memory indexing. Surprisingly, the reason for these characteristics and more is compatibility with a computer dating back before the creation of the microprocessor: the Datapoint 2200, a minicomputer with a processor built out of TTL chips. In this blog post, I'll look in detail at how the Datapoint 2200 led to the architecture of Intel's modern processors, step by step through the 8008, 8080, and 8086 processors.

The Datapoint 2200

In the late 1960s, 80-column IBM punch cards were the primary way of entering data into computers, although CRT terminals were growing in popularity. The Datapoint 2200 was designed as a low-cost terminal that could replace a keypunch, with a squat CRT display the size of a punch card. By putting some processing power into the Datapoint 2200, it could perform data validation and other tasks, making data entry more efficient. Even though the Datapoint 2200 was typically used as an intelligent terminal, it was really a desktop minicomputer with a "unique combination of powerful computer, display, and dual cassette drives." Although now mostly forgotten, the Datapoint 2200 was the origin of the 8-bit microprocessor, as I'll explain below.

The Datapoint 2200 computer (Version II).

The memory storage of the Datapoint 2200 had a large impact on its architecture and thus the architecture of today's computers. In the 1960s and early 1970s, magnetic core memory was the dominant form of computer storage. It consisted of tiny ferrite rings, threaded into grids, with each ring storing one bit. Magnetic core storage was bulky and relatively expensive, though. Semiconductor RAM was new and very expensive; Intel's first product in 1969 was a RAM chip called the 3101, which held just 64 bits and cost $99.50. To minimize storage costs, the Datapoint 2200 used an alternative: MOS shift-register memory. The Intel 1405 shift-register memory chip provided much more storage than RAM chips at a much lower cost (512 bits for $13.30).1

Intel 1405 shift-register memory chips in metal cans, in the Datapoint 2200.

The big problem with shift-register memory is that it is sequential: the bits come out one at a time, in the same order you put them in. This wasn't a problem when executing instructions sequentially, since the memory provided each instruction as it was needed. For a random access, though, you need to wait until the bits circulate around and you get the one you want, which is very slow. To minimize the number of memory accesses, the Datapoint 2200 had seven registers, a relatively large number of registers for the time.2 The registers were called A, B, C, D, E, H, and L, and these names had a lasting impact on Intel processors.

Another consequence of shift-register memory was that the Datapoint 2200 was a serial computer, operating on one bit at a time as the shift-register memory provided it, using a 1-bit ALU. To handle arithmetic operations, the ALU needed to start with the lowest bit so it could process carries. Likewise, a 16-bit value (such as a jump target) needed to start with the lowest bit. This resulted in a little-endian architecture, with the low byte first. The little-endian architecture has remained in Intel processors to the present.

Since the Datapoint 2200 was designed before the creation of the microprocessor, its processor was built from a board of TTL chips (as was typical for minicomputers at the time). The diagram below shows the processor board with the chips categorized by function. The board has a separate chip for each 8-bit register (B, C, D, etc.) and separate chips for control flags (Z, carry, etc.). The Arithmetic/Logic Unit (ALU) takes about 18 chips, while instruction decoding is another 18 chips. Because every feature required more chips, the designers of the Datapoint 2200 were strongly motivated to make the instruction set as simple as possible. This was necessary since the Datapoint 2200 was a low-cost device, renting for just $148 a month. In contrast, the popular PDP-8 minicomputer rented for $500 a month.

The Datapoint 2200 processor board with registers, flags, and other blocks labeled. Click this image (or any other) for a larger version.

One way that the Datapoint 2200 simplified the hardware was by creating a large set of instructions by combining simpler pieces in an orthogonal way. For instance, the Datapoint 2200 has 64 ALU instructions that apply one of eight ALU operations to one of the eight registers. This requires a small amount of hardware—eight ALU circuits and a circuit to select the register—but provides a large number of instructions. Another example is the register-to-register move instructions. Specifying one of eight source registers and one of eight destination registers provides a large, flexible set of instructions to move data.

The Datapoint 2200's instruction format was designed around this principle, with groups of three bits specifying a register. A common TTL chip could decode the group of three bits and activate the desired circuit.3 For instance, a data move instruction had the bit pattern 11DDDSSS to move a byte from the specified source (SSS) to the specified destination (DDD). (Note that this bit pattern maps onto three octal digits very nicely since the source and destination are separate digits.4)

One unusual feature of the Datapoint instruction set is that a memory access was just like a register access. That is, an instruction could specify one of the seven physical registers or could specify a memory access (M), using the identical instruction format. One consequence of this is that you couldn't include a memory address in an instruction. Instead, memory could only be accessed by first loading the address into the H and L registers, which held the high and low byte of the address respectively.5 This is very unusual and inconvenient, since a memory access took three instructions: two to load the H and L registers and one to access memory as the M "register". The advantage was that it simplified the instruction set and the decoding logic, saving chips and thus reducing the system cost. This decision also had lasting impact on Intel processors and how they access memory.

The table below shows the Datapoint 2200's instruction set in an octal table showing the 256 potential opcodes.6 I have roughly classified the instructions as arithmetic/logic (purple), control-flow (blue), data movement (green), input/output (orange), and miscellaneous (yellow). Note how the orthogonal instruction format produces large blocks of related instructions. The instructions in the lower right (green) load (L) a value from a source to a destination. (The no-operation NOP and HALT instructions are special cases.7) In the upper-left are Load operations (LA, etc.) that use an "immediate" byte, a data byte that follows the instruction. They use the same DDD code to specify the destination register, reusing that circuitry.

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	HALT	HALT	SLC	RFC	AD		LA	RETURN	JFC	INPUT	CFC		JMP		CALL
1			SRC	RFZ	AC		LB		JFZ		CFZ
2				RFS	SU		LC		JFS	EX ADR	CFS	EX STATUS		EX DATA		EX WRITE
3				RFP	SB		LD		JFP	EX COM1	CFP	EX COM2		EX COM3		EX COM4
4				RTC	ND		LE		JTC		CTC
5				RTZ	XR		LH		JTZ	EX BEEP	CTZ	EX CLICK		EX DECK1		EX DECK2
6				RTS	OR		LL		JTS	EX RBK	CTS	EX WBK				EX BSP
7				RTP	CP				JTP	EX SF	CTP	EX SB		EX REWND		EX TSTOP
0	ADA	ADB	ADC	ADD	ADE	ADH	ADL	ADM	NOP	LAB	LAC	LAD	LAE	LAH	LAL	LAM
1	ACA	ACB	ACC	ACD	ACE	ACH	ACL	ACM	LBA	LBB	LBC	LBD	LBE	LBH	LBL	LBM
2	SUA	SUB	SUC	SUD	SUE	SUH	SUL	SUM	LCA	LCB	LCC	LCD	LCE	LCH	LCL	LCM
3	SBA	SBB	SBC	SBD	SBE	SBH	SBL	SBM	LDA	LDB	LDC	LDD	LDE	LDH	LDL	LDM
4	NDA	NDB	NDC	NDD	NDE	NDH	NDL	NDM	LEA	LEB	LEC	LED	LEE	LEH	LEL	LEM
5	XRA	XRB	XRC	XRD	XRE	XRH	XRL	XRM	LHA	LHB	LHC	LHD	LHE	LHH	LHL	LHM
6	ORA	ORB	ORC	ORD	ORE	ORH	ORL	ORM	LLA	LLB	LLC	LLD	LLE	LLH	LLL	LLM
7	CPA	CPB	CPC	CPD	CPE	CPH	CPL	CPM	LMA	LMB	LMC	LMD	LME	LMH	LML	HALT

The lower-left quadrant (purple) has the bulk of the ALU instructions. These instructions have a regular, orthogonal structure making the instructions easy to decode: each row specifies the operation while each column specifies the source. This is due to the instruction structure: eight bits in the pattern 10AAASSS, where the AAA bits specified the ALU operation and the SSS bits specified the register source. The three-bit ALU code specifies the operations Add, Add with Carry, Subtract, Subtract with Borrow, logical AND, logical XOR, logical OR, and Compare. This list is important because it defined the fundamental ALU operations for later Intel processors.8 In the upper-left are ALU operations that use an "immediate" byte. These instructions use the same AAA bit pattern to select the ALU operation, reusing the decoding hardware. Finally, the shift instructions SLC and SRC are implemented as special cases outside the pattern.

The upper columns contain conditional instructions in blue—Return, Jump, and Call. The eight conditions test the four status flags (Carry, Zero, Sign, and Parity) for either True or False. (For example, JFZ Jumps if the Zero flag is False.) A 3-bit field selects the condition, allowing it to be easily decoded in hardware. The parity flag is somewhat unusual because parity is surprisingly expensive to compute in hardware, but because the Datapoint 2200 operated as a terminal, parity computation was important.

The Datapoint 2200 has an input instruction as well as many output instructions for a variety of specific hardware tasks (orange, labeled EX for external). Typical operations are STATUS to get I/O status, BEEP and CLICK to make sound, and REWIND to rewind the tape. As a result of this decision to use separate I/O instructions, Intel processors still use I/O instructions operating in an I/O space, different from processors such as the MOS 6502 and the Motorola 68000 that used memory-mapped I/O.

To summarize, the Datapoint 2200 has a fairly large number of instructions, but they are generated from about a dozen simple patterns that are easy to decode.9 By combining orthogonal bit fields (e.g. 8 ALU operations multiplied by 8 source registers), 64 instructions can be generated from one underlying pattern.

Intel 8008

The Intel 8008 was created as a clone of the Datapoint 2200 processor.10 Around the end of 1969, the Datapoint company talked with Intel and Texas Instruments about the possibility of replacing the processor board with a single chip. Even though the microprocessor didn't exist at this point, both companies said they could create such a chip. Texas Instruments was first with a chip called the TMX 1795 that they advertised as a "CPU on a chip". Slightly later, Intel produced the 8008 microprocessor. Both chips copied the Datapoint 2200's instruction set architecture with minor changes.

The Intel 8008 chip in its 18-pin package. The small number of pins hampered the performance of the 8008, but Intel was hesitant to even go to the 18-pin package. Photo by Thomas Nguyen, (CC BY-SA 4.0).

By the time the chips were completed, however, the Datapoint corporation had lost interest in the chips. They were designing a much faster version of the Datapoint 2200 with improved TTL chips (including the well-known 74181 ALU chip). Even the original Datapoint 2200 model was faster than the Intel 8008 processor, and the Version II was over 5 times faster,11 so moving to a single-chip processor would be a step backward.

Texas Instruments unsuccessfully tried to find a customer for their TMX 1795 chip and ended up abandoning the chip. Intel, however, marketed the 8008 as an 8-bit microprocessor, essentially creating the microprocessor industry. In my view, Intel's biggest innovation with the microprocessor wasn't creating a single-chip CPU, but creating the microprocessor as a product category: a general-purpose processor along with everything customers needed to take advantage of it. Intel put an enormous amount of effort into making microprocessors a success: from documentation and customer training to Intellec development systems, from support chips to software tools such as assemblers, compilers, and operating systems.

The table below shows the opcodes of the 8008. For the most part, the 8008 copies the Datapoint 2200, with identical instructions that have identical opcodes (in color). There are a few additional instructions (shown in white), though. Intel Designer Ted Hoff realized that increment and decrement instructions (IN and DC) would be very useful for loops. There are two additional bit rotate instructions (RAL and RAR) as well as the "missing" LMI (Load Immediate to Memory) instruction. The RST (restart) instructions act as short call instructions to fixed addresses for interrupt handling. Finally, the 8008 turned the Datapoint 2200's device-specific I/O instructions into 32 generic I/O instructions.

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	HLT	HLT	RLC	RFC	ADI	RST 0	LAI	RET	JFC	INP 0	CFC	INP 1	JMP	INP 2	CAL	INP 3
1	INB	DCB	RRC	RFZ	ACI	RST 1	LBI		JFZ	INP 4	CFZ	INP 5		INP 6		INP 7
2	INC	DCC	RAL	RFS	SUI	RST 2	LCI		JFS	OUT 8	CFS	OUT 9		OUT 10		OUT 11
3	IND	DCD	RAR	RFP	SBI	RST 3	LDI		JFP	OUT 12	CFP	OUT 13		OUT 14		OUT 15
4	INE	DCE		RTC	NDI	RST 4	LEI		JTC	OUT 16	CTC	OUT 17		OUT 18		OUT 19
5	INH	DCH		RTZ	XRI	RST 5	LHI		JTZ	OUT 20	CTZ	OUT 21		OUT 22		OUT 23
6	INL	DCL		RTS	ORI	RST 6	LLI		JTS	OUT 24	CTS	OUT 25		OUT 26		OUT 27
7				RTP	CPI	RST 7	LMI		JTP	OUT 28	CTP	OUT 29		OUT 30		OUT 31
0	ADA	ADB	ADC	ADD	ADE	ADH	ADL	ADM	NOP	LAB	LAC	LAD	LAE	LAH	LAL	LAM
1	ACA	ACB	ACC	ACD	ACE	ACH	ACL	ACM	LBA	LBB	LBC	LBD	LBE	LBH	LBL	LBM
2	SUA	SUB	SUC	SUD	SUE	SUH	SUL	SUM	LCA	LCB	LCC	LCD	LCE	LCH	LCL	LCM
3	SBA	SBB	SBC	SBD	SBE	SBH	SBL	SBM	LDA	LDB	LDC	LDD	LDE	LDH	LDL	LDM
4	NDA	NDB	NDC	NDD	NDE	NDH	NDL	NDM	LEA	LEB	LEC	LED	LEE	LEH	LEL	LEM
5	XRA	XRB	XRC	XRD	XRE	XRH	XRL	XRM	LHA	LHB	LHC	LHD	LHE	LHH	LHL	LHM
6	ORA	ORB	ORC	ORD	ORE	ORH	ORL	ORM	LLA	LLB	LLC	LLD	LLE	LLH	LLL	LLM
7	CPA	CPB	CPC	CPD	CPE	CPH	CPL	CPM	LMA	LMB	LMC	LMD	LME	LMH	LML	HLT

Intel 8080

The 8080 improved the 8008 in many ways, focusing on speed and ease of use, and resolving customer issues with the 8008.12 Customers had criticized the 8008 for its small memory capacity, low speed, and difficult hardware interfacing. The 8080 increased memory capacity from 16K to 64K and was over an order of magnitude faster than the 8008. The 8080 also moved to a 40-pin package that made interfacing easier, but the 8080 still required a large number of support chips to build a working system.

Although the 8080 was widely used in embedded systems, it is more famous for its use in the first generation of home computers, boxes such as the Altair and IMSAI. Famed chip designer Federico Faggin said that the 8080 really created the microprocessor; the 4004 and 8008 suggested it, but the 8080 made it real.13

Altair 8800 computer on display at the Smithsonian. Photo by Colin Douglas, (CC BY-SA 2.0).

The table below shows the instruction set for the 8080. The 8080 was designed to be compatible with 8008 assembly programs after a simple translation process; the instructions have been shifted around and the names have changed.15 The instructions from the Datapoint 2200 (colored) form the majority of the 8080's instruction set. The instruction set was expanded by adding some 16-bit support, allowing register pairs (BC, DE, HL) to be used as 16-bit registers for double add, 16-bit increment and decrement, and 16-bit memory transfers. Many of the new instructions in the 8080 may seem like contrived special cases— for example, SPHL (Load SP from HL) and XCHG (Exchange DE and HL)— but they made accesses to memory easier. The I/O instructions from the 8008 have been condensed to just IN and OUT, opening up room for new instructions.

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	NOP	LXI B	STAX B	INX B	INR B	DCR B	MVI B	RLC	MOV B,B	MOV B,C	MOV B,D	MOV B,E	MOV B,H	MOV B,L	MOV B,M	MOV B,A
1		DAD B	LDAX B	DCX B	INR C	DCR C	MVI C	RRC	MOV C,B	MOV C,C	MOV C,D	MOV C,E	MOV C,H	MOV C,L	MOV C,M	MOV C,A
2		LXI D	STAX D	INX D	INR D	DCR D	MVI D	RAL	MOV D,B	MOV D,C	MOV D,D	MOV D,E	MOV D,H	MOV D,L	MOV D,M	MOV D,A
3		DAD D	LDAX D	DCX D	INR E	DCR E	MVI E	RAR	MOV E,B	MOV E,C	MOV E,D	MOV E,E	MOV E,H	MOV E,L	MOV E,M	MOV E,A
4		LXI H	SHLD	INX H	INR H	DCR H	MVI H	DAA	MOV H,B	MOV H,C	MOV H,D	MOV H,E	MOV H,H	MOV H,L	MOV H,M	MOV H,A
5		DAD H	LHLD	DCX H	INR L	DCR L	MVI L	CMA	MOV L,B	MOV L,C	MOV L,D	MOV L,E	MOV L,H	MOV L,L	MOV L,M	MOV L,A
6		LXI SP	STA	INX SP	INR M	DCR M	MVI M	STC	MOV M,B	MOV M,C	MOV M,D	MOV M,E	MOV M,H	MOV M,L	HLT	MOV M,A
7		DAD SP	LDA	DCX SP	INR A	DCR A	MVI A	CMC	MOV A,B	MOV A,C	MOV A,D	MOV A,E	MOV A,H	MOV A,L	MOV A,M	MOV A,A
0	ADD B	ADD C	ADD D	ADD E	ADD H	ADD L	ADD M	ADD A	RNZ	POP B	JNZ	JMP	CNZ	PUSH B	ADI	RST 0
1	ADC B	ADC C	ADC D	ADC E	ADC H	ADC L	ADC M	ADC A	RZ	RET	JZ		CZ	CALL	ACI	RST 1
2	SUB B	SUB C	SUB D	SUB E	SUB H	SUB L	SUB M	SUB A	RNC	POP D	JNC	OUT	CNC	PUSH D	SUI	RST 2
3	SBB B	SBB C	SBB D	SBB E	SBB H	SBB L	SBB M	SBB A	RC		JC	IN	CC		SBI	RST 3
4	ANA B	ANA C	ANA D	ANA E	ANA H	ANA L	ANA M	ANA A	RPO	POP H	JPO	XTHL	CPO	PUSH H	ANI	RST 4
5	XRA B	XRA C	XRA D	XRA E	XRA H	XRA L	XRA M	XRA A	RPE	PCHL	JPE	XCHG	CPE		XRI	RST 5
6	ORA B	ORA C	ORA D	ORA E	ORA H	ORA L	ORA M	ORA A	RP	POP PSW	JP	DI	CP	PUSH PSW	ORI	RST 6
7	CMP B	CMP C	CMP D	CMP E	CMP H	CMP L	CMP M	CMP A	RM	SPHL	JM	EI	CM		CPI	RST 7

The 8080 also moved the stack to external memory, rather than using an internal fixed special-purpose stack as in the 8008 and Datapoint 2200. This allowed PUSH and POP instructions to put register data on the stack. Interrupt handling was also improved by adding the Enable Interrupt and Disable Interrupt instructions (EI and DI).14

Intel 8085

The Intel 8085 was designed as a "mid-life kicker" for the 8080, providing incremental improvements while maintaining compatibility. From the hardware perspective, the 8085 was much easier to use than the 8080. While the 8080 required three voltages, the 8085 required a single 5-volt power supply (represented by the "5" in the part number). Moreover, the 8085 eliminated most of the support chips required with the 8080; a working 8085 computer could be built with just three chips. Finally, the 8085 provided additional hardware functionality: better interrupt support and serial I/O.

The Intel 8085, like the 8080 and the 8086, was packaged in a 40-pin DIP. Photo by Thomas Nguyen, (CC BY-SA 4.0).

On the software side, the 8085 is curious: 12 instructions were added to the instruction set (finally using every opcode), but all but two were hidden and left undocumented.16 Moreover, the 8085 added two new condition codes, but these were also hidden. This situation occurred because the 8086 project started up in 1976, near the release of the 8085 chip. Intel wanted the 8086 to be compatible (to some extent) with the 8080 and 8085, but providing new instructions in the 8085 would make compatibility harder. It was too late to remove the instructions from the 8085 chip, so Intel did the next best thing and removed them from the documentation. These instructions are shown in red in the table below. Only the new SIM and RIM instructions were supported, necessary in order to use the 8085's new interrupt and serial I/O features.

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	NOP	LXI B	STAX B	INX B	INR B	DCR B	MVI B	RLC	MOV B,B	MOV B,C	MOV B,D	MOV B,E	MOV B,H	MOV B,L	MOV B,M	MOV B,A
1	DSUB	DAD B	LDAX B	DCX B	INR C	DCR C	MVI C	RRC	MOV C,B	MOV C,C	MOV C,D	MOV C,E	MOV C,H	MOV C,L	MOV C,M	MOV C,A
2	ARHL	LXI D	STAX D	INX D	INR D	DCR D	MVI D	RAL	MOV D,B	MOV D,C	MOV D,D	MOV D,E	MOV D,H	MOV D,L	MOV D,M	MOV D,A
3	RDEL	DAD D	LDAX D	DCX D	INR E	DCR E	MVI E	RAR	MOV E,B	MOV E,C	MOV E,D	MOV E,E	MOV E,H	MOV E,L	MOV E,M	MOV E,A
4	RIM	LXI H	SHLD	INX H	INR H	DCR H	MVI H	DAA	MOV H,B	MOV H,C	MOV H,D	MOV H,E	MOV H,H	MOV H,L	MOV H,M	MOV H,A
5	LDHI	DAD H	LHLD	DCX H	INR L	DCR L	MVI L	CMA	MOV L,B	MOV L,C	MOV L,D	MOV L,E	MOV L,H	MOV L,L	MOV L,M	MOV L,A
6	SIM	LXI SP	STA	INX SP	INR M	DCR M	MVI M	STC	MOV M,B	MOV M,C	MOV M,D	MOV M,E	MOV M,H	MOV M,L	HLT	MOV M,A
7	LDSI	DAD SP	LDA	DCX SP	INR A	DCR A	MVI A	CMC	MOV A,B	MOV A,C	MOV A,D	MOV A,E	MOV A,H	MOV A,L	MOV A,M	MOV A,A
0	ADD B	ADD C	ADD D	ADD E	ADD H	ADD L	ADD M	ADD A	RNZ	POP B	JNZ	JMP	CNZ	PUSH B	ADI	RST 0
1	ADC B	ADC C	ADC D	ADC E	ADC H	ADC L	ADC M	ADC A	RZ	RET	JZ	RSTV	CZ	CALL	ACI	RST 1
2	SUB B	SUB C	SUB D	SUB E	SUB H	SUB L	SUB M	SUB A	RNC	POP D	JNC	OUT	CNC	PUSH D	SUI	RST 2
3	SBB B	SBB C	SBB D	SBB E	SBB H	SBB L	SBB M	SBB A	RC	SHLX	JC	IN	CC	JNK	SBI	RST 3
4	ANA B	ANA C	ANA D	ANA E	ANA H	ANA L	ANA M	ANA A	RPO	POP H	JPO	XTHL	CPO	PUSH H	ANI	RST 4
5	XRA B	XRA C	XRA D	XRA E	XRA H	XRA L	XRA M	XRA A	RPE	PCHL	JPE	XCHG	CPE	LHLX	XRI	RST 5
6	ORA B	ORA C	ORA D	ORA E	ORA H	ORA L	ORA M	ORA A	RP	POP PSW	JP	DI	CP	PUSH PSW	ORI	RST 6
7	CMP B	CMP C	CMP D	CMP E	CMP H	CMP L	CMP M	CMP A	RM	SPHL	JM	EI	CM	JK	CPI	RST 7

Intel 8086

Following the 8080, Intel intended to revolutionize microprocessors with a 32-bit "micro-mainframe", the iAPX 432. This extremely complex processor implemented objects, memory management, interprocess communication, and fine-grained memory protection in hardware. The iAPX 432 was too ambitious and the project fell behind schedule, leaving Intel vulnerable against competitors such as Motorola and Zilog. Intel quickly threw together a 16-bit processor as a stopgap until the iAPX 432 was ready; to show its continuity with the 8-bit processor line, this processor was called the 8086. The iAPX 432 ended up being one of the great disaster stories of modern computing and quietly disappeared.

The "stopgap" 8086 processor, however, started the x86 architecture that changed the history of Intel. The 8086's victory was powered by the IBM PC, designed in 1981 around the Intel 8088, a variant of the 8086 with a cheaper 8-bit bus. The IBM PC was a rousing success, defining the modern computer and making Intel's fortune. Intel produced a succession of more powerful chips that extended the 8086: 286, 386, 486, Pentium, and so on, leading to the current x86 architecture.

The original IBM PC used the Intel 8088 processor, a variant of the 8086 with an 8-bit bus. Photo by Ruben de Rijcke, (CC BY-SA 3.0).

The 8086 was a major change from the 8080/8085, jumping from an 8-bit architecture to a 16-bit architecture and expanding from 64K of memory to 1 megabyte. Nonetheless, the 8086's architecture is closely related to the 8080. The designers of the 8086 wanted it to be compatible with the 8080/8085, but the difference was too wide for binary compatibility or even assembly-language compatibility. Instead, the 8086 was designed so a program could translate 8080 assembly language to 8086 assembly language.17 To accomplish this, each 8080 register had a corresponding 8086 register and most 8080 instructions had corresponding 8086 instructions.

The 8086's instruction set was designed with a new concept, the "ModR/M" byte, which usually follows the opcode byte. The ModR/M byte specifies the memory addressing mode and the register (or registers) to use, allowing that information to be moved out of the opcode. For instance, where the 8080 had a quadrant of 64 instructions to move from register to register, the 8086 has a single move instruction, with the ModR/M byte specifying the particular instruction. (The move instruction, however, has variants to handle byte vs. word operations, moves to or from memory, and so forth, so the 8086 ends up with a few move opcodes.) The ModR/M byte preserves the Datapoint 2200's concept of using the same instruction for memory and register operations, but allows a memory address to be provided in the instruction.

The 8086 also cleans up some of the historical baggage in the instruction set, freeing up space in the precious 256 opcodes for new instructions. The conditional call and return instructions were eliminated, while the conditional jumps were expanded. The 8008's RST (Restart) instructions were eliminated, replaced by interrupt vectors.

The 8086 extended its registers to 16 bits and added several new registers. An Intel patent (below) shows that the 8086's registers were originally called A, B, C, D, E, H, and L, matching the Datapoint 2200. The A register was extended to the 16-bit XA register, while the BC, DE, and HL registers were used unchanged. When the 8086 was released, these registers were renamed to AX, CX, DX, and BX respectively.18 In particular, the HL register was renamed to BX; this is why BX can specify a memory address in the ModR/M byte, but AX, CX, and DX can't.

A patent diagram showing the 8086's registers with their original names. (MP, IJ, and IK are now known as BP, SI, and DI.) From patent US4449184.

The table below shows the 8086's instruction set, with "b", "w", and "i" indicating byte (8-bit), word (16-bit), and immediate instructions. The Datapoint 2200 instructions (colored) are all still supported. The number of Datapoint instructions looks small because the ModR/M byte collapses groups of old opcodes into a single new one. This opened up space in the opcode table, though, allowing the 8086 to have many new instructions as well as 16-bit instructions.19

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	ADD b	ADD w	ADD b	ADD w	ADD bi	ADD wi	PUSH ES	POP ES	INC AX	INC CX	INC DX	INC BX	INC SP	INC BP	INC SI	INC DI
1	OR b	OR w	OR b	OR w	OR bi	OR wi	PUSH CS		DEC AX	DEC CX	DEC DX	DEC BX	DEC SP	DEC BP	DEC SI	DEC DI
2	ADC b	ADC w	ADC b	ADC w	ADC bi	ADC wi	PUSH SS	POP SS	PUSH AX	PUSH CX	PUSH DX	PUSH BX	PUSH SP	PUSH BP	PUSH SI	PUSH DI
3	SBB b	SBB w	SBB b	SBB w	SBB bi	SBB wi	PUSH DS	POP DS	POP AX	POP CX	POP DX	POP BX	POP SP	POP BP	POP SI	POP DI
4	AND b	AND w	AND b	AND w	AND bi	AND wi	ES:	DAA
5	SUB b	SUB w	SUB b	SUB w	SUB bi	SUB wi	CS:	DAS
6	XOR b	XOR w	XOR b	XOR w	XOR bi	XOR wi	SS:	AAA	JO	JNO	JB	JNB	JZ	JNZ	JBE	JA
7	CMP b	CMP w	CMP b	CMP w	CMP bi	CMP wi	DS:	AAS	JS	JNS	JPE	JPO	JL	JGE	JLE	JG
0	GRP1 b	GRP1 w	GRP1 b	GRP1 w	TEST b	TEST w	XCHG b	XCHG w			RET	RET	LES	LDS	MOV b	MOV w
1	MOV b	MOV w	MOV b	MOV w	MOV sr	LEA	MOV sr	POP			RETF	RETF	INT 3	INT	INTO	IRET
2	NOP	XCHG CX	XCHG DX	XCHG BX	XCHG SP	XCHG BP	XCHG SI	XCHG DI	Shift b	Shift w	Shift b	Shift w	AAM	AAD		XLAT
3	CBW	CWD	CALL	WAIT	PUSHF	POPF	SAHF	LAHF	ESC 0	ESC 1	ESC 2	ESC 3	ESC 4	ESC 5	ESC 6	ESC 7
4	MOV AL,M	MOV AX,M	MOV M,AL	MOV M,AX	MOVS b	MOVS w	CMPS b	CMPS w	LOOPNZ	LOOPZ	LOOP	JCXZ	IN b	IN w	OUT b	OUT w
5	TEST b	TEST w	STOS b	STOS w	LODS b	LODS w	SCAS b	SCAS w	CALL	JMP	JMP	JMP	IN b	IN w	OUT b DX	OUT w DX
6	MOV AL,i	MOV CL,i	MOV DL,i	MOV BL,i	MOV AH,i	MOV CH,i	MOV DH,i	MOV BH,i	LOCK		REPNZ	REPZ	HLT	CMC	GRP3a	GRP3b
7	MOV AX,i	MOV CX,i	MOV DX,i	MOV BX,i	MOV SP,i	MOV BP,i	MOV SI,i	MOV DI,i	CLC	STC	CLI	STI	CLD	STD	GRP4	GRP5

The 8086 has a 16-bit flags register, shown below, but the low byte remained compatible with the 8080. The four highlighted flags (sign, zero, parity, and carry) are the ones originating in the Datapoint 2200.

The flag word of the 8086 contains the original Datapoint 2200 registers.

Modern x86 and x86-64

The modern x86 architecture has extended the 8086 to a 32-bit architecture (IA-32) and a 64-bit architecture (x86-6420), but the Datapoint features remain. At startup, an x86 processor runs in "real mode", which operates like the original 8086. More interesting is 64-bit mode, which has some major architectural changes. In 64-bit mode, the 8086's general-purpose registers are extended to sixteen 64-bit registers (and soon to be 32 registers). However, the original Datapoint registers are special and can still be accessed as byte registers within the corresponding 64-bit register; these are highlighted in the table below.21

General purpose registers in x86-64. From Intel Software Developer's Manual.

The flag register of the 8086 was extended to 32 bits or 64 bits in x86. As the diagram below shows, the original Datapoint 2200 status flags are still there (highlighted in yellow).

The 32-bit and 64-bit flags of x86 contain the original Datapoint 2200 registers. From Intel Software Developer's Manual.

The instruction set in x86 has been extended from the 8086, mostly through prefixes, but the instructions from the Datapoint 2200 are still there. The ModR/M byte was changed in 32-bit mode so the BX (originally HL) register is no longer special when accessing memory (although it's still special with 16-bit addressing, until Intel removes that in the upcoming x86-S simplification.) I/O ports still exist in x86, although they are viewed as more of a legacy feature: modern I/O devices typically use memory-mapped I/O instead of I/O ports. To summarize, fifty years later, x86-64 is slowly moving away from some of the Datapoint 2200 features, but they are still there.

Conclusions

The modern x86 architecture is descended from the Datapoint 2200's architecture. Because there is backward-compatibility at each step, you should theoretically be able to take a Datapoint 2200 binary, disassemble it to 8008 assembly, automatically translate it to 8080 assembly, automatically convert it to 8086 assembly, and then run it on a modern x86 processor. (The I/O devices would be different and cause trouble, of course.)

The Datapoint 2200's complete instruction set, its flags, and its little-endian architecture have persisted into current processors. This shows the critical importance of backward compatibility to customers. While Intel keeps attempting to create new architectures (iAPX 432, i960, i860, Itanium), customers would rather stay on a compatible architecture. Remarkably, Intel has managed to move from 8-bit computers to 16, 32, and 64 bits, while keeping systems mostly compatible. As a result, design decisions made for the Datapoint 2200 over 50 years ago are still impacting modern computers. Will processors still have the features of the Datapoint 2200 another fifty years from now? I wouldn't be surprised.22

Thanks to Joe Oberhauser for suggesting this topic. I plan to write more on the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected] so you can follow me there too.

Notes and references

Shift-register memory was also used in the TV Typewriter (1973) and the display storage of the Apple I (1976). However, dynamic RAM (DRAM) rapidly dropped in price, making shift-register memory obsolete by the mid 1970s. (I wrote about the Intel 1405 shift register memory in detail in this article.) ↩
For comparison, the popular PDP-8 minicomputer had just two main registers: the accumulator and a multiplier-quotient register; instructions typically operated on the accumulator and a memory location. The Data General Nova, a minicomputer released in 1969, had four accumulator / index registers. Mainframes generally had many more registers; the IBM System/360 (1964), for instance, had 16 general registers and four floating-point registers. ↩
On the hardware side, instructions were decoded with BCD-to-decimal decoder chips (type 7442). These decoders normally decoded a 4-bit BCD value into one of 10 output lines. In the Datapoint 2200, they decoded a 3-bit value into one of 8 output lines, and the other two lines were ignored. This allowed the high-bit line to be used as a selection line; if it was set, none of the 8 outputs would be active. ↩
These bit patterns map cleanly onto octal, so the opcodes are clearest when specified in octal. This octal structure has persisted in Intel processors including modern x86 processors. Unfortunately, Intel invariably specifies the opcodes in hexadecimal rather than octal, which obscures the underlying structure. This structure is described in detail in The 80x86 is an Octal Machine. ↩
It is unusual for an instruction set to require memory addresses to be loaded into a register in order to access memory. This technique was common in microcode, where memory addresses were loaded into the Memory Address Register (MAR). As pwg pointed out, the CDC mainframes (e.g. 6600) had special address registers; when you changed an address register, the specified memory location was read or written to the corresponding operand register automatically.

At first, I thought that serial memory might motivate the use of an address register, but I don't think there's a connection. Most likely, the Datapoint 2200 used these techniques to create a simple, orthogonal instruction set that was easy to decode, and they weren't particularly concerned with performance. ↩
The instruction tables in this article are different from most articles, because I use octal instead of hexadecimal. (Displaying an octal-based instruction in a hexadecimal table obscures much of the underlying structure.) To display the table in octal, I break it into four quadrants based on the top octal digit of a three-digit opcode: 0, 1, 2, or 3. The digit 0-7 along the left is the middle octal digit and the digit along the top is the low octal digit. ↩
The regular pattern of Load instructions is broken by the NOP and HALT instructions. All the register-to-register load instructions along the diagonal accomplish nothing since they move a register to itself, but only the first one is explicitly called NOP. Moving a memory location to itself doesn't make sense, so its opcode is assigned the HALT instruction. Note that the all-0's opcode and the all-1's opcode are both HALT instructions. This is useful since it can stop execution if the program tries executing uninitialized memory. ↩
You might think that Datapoint and Intel used the same ALU operations simply because they are the obvious set of 8 operations. However, if you look at other processors around that time, they use a wide variety of ALU operations. Similarly, the status flags in the Datapoint 2200 aren't the obvious set; systems with four flags typically used Sign, Carry, Zero, and Overflow (not Parity). Parity is surprisingly expensive to implement on a standard processor, but (as Philip Freidin pointed out) parity is cheap on a serial processor like the Datapoint 2200. Intel processors didn't provide an Overflow flag until the 8086; even the 8080 didn't have it although the Motorola 6800 and MOS 6502 did. The 8085 implemented an overflow flag (V) but it was left undocumented. ↩
You might wonder if the Datapoint 2200 (and 8008) could be considered RISC processors since they have simple, easy-to-decode instruction sets. I think it is a mistake to try to wedge every processor into the RISC or CISC categories (Reduced Instruction Set Computer or Complex Instruction Set Computer). In particular, the Datapoint 2200 wasn't designed with the RISC philosophy (make a processor more powerful by simplifying the instruction set), its instruction set architecture is very different from RISC chips, and its implementation is different from RISC chips. Similarly, it wasn't designed with a CISC philosophy (make a processor more powerful by narrowing the semantic gap with high-level languages) and it doesn't look like a CISC chip.

So where does that leave the Datapoint 2200? In "RISC: Back to the future?", famed computer architect Gordon Bell uses the term MISC (Minimal Instruction Set Computer) to describe the architecture of simple, early computers and microprocessors such as the Manchester Mark I (1948), the PDP-8 minicomputer (1966), and the Intel 4004 (1971). Computer architecture evolved from these early hardwired "simple computers" to microprogrammed processors, processors with cache, and hardwired, pipelined processors. "Minimal Instruction Set Computer" seems like a good description of the Datapoint 2200, since it is about the smallest, simplest processor that could get the job done. ↩
Many people think that the Intel 8008 is an extension of the 4-bit Intel 4004 processor, but they are completely unrelated aside from the part numbers. The Intel 4004 is a 4-bit processor designed to implement a calculator for a company called Busicom. Its architecture is completely different from the 8008. In particular, the 4004 is a "Harvard architecture" system, with data storage and instruction storage completely separate. The 4004 also has a fairly strange instruction set, designed for calculators. For instance, it has a special instruction to convert a keyboard scan code to binary. The 4004 team and the 8008 team at Intel had many people in common, however, so the two chips have physical layouts (floorplans) that are very similar. ↩
In this article, I'm focusing on the Datapoint 2200 Version I. Any time I refer to the Datapoint 2200, I mean the version I specifically. The Version II has an expanded instruction set, but it was expanded in an entirely different direction from the Intel 8080, so it's not relevant to this post. The Version II is interesting, however, since it provides a perspective of how the Intel 8080 could have developed in an "alternate universe". ↩
Federico Faggin wrote The Birth of the Microprocessor in Byte Magazine, March 1992. This article describes in some detail the creation of the 8008 and 8080.

The Oral History of the 8080 discusses many of the problems with the 8008 and how the 8080 addressed them. (See page 4.) Masatoshi Shima, one of the architects of the 4004, described five problems with the 8008: It was slow because it used two clock cycles per state. It had no general-purpose stack and was weak with interrupts. It had limited memory and I/O space. The instruction set was primitive, with only 8-bit data, limited addressing, and a single address pointer register. Finally, the system bus required a lot of interface circuitry. (See page 7.) ↩
The 8080 is often said to be the "first truly usable microprocessor". Supposedly the source of this quote is Forgotten PC history, but the statement doesn't appear there. I haven't been able to find the original source of this statement, so let me know. In any case, I don't think that statement is particularly accurate, as the Motorola 6800 was "truly usable" and came out before the Intel 8080.

The 8080 was first in one important way, though: it was Intel's first microprocessor that was designed with feedback from customers. Both the 4004 and the 8008 were custom chips for a single company. The 8080, however, was based on extensive customer feedback about the flaws in the 8008 and what features customers wanted. The 8080 oral history discusses this in more detail. ↩
The 8008 was built with PMOS circuitry, while the 8080 was built with NMOS. This may seem like a trivial difference, but NMOS provided much superior performance. NMOS became the standard microprocessor technology until the rise of CMOS in the 1980s, combining NMOS and PMOS to dramatically reduce power consumption.

Another key hardware improvement was that the 8080 used a 40-pin package, compared to the 18-pin package of the 8008. Intel had long followed the "religion" of small 16-pin packages, and only reluctantly moved to 18 pins (as in the 8008). However, by the time the 8080 was introduced, Intel recognized the utility of industry-standard 40-pin packages. The additional pins made the 8080 much easier to interface to a system. Moreover, the 8080's 16-bit address bus supported four times the memory of the 8008's 14-bit address bus. (The 40-pin package was still small for the time; some companies used 50-pin or 64-pin packages for microprocessors.) ↩
The 8080 is not binary-compatible with the 8008 because almost all the instructions were shifted to different opcodes. One important but subtle change was that the 8 register/memory codes were reordered to start with B instead of A. The motivation is that this gave registers in a 16-bit register pair (BC, DE, or HL) codes that differ only in the low bit. This makes it easier to specify a register pair with a two-bit code. ↩
Stan Mazor (one of the creators of the 4004 and 8080) explained that the 8085 removed 10 of the 12 new instructions because "they would burden the 8086 instruction set." Because the decision came near the 8085's release, they would "leave all 12 instructions on the already designed 8085 CPU chip, but document and announce only two of them" since modifying a CPU is hard but modifying a CPU's paper reference manual is easy.

Several of the Intel 8086 engineers provided a similar explanation in Intel Microprocessors: 8008 to 8086: While the 8085 provided the new RIM and SIM instructions, "several other instructions that had been contemplated were not made available because of the software ramifications and the compatibility constraints they would place on the forthcoming 8086."

For more information on the 8085's undocumented instructions, see Unspecified 8085 op codes enhance programming. The two new condition flags were V (2's complement overflow) and X5 (underflow on decrement or overflow on increment). The opcodes were DSUB (double (i.e. 16-bit) subtraction), ARHL (arithmetic shift right of HL), RDEL (rotate DE left through carry), LDHI (load DE with HL plus an immediate byte), LDSI (load DE with SP plus an immediate byte), RSTV (restart on overflow), LHLX (load HL indirect through DE), SHLX (store HL indirect through DE), JX5 (jump on X5), and JNX5 (jump on not X5). ↩
Conversion from 8080 assembly code to 8086 assembly code was performed with a tool called CONV86. Each line of 8080 assembly code was converted to the corresponding line (or sometimes a few lines) of 8086 assembly code. The program wasn't perfect, so it was expected that the user would need to do some manual editing. In particular, CONV86 couldn't handle self-modifying code, where the program changed its own instructions. (Nowadays, self-modifying code is almost never used, but it was more common in the 1970s in order to make code smaller and get more performance.) CONV86 also didn't handle the 8085's RIM and SIM instructions, recommending a rewrite if code used these instructions heavily.

Writing programs in 8086 assembly code manually was better, of course, since the program could take advantage of the 8086's new features. Moreover, a program converted by CONV86 might be 25% larger, due to the 8086's use of two-byte instructions and inefficiencies in the conversion. ↩
This renaming is why the instruction set has the registers in the order AX, CX, DX, BX, rather than in alphabetical order as you might expect. The other factor is that Intel decided that AX, BX, CX, and DX corresponded to Accumulator, Base, Count, and Data, so they couldn't assign the names arbitrarily. ↩
A few notes on how the 8086's instructions relate to the earlier machines, since the ModR/M byte and 8- vs. 16-bit instructions make things a bit confusing. For an instruction like ADD, I have three 8-bit opcodes highlighted: an add to memory/register, an add from memory/register, and an immediate add. The neighboring unhighlighted opcodes are the corresponding 16-bit versions. Likewise, for MOV, I have highlighted the 8-bit moves to/from a register/memory. ↩
Since the x86's 32-bit architecture is called IA-32, you might expect that IA-64 would be the 64-bit architecture. Instead, IA-64 is the completely different architecture used in the ill-fated Itanium. IA-64 was supposed to replace IA-32, despite being completely incompatible. Since AMD was cut out of IA-64, AMD developed their own 64-bit extension of the existing x86 architecture and called it AMD64. Customers flocked to this architecture while the Itanium languished. Intel reluctantly copied the AMD64 architecture, calling it Intel 64. ↩
The x86 architecture allows byte access to certain parts of the larger registers (accessing AL, AH, etc.) as well as word and larger accesses. These partial-width reads and writes to registers make the implementation of the processor harder due to register renaming. The problem is that writing to part of a register means that the register's value is a combination of the old and new values. The Register Alias Table in the P6 architecture deals with this by adding a size field to each entry. If you write a short value and then read a longer value, the pipeline stalls to figure out the right value. Moreover, some 16-bit code uses the two 8-bit parts of a register as independent registers. To support this, the Register Alias Table keeps separate entries for the high and low byte. (For details, see the book Modern Processor Design, in particular the chapter on Intel's P6 Microarchitecture.) The point of this is that obscure features of the Datapoint 2200 (such as H and L acting as a combined register) can cause implementation difficulties 50 years later. ↩
Some miscellaneous references: For a detailed history of the Datapoint 2200, see Datapoint: The Lost Story of the Texans Who Invented the Personal Computer Revolution. The 8008 oral history provides a lot of interesting information on the development of the 8008. For another look at the Datapoint 2200 and instruction sets, see Comparing Datapoint 2200, 8008, 8080 and Z80 Instruction Sets. ↩

A close look at the 8086 processor's bus hold circuitry

The Intel 8086 microprocessor (1978) revolutionized computing by founding the x86 architecture that continues to this day. One of the lesser-known features of the 8086 is the "hold" functionality, which allows an external device to temporarily take control of the system's bus. This feature was most important for supporting the 8087 math coprocessor chip, which was an option on the IBM PC; the 8087 used the bus hold so it could interact with the system without conflicting with the 8086 processor.

This blog post explains in detail how the bus hold feature is implemented in the processor's logic. (Be warned that this post is a detailed look at a somewhat obscure feature.) I've also found some apparently undocumented characteristics of the 8086's hold acknowledge circuitry, designed to make signal transition faster on the shared control lines.

The die photo below shows the main functional blocks of the 8086 processor. In this image, the metal layer on top of the chip is visible, while the silicon and polysilicon underneath are obscured. The 8086 is partitioned into a Bus Interface Unit (upper) that handles bus traffic, and an Execution Unit (lower) that executes instructions. The two units operate mostly independently, which will turn out to be important. The Bus Interface Unit handles read and write operations as requested by the Execution Unit. The Bus Interface Unit also prefetches instructions that the Execution Unit uses when it needs them. The hold control circuitry is highlighted in the upper right; it takes a nontrivial amount of space on the chip. The square pads around the edge of the die are connected by tiny bond wires to the chip's 40 external pins. I've labeled the MN/MX, HOLD, and HLDA pads; these are the relevant signals for this post.

The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version.

How bus hold works

In an 8086 system, the processor communicates with memory and I/O devices over a bus consisting of address and data lines along with various control signals. For high-speed data transfer, it is useful for an I/O device to send data directly to memory, bypassing the processor; this is called DMA (Direct Memory Access). Moreover, a co-processor such as the 8087 floating point unit may need to read data from memory. The bus hold feature supports these operations: it is a mechanism for the 8086 to give up control of the bus, letting another device use the bus to communicate with memory. Specifically, an external device requests a bus hold and the 8086 stops putting electrical signals on the bus and acknowledges the bus hold. The other device can now use the bus. When the other device is done, it signals the 8086, which then resumes its regular bus activity.

Most things in the 8086 are more complicated than you might expect, and the bus hold feature is no exception, largely due to the 8086's minimum and maximum modes. The 8086 can be designed into a system in one of two ways—minimum mode and maximum mode—that redefine the meanings of the 8086's external pins. Minimum mode is designed for simple systems and gives the control pins straightforward meanings such as indicating a read versus a write. Minimum mode provides bus signals that were similar to the earlier 8080 microprocessor, making migration to the 8086 easier. On the other hand, maximum mode is designed for sophisticated, multiprocessor systems and encodes the control signals to provide richer system information.

In more detail, minimum mode is selected if the MN/MX pin is wired high, while maximum mode is selected if the MN/MX pin is wired low. Nine of the chip's pins have different meanings depending on the mode, but only two pins are relevant to this discussion. In minimum mode, pin 31 has the function HOLD, while pin 30 has the function HLDA (Hold Acknowlege). In maximum mode, pin 31 has the function RQ/GT0', while pin 30 has the function RQ/GT1'.

I'll start by explaining how a hold operation works in minimum mode. When an external device wants to use the bus, it pulls the HOLD pin high. At the end of the current bus cycle, the 8086 acknowledges the hold request by pulling HLDA high. The 8086 also puts its bus output pins into "tri-state" mode, in effect disconnecting them electrically from the bus. When the external device is done, it pulls HOLD low and the 8086 regains control of the bus. Don't worry about the details of the timing below; the key point is that a device pulls HOLD high and the 8086 responds by pulling HLDA high.

This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14.

The 8086's maximum mode is more complex, allowing two other devices to share the bus by using a priority-based scheme. Maximum mode uses two bidirectional signals, RQ/GT0 and RQ/GT1.2 When a device wants to use the bus, it issues a pulse on one of the signal lines, pulling it low. The 8086 responds by pulsing the same line. When the device is done with the bus, it issues a third pulse to inform the 8086. The RQ/GT0 line has higher priority than RQ/GT1, so if two devices request the bus at the same time, the RQ/GT0 device wins and the RQ/GT1 device needs to wait.1 Keep in mind that the RQ/GT lines are bidirectional: the 8086 and the external device both use the same line for signaling.

This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16.

The bus hold does not completely stop the 8086. The hold operation stops the Bus Interface Unit, but the Execution Unit will continue executing instructions until it needs to perform a read or write, or it empties the prefetch queue. Specifically, the hold signal blocks the Bus Interface Unit from starting a memory cycle and blocks an instruction prefetch from starting.

Bus sharing and the 8087 coprocessor

Probably the most common use of the bus hold feature was to support the Intel 8087 math coprocessor. The 8087 coprocessor greatly improved the performance of floating-point operations, making them up to 100 times faster. As well as floating-point arithmetic, the 8087 supported trigonometric operations, logarithms and powers. The 8087's architecture became part of later Intel processors, and the 8087's instructions are still a part of today's x86 computers.3

The 8087 had its own registers and didn't have access to the 8086's registers. Instead, the 8087 could transfer values to and from the system's main memory. Specifically, the 8087 used the RQ/GT mechanism (maximum mode) to take control of the bus if the 8087 needed to transfer operands to or from memory.4 The 8087 could be installed as an option on the original IBM PC, which is why the IBM PC used maximum mode.

The enable flip-flop

The circuit is built from six flip-flops. The flip-flops are a bit different from typical D flip-flops, so I'll discuss the flip-flop behavior before explaining the circuit.

A flip-flop can store a single bit, 0 or 1. Flip flops are very important in the 8086 because they hold information (state) in a stable way, and they synchronize the circuitry with the processor's clock. A common type of flip-flop is the D flip-flop, which takes a data input (D) and stores that value. In an edge-triggered flip-flop, this storage happens on the edge when the clock changes state from low to high.5 (Except at this transition, the input can change without affecting the output.) The output is called Q, while the inverted output is called Q-bar.

The symbol for the D flip-flop with enable.

Many of the 8086's flip-flops, including the ones in the hold circuit, have an "enable" input. When the enable input is high, the flip-flop records the D input, but when the enable input is low, the flip-flop keeps its previous value. Thus, the enable input allows the flip-flop to hold its value for more than one clock cycle. The enable input is very important to the functioning of the hold circuit, as it is used to control when the circuit moves to the next state.

How bus hold is implemented (minimum mode)

I'll start by explaining how the hold circuitry works in minimum mode. To review, in minimum mode the external device requests a hold through the HOLD input, keeping the input high for the duration of the request. The 8086 responds by pulling the hold acknowledge HLDA output high for the duration of the hold.

In minimum mode, only three of the six flip-flops are relevant. The diagram below is highly simplified to show the essential behavior. (The full schematic is in the footnotes.6) At the left is the HOLD signal, the request from the external device.

Simplified diagram of the circuitry for minimum mode.

When a HOLD request comes in, the first flip-flop is activated, and remains activated for the duration of the request. The second flip-flop waits if any condition is blocking the hold request: a LOCK instruction, an unaligned memory access, or so forth. When the HOLD can proceed, the second flip-flop is enabled and it latches the request. The second flip-flop controls the internal hold signal, causing the 8086 to stop further bus activity. The third flip-flop is then activated when the current bus cycle (if any) completes; when it latches the request, the hold is "official". The third flip-flop drives the external HLDA (Hold Acknowledge) pin, indicating that the bus is free. This signal also clears the bus-enabled latch (elsewhere in the 8086), putting the appropriate pins into floating tri-state mode. The key point is that the flip-flops control the timing of the internal hold and the external HLDA, moving to the next step as appropriate.

When the external device signals an end to the hold by pulling the HOLD pin low, the process reverses. The three flip-flops return to their idle state in sequence. The second flip-flop clears the internal hold signal, restarting bus activity. The third flip-flop clears the HLDA pin.7

How bus hold is implemented (maximum mode)

The implementation of maximum mode is tricky because it uses the same circuitry as minimum mode, but the behavior is different in several ways. First, minimum mode and maximum mode operate on opposite polarity: a hold is requested by pulling HOLD high in minimum mode versus pulling a request line low in maximum mode. Moreover, in minimum mode, a request on the HOLD pin triggers a response on the opposite pin (HLDA), while in maximum mode, a request and response are on the same pin. Finally, using the same pin for the request and grant signals requires the pin to act as both an input and an output, with tricky electrical properties.

In maximum mode, the top three flip-flops handle the request and grant on line 0, while the bottom three flip-flops handle line 1. At a high level, these flip-flops behave roughly the same as in the minimum mode case, with the first flip-flop tracking the hold request, the second flip-flop activated when the hold is "approved", and the third flip-flop activated when the bus cycle completes. An RQ 0 input will generate a GT 0 output, while a RQ 1 input will generate a GT 1 output. The diagram below is highly simplified, but illustrates the overall behavior. Keep in mind that RQ 0, GT 0, and HOLD use the same physical pin, as do RQ 1, GT 1, and HLDA.

Simplified diagram of the circuitry for maximum mode.

In more detail, the first flip-flop converts the pulse request input into a steady signal. This is accomplished by configuring the first flip-flop is configured with to toggle on when the request pulse is received and toggle off when the end-of-request pulse is received.10 The toggle action is implemented by feeding the output pulse back to the input, inverted (A); since the flip-flop is enabled by the RQ input, the flip-flop holds its value until an input pulse. One tricky part is that the acknowledge pulse must not toggle the flip-flop. This is accomplished by using the output signal to block the toggle. (To keep the diagram simple, I've just noted the "block" action rather than showing the logic.)

As before, the second flip-flop is blocked until the hold is "authorized" to proceed. However, the circuitry is more complicated since it must prioritize the two request lines and ensure that only one hold is granted at a time. If RQ0's first flip-flop is active, it blocks the enable of RQ1's second flip-flop (B). Conversely, if RQ1's second flip-flop is active, it blocks the enable of RQ0's second flip-flop (C). Note the asymmetry, blocking on RQ0's first flip-flop and RQ1's second flip-flop. This enforces the priority of RQ0 over RQ1, since an RQ0 request blocks RQ1 but only an RQ1 "approval" blocks RQ0.

When the second flip-flop is activated in either path, it triggers the internal hold signal (D).8 As before, the hold request is latched into the third flip-flop when any existing memory cycle completes. When the hold request is granted, a pulse is generated (E) on the corresponding GT pin.9

The same circuitry is used for minimum mode and maximum mode, although the above diagrams show differences between the two modes. How does this work? Essentially, logic gates are used to change the behavior between minimum mode and maximum mode as required. For the most part, the circuitry works the same, so only a moderate amount of logic is required to make the same circuitry work for both. On the schematic, the signal MN is active during minimum mode, while MX is active during maximum mode, and these signals control the behavior.

The "hold ok" circuit

As usually happens with the 8086, there are a bunch of special cases when different features interact. One special case is if a bus hold request comes in while the 8086 is acknowledging an interrupt. In this case, the interrupt takes priority and the bus hold is not processed until the interrupt acknowledgment is completed. A second special case is if the bus hold occurs while the 8086 is halted. In this case, the 8086 issues a second HALT indicator at the end of the bus hold. Yet another special case is the 8086's LOCK prefix, which locks the use of the bus for the following instruction, so a bus hold request is not honored until the locked instruction has completed. Finally, the 8086 performs an unaligned word access to memory by breaking it into two 8-bit bus cycles; these two cycles can't be interrupted by a bus hold.

In more detail, the "hold ok" circuit determines at each cycle if a hold could proceed. There are several conditions under which the hold can proceed:

The bus cycle is `T2`, except if an unaligned bus operation is taking place (i.e. a word split into two byte operations), or
A memory cycle is not active and a microcode memory operation is not taking place, or
A memory cycle is not active and a hold is currently active.

The first case occurs during bus (memory) activity, where a hold request before cycle T2 will be handled at the end of that cycle. The second case allows a hold if the bus is inactive. But if microcode is performing a memory operation, the hold will be delayed, even if the request is just starting. The third case is opposite from the other two: it enables the flip flop so a hold request can be dropped. (This ensures that the hold request can still be dropped in the corner case where a hold starts and then the microcode makes a memory request, which will be blocked by the hold.)

The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear.

An instruction with the LOCK prefix causes the bus to be locked against other devices for the duration of the instruction. Thus, a hold cannot be granted while the instruction is running. This is implemented through a separate path. This logic is between the output of the first (request) flip-flop and the second (accepted) flip-flop, tied into the LOCK signal. Conceptually, it seems that the LOCK signal should block hold-ok and thus block the second (accepted) flip-flop from being enabled. But instead, the LOCK signal blocks the data path, unless the request has already been granted. I think the motivation is to allow dropping of a hold request to proceed uninterrupted. In other words, LOCK prevents a hold from being accepted, but it doesn't prevent a hold from being dropped, and it was easier to implement this in the data path.

The pin drive circuitry

The circuitry for the HOLD/RQ0/GT0 and HLDA/RQ1/GT1 pins is somewhat complicated, since they are used for both input and output. In minimum mode, the HOLD pin is an input, while the HLDA pin is an output. In maximum mode, both pins act as an input, with a low-going pulse from an external device to start or stop the hold. But the 8086 also issues pulses to grant the hold. Pull-up resistors inside the 8086 to ensure that the pins remain high (idle) when unused. Finally, an undocumented active pull-up system restores a pin to a high state after it is pulled low, providing faster response than the resistor.

The schematic below shows the heart of the tri-state output circuit. Each pin is connected to two large output MOSFETs, one to drive the pin high and one to drive the pin low. The transistors have separate control lines; if both control lines are low, both transistors are off and the pin floats in the "tri-state" condition. This permits the pin to be used as an input, driven by an external device. The pull-up resistor keeps the pin in a high state.

The tri-state output circuit for each hold pin.

The diagram below shows how this circuitry looks on the die. In this image, the metal and polysilicon layers have been removed with acid to show the underlying doped silicon regions. The thin white stripes are transistor gates where polysilicon wiring crossed the silicon. The black circles are vias that connected the silicon to the metal on top. The empty regions at the right are where the metal pads for HOLD and HLDA were. Next to the pads are the large transistors to pull the outputs high or low. Because the outputs require much higher current than internal signals, these transistors are much larger than logic transistors. They are composed of several transistors placed in parallel, resulting in the parallel stripes. The small pullup resistors are also visible. For efficiency, these resistors are actually depletion-mode transistors, specially doped to act as constant-current sources.

The HOLD/HLDA pin circuitry on the die.

At the left, some of the circuitry is visible. The large output transistors are driven by "superbuffers" that provide more current than regular NMOS buffers. (A superbuffer uses separate transistors to pull the signal high and low, rather than using a pull-up to pull the signal high as in a standard NMOS gate.) The small transistors are the pass transistors that gate output signals according to the clock. The thick rectangles are crossovers, allowing the vertical metal wiring (no longer visible) to cross over a horizontal signal in the signal layer. The 8086 has only a single metal layer, so the layout requires a crossover if signals will otherwise intersect. Because silicon's resistance is much higher than metal's resistance, the crossover is relatively wide to reduce the resistance.

The problem with a pull-up resistor is that it is relatively slow when connected to a line with high capacitance. You essentially end up with a resistor-capacitor delay circuit, as the resistor slowly charges the line and brings it up to a high level. To get around this, the 8086 has an active drive circuit to pulse the RQ/GT lines high to pull them back from a low level. This circuit pulls the line high one cycle after the 8086 drops it low for a grant acknowledge. This circuit also pulls the line high after the external device pulls it low.11 (The schematic for this circuit is in the footnotes.) The curious thing is that I couldn't find this behavior documented in the datasheets. The datasheets describe the internal pull-up resistor, but don't mention that the 8086 actively pulls the lines high.12

Conclusions

The hold circuitry was a key feature of the 8086, largely because it was necessary for the 8087 math coprocessor chip. The hold circuitry seems like it should be straightforward, but there are many corner cases in this circuitry: it interacts with unaligned memory accesses, the LOCK prefix, and minimum vs. maximum modes. As a result, it is fairly complicated.

Personally, I find the hold circuitry somewhat unsatisfying to study, with few fundamental concepts but a lot of special-case logic. The circuitry seems overly complicated for what it does. Much of the complexity is probably due to the wildly different behavior of the pins between minimum and maximum mode. Intel should have simply used a larger package (like the Motorola 68000) rather than re-using pins to support different modes, as well as using the same pin for a request and a grant. It's impressive, however, the same circuitry was made to work for both minimum and maximum modes, despite using completely different signals to request and grant holds. This circuitry must have been a nightmare for Intel's test engineers, trying to ensure that the circuitry performed properly when there were so many corner cases and potential race conditions.

I plan to write more on the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected] and Bluesky as @righto.com so you can follow me there too.

Notes and references

The timing of priority between RQ0 and RQ1 is left vague in the documentation. In practice, even if RQ1 is requested first, a later RQ0 can still preempt it until the point that RQ1 is internally granted (i.e. the second flip-flop is activated). This happens before the hold is externally acknowledged, so it's not obvious to the user at what point priority no longer applies. ↩
The RQ/GT0 and RQ/GT1 signals are active-low. These signals should have an overbar to indicate this, but it makes the page formatting ugly :-) ↩
Modern x86 processors still support the 8087 (x87) instruction set. Starting with the 80486DX, the floating point unit was included on the CPU die, rather than as an external coprocessor. The x87 instruction set used a stack-based model, which made it harder to parallelize. To mitigate this, Intel introduced SSE in 1999, a different set of floating point instructions that worked on an independent register set. The x87 instructions are now considered mostly obsolete and are deprecated in 64-bit Windows. ↩
The 8087 provides another RQ/GT input line for an external device. Thus, two external devices can still be used in a system with an 8087. That is, although the 8087 uses up one of the 8086's two RQ/GT lines, the 8087 provides another one, so there are still two lines available. The 8087 has logic to combine its bus requests and external bus requests into a single RQ/GT line to the 8086. ↩
Confusingly, some of the flip-flops in the hold circuit transistion when the clock goes high, while others use the inverted clock signal and transition when the clock goes low. Moreover, the flip-flops are inconsistent about how they treat the data. In each group of three flip-flops, the first flip-flop is active-high, while the remaining flip-flops are active-low. For the most part, I'll ignore this in the discussion. You can look at the schematic if you want to know the details. ↩
The schematics below shows my reverse-engineered schematic for the hold circuitry. I have partitioned the schematic into the hold logic and the output driver circuitry. This split matches the physical partitioning of the circuitry on the die.

In the first schematic, the upper part handles HOLD and request0, while the lower part handles request1. There is some circuitry in the middle to handle the common enabling and to generate the internal hold signal. I won't explain the circuitry in much detail, but there are a few things I want to point out. First, even though the hold circuit seems like it should be simple, there are a lot of gates connected in complicated ways. Second, although there are many inverters, NAND, and NOR gates, there are also complex gates such as AND-NOR, OR-NAND, AND-OR-NAND, and so forth. These are implemented as single gates. Due to how gates are constructed from NMOS transistors, it is just as easy to build a hierarchical gate as a single gate. (The last step must be inversion, though.) The XOR gates are more complex; they are constructed from a NOR gate and an AND-NOR gate.

Schematic of the hold circuitry. Click this image (or any other) for a larger version.

The schematic below shows the output circuits for the two pins. These circuits are similar, but have a few differences because only the bottom one is used as an output (HLDA) in minimum mode. Each circuit has two inputs: what the current value of the pin is, and what the desired value of the pin is.

Schematic of the pin output circuits.

↩
Interestingly, the external pins aren't taken out of tri-state mode immediately when the HLDA signal is dropped. Instead, the 8086's bus drivers are re-enabled when a bus cycle starts, which is slightly later. The bus circuitry has a separate flip-flop to manage the enable/disable state, and the start of a bus cycle is what re-enables the bus. This is another example of behavior that the documentation leaves ambiguous. ↩
There's one more complication for the hold-out signal. If a hold is granted on one line, a request comes in on the other line, and then the hold is released on the first line, the desired behavior is for the bus to remain in the hold state as the hold switches to the second line. However, because of the way a hold on line 1 blocks a hold on line 0, the GT1 second flip-flop will drop a cycle before the GT0 second flip-flop is activated. This would cause hold-out to drop for a cycle and the 8086 could start unwanted bus activity. To prevent this case, the hold-out line is also activated if there is an RQ0 request and RQ1 is granted. This condition seems a bit random but it covers the "gap". I have to wonder if Intel planned the circuit this way, or they added the extra test as a bug fix. (The asymmetry of the priority circuit causes this problem to occur only going from a hold on line 1 to line 0, not the other direction.) ↩
The pulse-generating circuit is a bit tricky. A pulse signal is generated if the request has been accepted, has not been granted, and will be granted on the next clock (i.e. no memory request is active so the flip-flop is enabled). (Note that the pulse will be one cycle wide, since granting the request on the next cycle will cause the conditions to be no longer satisfied.) This provides the pulse one clock cycle before the flip-flop makes it "official". Moreover, the signals come from the inverted Q outputs from the flip-flops, which are updated half a clock cycle earlier. The result is that the pulse is generated 1.5 clock cycles before the flip-flop output. Presumably the point of this is to respond to hold requests faster, but it seems overly complicated. ↩
The request pulse is required to be one clock cycle wide. The feedback loop shows why: if the request is longer than one clock cycle, the first flip-flop will repeatedly toggle on and off, resulting in unexpected behavior. ↩
The details of the active pull-up circuitry don't make sense to me. First it XORs the state of the pin with the desired state of the pin and uses this signal to control a multiplexer, which generates the pull-up action based on other gates. The result of all this ends up being simply NAND, implemented with excessive gates. Another issue is that minimum mode blocks the active pull-up, which makes sense. But there are additional logic gates so minimum mode can affect the value going into the multiplexer, which gets blocked in minimum mode, so that logic seems wasted. There are also two separate circuits to block pull-up during reset. My suspicion is that the original logic accumulated bug fixes and redundant logic wasn't removed. But it's possible that the implementation is doing something clever that I'm just missing. ↩
My analysis of the RQ/GT lines being pulled high is based on simulation. It would be interesting to verify this behavior on a physical 8086 chip. By measuring the current out of the pin, the pull-up pulses should be visible. ↩