Intel introduced the 8086 microprocessor in 1978. This processor ended up being hugely influential, setting the path for the x86 architecture that is extensively used today. One interesting feature of the 8086 was instructions that can efficiently operate on blocks of memory up to 64K bytes long.1 These instructions rapidly copy, compare, or scan data and are known as "string" instructions.2
In this blog post, I explain string operations in the 8086, analyze the microcode that it used, and discuss the hardware circuitry that helped it out. My analysis is based on reverse-engineering the 8086 from die photos. The photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions. The microcode ROM at the lower right controls the process.
Segments and addressing
Before I get into the details of the string instructions, I need to give a bit of background on how the 8086 accesses memory through segments. Earlier microprocessors such as the Intel 8080 (1974) used 16 bits to specify a memory address, allowing a maximum of 64K of memory. This memory capacity is absurdly small by modern standards, but at the time when a 4K memory board cost hundreds of dollars, this limit was not a problem. However, due to Moore's Law and the exponential growth in memory capacity, the follow-on 8086 processor needed to support more memory. At the same time, the 8086 needed to use 16-bit registers for backward compatibility with the 8080.
The much-reviled solution was to create a 1-megabyte (20-bit) address space consisting of 64K segments, with a 16-bit address specifying a position within the segment. In more detail, the memory address was specified by a 16-bit offset address along with a particular 16-bit segment register selecting a segment. The segment register's value was shifted by 4 bits to give the segment's 20-bit base address. The 16-bit offset address was added, yielding a 20-bit memory address. This gave the processor a 1-megabyte address space, although only 64K could be accessed without changing a segment register. The 8086 had four segment registers so it could use multiple segments at the same time: the Code Segment, Data Segment, Stack Segment, and Extra Segment.
The 8086 chip is split into two processing units: the Bus Interface Unit (BIU) that handles segments and memory accesses, and the Execution Unit (EU) that executes instructions. The Execution Unit is what comes to mind when you think of a processor: it has most of the registers, the arithmetic/logic unit (ALU), and the microcode that implements instructions. The Bus Interface Unit interacts with memory and other external systems, performing the steps necessary to read and write memory.
Among other things, the Bus Interface Unit has a separate adder for address calculation; this adds the segment register to the base address to determine the final memory address. Every memory access uses the address adder at least once to add the segment base and offset. The address adder is also used to increment the program counter. Finally, the address adder increments and decrements the index registers used for block operations. This will be discussed in more detail below.
Microcode in the 8086
Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. With microcode, instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. This provides a considerable performance improvement for the block operations, which requires many steps in a loop. Performing this loop in microcode is considerably faster than writing the loop in assembly code.
A micro-instruction in the 8086 is encoded into 21 bits as shown below. Every micro-instruction specifies a move operation from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits depends on the type field and can be anything from an ALU operation to a memory read or write to a change of microcode control flow. Thus, an 8086 micro-instruction typically does two things in parallel: the move and the action. For more about 8086 microcode, see my microcode blog post.
I'll explain the behavior of an ALU micro-operation since it is important for string operations.
The Arithmetic/Logic Unit (ALU) is the heart of the processor, performing addition, subtraction, and logical operations.
The ALU has three temporary input registers that are invisible to the programmer: tmpA, tmpB, and tmpC.
An ALU operation takes its first argument from any temporary register, while the second argument always comes from tmpB.
Performing an ALU operation requires two micro-instructions.
The first micro-instruction specifies the ALU operation and source register, configuring the ALU. For instance, ADD tmpA
configures the ALU to add the tmpA register to the default tmpB register.
In the next micro-instruction (or a later one), the ALU result can be accessed through a special register called Σ
(SIGMA) and moved to another register.
I'll also explain the memory read and write micro-operations.
A memory operation uses two internal registers: IND
(Indirect) holds the memory address, while OPR
(Operand) holds the word that is read or written.
A typical memory micro-instruction for a read is R DS,BL
.
This causes the Bus Interface Unit to compute the memory address by adding the Data Segment (DS
) to the IND
register
and then perform the read.
The Bus Interface Unit determines if the instruction is performing a byte operation or a word operation and reads a byte or
word as appropriate, going through the necessary bus cycles.
The BL
option3 causes the Bus Interface Unit to update the IND
register as appropriate,3
incrementing or decrementing it by 1 or 2 depending on the Direction Flag and the
size of the access (byte or word).
All of this complexity happens in the hardware of the Bus Interface Unit and is invisible to the microcode.
The tradeoff is that this simplifies the microcode but makes the chip's hardware considerably more complicated.
The string move instruction
The 8086 has five types of string instructions, operating on blocks of memory:
MOVS
(Move String), CMPS
(Compare Strings), SCAS
(Scan String), LODS
(Load String), and STOS
(Store String).
Each instruction operates on a byte or word, but by using a REP
prefix, the operation can be repeated for up to 64k bytes,
controlled by a counter.
Conditional repetitions can terminate the loop on various conditions.
The string instructions provide a flexible way to operate on blocks of memory, much faster than a loop written in assembly code.
The MOVS
(Move String) operation copies one memory region to another.
The CMPS
(Compare Strings) operation compares two memory blocks and sets the status flags. In particular, this indicates if
one string is greater, less, or equal to the other.
The SCAS
(Scan String) operation scans memory, looking for a particular value.
The LODS
(Load String) operation moves an element into the accumulator, generally as part of a more complex loop.
Finally, STOS
(Store String) stores the accumulator value, either to initialize a block of memory or as part of a more complex loop.4
Like many 8086 instructions, each string instruction has two opcodes: one that operates on bytes and one that operates on words. One of the interesting features of the 8086 is that the same microcode implements the byte and word instructions, while the hardware takes care of the byte- or word-sized operations as needed. Another interesting feature of the string operations is that they can go forward through memory, incrementing the pointers, or they can go backward, decrementing the points. A special processor flag, the Direction Flag, indicates the direction: 0 for incrementing and 1 for decrementing. Thus, there are four possibilities for stepping through memory, part of the flexibility of the string operations.
The flowchart below shows the complexity of these instructions. I'm not going to explain the flowchart at this point, but the point is that there is a lot going on. This functionality is implemented by the microcode.
I'll start by explaining the MOVS
(Move String) instruction, which moves (copies) a block of memory.
Before executing this instruction, the registers should be configured so
the SI
Source Index register points to the first block, the DI
Destination Index register points to the second block,
and the CX
Count register holds the number of bytes or words to move.
The basic action of the MOVS
instruction reads a byte (or word) from the SI
address and updates SI
, writes the value
to the DI
address and updates DI
, and decrements the CX
counter.
The microcode block below is executed for the MOVS
(and LODS
) instructions.
There's a lot happening in this microcode with a variety of control paths, so it's a bit tricky to
understand, but let's see how it goes.
Each micro-instruction has a register-to-register move on the left and an action on the right, happening in parallel.
The first micro-instruction handles the REP
prefix, if any; let's assume for now that there's no prefix so it is skipped.
Next is the read from memory, which requires the memory address to be in the IND
register.
Thus, the micro-instruction moves SI
to IND
and starts the read cycle (R DS,BL
).
When the read completes, the updated IND
register is moved back to SI
, updating that register.
Meanwhile, X0
tests the opcode and jumps to "8" for LODS
.
The MOVS
path falls through, getting the address from the DI
register and writing to memory the value that we just read.
The updated IND
register is moved to DI
while another conditional jump goes to "7" if there's no REP
prefix.
Micro-instruction 7 performs an RNI
(Run Next Instruction), which ends the microcode and causes the next
machine instruction to be decoded.
As you can see, microcode is very low-level.
move action CALL F1 RPTS MOVS/LODS: handle REP if active SI → IND R DS,BL 1: Read byte/word from SI IND → SI JMPS X0 8 test instruction bit 3: jump if LODS DI → IND W DA,BL MOVS path: write to DI IND → DI JMPS NF1 7 4: run next instruction if not REP Σ → tmpC JMP INT RPTI 5: handle any interrupt tmpC → CX JMPS NZ 1 update CX, loop if not zero RNI 7: run next instruction OPR → M JMPS F1 5 8: LODS path: store AL/AX, jump back if REP RNI run next instruction
Now let's look at the case with a REP
prefix, causing the instruction to loop.
The first step is to test if the count register CX
is zero, and bail out of the loop if so.
In more detail, the REP
prefix sets an internal flag called F1
. The first micro-instruction for MOVS
above conditionally calls the RPTS
subroutine if F1
is set.
The RPTS
subroutine below is a bit tricky.
First, it moves the count in CX
to the ALU's temporary C register. It also configures the ALU to pass tmpC through unchanged.
The next move discards the ALU result Σ, but as a side effect, sets a flag if the value is zero.
This micro-instruction also configures the ALU to perform DEC tmpC
, but the decrement doesn't happen yet.
Next, if the value is nonzero (NZ
), the microcode execution jumps to 10 and returns from the microcode subroutine,
continuing execution of the MOVS
code described above.
On the other hand, if CX
is zero, execution falls through to RNI
(Run Next Instruction), which terminates execution of
the MOVS
instruction.
CX → tmpC PASS tmpC RPTS: test CX Σ → no dest DEC tmpC Set up decrement for later JMPS NZ 10 Jump to 10 if CX not zero RNI If 0, run next instruction RTN 10: return
If execution returns to the MOVS
microcode, it will execute as described earlier until the NF1
test below.
With a REP
prefix, the test fails and microcode execution falls through.
The next micro-instruction performs Σ → tmpC
, which puts the ALU result into tmpC.
The ALU was configured back in the RPTS
subroutine to decrement tmpC, which holds the count from CX
, so the
result is that CX
is decremented, put into tmpC, and then put back into CX
in the next micro-instruction.
It seems like a roundabout way to decrement the counter, but that's microcode.
Finally, if the value is nonzero (NZ
), microcode execution jumps back to 1 (near the top of the MOVS
code earlier), repeating the whole process.
Otherwise, RNI
ends processing of the instruction.
Thus, the MOVS
instruction repeats until CX
is zero.
In the next section, I'll explain how JMP INT RPTI
handles an interrupt.
IND → DI JMPS NF1 7 4: run next instruction if not REP Σ → tmpC JMP INT RPTI 5: handle any interrupt tmpC → CX JMPS NZ 1 update CX, loop if not zero RNI 7: run next instruction
The NZ
(not zero) condition tests a special 16-bit zero flag, not the standard zero status flag.
This allows zero to be tested without messing up the zero status flag.
Interrupts
Interrupts pose a problem for the string operations. The idea behind interrupts is that the computer can be interrupted during processing to handle a high-priority task, such as an I/O device that needs servicing. The processor stops its current task, executes the interrupt handling code, and then returns to the original task. The 8086 processor normally completes the instruction that it is executing before handling the interrupt, so it can continue from a well-defined state. However, a string operation can perform up to 64k moves, which could take a large fraction of a second.5 If the 8086 waited for the string operation to complete, interrupt handling would be way too slow and could lose network packets or disk data, for instance.
The solution is that a string instruction can be interrupted in the middle of the instruction, unlike most instructions.
The string instructions are designed to use registers in a way that allows the instruction to be restarted.
The idea is that the CX
register holds the current count, while the SI
and DI
registers hold the current memory
pointers, and these registers are updated as the instruction progresses. If the instruction is interrupted it can simply
continue where it left off.
After the interrupt, the 8086 restarts the string operation by backing the program counter up by two bytes
(one byte for the REP
prefix and one byte for the string opcode.)
This causes the interrupted string operation to be re-executed, continuing where it left off.
If there is an interrupt, the RPTI
microcode routine below is called to update the program counter.
Updating the program counter is harder than you'd expect because the 8086 prefetches instructions.
The idea is that while the memory bus is idle, instructions are read from memory into a prefetch queue.
Then, when an instruction is needed, the processor can (hopefully) get the instruction immediately from the prefetch
queue instead of waiting for a memory access.
As a result, the program counter in the 8086 points to the memory address of the next instruction to fetch, not the
next instruction to execute.
To get the "real" program counter value, prefetching is first suspended (SUSP
). Then the PC
value is corrected (CORR
) by subtracting the
length of the prefetch queue. At this point, the PC
points to the next instruction to execute.
tmpC → CX SUSP RPTI: store CX CORR correct PC PC → tmpB DEC2 tmpB Σ → PC FLUSH RNI PC -= 2, end instruction
At last, the microcode gets to the purpose of this subroutine: the PC
is decremented by 2 (DEC2
) using the ALU.
The prefetch queue is flushed and restarted and the RNI
micro-operation terminates the microcode and runs the next instruction.
Normally this would execute the instruction from the new program counter value (which now points to the string operation).
However, since there is an interrupt pending, the interrupt will take place instead, and the interrupt handler will
execute.
After the interrupt handler finishes, the interrupted string operation will be re-executed, continuing where it left off.
There's another complication, of course. An 8086 instruction can have multiple prefixes attached, for example using a segment register prefix to access a different segment. The approach of backing up two bytes will only execute the last prefix, ignoring any others, so if you have two prefixes, the instruction doesn't get restarted correctly. The 8086 documentation describes this unfortunate behavior. Apparently a comprehensive solution (e.g. counting the prefixes or providing a buffer to hold prefixes during an interrupt) was impractical for the 8086. I think this was fixed in the 80286.
The remaining string instructions
I'll discuss the microcode for the other string operations briefly.
The LODS
instruction loads from memory into the accumulator. It uses the same microcode routine as MOVS
; the code
below is the same code discussed earlier.
However, the path through the microcode is different for LODS
since the JMPS X0 8
conditional jump will be taken.
(This tests bit 3 of the opcode, which is set for LODS
.)
At step 8, a value has been read from memory and is in the OPR
(Operand) register.
This micro-instruction moves the value from OPR
to the accumulator (represented by M
for complicated reasons6).
If there is a repeat prefix, the microcode jumps back to the previous flow (5). Otherwise, RNI
runs the next instruction.
Thus, LODS
shares almost all its microcode with MOVS
, making the microcode more compact at the cost of slowing it
down slightly due to the conditional jumps.
move action CALL F1 RPTS MOVS/LODS: handle REP if active SI → IND R DS,BL 1: Read byte/word from SI IND → SI JMPS X0 8 test instruction bit 3: jump if LODS DI → IND W DA,BL MOVS path: write to DI IND → DI JMPS NF1 7 4: run next instruction if not REP Σ → tmpC JMP INT RPTI 5: handle any interrupt tmpC → CX JMPS NZ 1 update CX, loop if not zero RNI 7: run next instruction OPR → M JMPS F1 5 8: LODS path: store AL/AX, jump back if REP RNI run next instruction
The STOS
instruction is the opposite of LODS
, storing the accumulator value into memory.
The microcode (below) is essentially the second half of the MOVS
microcode.
The memory address in DI
is moved to the IND
register and the value in the accumulator is moved to the OPR
register
to set up the write operation. (As with LODS
, the M
register indicates the accumulator.6)
The CX
register is decremented using the ALU.
DI → IND CALL F1 RPTS STOS: if REP prefix, test if done M → OPR W DA,BL 1: write the value to memory IND → DI JMPS NF1 5 Quit if not F1 (repeat) Σ → tmpC JMP INT RPTI Jump to RPTI if interrupt tmpC → CX JMPS NZ 1 Loop back if CX not zero RNI 5: run next instruction
The CMPS
instruction compares strings, while the SCAS
instruction looks for a zero or non-zero value, depending on the prefix.
They share the microcode routine below, with the X0
condition testing bit 3 of the instruction to select the path.
The difference is that CMPS
reads the comparison character
from SI
, while SCAS
compares against the character in the accumulator.
The comparison itself is done by subtracting the two values and discarding the result. The F
bit in the micro-instruction causes the processor's status flags to
be updated with the result of the subtraction, indicating less than, equal, or greater than.
CALL F1 RPTS CMPS/SCAS: if RPT, quit if done M → tmpA JMPS X0 5 1:accum to tmpA, jump if SCAS SI → IND R DS,BL CMPS path, read from SI to tmpA IND → SI update SI OPR → tmpA fallthrough DI → IND R DA,BL 5: both: read from DI to tmpB OPR → tmpB SUBT tmpA subtract to compare Σ → no dest DEC tmpC F update flags, set up DEC IND → DI JMPS NF1 12 return if not RPT Σ → CX JMPS F1ZZ 12 update CX, exit if condition Σ → tmpC JMP INT RPTI if interrupt, jump to RPTI JMPS NZ 1 loop if CX ≠ 0 RNI 12: run next instruction
One tricky part about the scan and compare instructions is that you can either repeat until the values are equal or until they are unequal,
with the REPE
or REPNE
prefixes respectively. Rather than implementing this two-part condition in microcode, the F1ZZ
condition above
tests the right condition depending on the prefix.
Hardware support
Although the 8086 uses microcode to implement instructions, it also uses a considerable amount of hardware to simplify the microcode. This hybrid approach was necessary in order to fit the microcode into the small ROM capacity available in 1978.7 This section discusses some of the hardware circuitry in the 8086 that supports the string operations.
Implementing the REP
prefixes
Instruction prefixes, including REPNZ
and REPZ
, are executed in hardware rather than microcode.
The first step of instruction decoding, before microcode starts, is the Group Decode ROM.
This ROM categorizes instructions into various groups.
For instructions that are categorized as prefixes, the signal from the Group Decode ROM
delays any interrupts (because you don't want an interrupt between the prefix and the instruction)
and starts the next instruction without executing microcode.
The Group Decode ROM also outputs a REP
signal specifically for these two prefixes.
This signal causes the F1
latch to be loaded with 1, indicating a REP
prefix.
(This latch is also used during multiplication to track the sign.)
This signal also causes the F1Z
latch to be loaded with bit 0 of the instruction, which is 0 for REPNZ
and 1 for REPZ
.
The microcode uses these latches to determine the appropriate behavior of the string instruction.
Updating SI
and DI
: the Constant ROM
The SI
and DI
index registers are updated during each step to point to the next element.
This update is more complicated than you might expect, though, since
the registers are incremented or decremented based on the Direction Flag.
Moreover, the step size, 1 or 2, varies for a byte or word operation.
Another complication is unaligned word accesses, using an odd memory address to access a word.
The 8086's bus can only handle aligned words, so an unaligned word access is split into two byte accesses, incrementing
the address after the first access.
If the operation is proceeding downward, the address then needs to be decremented by 3 (not 2) at the end to
cancel out this increment.
The point is that updating the index registers is not trivial but requires an adjustment anywhere between -3 and +2, depending
on the circumstances.
The Bus Interface Unit performs these updates automatically, without requiring the microcode to implement the addition or subtraction. The arithmetic is not performed by the regular ALU (Arithmetic/Logic Unit) but by the special adder dedicated to addressing arithmetic. The increment or decrement value is supplied by a special ROM called the Constant ROM, located next to the adder. The Constant ROM (shown below) is implemented as a PLA (programmable logic array), a two-level structured arrangement of gates. The first level (bottom) selects the desired constant, while the second level (middle) generates the bits of the constant: three bits plus a sign bit. The constant ROM is also used for correcting the program counter value as described earlier.
Condition testing
The microcode supports conditional jumps based on 16 conditions. Several of these conditions are designed to support the
string operations.
To test if a REP
prefix is active, microcode uses the F1
test, which tests the F1
latch.
The REPZ
and REPNZ
prefixes loop while the zero flag is 1 or 0 respectively.
This somewhat complicated test is supported in microcode by the F1ZZ
condition, which evaluates the zero flag XOR the F1Z
latch. Thus, it tests for zero with REPZ (F1Z=0
) and nonzero with REPNZ (F1Z=1
).
Looping happens as long as the CX
register is nonzero. This is tested in microcode with the NZ
(Not Zero) condition.
A bit surprisingly, this test doesn't use the standard zero status flag, but a separate latch that tracks if an ALU result is zero.
(I call this the Z16
flag since it tests the 16-bit value, unlike the regular zero flag which tests either a byte or word.)
The Z16
flag is only used by the microcode and is invisible to the programmer.
The motivation behind this separate flag is so the string operations can leave the visible zero flag unchanged.8
Another important conditional jump is X0
, which tests bit 3 of the instruction.
This condition distinguishes between the MOVS
and LODS
instructions, which differ in bit 3, and similarly for
CMPS
versus SCAS
.
The test uses the X
register which stores part of the instruction during decoding.
Note that the opcodes aren't arbitrarily assigned to instructions like MOVS
and LODS
. Instead, the opcodes
are carefully assigned so the instructions can share microcode but be distinguished by X0
.
Finally, the string operation microcode also uses the INT
condition, which tests if an interrupt is pending.
The conditions are evaluated by the condition PLA (Programmable Logic Array, a grid of gates), shown below. The four condition bits from the micro-instruction, along with their complements, are fed into the columns. The PLA has 16 rows, one for each condition. Each row is a NOR gate matching one bit combination (i.e. selecting a condition) and the corresponding signal value to test. Thus, if a particular condition is specified and is satisfied, that row will be 1. The 16 row outputs are combined by the 16-input NOR gate at the left. Thus, if the specified condition is satisfied, this output will be 0, and if the condition is unsatisfied, the output will be 1. This signal controls the jump or call micro-instruction: if the condition is satisfied, the new micro-address is loaded into the microcode address register. If the condition is not satisfied, the microcode proceeds sequentially. I discuss the 8086's conditional circuitry in more detail in this post.
Conclusions
Hopefully you have found this close examination of microcode interesting. Microcode is implemented at an even lower level than assembly code, so it can be hard to understand. Moreover, the microcode in the 8086 was carefully optimized to make it compact, so it is even more obscure.
One of the big computer architecture debates of the 1980s was "RISC vs CISC", pitting Reduced Instruction Set Computers against Complex Instruction Set Computers. Looking at the 8086 in detail has given me more appreciation for the issues in a CISC processor such as the 8086. The 8086's string instructions are an example of the complex instructions in the 8086 that reduced the "semantic gap" between assembly code and high-level languages and minimized code size. While these instructions are powerful, their complexity spreads through the chip, requiring additional hardware features described above. These instructions also caused a great deal of complications for interrupt handling, including prefix-handling bugs that weren't fixed until later processors.
I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected].
Notes and references
-
Block move instructions didn't originate with the 8086. The IBM System/360 series of mainframes had an extensive set of block instructions providing moves, compare, logical operations (AND, OR, Exclusive OR), character translation, formatting, and decimal arithmetic. These operations supported blocks of up to 256 characters.
The Z80 processor (1976) had block instructions to move and compare blocks of data. The Z80 supported ascending and descending movements, but used separate instructions instead of a direction flag like the 8086. ↩
-
The "string" operations process arbitrary memory bytes or words. Despite the name, these instructions are not specific to zero-terminated strings or any other string format. ↩
-
The
BL
value in the micro-instruction indicates that theIND
register should be incremented or decremented by 1 or 2 as appropriate. I'm not sure whatBL
stands for in the microcode. The patent says "BL symbolically represents a two bit code which causes external logic to examine the byte or word line and the direction flag in PSW register to generate, according to random logic well known to the art, the address factor required." So perhaps "Byte Logic"? ↩↩ -
The designer of the 8086 instruction set, Steve Morse, discusses the motivation behind the string operations in his book The 8086/8088 primer. These instructions were designed to be flexible and support a variety of use cases. The
XLAT
(Translate) andJCXZ
(Jump ifCX
Zero) instructions were designed to work well with the string instructions.The implementation of string instructions is discussed in detail in the 8086 patent, section 13 onward. ↩
-
A string operation could perform 64k moves, each of which consists of a read and a write, yielding 128k memory operations. I think that if the memory accesses are unaligned, i.e. a word access to an odd address, then each byte of the word needs to be accessed separately. So I think you could get up to 256k memory accesses. Each memory operation takes at least 4 clock cycles, more if the memory is slow and has wait states. So one string instruction could take over a million clock cycles. ↩
-
You might wonder why the register
M
indicates the accumulator, and the explanation is a bit tricky. The microcode uses 5-bit register specifications to indicate the source and destination for a data move. Registers can be specified explicitly, such asAX
orBX
, or a byte register such asAL
or an internal register such asIND
ortmpA
. However, the microcode can also specify a generic source or destination register withM
orN
. The motivation is that the 8086 has a lot of operations that use an arbitrary source and destination register, for instanceADD AX, BX
. Rather than making the microcode figure out which registers to use for these instructions, the hardware decodes the register fields from the instruction and substitutes the appropriate registers forM
andN
. This makes the microcode much simpler.But why does the
LODS
microcode use theM
register instead ofAX
when this instruction only works with the accumulator? The microcode takes advantage of another clever feature of theM
andN
registers. The hardware looks at the instruction to determine if it is a byte or word instruction, and performs an 8-bit or 16-bit transfer accordingly. If theLODS
microcode was hardcoded for the accumulator, the microcode would need separate paths forAX
andAL
, the full accumulator and the lower byte of the accumulator.The final piece of the puzzle is how the hardware knows to use the accumulator for the string instructions when they don't explicitly specify a register. The first step of instruction decoding is the Group Decode ROM, which categorizes instructions into various groups. One group is "instructions that use the accumulator". The string operations are categorized in this group, which causes the hardware to use the accumulator when the
M
register is specified. (Other instructions in this group include the immediate ALU operations, I/O operations, and accumulator moves.)I discussed the 8086's register codes in more detail here. ↩↩
-
The 8086's microcode ROM was small: 512 words of 21 bits. In comparison, the VAX 11/780 minicomputer (1977) had 5120 words of 96 bits, over 45 times as large. ↩
-
The internal
Z16
zero flag is mostly used by the string operations. It is also used by theLOOP
iteration-control instructions and the shift instructions that take a shift count. ↩
10 comments:
Small point - you mention a "4 megabyte (20 bit)" address space. Perhaps a typo? The 8086 can only address 1 megabyte with its 20 bits. That aside, an excellent article as ever. I am enjoying these deep-dives into historic processors.
It's interesting that the flowchart and the microcode don't match up: the flowchart checks for pending interrupts before it performs the first iteration of the string operation, the microcode does this the other way around. I suspect an early iteration of the design used the logic in the flowchart (which would be simpler and only require one subroutine), but then it was realised that under heavy interrupt load (or possibly when using the trap flag) the string instructions would not make progress. A change in the design would also explain why the microcode blocks for STOS, CMPS/SCAS, and MOVS/LODS are not in contiguous places in the microcode ROM, whereas most microcode routines over 4 micro-instructions are contiguous. The other split routines are WAIT and JCXZ, but I don't know what changes might have been made to those.
LeoNerd: thanks, that's fixed. Andrew: I like your theory since it explains why the code is excessively non-contiguous.
In regards to your footnote 7, I have the full print set for an 11/785 sitting here at home if you ever want to see the details. Also the micro sequencer board out of an 11/785 that I used to run for many years.
Anonymous, the 11/785 print set seems like a good thing to put on bitsavers.
I wonder why the string instructions are not more popular than what they are. Seems like they provide functionality of, say, a blitter chip (such as used in the Amiga & later models of the atari ST) but in a way which generalizes and synergizes nicely with other features of the 8086.
e.g. Utf conversions, etc. Seems hard to imagine that the instruction bandwidth could be much smaller than using these instructions.
Am I right in assuming that movs is not slowed down by the conditional jump to check if it's a lods, only lods is? It seems that the movs code has enough that it needs to do in the left column that the conditional jump doesn't take any extra time from it?
It seems like the right call to prioritize the speed of movs, since lods is useless with a rep prefix.
I always thought that you could do rep cmps if you wanted, there just wasn't any point in doing so. But it seems that the microcode for cmps and scas is written only to support repe/repne. Which makes sense, why bother with rep?
Unknown: I'm quite sure that I wrote some sprite blitting code for a 286 or 486 that used rep movs to copy the pixels. But it was useless when you wanted to support a transparent color. (Or you would have to preprocess your sprite to spilt lines into opaque and transparent spans).
Rep stosw was definitely useful for clearing the screen. I'm quite sure I've used rep movsw to copy a back buffer to the screen.
Overall, the movs and stos instructions were pretty popular with me at least :)
Small typo in 8-bit division example: the remainder is 0x21. (I only noticed b/c remainder should be smaller than 0x34 :)
I read all your posts; very detailed and informative. Keep up the good work!
Post a Comment