tag:blogger.com,1999:blog-62649476948868875402024-03-18T15:18:31.488-07:00Ken Shirriff's blogComputer history, restoring vintage computers, IC reverse engineering, and whateverKen Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.comBlogger349125tag:blogger.com,1999:blog-6264947694886887540.post-59491997936087445162024-02-22T16:53:00.000-08:002024-02-24T21:59:46.202-08:00The first microcomputer: The transfluxor-powered Arma Micro Computer from 1962What would you say is the first microcomputer?<span id="fnref:firsts"><a class="ref" href="#fn:firsts">1</a></span> The Apple I from 1976? The Altair 8800 from 1974?
Perhaps the lesser-known <a href="https://en.wikipedia.org/wiki/Micral">Micral N</a> (1973) or <a href="https://www.tomshardware.com/desktops/house-cleaners-find-two-of-the-worlds-first-desktop-pcs-in-random-boxes">Q1</a> (1972)?
How about the Arma Micro Computer from way back in 1962.
The Arma Micro Computer was a compact 20-pound transistorized computer, designed for applications in space
such as inertial or celestial navigation, steering, radar, or engine control.</p>
<p>Obviously, the Arma Micro Computer is not a microcomputer according to modern definitions, since its processor
was made from discrete components.
But it's an interesting computer in many ways.
First, it is an example of the aerospace computers of the 1960s, advanced systems that are now almost
entirely forgotten.
People think of 1960s computers as room-filling mainframes, but there was a whole separate world of
cutting-edge miniaturized aerospace computers.
(Taking up just 0.4 cubic feet, the Arma Micro Computer was smaller than an Apple II.)
Second, the Arma Micro Computer used strange components such as transfluxors and had an unusual
22-bit serial architecture.
Finally, the Arma Micro Computer evolved into a series of computers used on Navy ships and submarines,
the E-2C Hawkeye airborne early warning plane, the Concorde, and even Air Force One.</p>
<h2>The Arma Micro Computer</h2>
<p><a href="https://static.righto.com/images/arma-microcomputer/arma_1.png"><img alt="The Arma Micro Computer, with a circuit board on top. Click this image (or any other) for a larger version. Photo courtesy of Daniel Plotnick." class="hilite" height="403" src="https://static.righto.com/images/arma-microcomputer/arma_1-w500.png" title="The Arma Micro Computer, with a circuit board on top. Click this image (or any other) for a larger version. Photo courtesy of Daniel Plotnick." width="500" /></a><div class="cite">The Arma Micro Computer, with a circuit board on top. Click this image (or any other) for a larger version. Photo courtesy of Daniel Plotnick.</div></p>
<p>The Micro Computer used 22-bit words, which may seem like a strange size from the modern perspective.
But there's no inherent need for a word size to be a power of 2.
In particular, the Micro Computer was designed for mathematical calculations, not dealing with 8-bit characters.
The word size was selected to provide enough accuracy for its navigational tasks.</p>
<p>Another strange aspect of the Micro Computer is that it was a serial machine,
sequentially operating on one bit of a word at a time.<span id="fnref:spaceborne"><a class="ref" href="#fn:spaceborne">2</a></span>
This approach was often used in early machines because it substantially reduced the amount of hardware
required: it only needs a 1-bit data bus and a 1-bit ALU.
The downside is that a serial machine is much slower because each 22-bit word takes 22 clock cycles
(plus 5 cycles of overhead).
As a result, the Micro Computer executed just 36000 operations per second, despite its 1 megahertz clock
speed.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/arma-shopping.jpg"><img alt="Ad for the Arma Micro Computer (called the MICRO here). Source: Electronics, July 27, 1962." class="hilite" height="732" src="https://static.righto.com/images/arma-microcomputer/arma-shopping-w400.jpg" title="Ad for the Arma Micro Computer (called the MICRO here). Source: Electronics, July 27, 1962." width="400" /></a><div class="cite">Ad for the Arma Micro Computer (called the MICRO here). Source: <a href="https://www.worldradiohistory.com/Archive-Electronics/60s/62/Electronics-1962-07-27.pdf#page=76">Electronics</a>, July 27, 1962.</div></p>
<p>The Micro Computer had a small instruction set of 19 instructions.<span id="fnref:instructions"><a class="ref" href="#fn:instructions">3</a></span>
It included multiply, divide, and square root, instructions that weren't implemented in early microprocessors.
This illustrates how early microprocessors were a significant step backward in functionality.
Moreover, the multiply, divide, and square root instructions used a separate arithmetic unit, so they could
execute in parallel with other arithmetic instructions.
Because the Micro Computer needed to interact with spacecraft systems, it had a focus on I/O, with 120 digital inputs or outputs, configured as needed for a particular mission.</p>
<h2>Circuits</h2>
<p>The Micro Computer was built from silicon transistors and diodes, using diode-transistor logic.
The construction technique was somewhat unusual.
The basic circuits were the flip-flop, the complementary buffer (i.e. an inverter), and the diode gate.
Each basic circuit was constructed on a small wafer, .77 inches on a side.<span id="fnref:wafers"><a class="ref" href="#fn:wafers">5</a></span>
The photo below shows wafers for a two-transistor flip-flop and two diode gates.
Each wafer had up to 16 connection tabs on the edges.
These wafers are analogous to integrated circuits, but constructed from discrete components.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/arma_6.png"><img alt="Three circuit modules from the Arma Micro Computer. Image from "The Arma Micro Computer for Space Applications"." class="hilite" height="499" src="https://static.righto.com/images/arma-microcomputer/arma_6-w300.png" title="Three circuit modules from the Arma Micro Computer. Image from "The Arma Micro Computer for Space Applications"." width="300" /></a><div class="cite">Three circuit modules from the Arma Micro Computer. Image from "The Arma Micro Computer for Space Applications".</div></p>
<p>The wafers were mounted on printed circuit boards, with up to 22 wafers on a board.
Pairs of boards were mounted back to back with polyurethane foam between the boards to form a "sandwich", which
was conformally coated.
The result was a module that was protected against the harsh environment of a missile or spacecraft.
The computer could handle a shock of 100 g's and temperatures of 0°C to 85°C as well
as 100% humidity or a vacuum.</p>
<p>Because the Micro Computer was a serial machine, its bits were constantly moving.
For register storage such as the accumulator, it used six magnetostrictive torsional delay lines, storing
a sequence of bits as physical twists that formed pulses racing through a long coil of wire.</p>
<p>The photo below shows the Arma Micro Computer with the case removed.
If you look closely, you can see the 22 small circuit wafers mounted on each printed circuit board.
The memory driver boards and delay lines are towards the back,
spaced more widely than the other printed circuit boards.
The cable harness underneath the boards provides the connections between boards.<span id="fnref:block-diagram"><a class="ref" href="#fn:block-diagram">4</a></span></p>
<p><a href="https://static.righto.com/images/arma-microcomputer/arma_10.png"><img alt="Circuit boards inside the Arma Micro Computer. Photo courtesy of Daniel Plotnick." class="hilite" height="297" src="https://static.righto.com/images/arma-microcomputer/arma_10-w500.png" title="Circuit boards inside the Arma Micro Computer. Photo courtesy of Daniel Plotnick." width="500" /></a><div class="cite">Circuit boards inside the Arma Micro Computer. Photo courtesy of Daniel Plotnick.</div></p>
<h2>Transfluxors</h2>
<p>One of the most unusual parts of the Micro Computer was its storage.
Computers at the time typically used magnetic core memory, with each bit stored in a tiny ferrite ring,
magnetized either clockwise or counterclockwise to store a 0 or 1.
One drawback of standard core memory was that the process of reading a core also cleared the core, requiring
data to be written back after a read.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/storage-grid.jpg"><img alt="Diagram of Arma's memory system. From patent 3048828." class="hilite" height="346" src="https://static.righto.com/images/arma-microcomputer/storage-grid-w500.jpg" title="Diagram of Arma's memory system. From patent 3048828." width="500" /></a><div class="cite">Diagram of Arma's memory system. From <a href=https://patents.google.com/patent/US3048828)">patent 3048828</a>.</div></p>
<p>The Micro Computer used ferrite cores, but these were "two-aperture" cores, with a larger hole and a smaller hole,
as shown above.
Data is written to the "major aperture" and read from the "minor aperture".
Although the minor aperture switches state and is erased during a read, the major aperture retains the bit,
allowing the minor aperture to be switched back to its original state.
Thus, unlike regular core memory, transfluxors don't lose their data when reading.</p>
<p>The resulting system is called non-destructive readout (NDRO), compared to the destructive readout (DRO) of
regular core memory.<span id="fnref:transfluxors"><a class="ref" href="#fn:transfluxors">6</a></span>
The Micro Computer used non-destructive readout memory to ensure that the
program memory remained uncorrupted.
In contrast, if a program is stored in regular core memory, each instruction must be written back as it is executed,
creating the possibility that a transient could corrupt the software.
By using transfluxors, this possibility of error is eliminated.
(In either case, core memory has the convenient property that data is preserved when power is
removed, since data is stored magnetically. With modern semiconductor memory, you
lose data when the power goes off.)</p>
<p>The photo below shows a compact transfluxor-based storage module used in the Micro Computer, holding 512 words.
In total, the computer could hold up to 7808 words of program memory and 256 words of data memory.
It appears that transfluxors didn't live up to their promise, since most computers used regular core
memory until semiconductor memory took over in the early 1970s.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/arma_9.png"><img alt="Transfluxor-based core memory module from the Arma Micro Computer. Image from "The Arma Micro Computer for Space Applications"." class="hilite" height="438" src="https://static.righto.com/images/arma-microcomputer/arma_9-w500.png" title="Transfluxor-based core memory module from the Arma Micro Computer. Image from "The Arma Micro Computer for Space Applications"." width="500" /></a><div class="cite">Transfluxor-based core memory module from the Arma Micro Computer. Image from "The Arma Micro Computer for Space Applications".</div></p>
<h2>Arma's history and the path to the Micro Computer</h2>
<p>The Arma Engineering Company was founded in 1918 and built advanced military equipment.<span id="fnref:capitalization"><a class="ref" href="#fn:capitalization">7</a></span>
Its first product was a searchlight for the Navy, followed by a gyroscopic compass and analog computers for naval gun targeting.
In 1939, Arma produced the Torpedo Data Computer, a remarkable electromechanical analog computer.
US submarines used this computer to track target ships and automatically aim torpedos.
The Torpedo Data Computer performed complex trigonometric calculations and integration to account for the motion of
the target ship and the submarine.
While the Torpedo Data Computer performed well, the Navy's Mark 14 torpedo had many <a href="https://en.wikipedia.org/wiki/Mark_14_torpedo#Problems">problems</a>—running too deep, exploding too soon, or failing to explode—making torpedoes often ineffectual even with a perfect hit.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/tdc.jpg"><img alt="The Torpedo Data Computer Mark III in the USS Pampanito." class="hilite" height="580" src="https://static.righto.com/images/arma-microcomputer/tdc-w400.jpg" title="The Torpedo Data Computer Mark III in the USS Pampanito." width="400" /></a><div class="cite">The Torpedo Data Computer Mark III in the USS Pampanito.</div></p>
<p>Arma underwent major corporate changes due to World War II.
Before the war, the German-owned Bosch Company built vehicle starters and aircraft magnetos in the United States.
When the US entered World War II in 1941, the government was concerned that a German-controlled company was manufacturing key military hardware so the Office of Alien Property Custodian took over the Bosch plant.
In 1948, the banking group that controlled Arma bought Bosch from the Office of the Alien Property Custodian, merging them into the American Bosch Arma Corporation (AMBAC).<span id="fnref:history"><a class="ref" href="#fn:history">8</a></span>
(Arma had earlier received the rights to gyrocompass technology from the German Anschutz company, <a href="https://archive.navalsubleague.org/1999/short-history-of-arma-gyrocompass-messrs-pekelney-lindell">seized</a> by
the Navy after World War I, so Arma benefitted twice from wartime government seizures.)</p>
<p>In the mid-1950s, Arma moved into digital computers, building an inertial guidance computer for the Atlas nuclear missile program.
America's first ICBM was the <a href="https://en.wikipedia.org/wiki/SM-65_Atlas">Atlas missile</a>, which became operational in 1959.
The first Atlas missiles used radio guidance from the launch site to direct the missile. Since radio signals could be jammed by the enemy, this wasn't a robust solution. </p>
<!-- Atlas details https://www.hsdl.org/?view&did=485671 Originally for Titan but redirected to Atlas.
Navigation system located in Atlas bulge https://www.designation-systems.net/dusrm/m-16.html
-->
<p>The solution to missile guidance was an <a href="https://en.wikipedia.org/wiki/Inertial_navigation_system">inertial navigation system</a>. By using sensitive gyroscopes and accelerometers, a missile could continuously track its position and velocity without any external input, making it unjammable.
A key developer of this system was Arma's <a href="https://en.wikipedia.org/wiki/Wen_Tsing_Chow">Wen Tsing Chow</a>,
one of the driving forces behind digital aviation computers.
He faced extreme skepticism in the 1950s for the idea of putting a computer in a missile.
One general mocked him, asking "Where are you going to put the five Harvard professors you'll need to keep it running?"
But computerized navigation was successful and
in 1961, the Atlas missile was updated to use the Arma inertial guidance computer.
It was said to be the first production airborne digital computer.<span id="fnref:tradic"><a class="ref" href="#fn:tradic">9</a></span>
Wen Tsing Chow also invented the programmable read-only memory (PROM), allowing missile targeting information to be programmed into a computer outside the factory.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/wtc.png"><img alt="Wen Tsing Chow, computer engineer, with Arma Micro Computer. From Control Engineering, January 1963, page 19. Courtesy of Daniel Plotnick." class="hilite" height="464" src="https://static.righto.com/images/arma-microcomputer/wtc-w500.png" title="Wen Tsing Chow, computer engineer, with Arma Micro Computer. From Control Engineering, January 1963, page 19. Courtesy of Daniel Plotnick." width="500" /></a><div class="cite">Wen Tsing Chow, computer engineer, with Arma Micro Computer. From Control Engineering, January 1963, page 19. Courtesy of Daniel Plotnick.</div></p>
<p>The photo below shows the Atlas ICBM's guidance system. The Arma W-107A computer is at the top and the
gyroscopes are in the middle.
This computer was an 18-bit serial machine running at 143.36 kHz. It ran a hard-wired program that integrated the accelerometer
information and solved equations for the crossrange error function, range error function, and gravity,
making these computations every half second.<span id="fnref:heiderstadt"><a class="ref" href="#fn:heiderstadt">10</a></span>
The computer weighed 240 pounds and consumed 1000 watts.
The computer contained about 36,000 components: discrete transistors, diodes, resistors, and capacitors
mounted on 9.5" × 6.5" printed-circuit boards.
On the ground, the computer was air-cooled to
55 °F, but there was no cooling after launch as the computer only operated for five minutes
of powered flight and wouldn't overheat during that time.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/AtlasInertialGuidanceSystem-Heiderstadt-3.png"><img alt="Guidance system for Atlas ICBM. From "Atlas Inertial Guidance System" by John Heiderstadt. Photo unclassified in 1967." class="hilite" height="630" src="https://static.righto.com/images/arma-microcomputer/AtlasInertialGuidanceSystem-Heiderstadt-3-w400.png" title="Guidance system for Atlas ICBM. From "Atlas Inertial Guidance System" by John Heiderstadt. Photo unclassified in 1967." width="400" /></a><div class="cite">Guidance system for Atlas ICBM. From "Atlas Inertial Guidance System" by John Heiderstadt. Photo unclassified in 1967.</div></p>
<p>The Atlas wasn't originally designed for a computerized guidance system so
there wasn't room inside the missile for the computer.
To get around this, a large pod was stuck on the side of the missile to hold the computer and gyroscopes, as
indicated in the photo below.
This doesn't look aerodynamic, but I guess it worked.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/atlas.jpg"><img alt="Atlas missile. Arrow indicates the pod containing the Arma guidance computer and inertial navigation system. Original photo by Robert DuHamel, CC BY-SA 3.0." class="hilite" height="553" src="https://static.righto.com/images/arma-microcomputer/atlas-w300.jpg" title="Atlas missile. Arrow indicates the pod containing the Arma guidance computer and inertial navigation system. Original photo by Robert DuHamel, CC BY-SA 3.0." width="300" /></a><div class="cite">Atlas missile. Arrow indicates the pod containing the Arma guidance computer and inertial navigation system. Original photo by <a href="https://commons.wikimedia.org/wiki/File:Atlas_Missile_-_panoramio_-_Robert_DuHamel.jpg">Robert DuHamel</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en">CC BY-SA 3.0</a>.</div></p>
<p>The Atlas guidance computer (left, below) consisted of three aluminum sections called "decks".
The top deck held two replaceable target constant units, each
providing 54 navigation constants that specified a target.
The constants were stored in a stack of printed circuit boards 16" × 8" × 1.5", covered in over a thousand diodes, Wen Tsing Chow's PROM memory.
A target was programmed into the stack by a rack of equipment that would selectively burn out diodes, changing
the corresponding bit to a 1.
(This is why programming a PROM is referred to as "burning the PROM".<span id="fnref:burning"><a class="ref" href="#fn:burning">11</a></span>)
The diode matrix was later replaced with a transfluxor memory array, which had the advantage that it could
be reprogrammed as necessary.
The top deck also had connectors for the accelerometer inputs, the outputs, and connections for ground support equipment.
The bottom deck had power connectors for 28 volts DC and 115V 400 Hz 3-phase AC.
In the bottom deck, quartz delay lines were used for storage, representing bits as acoustic waves.
Twelve circuit cards, each with a faceted quartz block four inches in diameter,
provided a total of 32 words of storage.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/Arma-Three-Generations.png"><img alt="Three generations of Arma Computers: the W-107A Atlas ICBM guidance computer, the Lightweight Airborne Digital Computer, and the Arma Micro Computer (perhaps a prototype). Photo courtesy of Daniel Plotnick." class="hilite" height="334" src="https://static.righto.com/images/arma-microcomputer/Arma-Three-Generations-w500.png" title="Three generations of Arma Computers: the W-107A Atlas ICBM guidance computer, the Lightweight Airborne Digital Computer, and the Arma Micro Computer (perhaps a prototype). Photo courtesy of Daniel Plotnick." width="500" /></a><div class="cite">Three generations of Arma Computers: the W-107A Atlas ICBM guidance computer, the Lightweight Airborne Digital Computer, and the Arma Micro Computer (perhaps a prototype). Photo courtesy of Daniel Plotnick.</div></p>
<p>Arma considered the Micro Computer the third generation of its airborne computers.
The first generation was the Atlas guidance computer, constructed from germanium transistors and diodes
(in the pre-silicon era).
The second-generation computer moved to silicon transistors and diodes.
The third-generation computers still used discrete components, but mounted on the small square wafers.
The third generation also had a general-purpose architecture and programmable transfluxor memory
instead of a hard-wired program.</p>
<!-- Micro C info https://books.google.com/books?id=1G0t_0MviUMC&pg=RA11-PA2&dq=%22arma%22+%22micro+c%22+%22computer%22&hl=en&newbks=1&newbks_redir=0&sa=X&ved=2ahUKEwjnqOPL-pCEAxWblGoFHYC1DA0Q6AF6BAgJEAI#v=onepage&q=%22arma%22%20%22micro%20c%22%20%22computer%22&f=false -->
<h2>After the Micro Computer</h2>
<p>Arma continued to develop computers, improving the Arma Micro Computer.
The Micro C computer (1965) was developed for Navy ships and submarines.
Much like the original Micro, the Micro C used transfluxor storage, but increased the clock frequency to
972 kHz.
The computer was much larger: 3.87 cubic feet and 150 pounds.
This <a href="https://nsarchive.gwu.edu/document/16941-office-naval-research-mathematical-science">description</a> states that "the machine is an outgrowth of the ARMA product line of micro computers and is logically and
electrically similar to micro-computers designed for missile environments."</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/micro-c.png"><img alt="Module from the Arma Micro-C Computer. Photo courtesy of Daniel Plotnick." class="hilite" height="313" src="https://static.righto.com/images/arma-microcomputer/micro-c-w400.png" title="Module from the Arma Micro-C Computer. Photo courtesy of Daniel Plotnick." width="400" /></a><div class="cite">Module from the Arma Micro-C Computer. Photo courtesy of Daniel Plotnick.</div></p>
<p>In mid-1966, Arma introduced the Micro D computer, built from TTL integrated circuits. Like the original Micro, this computer was serial, but
the Micro D had a word length of 18 bits and ran at 1.5 MHz. It weighed 5.25 pounds and was very compact, just 0.09 ft<sup>3</sup>.
Instead of transfluxors, the Micro D used regular magnetic core memory, 4K to 31K words.</p>
<!-- Micro-D: https://www.bitsavers.org/pdf/bellcomm/TM67-1031-1_State_Of_The_Art_Of_Aerospace_Digital_Computers_62-67_Jun67.pdf#page=20
Characteristics: https://www.osti.gov/servlets/purl/4623576#page=10
Also https://ntrs.nasa.gov/api/citations/19690025249/downloads/19690025249.pdf
-->
<p><a href="https://static.righto.com/images/arma-microcomputer/Arma-Micro-D-1801.jpg"><img alt="The Arma Micro-D 1801 computer. The 1808 was a slightly larger model. Photo courtesy of Daniel Plotnick." class="hilite" height="296" src="https://static.righto.com/images/arma-microcomputer/Arma-Micro-D-1801-w500.jpg" title="The Arma Micro-D 1801 computer. The 1808 was a slightly larger model. Photo courtesy of Daniel Plotnick." width="500" /></a><div class="cite">The Arma Micro-D 1801 computer. The 1808 was a slightly larger model. Photo courtesy of Daniel Plotnick.</div></p>
<p>The widely-used Litton LTN-51 inertial navigation system was built around the Arma Micro-D computer.<span id="fnref:ltn-51"><a class="ref" href="#fn:ltn-51">12</a></span>
This navigation system was designed for commercial aircraft, but was also used for military applications,
ships, and NASA aircraft.
Aircraft from early Concordes to Air Force One used the LTN-51 for navigation.
The photo below shows a navigation unit with the Arma Micro-D computer in the lower left and the gyroscope
unit on the right.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/ltn-51.jpg"><img alt="Litton LTN-51 inertial navigation system. Photo courtesy of pascal mz, concordescopia.com." class="hilite" height="385" src="https://static.righto.com/images/arma-microcomputer/ltn-51-w700.jpg" title="Litton LTN-51 inertial navigation system. Photo courtesy of pascal mz, concordescopia.com." width="700" /></a><div class="cite">Litton LTN-51 inertial navigation system. Photo courtesy of pascal mz, <a href="http://concordescopia.com">concordescopia.com</a>.</div></p>
<p>In early 1968, the Arma Portable Micro D was introduced, a 14-pound battery-powered computer also called the Celestial Data Processor.
This handheld computer was designed for navigation in crewed earth orbital flight, determining orbital parameters from stadimeter and sextant
measurements performed by astronauts.
As far as I can tell, this computer never made it beyond the prototype stage.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/cdp.jpg"><img alt="The Arma Celestial Data Processor (source)." class="hilite" height="561" src="https://static.righto.com/images/arma-microcomputer/cdp-w350.jpg" title="The Arma Celestial Data Processor (source)." width="350" /></a><div class="cite">The Arma Celestial Data Processor (<a href="https://www.sciencedirect.com/science/article/pii/S1474667017688207">source</a>).</div></p>
<h2>Conclusions</h2>
<p>The Arma Micro Computer is just one of the dozens of compact aerospace computers of the 1960s, a category that
is mostly forgotten and ignored.
Another example is the Delco <a href="https://www.righto.com/2020/03/the-delco-magic-line-of-aerospace.html">MAGIC I</a> (1961),
said to be the
"first complete airborne computer to have its logic functions mechanized exclusively with integrated circuits".
IBM's 4 Pi series started in 1966 and was used in many systems from the F-15 to the Space Shuttle.
By 1968, denser MOS/LSI chips were used in general-purpose aerospace computers such as the
<a href="https://dl.acm.org/doi/pdf/10.1145/1476589.1476701">Rockwell MOS GP</a> and the
Texas Instruments <a href="https://www.bitsavers.org/magazines/Datamation/19700715.pdf#page=92">Model 2502</a> LSI Computer.
<span id="fnref:aerospace"><a class="ref" href="#fn:aerospace">13</a></span></p>
<!--
The Burroughs D210 Magnetic Computer (1961) used magnetic logic as well as [rope memory](http://www.bitsavers.org/pdf/afips/1963-11_%2324.pdf#page=52), read-only memory made more famous by the Apollo Guidance Computer.
The SAAB CK 37 (mid-1964) was a 26-bit computer built from integrated circuits, becoming the
[central computer](https://www.datasaab.se/Papers/Articles/Viggenck37.pdf) for the Saab 37 Viggen combat aircraft.
-->
<p>Arma also illustrates that a company can be on the cutting edge of technology for decades and then suddenly go out
of business and be forgotten.
After some struggles, Arma was acquired by United Technologies in 1978 for $210 million and was then shut down in 1982.
(The German Bosch corporation remains, now a large multinational known for products such as dishwashers, auto parts, and power tools.)
Looking at a <a href="https://ntrs.nasa.gov/api/citations/19690025249/downloads/19690025249.pdf#page=8">list of aerospace computers</a> shows many innovative but vanished companies: Univac, Burroughs, Sperry (now all Unisys), AC Electronics (now part of Raytheon), Autonetics (acquired by Boeing), RCA (bought by GE), and TRW (acquired by Northrop Grumman).</p>
<p>Finally, the Micro Computer illustrates that terms such as "microcomputer" are not objective categories but
are social constructs.
At first, it seems obvious that the Arma Micro Computer is not a real microcomputer. If you consider a microcomputer
to be a computer built around a microprocessor, that's true. (Although "microprocessor" is also
<a href="https://spectrum.ieee.org/the-surprising-story-of-the-first-microprocessors">not as clear</a> as you might think.)
But a microcomputer can also be defined as "A small computer that includes one or more input/output units
and sufficient memory to execute instructions" (according to the IBM Dictionary of Computing, 1994)<span id="fnref:ibm"><a class="ref" href="#fn:ibm">14</a></span>
and the Arma Micro Computer meets that definition.
The "microcomputer" is a shifting concept, changing from the 1960s to the 1990s to today.</p>
<p>For more,
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.
Thanks to Daniel Plotnick for providing a great deal of information and photos.
Thanks to John Hartman for obtaining an obscure conference proceedings for me.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:firsts">
<p>I should mention the danger of "firsts" from a historical perspective.
Historian Michael Williams advised "not to use the word 'first'" and said, "If you add enough adjectives to a description you can always claim your own favorite."
(See <a href="https://amzn.to/3wnsf4H">ENIAC in Action</a>, p7.)</p>
<!--
Computer Historian Michael Mahoney also talks about "firsts" in "Histories of Computing" (p44),
in particular the risk of "retrospective judgements, where the first X had always been known as a Y."
-->
<p>The first usage of "micro-computer" that I could find is from 1956. In Isaac Asimov's short story
"The Dying Night", he mentions a "micro-computer" in passing: "In recent years, it [the handheld scanner] had become the hallmark
of the scientist, much as the stethoscope was that of the physician and the micro-computer that of
the statistician."</p>
<p>Another interesting example of a "micro-computer" is
the Texas Instruments Semiconductor Network Computer.
This palm-sized computer is often considered the first integrated-circuit computer.
It was an 11-bit serial computer running at 100 kHz, built out of RS flip-flops, NOR gates, and
logic drivers.
The 1961 article below described this computer as a "micro-computer", although this was a
one-off use of the term, not the computer's name.
This <a href="https://s3data.computerhistory.org/brochures/ti.molecular.1961.102646283.pdf">brochure</a> describes
the Semiconductor Network Computer in more detail and
Semiconductor Networks are described in detail in <a href="https://www.worldradiohistory.com/Archive-Electronics/60s/60/Electronics-1960-05-13.pdf#page=67">this article</a>.
Unlike modern ICs, these integrated circuits used flying wires for internal connections rather than
a deposited metal layer, making their design a dead end.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/ti.jpg"><img alt="The Texas Instruments Semiconductor Network Computer. From Computers and Automation, Dec. 1961." class="hilite" height="362" src="https://static.righto.com/images/arma-microcomputer/ti-w500.jpg" title="The Texas Instruments Semiconductor Network Computer. From Computers and Automation, Dec. 1961." width="500" /></a><div class="cite">The Texas Instruments Semiconductor Network Computer. From <a href="https://archive.org/details/sim_computers-and-people_1961-12_10_12/page/85/mode/1up">Computers and Automation</A>, Dec. 1961.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:firsts" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:spaceborne">
<p>Most of the information on the Arma Micro Computer in this article is from
"The Arma Micro Computer for Space Applications", by E. Keonjian and J. Marx,
Spaceborne Computing Engineering Conference, 1962, pages 103-116. <a class="footnote-backref" href="#fnref:spaceborne" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:instructions">
<p>The Arma Micro Computer's instruction set consisted of 19 22-bit instructions, shown below.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/instructions.png"><img alt="Instruction set of the Arma Micro Computer. Figure from "The Arma Micro Computer for Space Applications"." class="hilite" height="562" src="https://static.righto.com/images/arma-microcomputer/instructions-w450.png" title="Instruction set of the Arma Micro Computer. Figure from "The Arma Micro Computer for Space Applications"." width="450" /></a><div class="cite">Instruction set of the Arma Micro Computer. Figure from "The Arma Micro Computer for Space Applications".</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:instructions" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:block-diagram">
<p>This block diagram shows the structure of the Micro Computer.
The accumulator register (AC) is used for all data transfers as well as addition and subtraction.
The multiply-divide register is used for multiplication, division, and square roots.
The product register (PR), quotient register (QR), and square root register (SR) are used by the
corresponding instructions.
The data buffer register (S) holds data moving in or out of storage; it is shown with two 11-bit parts.</p>
<p><a href="https://static.righto.com/images/arma-microcomputer/arma_2.png"><img alt="Block diagram of the Arma Micro Computer. Figure from "The Arma Micro Computer for Space Applications"." class="hilite" height="468" src="https://static.righto.com/images/arma-microcomputer/arma_2-w600.png" title="Block diagram of the Arma Micro Computer. Figure from "The Arma Micro Computer for Space Applications"." width="600" /></a><div class="cite">Block diagram of the Arma Micro Computer. Figure from "The Arma Micro Computer for Space Applications".</div></p>
<p>For control logic, the location counter (L) is the 13-bit program counter.
For a subroutine call, the current address can be stored in the recall register (RR), which acts as a
link register to hold the return address. (The RR is not shown on the diagram because it is held in memory.)
Instruction decoding uses the instruction register (I), with the next instruction in the instruction
buffer (B). The operand register (P) contains the 13-bit address from an instruction, while the
remaining register (R) is used for I/O addressing. <a class="footnote-backref" href="#fnref:block-diagram" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:wafers">
<p>Arma's <a href="https://www.electronicdesign.com/technologies/industrial/boards/article/21772791/microminiature-guidance-computer">original plan</a> was to mount circuits on ceramic wafers.
Resistors would be printed onto the wafer and wiring silk-screened.
(This is similar to IBM's SLT modules (1964), although IBM mounted diode and transistors as bare dies
rather than components.)
However, the Micro Computer ended up using epoxy-glass wafers with small, but discrete components:
standard TO-46 transistors, "fly-speck" diodes, and 1/10 watt resistors.
I don't see much advantage to these wafers over mounting the components directly on the printed-circuit
board; maybe standardization is the benefit. <a class="footnote-backref" href="#fnref:wafers" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:transfluxors">
<p>The Micro Computer used an unusual mechanism to select a word to read or write.
Most computers used a grid of selection wires; by energizing an X and a Y wire at the same time,
the corresponding core was selected.
The key idea of this "coincident-current" approach is that each wire has half the current necessary to
flip a core, so the core with the energized X and Y wires will have enough current to flip.
This puts tight constraints on the current level, since too much current will flip all the cores along
the wire, but not enough current will not flip the selected core.
What makes this difficult is that the properties of a core change with temperature, so either the cores
need to be temperature-stabilized or the current needs to be adjusted based on the temperature.</p>
<p>The Micro Computer instead used a separate wire for each word, so as long as the current is large enough,
the cores will flip.
This approach avoids the issues with temperature sensitivity, an important concern for a computer that needs
to handle the large temperature swings of a spacecraft, not an air-conditioned data center.
Unfortunately, it requires much more wiring.
Specifically, the large advantage of the coincident-current approach is that an N×N grid of wires
lets you select N<sup>2</sup> words. With the Micro Computer approach, N wires only select N words, so
the scalability is much worse.</p>
<p>For more on Arma's memory systems, see patents: <a href="https://patents.google.com/patent/US3048828A">Memory Device, 3048828</a> and <a href="https://patents.google.com/patent/US3289181A">Multiaperture Core Memory Matrix, 3289181</a>. <a class="footnote-backref" href="#fnref:transfluxors" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:capitalization">
<p>The capitalization of Arma vs. ARMA is inconsistent.
It often appears in all-caps, but both forms are used, sometimes in the same article.
"Arma" is not an acronym; the name came from the names of its founders: Arthur Davis and David Mahood
(source: <a href="https://s3.amazonaws.com/arena-attachments/2260731/51714a4d7f394844dffe73c2a4351902.pdf?1528090951">Between Human and Machine</a>, p54).
I suspect a 1960s corporate branding effort was responsible for the use of all-caps. <a class="footnote-backref" href="#fnref:capitalization" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:history">
<p>For more on the corporate history of Arma, see <a href="https://www.ieee.li/pulse/pulse_1958_03.pdf">IRE Pulse</a>, March 1958, p9-10. Details of corporate politics and what went wrong are <a href="https://downunderpatriot.blogspot.com/2014/11/arma-division-american-bosch-arma.html">here</a>.
More information on the financial ups and downs of Arma is in "Charles Perelle's Spacemanship", Fortune, January 1959, an article that focused
on Charles Perelle, the president of American Bosch Arma. <a class="footnote-backref" href="#fnref:history" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:tradic">
<p>Wikipedia <a href="https://en.wikipedia.org/wiki/Wen_Tsing_Chow#:~:text=production%20airborne%20digital%20computer">says</a> that Arma's guidance computer was "the first production airborne digital computer".
However, the Hughes Digitair (1958) has also been <a href="https://www.rfcafe.com/references/radio-news/airborne-digital-computer-radio-tv-news-february-1958.htm">called</a>
"the first airborne digital computer in actual production."
Another <a href="https://web.archive.org/web/20110722175532/http://www.afspc.af.mil/shared/media/document/AFD-100405-081.pdf">source</a> says the Arma computer was the "first all-solid-state, high-reliability, space-borne digital computer."
The <a href="https://en.wikipedia.org/wiki/TRADIC">TRADIC</a> (Transistorized Airborne Digital Computer) (1954)
was earlier, but was a prototype system, not a production system.
In turn, the TRADIC is said by some to be the first fully transistorized computer, but that depends on exactly
how you interpret "fully".</p>
<p>This is another example of how the "first" depends on the specific adjectives used. <a class="footnote-backref" href="#fnref:tradic" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:heiderstadt">
<p>The information on the Arma W-107A computer is from "Atlas Inertial Guidance System: As I Remember It"
by Principal Engineer John Heiderstadt. <a class="footnote-backref" href="#fnref:heiderstadt" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:burning">
<p>Chow Wen Tsing's <a href="https://patents.google.com/patent/US3028659A">PROM patent</a>
discusses the term "burning", explaining that it refers to burning out the diodes electrically.
To widen the patent, he clarifies that "The term 'blowing out' or 'burning out' further includes
any process which, by means less drastic than actual destruction of the non-linear elements, effects
a change of the circuit impedance to a level which makes the particular circuit inoperative."
This description prevented someone from trying to get around the patent by stating that nothing was <em>really</em>
burning. <a class="footnote-backref" href="#fnref:burning" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:ltn-51">
<p>Details on the LTN-51 navigation system and its uses are in <a href="https://www.osti.gov/servlets/purl/4623576#page=8">this document</a>. <a class="footnote-backref" href="#fnref:ltn-51" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:aerospace">
<p>For more information on early aerospace computers, see
<a href="https://bitsavers.org/pdf/bellcomm/TM67-1031-1_State_Of_The_Art_Of_Aerospace_Digital_Computers_62-67_Jun67.pdf">State-of-the-art of Aerospace Digital Computers</a> (1967), updated as <a href="https://ntrs.nasa.gov/api/citations/19690025249/downloads/19690025249.pdf">Trends in Aerospace Digital Computer Design</a> (1969).
Also see the 1970 <a href="https://www.bitsavers.org/magazines/Datamation/19700715.pdf#page=89">Survey of Military CPUs</a>.
<a href="https://dl.acm.org/doi/10.1145/1476589.1476699">Efficient partitioning for the batch-fabricated fourth generation computer</a> (1968) discusses how "The computer industry is on the verge of an upheaval" from new hardware
including LSI and fast ROMs, and describes various LSI aerospace computers. <a class="footnote-backref" href="#fnref:aerospace" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:ibm">
<p>The "IBM Dictionary of Computing" (1994) has two definitions of "microcomputer":
"(1) A digital computer whose processing unit consists of one or more microprocessors, and includes
storage and input/output facilities. (2) A small computer that includes one or more input/output units
and sufficient memory to execute instructions; for example a personal computer. The essential components of
a microcomputer are often contained within a single enclosure."
The latter definition was from an ISO/IEC draft standard for terminology so it is somewhat "official". <a class="footnote-backref" href="#fnref:ibm" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com10tag:blogger.com,1999:blog-6264947694886887540.post-13073208399945431512024-02-17T10:11:00.000-08:002024-02-19T13:58:12.267-08:00Inside the mechanical Bendix Air Data Computer, part 5: motor/tachometersThe Bendix Central Air Data Computer (CADC) is an electromechanical analog computer that uses
gears and cams for its mathematics. It was a key part of military planes such as the F-101 and the F-111 fighters,
computing
airspeed, Mach number, and other "air data".
The rotating gears are powered by six small servomotors, so these motors
are in a sense the fundamental component of the CADC.
In the photo below, you can see one of the cylindrical motors near the center, about 1/3 of the way down.</p>
<p>The servomotors in the CADC are unlike standard motors.
Their name—"Motor-Tachometer Generator" or "Motor and Rate Generator"<span id="fnref:part-numbers"><a class="ref" href="#fn:part-numbers">1</a></span>—indicates that each unit contains both a motor and a speed sensor.
Because the motor and generator use two-phase signals, there are a total of eight colorful wires coming out,
many more than a typical motor.
Moreover, the direction of the motor can be controlled, unlike typical AC motors.
I couldn't find a satisfactory explanation of how these units worked, so I bought one and disassembled it.
This article (part 5 of my series on the CADC<span id="fnref:series"><a class="ref" href="#fn:series">2</a></span>) provides a complete teardown of the motor/generator
and explain how it works.</p>
<p><a href="https://static.righto.com/images/bendix-motor/cadc-view.jpg"><img alt="The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version." class="hilite" height="688" src="https://static.righto.com/images/bendix-motor/cadc-view-w400.jpg" title="The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version." width="400" /></a><div class="cite">The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version.</div></p>
<p>The image below shows a closeup of two motors powering one of the pressure signal outputs.
Note the bundles of colorful wires to each motor, entering in two locations.
At the top, the motors drive complex gear trains.
The high-speed motors are geared down by the gear trains to provide much slower rotations with sufficient
torque to power the rest of the CADC's mechanisms.</p>
<p><a href="https://static.righto.com/images/bendix-motor/motors.jpg"><img alt="Two motor/generators in the pressure section of the CADC. The one at the back is mostly hidden." class="hilite" height="376" src="https://static.righto.com/images/bendix-motor/motors-w250.jpg" title="Two motor/generators in the pressure section of the CADC. The one at the back is mostly hidden." width="250" /></a><div class="cite">Two motor/generators in the pressure section of the CADC. The one at the back is mostly hidden.</div></p>
<p>The motor/tachometer that we disassembled is shorter than the ones in the CADC (despite having the same part
number), but the principles are the same.
We started by removing a small C-clip on the end of the motor and and unscrewing the end plate.
The unit is pretty simple mechanically. It has bearings at each end for the rotor shaft.
There are four wires for the motor and four wires for the tachometer.<span id="fnref:wiring"><a class="ref" href="#fn:wiring">3</a></span></p>
<p><a href="https://static.righto.com/images/bendix-motor/teardown.jpg"><img alt="The motor disassembled to show the internal components." class="hilite" height="541" src="https://static.righto.com/images/bendix-motor/teardown-w600.jpg" title="The motor disassembled to show the internal components." width="600" /></a><div class="cite">The motor disassembled to show the internal components.</div></p>
<p>The rotor (below) has two parts on the shaft. the left part is for the motor and the right drum is for the tachometer.
The left part is a <a href="https://en.wikipedia.org/wiki/Squirrel-cage_rotor">squirrel-cage rotor</a><span id="fnref:angle"><a class="ref" href="#fn:angle">4</a></span>
for the motor. It consists of conducting bars (light-colored) on an iron core.
The conductors are all connected at both ends by the conductive rings at either end.
The metal drum on the right is used by the tachometer.
Note that there are no electrical connections between the rotor components and the rest of the motor:
there are no brushes or slip rings.
The interaction between the rotor and the windings in the body of the motor is purely magnetic, as will be explained.</p>
<p><a href="https://static.righto.com/images/bendix-motor/rotor.jpg"><img alt="The rotor and shaft." class="hilite" height="223" src="https://static.righto.com/images/bendix-motor/rotor-w600.jpg" title="The rotor and shaft." width="600" /></a><div class="cite">The rotor and shaft.</div></p>
<p>The motor/tachometer contains two cylindrical stators that create the magnetic fields, one for the motor and one for the tachometer.
The photo below shows the motor stator inside the unit after removing the tachometer stator.
The stators are encased in hard green plastic and tightly pressed inside the unit.
In the center, eight metal poles are visible. They direct the magnetic field onto the rotor.</p>
<p><a href="https://static.righto.com/images/bendix-motor/motor-inside.jpg"><img alt="Inside the motor after removing the tachometer winding." class="hilite" height="332" src="https://static.righto.com/images/bendix-motor/motor-inside-w300.jpg" title="Inside the motor after removing the tachometer winding." width="300" /></a><div class="cite">Inside the motor after removing the tachometer winding.</div></p>
<p>The photo below shows the stator for the tachometer, similar to the stator for the motor.
Note the shallow notches that look like black lines
in the body on the lower left.
These are probably adjustments to the tachometer during manufacturing to compensate for imperfections.
The adjustments ensure that the magnetic fields are
nulled out so the tachometer returns zero voltage when stationary.
The metal plate on top shields the tachometer from the motor's magnetic fields.</p>
<p><a href="https://static.righto.com/images/bendix-motor/windings3.jpg"><img alt="The stator for the tachometer." class="hilite" height="286" src="https://static.righto.com/images/bendix-motor/windings3-w300.jpg" title="The stator for the tachometer." width="300" /></a><div class="cite">The stator for the tachometer.</div></p>
<p>The poles and the metal case of the stator look solid, but they are not. Instead, they are formed from a stack of thin
laminations.
The reason to use laminations instead of solid metal is to reduce eddy currents in the metal.
Each lamination is varnished, so it is insulated from its neighbors, preventing the flow of eddy currents.</p>
<p><a href="https://static.righto.com/images/bendix-motor/lamination.jpg"><img alt="One lamination from the stack of laminations that make up the winding. The lamination suffered some damage during disassembly; it was originally round." class="hilite" height="281" src="https://static.righto.com/images/bendix-motor/lamination-w300.jpg" title="One lamination from the stack of laminations that make up the winding. The lamination suffered some damage during disassembly; it was originally round." width="300" /></a><div class="cite">One lamination from the stack of laminations that make up the winding. The lamination suffered some damage during disassembly; it was originally round.</div></p>
<p>In the photo below, I removed some of the plastic to show the wire windings underneath.
The wires look like bare copper, but they have a very thin layer of varnish to insulate them.
There are two sets of windings (orange and blue, or red and black) around alternating metal poles.
Note that the wires run along the pole, parallel to the rotor, and then wrap around the pole at the top and bottom,
forming oblong coils around each pole.<span id="fnref:cross-section"><a class="ref" href="#fn:cross-section">5</a></span>
This generates a magnetic field through each pole.</p>
<p><a href="https://static.righto.com/images/bendix-motor/windings2.jpg"><img alt="Removing the plastic reveals the motor windings." class="hilite" height="288" src="https://static.righto.com/images/bendix-motor/windings2-w300.jpg" title="Removing the plastic reveals the motor windings." width="300" /></a><div class="cite">Removing the plastic reveals the motor windings.</div></p>
<h2>The motor</h2>
<p>The motor part of the unit is a two-phase induction motor with a squirrel-cage rotor.<span id="fnref:servomotor"><a class="ref" href="#fn:servomotor">6</a></span>
There are no brushes or electrical connections to the rotor, and there are no magnets, so it isn't obvious
what makes the rotor rotate.
The trick is the "squirrel-cage" rotor, shown below. It consists of metal bars that are connected at the top
and bottom by rings.
Assume (for now) that the fixed part of the motor, the stator, creates a rotating magnetic
field.
The important principle is that a <em>changing</em> magnetic field will produce a current in a wire loop.<span id="fnref:faraday"><a class="ref" href="#fn:faraday">7</a></span>
As a result, each loop in the squirrel-cage rotor will have an induced current: current will flow up<span id="fnref:signs"><a class="ref" href="#fn:signs">9</a></span> the
bars facing the north magnetic field and down the south-facing bars, with the rings on the end closing the circuits.</p>
<p><a href="https://static.righto.com/images/bendix-motor/squirrel-cage.png"><img alt="A squirrel-cage rotor. The numbered parts are (1) shaft, (2) end cap, (3) laminations, and (4) splines to hold the laminations. Image from Robo Blazek." class="hilite" height="225" src="https://static.righto.com/images/bendix-motor/squirrel-cage-w300.png" title="A squirrel-cage rotor. The numbered parts are (1) shaft, (2) end cap, (3) laminations, and (4) splines to hold the laminations. Image from Robo Blazek." width="300" /></a><div class="cite">A squirrel-cage rotor. The numbered parts are (1) shaft, (2) end cap, (3) laminations, and (4) splines to hold the laminations. Image from <a href="https://commons.wikimedia.org/wiki/File:AM_Klietka.png">Robo Blazek</a>.</div></p>
<p>But how does the stator produce a rotating magnetic field? And how do you control the direction of rotation?
The next important principle is that current flowing through a wire produces a magnetic field.<span id="fnref:ampere"><a class="ref" href="#fn:ampere">8</a></span>
As a result, the currents in the squirrel cage rotor produce a magnetic field perpendicular to the cage.
This magnetic field causes the rotor to turn in the same direction as the stator's magnetic field, driving
the motor.
Because the rotor is powered by the induced currents, the motor is called an induction motor.</p>
<p>The diagram below shows how the motor is wired, with a control winding and a reference winding.
Both windings are powered with AC, but the control voltage
either lags the reference winding by
90° or leads the reference winding by 90°, due to the capacitor.
Suppose the current through the control winding lags by 90°. First, the reference voltage's sine wave will have a peak, producing
the magnetic field's north pole at A. Next (90° later), the control voltage will peak, producing the north
pole at B. The reference voltage will go negative, producing a south pole at A and thus a north pole at C.
The control voltage will go negative, producing a south pole at B and a north pole at D.
This cycle will repeat, with the magnetic field rotating counter-clockwise from A to D.
Conversely, if the control voltage leads the reference voltage, the magnetic field will rotate clockwise.
This causes the motor to spin in one direction or the other, with the direction controlled by the control voltage.
(The motor has four poles for each winding, rather than the one shown below; this increases the torque and
reduces the speed.)</p>
<p><a href="https://static.righto.com/images/bendix-motor/servomotor.jpg"><img alt="Diagram showing the servomotor wiring." class="hilite" height="298" src="https://static.righto.com/images/bendix-motor/servomotor-w350.jpg" title="Diagram showing the servomotor wiring." width="350" /></a><div class="cite">Diagram showing the servomotor wiring.</div></p>
<p>The purpose of the capacitor is to provide the 90° phase shift so the reference voltage and the control voltage
can be driven from the same single-phase AC supply (in this case, 26 volts, 400 hertz).
Switching the polarity of the control voltage reverses the direction of the motor.</p>
<p>There are a few interesting things about induction motors.
You might expect that the motor would spin at the same rate as the rotating magnetic field.
However, this is not the case.
Remember that a <em>changing</em> magnetic field induces the current in the squirrel-cage rotor.
If the rotor is spinning at the same rate as the magnetic field, the rotor will encounter an unchanging
magnetic field and there will be no current in the bars of the rotor. As a result, the rotor will not generate
a magnetic field and there will be no torque to rotate it.
The consequence is that the rotor must spin somewhat slower than the magnetic field.
This is called "slippage" and is typically a few percent of the full speed, with more slippage as more torque is
required.</p>
<p>Many household appliances use induction motors, but how do they generate a rotating magnetic field from a single-phase AC
winding?
The problem is that the magnetic field in a single AC winding will just flip back and forth, so the motor will
not turn in either direction.
One solution is a <a href="https://en.wikipedia.org/wiki/Shaded-pole_motor">shaded-pole motor</a>, which puts a
copper bar around part of each pole to break the symmetry and produce a weakly rotating magnetic field.
More powerful induction motors use a startup winding with a capacitor (analogous to the control winding).
This winding can either be switched out of the circuit once the motor starts spinning,<span id="fnref:single-phase"><a class="ref" href="#fn:single-phase">10</a></span> or used continuously,
called a permanent-split capacitor (PSC) motor.
The best solution is three-phase power (if available); a three-phase winding automatically produces a rotating magnetic field.</p>
<h2>Tachometer/generator</h2>
<p>The second part of the unit is the tachometer generator, sometimes called the rate unit.<span id="fnref:tachometers"><a class="ref" href="#fn:tachometers">11</a></span>
The purpose of the generator is to produce a voltage proportional to the speed of the shaft.
The unusual thing about this generator is that it produces a 400-hertz output that is either in phase with
the input or 180° out of phase.
This is important because the phase indicates which direction the shaft is turning.
Note that a "normal" generator is different: the output frequency is proportional to the speed.</p>
<p>The diagram below shows the principle behind the generator.
It has two stator windings: the reference coil that is powered at 400 Hz, and the output coil that produces
the output signal.
When the rotor is stationary (A), the magnetic flux is perpendicular to the output coil, so no output voltage
is produced.
But when the rotor turns (B), eddy currents in the rotor distort the magnetic field.
It now couples with the output coil, producing a voltage.
As the rotor turns faster, the magnetic field is distorted more, increasing the coupling and thus the output voltage.
If the rotor turns in the opposite direction (C), the magnetic field couples with the output coil in the opposite
direction, inverting the output phase.
(This diagram is more conceptual than realistic, with the coils and flux 90° from their real orientation,
so don't take it too seriously.
As shown earlier, the coils are perpendicular to the rotor so the real flux lines are completely different.)</p>
<p><a href="https://static.righto.com/images/bendix-motor/drag-cup.jpg"><img alt="Principle of the drag-cup rate generator. From Navy electricity and electronics training series: Principles of synchros, servos, and gyros, Fig 2-16" class="hilite" height="419" src="https://static.righto.com/images/bendix-motor/drag-cup-w600.jpg" title="Principle of the drag-cup rate generator. From Navy electricity and electronics training series: Principles of synchros, servos, and gyros, Fig 2-16" width="600" /></a><div class="cite">Principle of the drag-cup rate generator. From <a href="https://books.google.com/books?id=rRAqVWwRVeAC">Navy electricity and electronics training series: Principles of synchros, servos, and gyros</a>, Fig 2-16</div></p>
<p>But why does the rotating drum change the magnetic field?
It's easier to understand by considering a tachometer that uses a squirrel-cage rotor instead of a drum.
When the rotor rotates, currents will be induced in the squirrel cage, as described earlier with the motor.
These currents, in turn, generate a perpendicular magnetic field, as before.
This magnetic field, perpendicular to the orginal field, will be aligned with the output coil and will be picked up.
The strength of the induced field (and thus the output voltage) is proportional to the speed, while the
direction of the field depends on the direction of rotation.
Because the primary coil is excited at 400 hertz, the currents in the squirrel cage and the resulting magnetic field also oscillate at 400 hertz.
Thus, the output is at 400 hertz, regardless of the input speed.</p>
<p>Using a drum instead of a squirrel cage provides higher accuracy because there are no fluctuations due
to the discrete bars.
The operation is essentially the same, except that the currents pass through the metal of the drum continuously instead of
through individual bars. The result is eddy currents in the drum, producing the second magnetic field.
The diagram below shows the eddy currents (red lines) from a metal plate moving through a magnetic field (green), producing a second magnetic field (blue arrows).
For the rotating drum, the situation is similar except the metal surface is curved, so both field arrows
will have a component pointing to the left. This creates the directed magnetic field that produces the output.</p>
<p><a href="https://static.righto.com/images/bendix-motor/eddy-currents.png"><img alt="A diagram showing eddy currents in a metal plate moving under a magnet, Image from Chetvorno." class="hilite" height="279" src="https://static.righto.com/images/bendix-motor/eddy-currents-w500.png" title="A diagram showing eddy currents in a metal plate moving under a magnet, Image from Chetvorno." width="500" /></a><div class="cite">A diagram showing eddy currents in a metal plate moving under a magnet, Image from <a href=
"https://commons.wikimedia.org/wiki/File:Eddy_currents_due_to_magnet.svg">Chetvorno</a>.</div></p>
<h2>The servo loop</h2>
<p>The motor/generator is called a servomotor because it is used in a servo loop,
a control system that uses feedback to obtain precise positioning.
In particular, the CADC uses the rotational position of shafts to represent various values.
The servo loops convert the CADC's inputs (static pressure, dynamic pressure, temperature, and
pressure correction) into shaft positions.
The rotations of these shafts power the gears, cams, and differentials that perform the computations.</p>
<p>The diagram below shows a typical servo loop in the CADC.
The goal is to rotate the output shaft to a position that exactly matches the input voltage.
To accomplish this, the output position is converted into a feedback voltage by a potentiometer that rotates
as the output shaft rotates.<span id="fnref:feedback"><a class="ref" href="#fn:feedback">12</a></span>
The error amplifier compares the input voltage to the feedback voltage and generates an error signal, rotating
the servomotor in the appropriate direction.
Once the output shaft is in the proper position, the error signal drops to zero and the motor stops.
To improve the dynamic response of the servo loop, the tachometer signal is used as a negative feedback voltage.
This ensures that the motor slows as the system gets closer to the right position, so the motor doesn't overshoot
the position and oscillate.
(This is sort of like a <a href="https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller">PID controller</a>.)</p>
<p><a href="https://static.righto.com/images/bendix-motor/servo-loop.jpg"><img alt="Diagram of a servo loop in the CADC." class="hilite" height="180" src="https://static.righto.com/images/bendix-motor/servo-loop-w500.jpg" title="Diagram of a servo loop in the CADC." width="500" /></a><div class="cite">Diagram of a servo loop in the CADC.</div></p>
<p>The error amplifier and motor drive circuit for a pressure transducer are shown below.
Because of the state of electronics at the time, it took three circuit boards to implement a single servo loop.
The amplifier was implemented with germanium transistors (since silicon transistors were later). The transistors weren't powerful enough to drive the
motors directly. Instead, <a href="https://spectrum.ieee.org/the-vacuum-tubes-forgotten-rival">magnetic amplifiers</a> (the yellow transformer-like modules at the front) powered
the servomotors. The large rectangular capacitors on the right provided the phase shift required for the
control voltage.</p>
<p><a href="https://static.righto.com/images/bendix-motor/amplifier-stack.jpg"><img alt="One of the three-board amplifiers for the pressure transducer." class="hilite" height="424" src="https://static.righto.com/images/bendix-motor/amplifier-stack-w400.jpg" title="One of the three-board amplifiers for the pressure transducer." width="400" /></a><div class="cite">One of the three-board amplifiers for the pressure transducer.</div></p>
<h2>Conclusions</h2>
<p>The Bendix CADC used a variety of electromechanical devices including synchros, control transformers, servo motors, and tachometer generators.
These were expensive military-grade components driven by complex electronics.
Nowadays, you can get a PWM servo motor for a <a href="https://www.adafruit.com/product/169">few dollars</a> with
the gearing, feedback, and control circuitry inside the motor housing.
These motors are widely used for hobbyist robotics, drones, and other applications.
It's amazing that servo motors have gone from specialized avionics hardware to an easy-to-use, inexpensive commodity.</p>
<p><a href="https://static.righto.com/images/bendix-motor/servo2.jpg"><img alt="A modern DC servo motor. Photo by Adafruit (CC BY-NC-SA 2.0 DEED)." class="hilite" height="308" src="https://static.righto.com/images/bendix-motor/servo2-w400.jpg" title="A modern DC servo motor. Photo by Adafruit (CC BY-NC-SA 2.0 DEED)." width="400" /></a><div class="cite">A modern DC servo motor. Photo by <a href="https://www.adafruit.com/product/169">Adafruit</a> (<a href="https://creativecommons.org/licenses/by-nc-sa/2.0/">CC BY-NC-SA 2.0 DEED</a>).</div></p>
<p>Follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon as <a href="https://oldbytes.space/@kenshirriff">@oldbytes.space@kenshirriff</a>.
Thanks to Joe for providing the CADC.
Thanks to Marc Verdiell for disassembling the motor.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:part-numbers">
<p>The two types of motors in the CADC are part number "FV-101-19-A1" and
part number "FV-101-5-A1" (or FV101-5A1). They are called either a "Tachometer Rate Generator" or "Tachometer
Motor Generator", with both names applied to the same part number.
The "19" and "5" units look the same, with the "19" used for one pressure servo loop and the "5" used everywhere else.</p>
<p>The motor that I got is similar to the ones in the CADC, but shorter.
The difference in size is mysterious since both have the Bendix part number FV-101-5-A1.</p>
<p>For reference, the motor I disassembled is labeled:
<div style="white-space:pre-wrap;">Cedar Division
Control Data Corp.
ST10162 Motor Tachometer
F0: 26V C0: 26V TACH: 18V 400 CPS
DSA-400-70C-4651
FSN6105-581-5331 US
BENDIX FV-101-5-A1
</div></p>
<p>I wondered why the motor listed both Control Data and Bendix.
In 1952, the Cedar Engineering Company was spun off from the Minneapolis Honeywell Regulator Company
(better known as Honeywell, the name it took in <a href="https://hrcaz.org/PermPages/HoneywellHistory.html#:~:text=1964%20The%20company%20name%20was,began%20at%20Honeywell%20Aero%20Division.">1964</a>).
Cedar Engineering produced motors, servos, and aircraft actuators.
In 1957, Control Data <a href="https://gallery.lib.umn.edu/exhibits/show/digital-state/control-data-corporation">bought</a> Cedar Engineering, which became the Cedar Division of CDC.
Then, Control Data <a href="https://www.worldradiohistory.com/Archive-Electronics/60s/63/Electronics-1963-03-15.pdf#page=10">acquired</a> Bendix's computer division in 1963.
Thus, three companies were involved. <a class="footnote-backref" href="#fnref:part-numbers" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:series">
<p>My previous articles on the CADC are:
<ul>
<li><a href="https://www.righto.com/2023/02/bendix-central-air-data-computer-cadc.html">Overview</a>
<li><a href="https://www.righto.com/2023/10/bendix-cadc-reverse-engineering.html">Top section</a>
<li><a href="https://www.righto.com/2024/01/bendix-cadc-pressure-transducers.html">Pressure transducers</a>
<li><a href="https://www.righto.com/2024/02/reverse-engineering-analog-bendix-air.html">Mach section</a>
</ul> <a class="footnote-backref" href="#fnref:series" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:wiring">
<p>From testing the motor, here is how I believe it is wired:
<br/>
Motor reference (power): red and black
<br/>
Motor control: blue and orange
<br/>
Generator reference (power): green and brown
<br/>
Generator out: white and yellow <a class="footnote-backref" href="#fnref:wiring" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:angle">
<p>The bars on the squirrel-cage rotor are at a slight angle.
Parallel bars would go in and out of alignment with the stator, causing fluctuations in the force,
while the angled bars avoid this problem. <a class="footnote-backref" href="#fnref:angle" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:cross-section">
<p>This cross-section through the stator shows the windings.
On the left, each winding is separated into the parts on either side of the pole.
On the right, you can see how the wires loop over from one side of the pole to the other.
Note the small circles in the 12 o'clock and 9 o'clock positions: cross sections of the input wires.
The individual horizontal wires near the circumference connect alternating windings.</p>
<p><a href="https://static.righto.com/images/bendix-motor/cross-section.jpg"><img alt="A cross-section of the stator, formed by sanding down the plastic on the end." class="hilite" height="472" src="https://static.righto.com/images/bendix-motor/cross-section-w500.jpg" title="A cross-section of the stator, formed by sanding down the plastic on the end." width="500" /></a><div class="cite">A cross-section of the stator, formed by sanding down the plastic on the end.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:cross-section" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:servomotor">
<p>It's hard to find explanations of AC servomotors since they are an old technology.
One discussion is in
<a href="https://archive.org/details/electromechanica00davi/page/234/mode/2up?view=theater">Electromechanical components for servomechanisms</a> (1961).
This book points out some interesting things about a servomotor.
The stall torque is proportional to the control voltage.
Servomotors are generally high-speed, but low-torque devices, heavily geared down.
Because of their high speed and their need to change direction, rotational inertia is a problem.
Thus, servomotors typically have a long, narrow rotor compared with typical motors.
(You can see in the teardown photo that the rotor is long and narrow.)
Servomotors are typically designed with many poles (to reduce speed) and smaller air gaps to increase inductance.
These small airgaps (e.g. 0.001") require careful manufacturing tolerance, making servomotors a precision part. <a class="footnote-backref" href="#fnref:servomotor" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:faraday">
<p>The principle is <a href="https://en.wikipedia.org/wiki/Faraday%27s_law_of_induction">Faraday's law of induction</a>:
"The electromotive force around a closed path is equal to the negative of the time rate of change of the magnetic flux enclosed by the path." <a class="footnote-backref" href="#fnref:faraday" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:ampere">
<p><a href="">Ampère's law</a> states that "the integral of the magnetizing field <em>H</em>
around any closed loop is equal to the sum of the current flowing through the loop." <a class="footnote-backref" href="#fnref:ampere" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:signs">
<p>The direction of the current flow (up or down) depends on the direction of rotation.
I'm not going to worry about the specific direction of current flow, magnetic flux, and so forth in this article. <a class="footnote-backref" href="#fnref:signs" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:single-phase">
<p>Once an induction motor is spinning, it can be powered from a single AC phase since the stator is
rotating with respect to the magnetic field.
This works for the servomotor too. I noticed that once the motor is spinning,
it can operate without the control voltage. This isn't the normal way of using the motor, though. <a class="footnote-backref" href="#fnref:single-phase" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:tachometers">
<p>A long discussion of tachometers is in the book <a href="https://archive.org/details/electromechanica0000unse_d6y6/page/188/mode/2up?view=theater">Electromechanical Components for Servomechanisms</a> (1961). The AC induction-generator
tachometer is described starting on page 193.</p>
<p>For a mathematical analysis of the tachometer generator, see <a href="https://apps.dtic.mil/sti/tr/pdf/AD0655862.pdf#page=67">Servomechanisms, Section 2, Measurement and Signal Converters, MCP 706-137, U.S. Army</a>.
This source also discusses sources of errors in detail.
Inexpensive tachometer generators may have an error of 1-2%, while precision devices can have an error of
about 0.1%.
Accuracy is worse for small airborne generators, though.
Since the Bendix CADC uses the tachometer output for damping, not as a signal output, accuracy
is less important. <a class="footnote-backref" href="#fnref:tachometers" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:feedback">
<p>Different inputs in the CADC use different feedback mechanisms.
The temperature servo uses a potentiometer for feedback.
The angle of attack correction uses a synchro control transformer, which generates a voltage based on the
angle error.
The pressure transducers contain inductive pickups that generate a voltage based on the pressure error.
For more details, see my article on the CADC's <a href="https://www.righto.com/2024/01/bendix-cadc-pressure-transducers.html">pressure transducer servo circuits</a>. <a class="footnote-backref" href="#fnref:feedback" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com3tag:blogger.com,1999:blog-6264947694886887540.post-43992784450476997852024-02-11T09:44:00.000-08:002024-02-11T09:44:34.591-08:00Reverse-engineering an analog Bendix air data computer: part 4, the Mach section<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js"></script>
<script>
MathJax = {
tex: {
inlineMath: [['$', '$'], ['\\(', '\\)']]
},
svg: {
fontCache: 'global'
},
chtml: { displayAlign: 'left' }
};
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
"HTML-CSS": { scale: 175}
});
</script>
<style type="text/css">
.MathJax {font-size: 1em !important}
</style>
<p>In the 1950s, many fighter planes used the
Bendix Central Air Data Computer (CADC) to compute airspeed, Mach number, and other "air data".
The CADC is an analog computer, using
tiny gears and specially-machined cams for its mathematics.
In this article, part 4 of my series,<span id="fnref:series"><a class="ref" href="#fn:series">1</a></span> I reverse engineer the Mach section of the CADC and explain its calculations.
(In the photo below, the Mach section is the middle section of the CADC.)</p>
<p><a href="https://static.righto.com/images/bendix-mach/cadc-side.jpg"><img alt="The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version." class="hilite" height="330" src="https://static.righto.com/images/bendix-mach/cadc-side-w600.jpg" title="The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version." width="600" /></a><div class="cite">The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version.</div></p>
<p>Aircraft have determined airspeed from
air pressure for over a century.
A port in the side of the plane provides the static air pressure,<span id="fnref:static"><a class="ref" href="#fn:static">2</a></span> the air pressure outside the aircraft.
A pitot tube points forward and receives the "total" air pressure, a higher pressure due to the air forced into the tube by the speed of the airplane.
The airspeed can be determined from the ratio of these two pressures, while the altitude can be determined
from the static pressure.</p>
<p>But as you approach the speed of sound, the fluid dynamics of air change and the calculations become very complicated.
With the development of supersonic fighter planes in the 1950s, simple mechanical instruments were no longer
sufficient.
Instead, an analog computer calculated the "air data" (airspeed, air density, Mach number, and so forth) from the pressure measurements.
This computer then transmitted the air data electrically to the systems that needed it: instruments, weapons
targeting, engine control, and so forth.
Since the computer was centralized, the
system was called a Central Air Data Computer or CADC, manufactured by Bendix and other companies.</p>
<p><a href="https://static.righto.com/images/bendix-mach/gears.jpg"><img alt="A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible." class="hilite" height="367" src="https://static.righto.com/images/bendix-mach/gears-w600.jpg" title="A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible." width="600" /></a><div class="cite">A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible.</div></p>
<p>Each value in the Bendix CADC is indicated by the rotational position of a shaft.
Compact electric motors rotate the shafts, controlled by the pressure inputs.
Gears, cams, and differentials perform computations, with the results indicated by more rotations.
Devices called synchros converted the rotations to electrical outputs that are connected to other aircraft systems.
The CADC is said to contain 46 synchros, 511 gears, 820 ball bearings, and a total of 2,781 major parts (but I haven't counted).
These components are crammed into a compact cylinder: just 15 inches long and weighing 28.7 pounds.</p>
<p>The equations computed by the CADC are impressively complicated. For instance, one equation is:</p>
<p>\[~~~\frac{P_t}{P_s} = \frac{166.9215M^7}{( 7M^2-1)^{2.5}}\]</p>
<p>It seems incredible that these functions could be computed mechanically, but three
techniques make this possible.
The fundamental mechanism is the differential gear, which adds or subtracts values.
Second, logarithms are used extensively, so multiplications and divisions are implemented by additions and subtractions performed by
a differential, while square roots are calculated by gearing down by a factor of 2.
Finally, specially-shaped cams implement functions: logarithm, exponential, and application-specific functions.
By combining these mechanisms, complicated functions can be computed mechanically, as I will explain below.</p>
<!-- Aerodynamics for Naval Aviators (1965) https://www.faa.gov/sites/faa.gov/files/regulations_policies/handbooks_manuals/aviation/00-80T-80.pdf -->
<h3>The differential</h3>
<p>The differential gear assembly is the mathematical component of the CADC, as it performs addition or subtraction.<span id="fnref:gearing"><a class="ref" href="#fn:gearing">3</a></span>
The differential takes two input rotations and produces an output rotation that is the sum or difference of these rotations.<span id="fnref:differential"><a class="ref" href="#fn:differential">4</a></span>
Since most values in the CADC are expressed logarithmically, the differential computes multiplication and division when it
adds or subtracts its inputs.</p>
<p><a href="https://static.righto.com/images/bendix-mach/differential-closeup.png"><img alt="A closeup of a differential mechanism." class="hilite" height="332" src="https://static.righto.com/images/bendix-mach/differential-closeup-w350.png" title="A closeup of a differential mechanism." width="350" /></a><div class="cite">A closeup of a differential mechanism.</div></p>
<p>While the differential functions like the differential in a car, it is constructed differently, with a
<a href="https://en.wikipedia.org/wiki/Differential_(mechanical_device)#Spur-gear_design">spur-gear</a> design.
This compact arrangement of gears is about 1 cm thick and 3 cm in diameter.
The differential is mounted on a shaft along with three co-axial gears: two gears provide the inputs to the differential and the third
provides the output.
In the photo, the gears above and below the differential are the input gears.
The entire differential body rotates with the sum, connected to the output gear at the top through a concentric shaft.
(In practice, any of the three gears can be used as the output.)
The two thick gears inside the differential body are part of the mechanism.</p>
<h3>The cams</h3>
<p>The CADC uses cams to implement various functions. Most importantly, cams compute logarithms and exponentials.
Cams also implement complicated functions of one variable such as ${M}/{\sqrt{1 + .2 M^2}}$.
The function is encoded into the cam's shape during manufacturing, so a hard-to-compute nonlinear function isn't a problem for the CADC.
The photo below shows a cam with the follower arm in front. As the cam rotates, the follower
moves in and out according to the cam's radius.</p>
<p><a href="https://static.righto.com/images/bendix-mach/cam.jpg"><img alt="A cam inside the CADC implements a function." class="hilite" height="444" src="https://static.righto.com/images/bendix-mach/cam-w500.jpg" title="A cam inside the CADC implements a function." width="500" /></a><div class="cite">A cam inside the CADC implements a function.</div></p>
<p>However, the shape of the cam doesn't provide the function directly, as you might expect.
The main problem with the straightforward approach is the discontinuity when the cam wraps around.
For example, if the cam implemented an exponential directly, its radius would spiral exponentially and there would be
a jump back to the starting value when it wraps around.
Instead, the CADC uses a clever <a href="https://patents.google.com/patent/US2969910A"">patented</a> method:
the cam encodes the <em>difference</em> between the desired function and a straight line.
For example, an exponential curve is shown below (blue), with a line (red) between the endpoints.
The height of the gray segment, the difference, specifies the radius of the cam (added to the cam's fixed minimum radius).
The point is that this difference goes to 0 at the extremes, so the cam will no longer have a discontinuity when it
wraps around.
Moreover, this technique significantly reduces the size of the value (i.e. the height of the gray region is
smaller than the height of the blue line), increasing the cam's accuracy.<span id="fnref:cam-shape"><a class="ref" href="#fn:cam-shape">5</a></span></p>
<p><a href="https://static.righto.com/images/bendix-mach/exponential.png"><img alt="An exponential curve (blue), linear curve (red), and the difference (gray)." class="hilite" height="216" src="https://static.righto.com/images/bendix-mach/exponential-w300.png" title="An exponential curve (blue), linear curve (red), and the difference (gray)." width="300" /></a><div class="cite">An exponential curve (blue), linear curve (red), and the difference (gray).</div></p>
<p>To make this work, the cam position must be added to the linear value to yield the result.
This is implemented by combining each cam with a differential gear; watch for the paired cams and differentials below.
As the diagram below shows, the input (23) drives the cam (30) and the
differential (25, 37-41). The follower (32) tracks the cam and provides a second input (35) to the differential.
The sum from the differential produces the desired function (26).</p>
<p><a href="https://static.righto.com/images/bendix-mach/differential-patent.png"><img alt="This diagram, from Patent 2969910, shows how the cam and follower are connected to a differential." class="hilite" height="312" src="https://static.righto.com/images/bendix-mach/differential-patent-w350.png" title="This diagram, from Patent 2969910, shows how the cam and follower are connected to a differential." width="350" /></a><div class="cite">This diagram, from <a href="https://patents.google.com/patent/US2969910A">Patent 2969910</a>, shows how the cam and follower are connected to a differential.</div></p>
<h3>The synchro outputs</h3>
<p>A synchro is an interesting device that can transmit a rotational position electrically over three wires.
In appearance, a synchro is similar to an electric motor, but its internal construction is different, as shown below.
Before digital systems, synchros were very popular for transmitting signals electrically through an aircraft.
For instance, a synchro could transmit an altitude reading to a cockpit display or a targeting system.
Two synchros at different locations have their stator windings connected together, while the rotor windings are driven with
AC.
Rotating the shaft of one synchro causes the other to rotate to the same position.<span id="fnref:synchro"><a class="ref" href="#fn:synchro">6</a></span></p>
<p><a href="https://static.righto.com/images/bendix-mach/synchro-cutaway.png"><img alt="Cross-section diagram of a synchro showing the rotor and stators." class="hilite" height="306" src="https://static.righto.com/images/bendix-mach/synchro-cutaway-w350.png" title="Cross-section diagram of a synchro showing the rotor and stators." width="350" /></a><div class="cite">Cross-section diagram of a synchro showing the rotor and stators.</div></p>
<p>For the CADC, most of the outputs are synchro signals, using compact synchros that are about 3 cm in length.
For improved resolution, many of the CADC outputs use two synchros: a coarse synchro and a fine synchro.
The two synchros are typically geared in an 11:1 ratio, so the fine synchro rotates 11 times as fast as the coarse
synchro.
Over the output range, the coarse synchro may turn 180°, providing the approximate output unambiguously, while the fine
synchro spins multiple times to provide more accuracy.</p>
<h2>Examining the Mach section of the CADC</h2>
<p><a href="https://static.righto.com/images/bendix-mach/side.jpg"><img alt="Another view of the CADC." class="hilite" height="343" src="https://static.righto.com/images/bendix-mach/side-w600.jpg" title="Another view of the CADC." width="600" /></a><div class="cite">Another view of the CADC.</div></p>
<p>The Bendix CADC is constructed from modular sections.
In this blog post, I'm focusing on
the middle section, called the "Mach section" and indicated by the arrow above.
This section computes log static pressure, impact pressure, pressure ratio, and Mach number and provides these outputs
electrically as synchro signals.
It also provides the log pressure ratio and log static pressure to the rest of the CADC as shaft rotations.
The left section of the CADC computes values related to airspeed, air density, and temperature.<span id="fnref:left"><a class="ref" href="#fn:left">7</a></span>
The right section has the pressure sensors (the black domes), along with the servo mechanisms that control
them.</p>
<p>I had feared that any attempt at disassembly would result in tiny gears flying in every direction, but the CADC
was designed to be taken apart for maintenance.
Thus, I could remove the left section of the CADC for analysis.
Unfortunately, we lost the gear alignment between the sections and don't have the calibration instructions, so
the CADC no longer produces accurate results.</p>
<p>The diagram below shows the internal components of the Mach section after disassembly.
The synchros are in pairs to generate coarse and fine outputs; the coarse synchros can
be distinguished because they
have spiral anti-backlash springs installed.
These springs prevent wobble in the synchro and gear train as the gears change direction.
The gears and differentials are not visible from this angle as they are underneath the metal plate.
The Pressure Error Correction (PEC) subsystem has a motor to drive the shaft and a control transformer for feedback.
The Mach section has two D-sub connectors. The one on the right links the Mach section and pressure section to the front section
of the CADC. The Position Error Correction (PEC) servo amplifier board plugs into the left connector.
The static pressure and total pressure input lines have fittings so the lines can be disconnected from
the lines from the front of the CADC.<span id="fnref:connectors"><a class="ref" href="#fn:connectors">8</a></span></p>
<p><a href="https://static.righto.com/images/bendix-mach/mach-labeled.jpg"><img alt="The Mach section with components labeled." class="hilite" height="490" src="https://static.righto.com/images/bendix-mach/mach-labeled-w600.jpg" title="The Mach section with components labeled." width="600" /></a><div class="cite">The Mach section with components labeled.</div></p>
<p>The photo below shows the left section of the CADC. This section meshes with the Mach section shown above.
The two sections have parts at various heights, so they join in a complicated way.
Two gears receive the pressure signals \( log ~ P_t / P_s \) and \( log ~ P_s \) from the Mach section.
The third gear sends the log total temperature to the rest of the CADC.
The electrical connector (a standard 37-pin D-sub) supplies 120 V 400 Hz power to the Mach section and
pressure transducers
and passes synchro signals to the output connectors.</p>
<p><a href="https://static.righto.com/images/bendix-mach/back-labeled.jpg"><img alt="The left part of the CADC that meshes with the Mach section." class="hilite" height="480" src="https://static.righto.com/images/bendix-mach/back-labeled-w500.jpg" title="The left part of the CADC that meshes with the Mach section." width="500" /></a><div class="cite">The left part of the CADC that meshes with the Mach section.</div></p>
<h2>The position error correction servo loop</h2>
<!--
The Bendix CADC has two pneumatic inputs through tubes:
the static pressure and the total pressure.
It also receives the total temperature from a platinum temperature probe.
From these, it computes many outputs: true air speed, Mach number, log static pressure, differential pressure, air density, air density × the speed of sound, total temperature, and log true free air temperature.
-->
<p>The CADC receives two pressure inputs
and two pressure transducers convert the pressures into rotational positions, providing
the indicated static pressure
\( P_{si} \) and the total pressure \( P_t \) as shaft rotations to the rest of the CADC.
(I explained the pressure transducers in detail in the <a href="https://www.righto.com/2024/01/bendix-cadc-pressure-transducers.html">previous article</a>.)</p>
<p>There's one complication though.
The static pressure \( P_s \) is the atmospheric pressure outside the aircraft.
The problem is that the static pressure measurement is perturbed by the airflow around the aircraft, so
the measured pressure (called the indicated static pressure \( P_{si} \)) doesn't match the real pressure.
This is bad because a "static-pressure error manifests itself as errors in indicated airspeed, altitude, and Mach number to the pilot."<span id="fnref:pec"><a class="ref" href="#fn:pec">9</a></span></p>
<p>The solution is a correction factor called the Position Error Correction.
This factor gives the ratio between the real pressure \( P_s \) and the measured pressure
\( P_{si} \).
By applying this correction factor to the indicated (i.e. measured) pressure, the true pressure can be obtained.
Since this correction factor depends on the shape of the aircraft, it is generated outside the CADC by a
separate cylindrical unit called the Compensator, customized to the aircraft type.
The position error computation depends on two parameters:
the Mach number provided by the CADC and the angle of attack provided by an aircraft sensor.
The compensator determines the correction factor by using
a three-dimensional cam.
The vintage photo below shows the components inside the compensator.</p>
<p><a href="https://static.righto.com/images/bendix-mach/compensator.jpg"><img alt=""Static Pressure and Angle of Attack Compensator Type X1254115-1 (Cover Removed)" from Air Data Computer Mechanization." class="hilite" height="438" src="https://static.righto.com/images/bendix-mach/compensator-w350.jpg" title=""Static Pressure and Angle of Attack Compensator Type X1254115-1 (Cover Removed)" from Air Data Computer Mechanization." width="350" /></a><div class="cite">"Static Pressure and Angle of Attack Compensator Type X1254115-1 (Cover Removed)" from Air Data Computer Mechanization.</div></p>
<p>The correction factor is transmitted from the compensator to the CADC as a synchro signal over three wires.
To use this value, the CADC must convert the synchro signal to a shaft rotation.
The CADC uses a motorized servo loop that rotates the shaft until the shaft position matches
the angle specified by the synchro input.</p>
<p><a href="https://static.righto.com/images/bendix-mach/servo-diagram.png"><img alt="The servo loop ensures that the shaft position matches the input angle." class="hilite" height="226" src="https://static.righto.com/images/bendix-mach/servo-diagram-w500.png" title="The servo loop ensures that the shaft position matches the input angle." width="500" /></a><div class="cite">The servo loop ensures that the shaft position matches the input angle.</div></p>
<p>The key to the servo loop is a <em>control transformer</em>.
This device looks like a synchro and has five wires like a synchro, but its function is different.
Like the synchro motor, the control transformer has three stator wires that provide the angle input.
Unlike the synchro, the control transformer also uses the shaft position as an input, while the rotor winding generates an output voltage indicating the error.
This output voltage indicates the error between the control transformer's shaft position and the three-wire angle input.
The control transformer provides its error signal as a 400 Hz sine wave, with a larger signal indicating more error.<span id="fnref:phase"><a class="ref" href="#fn:phase">10</a></span></p>
<p>The amplifier board (below) drives the motor in the appropriate direction to cancel out the error.
The power transformer in the upper left is the largest component, powering the amplifier board
from the CADC's 115-volt, 400 Hertz aviation power.
Below it are two transformer-like components; these are the magnetic amplifiers.
The relay in the lower-right corner switches the amplifier into test mode.
The rest of the circuitry consists of transistors, resistors, capacitors, and diodes.
The construction is completely different from modern printed circuit boards.
Instead, the amplifier uses point-to-point wiring
between plastic-insulated metal pegs.
Both sides of the board have components, with connections between the sides through the metal pegs.</p>
<p><a href="https://static.righto.com/images/bendix-mach/pec-amplifier.jpg"><img alt="The amplifier board for the position error correction." class="hilite" height="466" src="https://static.righto.com/images/bendix-mach/pec-amplifier-w500.jpg" title="The amplifier board for the position error correction." width="500" /></a><div class="cite">The amplifier board for the position error correction.</div></p>
<p>The amplifier board is implemented with a transistor amplifier driving two magnetic amplifiers, which
control the motor.<span id="fnref:schematic"><a class="ref" href="#fn:schematic">11</a></span>
(Magnetic amplifiers are an old technology that can amplify AC signals, allowing the relatively weak transistor
output to control a larger AC output.<span id="fnref:magamps"><a class="ref" href="#fn:magamps">12</a></span>)
The motor is a "Motor / Tachometer Generator" unit that also generates a voltage based on the motor's speed.
This speed signal provides negative feedback, limiting the motor speed as the error
becomes smaller and ensuring that the feedback loop doesn't overshoot.
The photo below shows how the amplifier board is mounted in the middle of the CADC, behind the static pressure tubing.</p>
<!--
The CADC is powered by standard avionics power of 115 volts AC, 400 hertz.
The servo circuit has a simple power supply running off this AC, using a multi-winding power
transformer.
One winding and diode produce DC for the transistor amplifier.
Other windings supply AC (controlled by the magnetic amplifiers) to power the motor, and AC for the
magnetic amplifier control signal.
The transformer ensures that the transducer circuitry is electrically isolated from other parts of the CADC
and the aircraft.
-->
<p><a href="https://static.righto.com/images/bendix-mach/side-with-board.jpg"><img alt="Side view of the CADC." class="hilite" height="310" src="https://static.righto.com/images/bendix-mach/side-with-board-w600.jpg" title="Side view of the CADC." width="600" /></a><div class="cite">Side view of the CADC.</div></p>
<h2>The equations</h2>
<p>Although the CADC looks like an inscrutable conglomeration of tiny gears, it is possible to trace out
the gearing and see exactly how it computes the air data functions.
With considerable effort, I have reverse-engineered the mechanisms to create the diagram below,
showing how each computation is broken down into mechanical steps.
Each line indicates a particular value, specified by a shaft rotation.
The ⊕ symbol indicates a differential gear, adding or subtracting its inputs to produce another value.
The cam symbol indicates a cam coupled to a differential gear.
Each cam computes either a specific function or an exponential, providing the value as a rotation.
At the right, the outputs are either shaft rotations to the rest of the CADC or synchro outputs.</p>
<p><a href="https://static.righto.com/images/bendix-mach/dataflow.png"><img alt="This diagram shows how the values are computed. The differential numbers are my own arbitrary numbers. Click for a larger version." class="hilite" height="331" src="https://static.righto.com/images/bendix-mach/dataflow-w800.png" title="This diagram shows how the values are computed. The differential numbers are my own arbitrary numbers. Click for a larger version." width="800" /></a><div class="cite">This diagram shows how the values are computed. The differential numbers are my own arbitrary numbers. Click for a larger version.</div></p>
<p>I'll go through each calculation briefly.</p>
<h3>log static pressure</h3>
<p>The static pressure is calculated by dividing the indicated static pressure by the pressure error
correction factor.
Since these values are all represented logarithmically, the division turns into a subtraction,
performed by a differential gear.
The output goes to two synchros, geared to provide coarse and fine outputs.<span id="fnref:coarse"><a class="ref" href="#fn:coarse">13</a></span></p>
<p>\[log ~ P_s = log ~ P_{si} - log ~ P_{si} / P_s \]</p>
<!-- https://apps.dtic.mil/sti/pdfs/ADA614715.pdf -->
<h3>Impact pressure</h3>
<p>The impact pressure is the pressure due to the aircraft's speed, the difference between the total pressure
and the static pressure.
To compute the impact pressure, the log pressure values are first converted to linear values by exponentiation,
performed by cams.
The linear pressure values are then subtracted by a differential gear.
Finally, the impact pressure is output through two synchros, coarse and fine in an 11:1 ratio.</p>
<p>\[ P_t - P_s = exp(log ~ P_t) - exp(log ~ P_s) \]</p>
<h3>log pressure ratio</h3>
<p>The log pressure ratio \( P_t/P_s \) is the ratio of total pressure to static pressure.
This value is important because it is used to compute the Mach number, true airspeed, and log free air temperature.
The Mach number is computed in the Mach section as described below. The true airspeed and log free air temperature
are computed in the left section.
The left section receives the log pressure ratio as a rotation.
Since the left section and Mach section can be separated for maintenance, a direct shaft connection is not used.
Instead, each section has a gear and the gears mesh when the sections are joined.</p>
<p>Computing the log pressure ratio is straightforward. Since the log total pressure and log static pressure are
both available, subtracting the logs with a differential yields the desired value.
That is,</p>
<p>\[log ~ P_t/P_s = log ~ P_t - log ~ P_s \]</p>
<h3>Mach number</h3>
<p>The Mach number is defined in terms of \(P_t/P_s \), with separate cases for
subsonic and supersonic:<span id="fnref:equations"><a class="ref" href="#fn:equations">14</a></span></p>
<p>\[M<1:\]
\[~~~\frac{P_t}{P_s} = ( 1+.2M^2)^{3.5}\]</p>
<p>\[M > 1:\]</p>
<p>\[~~~\frac{P_t}{P_s} = \frac{166.9215M^7}{( 7M^2-1)^{2.5}}\]</p>
<p>Although these equations are very complicated, the solution is a function of one
variable \(P_t/P_s\) so M can be computed with a single cam.
In other words, the mathematics needed to be done when the CADC was manufactured,
but once the cam exists, computing M is easy, using the log pressure ratio computed earlier:</p>
<p>\[ M = f(log ~ P_t / P_s) \]</p>
<h2>Conclusions</h2>
<p>The CADC performs nonlinear calculations that seem way too complicated to solve with mechanical gearing.
But reverse-engineering the mechanism shows how the equations are broken down into steps that can be performed
with cams and differentials, using logarithms for multiplication and division.
The diagram below shows the complex gearing in the Mach section.
Each differential below corresponds to a differential in the earlier equation diagram.</p>
<p><a href="https://static.righto.com/images/bendix-mach/mach-gearing.jpg"><img alt="A closeup of the gears and cams in the Mach section. The differential for the pressure ratio is hidden in the middle." class="hilite" height="389" src="https://static.righto.com/images/bendix-mach/mach-gearing-w800.jpg" title="A closeup of the gears and cams in the Mach section. The differential for the pressure ratio is hidden in the middle." width="800" /></a><div class="cite">A closeup of the gears and cams in the Mach section. The differential for the pressure ratio is hidden in the middle.</div></p>
<p>Follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for more reverse engineering.
I'm also on Mastodon as <a href="https://oldbytes.space/@kenshirriff">@oldbytes.space@kenshirriff</a>.
Thanks to Joe for providing the CADC. Thanks to Nancy Chen for obtaining a hard-to-find document for me.<span id="fnref:refs"><a class="ref" href="#fn:refs">15</a></span>
Marc Verdiell and Eric Schlaepfer are working on the CADC with me. CuriousMarc's video shows the CADC in action:
</p>
<p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/LQkDi3cjY3U?si=67_xTUBMCdxaQahX" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</p>
<!-- AiResearch F-14A air data computer with solid state pressure sensor https://www.google.com/books/edition/Aircraft_Yearbook/KnVGAAAAYAAJ?hl=en&gbpv=1&pg=PA85&printsec=frontcover
F-104 lawsuit over whether CADC is essential:
https://www.google.com/books/edition/United_States_Customs_Court_Reports/7p1ypJhUYukC?hl=en&gbpv=1&dq=electronic+air+data+computer&pg=PA331&printsec=frontcover
Bendix digital air data computer https://patentimages.storage.googleapis.com/ae/55/07/f72bf49c914d2c/US3564222.pdf
https://www.google.com/books/edition/Aircraft_Yearbook/KnVGAAAAYAAJ?hl=en&gbpv=1&dq=%22digital+air+data+computer%22&pg=RA1-PA190&printsec=frontcover
https://patents.google.com/patent/US3564222A/en
-->
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:series">
<p>My articles on the CADC are:
<ul>
<li><a href="https://www.righto.com/2023/02/bendix-central-air-data-computer-cadc.html">Overview</a>
<li><a href="https://www.righto.com/2023/10/bendix-cadc-reverse-engineering.html">Top section</a>
<li><a href="https://www.righto.com/2024/01/bendix-cadc-pressure-transducers.html">Pressure transducers</a>
</ul></p>
<p>There is a lot of overlap between the articles, so skip over parts that seem repetitive :-) <a class="footnote-backref" href="#fnref:series" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:static">
<p>The static air pressure can also be provided by holes in the side of the pitot tube; this is
the typical approach in fighter planes. <a class="footnote-backref" href="#fnref:static" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:gearing">
<p>Multiplying a rotation by a constant factor doesn't require a differential; it can be done simply with the ratio between
two gears. (If a large gear rotates a small gear, the small gear rotates faster according to the size ratio.)
Adding a constant to a rotation is even easier, just a matter of defining what shaft position indicates 0.
For this reason, I will ignore constants in the equations. <a class="footnote-backref" href="#fnref:gearing" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:differential">
<p>Strictly speaking, the output of the differential is the sum of the inputs divided by two.
I'm ignoring the factor of 2 because the gear ratios can easily cancel it out.
It's also arbitrary whether you think of the differential as adding or subtracting, since it depends on which rotation
direction is defined as positive. <a class="footnote-backref" href="#fnref:differential" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:cam-shape">
<p>The diagram below shows a typical cam function in more detail. The input is \(log~ dP/P_s\)
and the output is \(log~M / \sqrt{1+.2KM^2}\).
The small humped curve at the bottom is the cam correction.
Although the input and output functions cover a wide range, the difference that is encoded in the cam is much
smaller and drops to zero at both ends.</p>
<p><a href="https://static.righto.com/images/bendix-mach/cam-diagram.jpg"><img alt="This diagram, from Patent 2969910, shows how a cam implements a complicated function." class="hilite" height="439" src="https://static.righto.com/images/bendix-mach/cam-diagram-w350.jpg" title="This diagram, from Patent 2969910, shows how a cam implements a complicated function." width="350" /></a><div class="cite">This diagram, from <a href="https://patents.google.com/patent/US2969910A">Patent 2969910</a>, shows how a cam implements a complicated function.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:cam-shape" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:synchro">
<p>Internally, a synchro has a moving rotor winding and three fixed stator windings.
When AC is applied to the rotor, voltages are developed on the stator windings depending on the position of the rotor.
These voltages produce a torque that rotates the synchros to the same position.
In other words, the rotor receives power (26 V, 400 Hz in this case), while the three stator wires transmit the position.
The diagram below shows how a synchro is represented schematically, with rotor and stator coils.</p>
<p><a href="https://static.righto.com/images/bendix-mach/synchro-symbol.jpg"><img alt="The schematic symbol for a synchro." class="hilite" height="240" src="https://static.righto.com/images/bendix-mach/synchro-symbol-w250.jpg" title="The schematic symbol for a synchro." width="250" /></a><div class="cite">The schematic symbol for a synchro.</div></p>
<p>A control transformer has a similar structure, but the rotor winding provides an output, instead of being
powered. <a class="footnote-backref" href="#fnref:synchro" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:left">
<p>Specifically, the left part of the CADC computes
true airspeed, air density, total temperature, log true free
air temperature, and air density × speed of sound.
I discussed the left section in detail <a href="https://www.righto.com/2023/10/bendix-cadc-reverse-engineering.html">here</a>. <a class="footnote-backref" href="#fnref:left" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:connectors">
<p>From the outside, the CADC is a boring black cylinder, with no hint of the complex gearing inside.
The CADC is wired to the rest of the aircraft through round military connectors.
The front panel interfaces these connectors to the D-sub connectors used internally.
The two pressure inputs are the black cylinders at the bottom of the photo.</p>
<p><a href="https://static.righto.com/images/bendix-mach/exterior.jpg"><img alt="The exterior of the CADC. It is packaged in a rugged metal cylinder. It is sealed by a soldered metal band, so we needed a blowtorch to open it." class="hilite" height="470" src="https://static.righto.com/images/bendix-mach/exterior-w300.jpg" title="The exterior of the CADC. It is packaged in a rugged metal cylinder. It is sealed by a soldered metal band, so we needed a blowtorch to open it." width="300" /></a><div class="cite">The exterior of the CADC. It is packaged in a rugged metal cylinder. It is sealed by a soldered metal band, so we needed a blowtorch to open it.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:connectors" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:pec">
<p>The concepts of position error correction are described
<a href="https://apps.dtic.mil/sti/pdfs/ADA614715.pdf">here</a>. <a class="footnote-backref" href="#fnref:pec" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:phase">
<p>The phase of the signal is 0° or 180°, depending on the direction of the error.
In other words, the error signal is proportional to the driving AC signal in one direction and
flipped when the error is in the other direction.
This is important since it indicates which direction the motor should turn.
When the error is eliminated, the signal is zero. <a class="footnote-backref" href="#fnref:phase" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:schematic">
<p>I reverse-engineered the circuit board to create the schematic below for the amplifier.
The idea is that one magnetic amplifier or the other is selected, depending on the phase of the error
signal, causing the motor to turn counterclockwise or clockwise as needed.
To implement this, the magnetic amplifier control windings are connected to opposite phases of the 400 Hz power.
The transistor is connected to both magnetic amplifiers through diodes, so current will flow only if
the transistor pulls the winding low during the half-cycle that the winding is powered high.
Thus, depending on the phase of the transistor output, one winding or the other will be powered,
allowing that magnetic amplifier to pass AC to the motor.</p>
<p><a href="https://static.righto.com/images/bendix-mach/schematic.png"><img alt="This reverse-engineered schematic probably has a few errors. Click the schematic for a larger version." class="hilite" height="358" src="https://static.righto.com/images/bendix-mach/schematic-w600.png" title="This reverse-engineered schematic probably has a few errors. Click the schematic for a larger version." width="600" /></a><div class="cite">This reverse-engineered schematic probably has a few errors. Click the schematic for a larger version.</div></p>
<p>The CADC has four servo amplifiers: this one for pressure error correction, one for temperature, and two
for pressure.
The amplifiers have different types of inputs: the temperature input is the probe resistance, the pressure error correction uses an error voltage from the
control transformer, and the pressure inputs are voltages from the inductive pickups in the sensor.
The circuitry is roughly the same for each amplifier—a transistor amplifier driving two magnetic amplifiers—but the details are different.
The largest difference is that each pressure transducer amplifier drives two motors (coarse and fine) so each
has two transistor stages and four magnetic amplifiers. <a class="footnote-backref" href="#fnref:schematic" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:magamps">
<p>The basic idea of a magnetic amplifier is a controllable inductor.
Normally, the inductor blocks alternating current.
But applying a relatively small DC signal to a control winding causes the inductor to saturate, permitting
the flow of AC.
Since the magnetic amplifier uses a small signal to control a much larger signal, it provides amplification.</p>
<p>In the early 1900s, magnetic amplifiers were used in applications such as dimming lights.
Germany improved the technology in World War II, using magnetic amplifiers in ships, rockets, and trains.
The magnetic amplifier had a resurgence in the 1950s; the Univac Solid State computer used magnetic amplifiers
(rather than vacuum tubes or transistors) as its logic elements.
However, improvements in transistors made the magnetic amplifier obsolete except for specialized applications.
(See my IEEE Spectrum article on <a href="https://spectrum.ieee.org/the-vacuum-tubes-forgotten-rival">magnetic amplifiers</a> for more history of magnetic amplifiers.) <a class="footnote-backref" href="#fnref:magamps" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:coarse">
<p>The CADC specification defines how the parameter values correspond to rotation angles of the synchros.
For instance, for the log static pressure synchros,
the CADC supports the parameter range 0.8099 to 31.0185 inches of mercury.
The spec defines the corresponding synchro outputs as 16,320° rotation of the fine synchro
and 175.48° rotation of the coarse synchro over this range.
The synchro null point corresponds to 29.92 inches of mercury (i.e. zero altitude).
The fine synchro is geared to rotate 93 times as fast as the coarse synchro, so it
rotates over 45 times during this range, providing higher resolution than
a single synchro would provide.
The other synchro pairs use a much smaller 11:1 ratio; presumably high accuracy of the static pressure
was important. <a class="footnote-backref" href="#fnref:coarse" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:equations">
<p>Although the CADC's equations may seem <em>ad hoc</em>, they can be derived from fluid dynamics principles.
These equations were standardized in the 1950s by various government organizations including the National
Bureau of Standards and NACA (the precursor of NASA). <a class="footnote-backref" href="#fnref:equations" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
<li id="fn:refs">
<p>It was very difficult to find information about the CADC.
The official military specification is MIL-C-25653C(USAF). After searching everywhere, I was finally able to
get a copy from the Technical Reports & Standards unit of the Library of Congress.
The other useful document was in an obscure conference proceedings from 1958:
"Air Data Computer Mechanization" (Hazen), Symposium on the USAF Flight Control Data Integration Program, Wright Air Dev Center US Air Force, Feb 3-4, 1958, pp 171-194. <a class="footnote-backref" href="#fnref:refs" title="Jump back to footnote 15 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com0tag:blogger.com,1999:blog-6264947694886887540.post-75824736928598418802024-01-29T17:33:00.000-08:002024-02-03T09:58:58.445-08:00Reverse engineering standard cell logic in the Intel 386 processorThe 386 processor (1985) was Intel's most complex processor at the time, with 285,000 transistors.
Intel had scheduled 50 person-years to design the processor, but it was falling behind schedule.
The design team decided to automate chunks of the layout, developing "automatic place and route" software.<span id="fnref:oral-history"><a class="ref" href="#fn:oral-history">1</a></span>
This was a risky decision since if the software couldn't create a dense enough layout, the chip couldn't be manufactured.
But in the end, the 386 finished ahead of schedule, an almost unheard-of accomplishment.</p>
<p>In this article, I take a close look at the "standard cells" used in the 386, the logic blocks that were
arranged and wired by software.
Reverse-engineering these circuits shows how standard cells implement logic gates, latches,
and other components with CMOS transistors.
Modern integrated circuits still use standard cells, much smaller now, of course, but built from the same
principles.</p>
<p>The photo below shows the 386 die with the automatic-place-and-route regions highlighted in red.
These blocks of unstructured logic have cells arranged in rows, giving them a characteristic striped appearance.
In comparison,
functional blocks
such as the datapath on the left
and the microcode ROM in the lower right
were designed manually to optimize density and performance, giving them a more solid appearance.
As for other features on the chip,
the black circles around the border are bond wire connections that go to the chip's external pins.
The chip has two metal layers, a small number by modern
standards, but a jump from the single metal layer of earlier processors such as the 286. The metal appears white in larger areas, but
purplish where circuitry underneath roughens its surface.
For the most part, the underlying silicon and the polysilicon wiring on top are obscured by the metal layers.</p>
<p><a href="https://static.righto.com/images/386-stdcell/386-apr.jpg"><img alt="Die photo of the 386 processor with standard-cell logic highlighted in red." class="hilite" height="636" src="https://static.righto.com/images/386-stdcell/386-apr-w600.jpg" title="Die photo of the 386 processor with standard-cell logic highlighted in red." width="600" /></a><div class="cite">Die photo of the 386 processor with standard-cell logic highlighted in red.</div></p>
<p>Early processors in the 1970s were usually designed by manually laying out every transistor individually, fitting transistors
together like puzzle pieces to optimize their layout.
While this was tedious, it resulted in a highly dense layout.
Federico Faggin, designer of the popular Z80 processor, <a href="https://archive.computerhistory.org/resources/text/Oral_History/Zilog_Z80/102658073.05.01.pdf#page=20">describes</a> finding that the last few transistors wouldn't fit, so he had to erase
three weeks of work and start over.
The closeup of the resulting Z80 layout below
shows that each transistor has a different, complex shape, optimized to pack the transistors as tightly as possible.<span id="fnref:regularity"><a class="ref" href="#fn:regularity">2</a></span></p>
<p><a href="https://static.righto.com/images/386-stdcell/z80.jpg"><img alt="A closeup of transistors in the Zilog Z80 processor (1976). This chip is NMOS, not CMOS, which provides more layout flexibility. The metal and polysilicon layers have been removed to expose the underlying silicon. The lighter stripes over active silicon indicate where the polysilicon gates were. I think this photo is from the Visual 6502 project but I'm not sure." class="hilite" height="377" src="https://static.righto.com/images/386-stdcell/z80-w350.jpg" title="A closeup of transistors in the Zilog Z80 processor (1976). This chip is NMOS, not CMOS, which provides more layout flexibility. The metal and polysilicon layers have been removed to expose the underlying silicon. The lighter stripes over active silicon indicate where the polysilicon gates were. I think this photo is from the Visual 6502 project but I'm not sure." width="350" /></a><div class="cite">A closeup of transistors in the Zilog Z80 processor (1976). This chip is NMOS, not CMOS, which provides more layout flexibility. The metal and polysilicon layers have been removed to expose the underlying silicon. The lighter stripes over active silicon indicate where the polysilicon gates were. I think this photo is from the Visual 6502 project but I'm not sure.</div></p>
<!--
![A closeup of transistors in the Intel 8086 processor. This chip is NMOS, not CMOS, which provides more layout flexibility. The metal and polysilicon layers have been removed to expose the underlying silicon.](8086-transistors.jpg "w400")
-->
<p>Standard-cell logic is an alternative that is much easier than manual layout.<span id="fnref:gate-arrays"><a class="ref" href="#fn:gate-arrays">3</a></span>
The idea is to create a standard library of blocks (cells) to implement each type of gate, flip-flop, and other low-level
component.
To use a particular circuit, instead of arranging each transistor, you use the standard design.
Each cell has a fixed height but the width varies as needed, so the standard cells can be arranged in rows.
For example, the die photo below three cells in a row: a latch, a high-current inverter, and a second latch.
This region has 24 transistors in total with PMOS above and NMOS below.
Compare the orderly arrangement of these transistors with the Z80 transistors above.</p>
<p><a href="https://static.righto.com/images/386-stdcell/cells.jpg"><img alt="Some standard cell circuitry in the 386. I removed the metal and polysilicon to show the underlying silicon. The irregular blotches are oxide that wasn't fully removed, and can be ignored." class="hilite" height="128" src="https://static.righto.com/images/386-stdcell/cells-w350.jpg" title="Some standard cell circuitry in the 386. I removed the metal and polysilicon to show the underlying silicon. The irregular blotches are oxide that wasn't fully removed, and can be ignored." width="350" /></a><div class="cite">Some standard cell circuitry in the 386. I removed the metal and polysilicon to show the underlying silicon. The irregular blotches are oxide that wasn't fully removed, and can be ignored.</div></p>
<p>The space between rows is used as a "wiring channel" that holds the wiring between the cells.
The photo below zooms out to show four rows of standard cells (the dark bands) and the wiring in between.
The 386 uses three layers for this wiring: polysilicon and the upper metal layer (M2) for vertical segments
and the lower metal layer (M1) for horizontal segments.</p>
<p><a href="https://static.righto.com/images/386-stdcell/standard-cell.jpg"><img alt="Some standard-cell logic in the 386 processor." class="hilite" height="398" src="https://static.righto.com/images/386-stdcell/standard-cell-w600.jpg" title="Some standard-cell logic in the 386 processor." width="600" /></a><div class="cite">Some standard-cell logic in the 386 processor.</div></p>
<p>To summarize, with standard cell logic, the cells are obtained from the standard cell library as needed,
defining the transistor layout and the wiring inside the cell.
However, the locations of each cell (placing) need to be determined, as well as how to arrange the wiring
(routing).
As will be seen, placing and routing the cells can be done manually or automatically.</p>
<!--
The big problem with standard cells is that placing the cells and routing the wiring is a very difficult
optimization task.
Cells that are part of the same circuit should be placed close to each other to minimize wire length.
And the connections must be routed without wires blocking each other.
(This was especially difficult when chips only had one or two layers of metal.)
I find the dies of standard cell chips are less interesting because there aren't visible structures but
just a random sea of gates. https://riscv.org/wp-content/uploads/2018/05/14.15-14.40-FlorianZaruba_riscv_workshop-1.pdf#page=22
The Intel 386 processor was implemented with standard cell logic for the "random" logic and manual layout
for structured or critical circuitry.
("Random logic" is not literally random, of course, but refers to circuitry that appears random because
it doesn't have a higher-level structure like an ALU or memory, for instance.)
-->
<h2>Use of standard cells in the 386</h2>
<p>Fairly late in the design process, the 386 team decided to use automatic place and route for parts of the chip.
By using automatic place and route, 2,254 gates (consisting of over 10,000 devices) were placed and routed in seven weeks.
(These numbers are from a paper "Automatic Place and Route Used on the 80386", co-written by Pat Gelsinger, now
the CEO of Intel. I refer to this paper multiple times, so I'll call it <em>APR386</em> for convenience.<span id="fnref:apr386"><a class="ref" href="#fn:apr386">4</a></span>)
Automatic place and route was not only faster, but it avoided the errors that crept in when layout was performed
manually.<span id="fnref:compaction"><a class="ref" href="#fn:compaction">5</a></span></p>
<p>The "place" part of automatic place and route consists of determining the arrangement of the standard cells
into rows to minimize the distance between connected cells.
Running long wires between cells wastes space on the die, since you end up with a lot of unnecessary metal wiring.
But more importantly, long paths have higher resistance, slowing down the signals.
Placement is a difficult optimization problem that is <a href="https://dl.acm.org/doi/10.1145/103724.103725">NP-complete</a>.
Moreover, the task was made more complicated by weighting paths by importance and electrical characteristics,
classifying signals as "normal", "fast", or "critical".
Paths were also weighted to encourage the use of the thicker M2 metal layer rather than the lower M1 layer.</p>
<p>The 386 team solved the placement problem with a program called Timberwolf, developed by a Berkeley grad student.
As one member of the 386 team <a href="https://dl.acm.org/doi/10.1145/103724.103725">said</a>, "If management had known that we were using a tool by some grad student as a key part of the methodology, they would never have let us use it."
Timberwolf used a simulated annealing algorithm, based on a simulated temperature that decreased over time.
The idea is to randomly move cells around, trying to find better positions, but gradually tighten up
the moves as the "temperature" drops.
At the end, the result is close to optimal.
The purpose of the temperature is to avoid getting stuck in a local minimum by allowing "bad" changes at the
beginning, but then tightening up the changes as the algorithm progresses.</p>
<p>Once the cells were placed in their positions, the second step was "routing", generating the layout of all
the wiring.
A suitable commercial router was not available in 1984, so Intel developed its own.
As routing is a difficult problem (also NP-complete), they took an iterative heuristic approach, repeatedly routing until they found
the smallest channel height that would work.
(Thus, the wiring channels are different sizes as needed.)
Then they checked the R-C timing of all the signals to find any signals that were too slow.
Designers could boost the size of the associated drivers (using the variety of available standard cells) and
try the routing again.</p>
<h2>Brief CMOS overview</h2>
<p>The 386 was the first processor in Intel's x86 line to be built with a technology called CMOS instead of
using NMOS.
Modern processors are all built from CMOS because CMOS uses much less power than NMOS.
CMOS is more complicated to construct, though, because it uses two types of transistors—NMOS and PMOS—so early processors were typically NMOS.
But by the mid-1980s, the advantages of switching to CMOS were compelling.</p>
<p>The diagram below shows how an NMOS transistor is constructed.
The transistor can be considered a switch between the source and drain, controlled by the gate.
The source and drain regions (green) consist of silicon doped with impurities to change its semiconductor properties, forming N+ silicon.
The gate consists of a layer of polysilicon (red), separated from the silicon by a very thin insulating oxide layer.
Whenever polysilicon crosses active silicon, a transistor is formed.
A PMOS transistor has similar construction except it swaps the N-type and P-type silicon, consisting of P+ regions in a substrate of N silicon.</p>
<p><a href="https://static.righto.com/images/386-stdcell/mosfet-n.jpg"><img alt="Diagram showing the structure of an NMOS transistor." class="hilite" height="231" src="https://static.righto.com/images/386-stdcell/mosfet-n-w400.jpg" title="Diagram showing the structure of an NMOS transistor." width="400" /></a><div class="cite">Diagram showing the structure of an NMOS transistor.</div></p>
<p>The NMOS and PMOS transistors are opposite in their construction and operation.
An NMOS transistor turns on when the gate is high, while a PMOS transistor turns on when the gate is low.
An NMOS transistor is best at pulling its output low, while a PMOS transistor is best at pulling its output high.
In a CMOS circuit, the transistors work as a team, pulling the output high or low as needed;
this is the "Complementary" in CMOS.
(The behavior of MOS transistors is complicated, so this description is simplified, just enough to understand
digital circuits.)</p>
<p>One complication is that NMOS transistors are built on
P-type silicon, while PMOS transistors are built on N-type silicon.
Since the silicon die itself is N silicon, the NMOS transistors need to be surrounded by
a tub or well of P silicon.<span id="fnref:tub"><a class="ref" href="#fn:tub">6</a></span> The cross-section diagram below shows how the NMOS transistor on the left is embedded in a well of P-type silicon.</p>
<p><a href="https://static.righto.com/images/386-stdcell/cmos-cross-section.jpg"><img alt="Simplified structure of the CMOS circuits." class="hilite" height="217" src="https://static.righto.com/images/386-stdcell/cmos-cross-section-w500.jpg" title="Simplified structure of the CMOS circuits." width="500" /></a><div class="cite">Simplified structure of the CMOS circuits.</div></p>
<p>For proper operation, the silicon that surrounds transistors needs to be connected to the appropriate voltage through
"tap" contacts.<span id="fnref:tap"><a class="ref" href="#fn:tap">7</a></span>
For PMOS transistors, the substrate is connected to power through the taps, while for NMOS transistors the well region is connected to ground through the taps.
The chip needs to have enough taps to keep the voltage from fluctuating too much; each standard cell typically
has a positive tap and a ground tap.</p>
<p>The actual structure of the integrated circuit is much more three-dimensional than the diagram above,
due to the thickness of the various layers.
The diagram below is a more accurate cross-section.
The 386 has two layers of metal: the lower metal layer (M1) in blue and the upper metal layer (M2) in purple.
Polysilicon is colored red, while the insulating oxide layers are gray.</p>
<p><a href="https://static.righto.com/images/386-stdcell/cross-section.jpg"><img alt="Cross-section of CHMOS III transistors. From A double layer metal CHMOS III technology, image colorized by me." class="hilite" height="259" src="https://static.righto.com/images/386-stdcell/cross-section-w600.jpg" title="Cross-section of CHMOS III transistors. From A double layer metal CHMOS III technology, image colorized by me." width="600" /></a><div class="cite">Cross-section of CHMOS III transistors. From <a href="https://www.semanticscholar.org/paper/A-double-layer-metal-CHMOS-III-technology-Smith-Sery/7f1eb575d0c029300c946166f43f46558b4ba1b0">A double layer metal CHMOS III technology</a>, image colorized by me.</div></p>
<p>This complicated three-dimensional structure makes it harder to interpret the microscope images.
Moreover, the two metal layers obscure the circuitry underneath.
I have removed various layers with acids for die photos, but even so, the images are harder to interpret than
those of simpler chips.
If the die photos look confusing, don't be surprised.</p>
<p>A logic gate in CMOS is constructed from NMOS and PMOS transistors working together.
The schematic below shows a NAND gate with two PMOS transistors in parallel above and two NMOS transistors
in series below.
If both inputs are high, the two NMOS transistors turn on, pulling the output low.
If either input is low, a PMOS transistor turns on, pulling the output high.
(Recall that NMOS and PMOS are opposites: a high voltage turns an NMOS transistor on while a low voltage turns
a PMOS transistor on.)
Thus, the CMOS circuit below produces the desired output for the NAND function.</p>
<p><a href="https://static.righto.com/images/386-stdcell/nand.jpg"><img alt="A CMOS NAND gate." class="hilite" height="269" src="https://static.righto.com/images/386-stdcell/nand-w200.jpg" title="A CMOS NAND gate." width="200" /></a><div class="cite">A CMOS NAND gate.</div></p>
<p>The diagram below shows how this NAND gate is implemented in the 386 as a standard cell.<span id="fnref:nand-layout"><a class="ref" href="#fn:nand-layout">9</a></span>
A lot is going on in this cell, but it boils down to four transistors, as in the schematic above.
The yellow region is the P-type silicon that forms the two PMOS transistors; the transistor gates are
where the polysilicon (red) crosses the yellow region.<span id="fnref:colors"><a class="ref" href="#fn:colors">8</a></span>
(The middle yellow region is the drain for both transistors; there is no discrete boundary between the transistors.)
Likewise, the two NMOS transistors are at the bottom, where the polysilicon (red) crosses the active silicon (green).
The blue lines indicate the metal wiring for the cell. I thinned these lines to make the diagram clearer; in
the actual cell, the metal lines are as thick as they can be without touching, so they cover most of the cell.
The black circles are contacts, connections between the metal and the silicon or polysilicon.
Finally, the well taps are the opposite type of silicon, connected to the underlying silicon well or substrate to keep it at the proper voltage.</p>
<p><a href="https://static.righto.com/images/386-stdcell/nand-cell.jpg"><img alt="A standard cell for NAND in the 386." class="hilite" height="428" src="https://static.righto.com/images/386-stdcell/nand-cell-w400.jpg" title="A standard cell for NAND in the 386." width="400" /></a><div class="cite">A standard cell for NAND in the 386.</div></p>
<p>Wiring to a cell's inputs and output takes place at the top or bottom of the cell, with wiring in the channels between rows of
cells.
The polysilicon input and output lines are thickened at the top and bottom of the cell to allow connections to the cell.
The wiring between cells can be done with either polysilicon or metal.
Typically the upper metal layer (M2) is used for vertical wiring, while the lower metal layer (M1) is used
for horizontal runs.
Since each standard cell only uses M1, vertical wiring (M2) can pass over cells.
Moreover, a cell's output can also use a vertical metal wire (M2) rather than the polysilicon shown.
The point is that there is a lot of flexibility in how the system can route wires between the cells.
The power and ground wires (M1) are horizontal so they can run from cell to cell and a whole row can be powered from the ends.</p>
<p>The photo below shows this NAND cell with the metal layers removed by acid, leaving the silicon and the polysilicon.
You can match the features in the photo with the diagram above. The polysilicon appears green due to thin-film
effects. At the bottom, two polysilicon lines are connected to the inputs.</p>
<p><a href="https://static.righto.com/images/386-stdcell/nand-poly.jpg"><img alt="Die photo of the NAND standard cell with the metal layers removed. The image isn't as clear as I would like, but it was very difficult to remove the metal without destroying the polysilicon." class="hilite" height="357" src="https://static.righto.com/images/386-stdcell/nand-poly-w180.jpg" title="Die photo of the NAND standard cell with the metal layers removed. The image isn't as clear as I would like, but it was very difficult to remove the metal without destroying the polysilicon." width="180" /></a><div class="cite">Die photo of the NAND standard cell with the metal layers removed. The image isn't as clear as I would like, but it was very difficult to remove the metal without destroying the polysilicon.</div></p>
<p>The photo below shows how the cell appears in the original die. The two metal layers are visible, but they hide the polysilicon and
silicon underneath.
The vertical metal stripes are the upper (M2) wiring while the
lower metal wiring (M1) makes up the standard cell.
It is hard to distinguish the two metal layers, which makes interpretation of the images difficult.
Note that the metal wiring is wide, almost completely covering the cell, with small gaps between wires.
The contacts are visible as dark circles.
Is hard to recognize the standard cells from the bare die, as the contact pattern
is the only distinguishing feature.</p>
<p><a href="https://static.righto.com/images/386-stdcell/nand-metal.jpg"><img alt="Die photo of the NAND standard cell showing the metal layer." class="hilite" height="391" src="https://static.righto.com/images/386-stdcell/nand-metal-w180.jpg" title="Die photo of the NAND standard cell showing the metal layer." width="180" /></a><div class="cite">Die photo of the NAND standard cell showing the metal layer.</div></p>
<p>One of the interesting features of the 386's standard cell library is that each type of logic gate is available in
multiple drive strengths.
That is, cells are available with small transistors, large transistors, or multiple transistors in parallel.
Because the wiring and the transistor gates have capacitance, a delay occurs when changing state.
Bigger transistors produce more current, so they can switch the values on a wire faster.
But there are two disadvantages to bigger transistors. First, they take up more space on the die.
But more importantly, bigger transistors have bigger gates with more capacitance, so their inputs take longer
to switch.
(In other words, increasing the transistor size speeds up the output but slows the input, so overall performance
could end up worse.)
Thus, the sizes of transistors need to be carefully balanced to achieve optimum performance.<span id="fnref:logical-effort"><a class="ref" href="#fn:logical-effort">10</a></span>
With a variety of sizes in the standard cell library, designers can make the best choices.</p>
<p>The image below shows a small NAND gate. The design is the same as the one described earlier, but the
transistors are much smaller. (Note that there is one row of metal contacts instead of two or three.)
The transistor gates are about half as wide (measured vertically) so the NAND gate will produce about
half the output current.<span id="fnref:gates"><a class="ref" href="#fn:gates">11</a></span></p>
<p><a href="https://static.righto.com/images/386-stdcell/nand-small.jpg"><img alt="Die photo of a small NAND standard cell with the metal removed." class="hilite" height="334" src="https://static.righto.com/images/386-stdcell/nand-small-w154.jpg" title="Die photo of a small NAND standard cell with the metal removed." width="154" /></a><div class="cite">Die photo of a small NAND standard cell with the metal removed.</div></p>
<p>Since the standard cells are all the same height, the maximum size of a transistor is limited.
To provide a larger drive strength, multiple transistors can be used in parallel.
The NAND gate below uses 8 transistors, four PMOS and four NMOS, providing twice as much current.</p>
<p><a href="https://static.righto.com/images/386-stdcell/nand-large.jpg"><img alt="A large NAND gate as it appears on the die, with the metal removed. The left side is slightly obscured by some remaining oxide." class="hilite" height="361" src="https://static.righto.com/images/386-stdcell/nand-large-w240.jpg" title="A large NAND gate as it appears on the die, with the metal removed. The left side is slightly obscured by some remaining oxide." width="240" /></a><div class="cite">A large NAND gate as it appears on the die, with the metal removed. The left side is slightly obscured by some remaining oxide.</div></p>
<p>The diagram below shows the structure of the large NAND gate, essentially two NAND gates in parallel.
Note that input 1 must be provided separately to both halves by the routing outside the cell.
Input 2, on the other hand, only needs to be supplied to the cell once, since it is wired to both halves
inside the cell.</p>
<p><a href="https://static.righto.com/images/386-stdcell/nand-large-diagram.jpg"><img alt="A diagram showing the structure of the large NAND gate." class="hilite" height="421" src="https://static.righto.com/images/386-stdcell/nand-large-diagram-w400.jpg" title="A diagram showing the structure of the large NAND gate." width="400" /></a><div class="cite">A diagram showing the structure of the large NAND gate.</div></p>
<p>Inverters are also available in a variety of drive strengths, from very small to very large, as shown below.
The inverter on the left uses the smallest transistors, while the inverter on the right not only uses
large transistors but is constructed from six inverters in parallel.
One polysilicon input controls all the transistors.</p>
<p><a href="https://static.righto.com/images/386-stdcell/inverters.jpg"><img alt="A small inverter and a large inverter." class="hilite" height="320" src="https://static.righto.com/images/386-stdcell/inverters-w460.jpg" title="A small inverter and a large inverter." width="460" /></a><div class="cite">A small inverter and a large inverter.</div></p>
<p>A more complex standard cell is XOR. The diagram below shows an XOR cell with large drive current. (There
are smaller XOR cells).
As with the large NAND gate, the PMOS transistors are doubled up for more current. The multiple input connections
are handled by the routing outside the cell.
Since the NMOS transistors don't need to be doubled up, there is a lot of unused space in the lower part
of the cell.
The extra space is used for a very large tap contact, consisting of 24 contacts to ground the well.</p>
<p><a href="https://static.righto.com/images/386-stdcell/xor-diagram.jpg"><img alt="The structure of an XOR cell with large drive current." class="hilite" height="372" src="https://static.righto.com/images/386-stdcell/xor-diagram-w600.jpg" title="The structure of an XOR cell with large drive current." width="600" /></a><div class="cite">The structure of an XOR cell with large drive current.</div></p>
<p>XOR is a difficult gate to build with CMOS. The cell above implements it by combining a NOR gate and an
AND-NOR gate, as shown below. You can verify that if both inputs are 0 or both inputs are 1, the output is
forced low as desired.
In the layout above, the NOR gate is on the left, while the AND-NOR gate has the AND part on the right.
A metal wire down the center connects the NOR output to the AND-NOR input.
The need for two sub-gates is another reason why the XOR cell is so large.</p>
<p><a href="https://static.righto.com/images/386-stdcell/xor-schematic.jpg"><img alt="Schematic of the XOR cell." class="hilite" height="124" src="https://static.righto.com/images/386-stdcell/xor-schematic-w400.jpg" title="Schematic of the XOR cell." width="400" /></a><div class="cite">Schematic of the XOR cell.</div></p>
<p>I'll describe one more cell, the latch, which holds one bit and is controlled by a clock signal.
Latches are heavily used in the 386 whenever a signal needs to be remembered or a circuit needs to
be synchronous.
The 386 has multiple types of standard cell latches including latches with set or reset controls and latches with different drive strengths.
Moreover, two latches can be combined to form an edge-triggered flip-flop standard cell.</p>
<p>The schematic below shows the basic latch circuit, the most common type in the 386.
On the right, two inverters form a loop. This loop can stably hold a 0 or 1 value.
On the left, a PMOS transistor and an NMOS transistor form a transmission gate. If the clock is high, both
transistors will turn on and pass the input through. If the clock is low, both transistors will turn off and
block the input.
The trick to the latch is that one inverter is weak, producing just a small current. The consequence is that
the input can overpower the inverter output, causing the inverter loop to switch to the input value.
The result is that when the clock is high, the latch will pass the input value through to the output.
But when the clock is low, the latch will hold its previous value.
(The output is inverted with respect to the input, which is slightly inconvenient but reduces the size of the latch.)</p>
<p><a href="https://static.righto.com/images/386-stdcell/latch-schematic.jpg"><img alt="Schematic of a latch." class="hilite" height="171" src="https://static.righto.com/images/386-stdcell/latch-schematic-w350.jpg" title="Schematic of a latch." width="350" /></a><div class="cite">Schematic of a latch.</div></p>
<p>The standard cell layout of the latch (below) is complicated, but it corresponds to the schematic.
At the left are the PMOS and NMOS transistors that form the transmission gate.
In the center is the weak inverter, with its output to the left.
The weak transistors are in the middle; they are overlapped by a thick polysilicon region, creating a long
gate that produces a low current.<span id="fnref:current"><a class="ref" href="#fn:current">12</a></span>
At the right is the inverter that drives the output.
The layout of this circuit is clever, designed to make the latch as compact as possible.
For example, the two inverters share power and ground connections.
Notice how the two clock lines pass from top to bottom through gaps in the active silicon so each line only forms one transistor.
Finally, the metal line in the center connects the transmission gate outputs and the weak inverter
output to the other inverter's input, but asymmetrically at the top so the two inverters don't collide.</p>
<p><a href="https://static.righto.com/images/386-stdcell/latch-diagram.jpg"><img alt="The standard cell layout of a latch." class="hilite" height="428" src="https://static.righto.com/images/386-stdcell/latch-diagram-w550.jpg" title="The standard cell layout of a latch." width="550" /></a><div class="cite">The standard cell layout of a latch.</div></p>
<p>To summarize, I examined many (but not all) of the standard cells in the 386 and found about 70 different types of cells.
These included the typical logic gates with various drive strengths: inverters, buffers, XOR, XNOR, AND-NOR,
and 3- and 4-input logic gates.
There are also transmission gates including ones that default high or low, as well as multiplexers built
from transmission gates.
I found a few cells that were surprising such as dual inverters and a combination 3-input and 2-input NAND gate.
I suspect these consist of two standard cells that were merged together, since they seem too specialized to
be part of a standard cell library.</p>
<p>The APR386 paper showed six of the standard cells in the 386 with the diagram below.
The small and large inverters are the same as the ones described above, as is the
NAND gate NA2B.
The latch is similar to the one described above, but with larger transistors.
The APR386 paper also showed a block of standard cells, which I was able to locate in the 386.<span id="fnref:cell-block"><a class="ref" href="#fn:cell-block">13</a></span></p>
<p><a href="https://static.righto.com/images/386-stdcell/paper-cells.jpg"><img alt="Examples of standard cells, from APR386. The numbers are not defined but may indicate input and output capacitance. (Click for a larger version.)" class="hilite" height="228" src="https://static.righto.com/images/386-stdcell/paper-cells-w800.jpg" title="Examples of standard cells, from APR386. The numbers are not defined but may indicate input and output capacitance. (Click for a larger version.)" width="800" /></a><div class="cite">Examples of standard cells, from APR386. The numbers are not defined but may indicate input and output capacitance. (Click for a larger version.)</div></p>
<!--
![This diagram from APR386 shows the 386 die with the automatically placed blocks indicated. The names may correspond to bus interface unit (BA), control unit (CA), prefetch unit (KA), paging (PA), segment unit (SA), and protection test unit (TA).](standard-cells-in-386.jpg "w600")
-->
<h2>Intel's standard cell line</h2>
<p>Intel productized its standard cells around 1986
as a 1.5 µm library using Intel's CMOS technology (called CHMOS III).<span id="fnref:handbook"><a class="ref" href="#fn:handbook">14</a></span>
Although the library had over 100 cell types, it was very limited compared to the cells used inside the 386.
The library included logic gates, flip-flops, and latches as well as scalable registers, counters, and adders.
Most gates only came in one drive strength. Even inverters only came in "normal" and "high" drive strength.
I assume these cells are the same as the ones used in the 386, but I don't have proof.
The library also included larger devices such as a cell-compatible 80C51 microcontroller and PC peripheral chips such
as the 8259 programmable interrupt controller and the 8254 programmable interval timer.
I think these were re-implemented using standard cells.</p>
<p>Intel later produced a 1.0 µm library using CHMOS IV, for use "both by ASIC customers and Intel's internal chip designers." This library had a larger collection of drive strengths.
The 1.0 µm library included the 80C186 and associated peripheral chips.</p>
<!--
## Discussion
According to APR386, the standard cell blocks have an (inverse) density of 1.0 to 1.9 mil<sup>2</sup> per device,
generally increasing as the block gets larger.
In other words, the computerized layout works better for smaller blocks than larger blocks.
In comparison, the 386 as a whole uses .6 mils/device, considerably less.
The 1.5 µm 386 die was 104 mm<sup>2</sup> and had 285,000 transistor sites.
That works out to 0.6 mil<sup>2</sup> per transistor site.
(A transistor site is a spot that could have a transistor, so the transistor count of a ROM doesn't change
based on whether it holds a 0 or 1 bit.)
From APR386, the total area of APR circuitry is 15278 mils<sup>2</sup>.
The microcode ROM has 37×10 rows of 256 columns, 94720 transistor sites. (This is for the ROM itself,
not the associated decoder and buffer circuitry.)
Another interesting perspective is that the APR circuitry used 10,000 transistors out of 285,000, 3.5%
of the total transistor count, while the APR circuitry took 9.5% of the total area.
In comparison, the microcode ROM holds a whopping 33% of the 386's transistors but takes up just 5.2% of the die.
A ROM is pretty much a best case situation for packing transistors, one advantage of using microcode.
-->
<h2>Layout techniques in the 386</h2>
<p>In this section, I'll look at the active silicon regions, making the cells themselves more visible.
In the photos below, I dissolved the metal and polysilicon, leaving the active silicon.
(Ignore the irregular greenish shapes; these are oxide that wasn't fully removed.)</p>
<p>The photo below shows the silicon for three rows of standard cells using automatic place and route.
You can see the wide variety of standard cell widths, but the height of the cells is constant.
The transistor gates are visible as the darker vertical stripes across the silicon.
You may be able to spot the latch in each row, distinguished by the long, narrow transistors of the weak inverters.</p>
<p><a href="https://static.righto.com/images/386-stdcell/silicon1.jpg"><img alt="Three rows of standard cells that were automatically placed and routed." class="hilite" height="399" src="https://static.righto.com/images/386-stdcell/silicon1-w236.jpg" title="Three rows of standard cells that were automatically placed and routed." width="236" /></a><div class="cite">Three rows of standard cells that were automatically placed and routed.</div></p>
<p>In the first row, the larger PMOS transistors are on top, while the smaller NMOS transistors are below.
This pattern alternates from row to row, so the second row has the NMOS transistors on top and the third
row has the PMOS transistors on top.
The height of the wiring channel between the cells is variable, made as small as possible while fitting the
wiring.</p>
<p>The 386 also contains regions of standard cells that were apparently manually placed and routed, as shown in the photo below.
Using standard cells avoids the effort of laying out each transistor, so it is still easier than a fully
custom layout.
These cells are in rows, but the rows are now double rows with channels in between.
The density is higher, but routing the wires becomes more challenging.</p>
<p><a href="https://static.righto.com/images/386-stdcell/silicon2.jpg"><img alt="Three rows of standard cells that were manually placed and routed." class="hilite" height="410" src="https://static.righto.com/images/386-stdcell/silicon2-w347.jpg" title="Three rows of standard cells that were manually placed and routed." width="347" /></a><div class="cite">Three rows of standard cells that were manually placed and routed.</div></p>
<p>For critical circuitry such as the datapath, the layout of each transistor was optimized. The register file,
for example, has a very dense layout as shown below.
As you can see, the density is much higher than in the previous photos. (The three photos are at the same scale.)
Transistors are packed together with very little wasted space.
This makes the layout difficult since there is little room for wiring.
For this particular circuit, the lower metal layer (M1) runs vertically with signals for each bit while the upper metal layer (M2) runs
horizontally for power, ground, and control signals.<span id="fnref:regs"><a class="ref" href="#fn:regs">15</a></span></p>
<p><a href="https://static.righto.com/images/386-stdcell/silicon3.jpg"><img alt="Three rows of standard cells that were manually placed and routed." class="hilite" height="288" src="https://static.righto.com/images/386-stdcell/silicon3-w400.jpg" title="Three rows of standard cells that were manually placed and routed." width="400" /></a><div class="cite">Three rows of standard cells that were manually placed and routed.</div></p>
<p>The point of this is that the 386 uses a variety of different design techniques, from dense manual layout
to much faster automated layout.
Different techniques were used for different parts of the chip, based on how important it was to optimize.
For example, circuits in the datapath were typically repeated 32 times, once for each bit, so manual effort was worthwhile.
The most critical functional blocks were the microcode ROM (CROM), large PLAs, ALU, TLB (translation lookaside buffer), and the barrel shifter.<span id="fnref:prak"><a class="ref" href="#fn:prak">16</a></span></p>
<h2>Conclusions</h2>
<p>Standard cell logic and automatic place and route have a long history before the 386,
back to the early 1970s, so this isn't an Intel invention.<span id="fnref:standard-cell-history"><a class="ref" href="#fn:standard-cell-history">17</a></span>
Nonetheless, the 386 team deserves the credit for deciding to use this technology at a time when it
was a risky decision.
They needed to develop custom software for their placing and routing needs, so this wasn't a trivial undertaking.
This choice paid off and they completed the 386 ahead of schedule.
The 386 ended up being a huge success for Intel, moving the x86 architecture to 32-bits and defining the dominant computer
architecture for the rest of the 20th century.</p>
<p>If you're interested in standard cell logic, I wrote about <a href="https://www.righto.com/2021/03/reverse-engineering-standard-cell-logic.html">standard cell logic in an IBM chip</a>.
I plan to write more about the 386, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon occasionally as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.
Thanks to Pat Gelsinger and Roxanne Koester for providing helpful papers.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:oral-history">
<p>The decision to use automatic place and route is described on page 13 of the <a href="https://archive.computerhistory.org/resources/text/Oral_History/Intel_386_Design_and_Dev/102702019.05.01.acc.pdf#page=13">Intel 386 Microprocessor Design and Development Oral History Panel</a>, a very interesting document on the 386 with discussion from
some of the people involved in its development. <a class="footnote-backref" href="#fnref:oral-history" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:regularity">
<p>Circuits that had a high degree of regularity, such as the arithmetic/logic unit (ALU) or register storage were
typically constructed by manually laying out a block to implement a bit and then repeating the block
as needed.
Because a circuit was repeated 32 times for the 32-bit processor, the additional effort was worthwhile. <a class="footnote-backref" href="#fnref:regularity" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:gate-arrays">
<p>An alternative layout technique is the gate array, which doesn't provide as much flexibility as a standard cell approach.
In a gate array (sometimes called a master slice), the chip had a fixed array of transistors (and often resistors).
The chip could be customized for a particular application by designing the metal layer to connect the transistors
as needed. The density of the chip was usually poor, but gate arrays were much faster to design, so they
were advantageous for applications that didn't need high density or produced a relatively small volume of chips.
Moreover, manufacturing was much faster because the silicon wafers could be constructed in advance with
the transistor array and warehoused. Putting the metal layer on top for a particular application could
then be quick.
Similar gate arrays used a fixed arrangement of logic gates or flip-flops, rather than transistors.
Gate arrays date back to <a href="https://www.computerhistory.org/siliconengine/application-specific-integrated-circuits-employ-computer-aided-design/">1967</a>. <a class="footnote-backref" href="#fnref:gate-arrays" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:apr386">
<p>The full citation for the APR386 paper is "Automatic Place and Route Used on the 80386" by Joseph Krauskopf and Pat Gelsinger, Intel Technology Journal, Spring 1986.
I was unable to find it online. <a class="footnote-backref" href="#fnref:apr386" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:compaction">
<p>Once the automatic place and route process had finished, the mask designers performed some cleanup along with
compaction to
squeeze out wasted space, but this was a relatively minor amount of work.</p>
<p>While manual optimization has benefits, it can also be overdone.
When the manufacturing process improved, the 80386 moved from a 1.5 µm process to a 1 µm process.
The layout engineers took advantage
of this switch to optimize the standard cell circuitry, manually squeezing out some extra space.
Unfortunately, optimizing one block of a die doesn't necessarily make the die smaller, since the size
is constrained by the largest blocks.
The result is that the optimized 80386 has blocks of empty space at the bottom (visible as black rectangles)
and the standard-cell optimization didn't provide any overall benefit.
(As the Pentium Pro chief architect Robert Colwell explains, "Removing the state of Kansas does not make the perimeter of the United States any smaller.") <!-- Pentium Chronicles, p159 --></p>
<p><a href="https://static.righto.com/images/386-stdcell/shrink.jpg"><img alt="Comparison of the 1.5 µm die and the 1 µm die at the same scale. Photos courtesy of Antoine Bercovici." class="hilite" height="384" src="https://static.righto.com/images/386-stdcell/shrink-w600.jpg" title="Comparison of the 1.5 µm die and the 1 µm die at the same scale. Photos courtesy of Antoine Bercovici." width="600" /></a><div class="cite">Comparison of the 1.5 µm die and the 1 µm die at the same scale. Photos courtesy of Antoine Bercovici.</div></p>
<p>At least compaction went better for the 386 than for the Pentium.
Intel performed a compaction on the Pentium shortly before release, attempting to reduce the die size.
The engineers shrunk the floating point divider, removing some lookup table cases that they proved
were unnecessary.
Unfortunately, the proof was wrong, resulting in floating point errors in a few cases.
This caused the infamous Pentium FDIV bug, a problem that became highly visible to the general public.
Replacing the flawed processors cost Intel 475 million dollars.
And it turned out that shrinking the floating point divider had no effect on the overall die size.</p>
<p>Coincidentally, early models of the 386 had an integer multiplication bug, but Intel fixed this with
little cost or criticism. The 386 bug was an analog issue that only showed up unpredictably with
a combination of argument values, temperature, and manufacturing conditions. <a class="footnote-backref" href="#fnref:compaction" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:tub">
<p>This chip is built on a substrate of N-type silicon, with wells of P-type silicon for the NMOS transistors.
Chips can be built the other way around, starting with P-type silicon and putting wells of N-type silicon
for the PMOS transistors.
Another approach is the "twin-well" CMOS process, constructing wells for both NMOS and PMOS transistors. <a class="footnote-backref" href="#fnref:tub" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:tap">
<p>The bulk silicon voltage makes the boundary between a transistor and the bulk silicon act as a reverse-biased diode, so current can't flow
across the boundary.
Specifically, for a PMOS transistor, the N-silicon substrate is connected to the positive supply.
For an NMOS transistor, the P-silicon well
is connected to ground.
A P-N junction acts as a diode, with current flowing from P to N. But the substrate voltages put P at ground and N at +5, blocking any
current flow.
The result is that the bulk silicon can be considered an insulator, with current restricted to the N+ and P+ doped regions.
If this back bias gets reversed, for example, due to power supply fluctuations, current can flow through the
substrate.
This can result in "latch-up", a situation where the N and P regions act as parasitic NPN and PNP transistors
that latch into the "on" state. This shorts power and ground and can destroy the chip.
The point is that the substrate voltages are very important for the proper operation of the chip. <a class="footnote-backref" href="#fnref:tap" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:colors">
<p>I'm using the standard CMOS coloring scheme for my diagrams. I'm told that Intel uses a different color
scheme internally. <a class="footnote-backref" href="#fnref:colors" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:nand-layout">
<p>The schematic below shows the physical arrangement of the transistors for the NAND gate, in case it is unclear how to get from the layout to the logic gate circuit.
The power and ground lines are horizontal so power can pass from cell to cell when the cells are
connected in rows.
The gate's inputs and outputs are at the top and bottom of the cell, where they can be connected through
the wiring channels.
Even though the transistors are arranged horizontally, the PMOS transistors (top) are in parallel, while the NMOS transistors (bottom) are in series.</p>
<p><a href="https://static.righto.com/images/386-stdcell/nand-layout.jpg"><img alt="Schematic of the NAND gate as it is arranged in the standard cell." class="hilite" height="224" src="https://static.righto.com/images/386-stdcell/nand-layout-w300.jpg" title="Schematic of the NAND gate as it is arranged in the standard cell." width="300" /></a><div class="cite">Schematic of the NAND gate as it is arranged in the standard cell.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:nand-layout" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:logical-effort">
<p>The 1999 book <a href="https://amzn.to/3UxmgV5">Logical Effort</a> describes a methodology for maximizing the
performance of CMOS circuits by correctly sizing the transistors. <a class="footnote-backref" href="#fnref:logical-effort" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:gates">
<p>Unfortunately, the word "gate" is used for both transistor gates and logic gates, which can be confusing. <a class="footnote-backref" href="#fnref:gates" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:current">
<p>You might expect that these transistors would produce <em>more</em> current since they are larger than the
regular transistors. The reason is that a transistor's current output is proportional to the gate width
divided by the length.
Thus, if you make the transistor bigger in the width direction, the current increases, but if you make
the transistor bigger in the length direction, the current decreases.
You can think of increasing width as acting as multiple transistors in parallel.
Increasing length, on the other hand, makes a longer path for current to get from the source to the drain,
weakening it. <a class="footnote-backref" href="#fnref:current" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:cell-block">
<p>The APR386 paper discusses the standard-cell layout in detail.
It includes a plot of a block of standard-cell circuitry (below).</p>
<p><a href="https://static.righto.com/images/386-stdcell/apr-386.jpg"><img alt="A block of standard-cell circuitry from APR386." class="hilite" height="442" src="https://static.righto.com/images/386-stdcell/apr-386-w300.jpg" title="A block of standard-cell circuitry from APR386." width="300" /></a><div class="cite">A block of standard-cell circuitry from APR386.</div></p>
<p>After carefully studying the 386 die, I was able to
find the location of this block of circuitry (below).
The two regions match exactly; they look a bit different because the M1 metal layer (horizontal) doesn't show up in the plot above.</p>
<p><a href="https://static.righto.com/images/386-stdcell/apr-386-die-zoom.jpg"><img alt="The same block of standard cells on the 386 die." class="hilite" height="466" src="https://static.righto.com/images/386-stdcell/apr-386-die-zoom-w600.jpg" title="The same block of standard cells on the 386 die." width="600" /></a><div class="cite">The same block of standard cells on the 386 die.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:cell-block" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:handbook">
<p>Intel's CHMOS III standard cells are documented in <a href="https://bitsavers.org/components/intel/_dataBooks/1988_Introduction_to_Intel_Cell-Based_Design.pdf">Introduction to Intel Cell-Based Design</a> (1988).
The CHMOS IV library is discussed in <a href="https://doi.org/10.1109/ASIC.1990.186171">Design Methodology for a 1.0µ Cell-based Library Efficiently Optimized for Speed and Area</a>.
The paper <a href="https://doi.org/10.1109/ASIC.1990.186174">Validating an ASIC Standard Cell Library</a> covers both libraries. <a class="footnote-backref" href="#fnref:handbook" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
<li id="fn:regs">
<p>For details on the 386's register file, see my <a href="https://www.righto.com/2023/11/reverse-engineering-intel-386.html">earlier article</a>. <a class="footnote-backref" href="#fnref:regs" title="Jump back to footnote 15 in the text">↩</a></p>
</li>
<li id="fn:prak">
<p>Source: "High Performance Technology Circuits and Packaging for the 80386", Jan Prak, Proceedings, ICCD Conference, Oct. 1986. <a class="footnote-backref" href="#fnref:prak" title="Jump back to footnote 16 in the text">↩</a></p>
</li>
<li id="fn:standard-cell-history">
<p>I'll provide more history on standard cells in this footnote.
RCA <a href="https://patents.google.com/patent/US3573488A">patented</a> a bipolar standard cell in 1971,
but this was a fixed arrangement of transistors and resistors, more of a gate array than a modern
standard cell.
Bell Labs researched standard cell layout techniques in the early 1970s, calling them Polycells, including
a <a href="https://dl.acm.org/doi/10.1145/62882.62886">1973 paper</a> by Brian Kernighan.
By 1979 <a href="http://www.bitsavers.org/pdf/xerox/parc/techReports/SSL-79-7_A_Guide_to_LSI_Implementation_Second_Edition.pdf">A Guide to LSI Implementation</a> discussed the standard cell approach and
it was described as well-known in <a href="https://patents.google.com/patent/US4319396A">this patent application</a>.
Even so, <a href="https://www.worldradiohistory.com/Archive-Electronics/80s/80/Electronics-1980-07-31.pdf#page=77">Electronics</a> called these design methods "futuristic" in 1980.</p>
<p>Standard cells became popular in the mid-1980s as faster computers and improved design software made
it practical to produce semi-custom designs that used standard cells.
Standard cells made it to the cover of <a href="http://www.bitsavers.org/magazines/Digital_Design/Digital_Design_V15_N08_198508.pdf">Digital Design</a> in August 1985, and the article inside described numerous vendors and products.
Companies like <a href="https://www.worldradiohistory.com/Archive-Electronics/80s/83/Electronics-1983-03-10.pdf#page=151">Zymos</a> and <a href="http://www.bitsavers.org/components/vti/tools/1988_VTI_Cell-Based_Design_Users_Guide.pdf">VLSI Technology</a> (VTI) focused on standard cells.
Traditional companies such as <a href="http://www.bitsavers.org/components/ti/_dataBooks/1986_TI_2-um_CMOS_Standard_Cell_Data_Book.pdf">Texas Instruments</a>, NCR, GE/RCA, <a href="https://www.worldradiohistory.com/Archive-Electronics/80s/87/Electronics-1987-06-25.pdf#page=82">Fairchild</a>, Harris, <a href="https://www.worldradiohistory.com/Archive-ITT/80s/ITT-1983-58-No-4.pdf">ITT</a>, and Thomson introduced lines of standard cell products in
the mid-1980s.
<!-- https://www.worldradiohistory.com/Archive-Electronics/80s/87/Electronics-1987-05-28.pdf -->
<!--
The mid-1980s saw enough interest in standard cells that a <a href="https://books.google.com/books/about/Gate_Array_and_Standard_Cell_IC_Vendor_D.html?id=0kwsAQAAIAAJ">vendor guide</a> was published.
--> <a class="footnote-backref" href="#fnref:standard-cell-history" title="Jump back to footnote 17 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com3tag:blogger.com,1999:blog-6264947694886887540.post-56313316092119278672024-01-28T09:57:00.000-08:002024-01-28T09:57:48.391-08:00Reverse engineering CMOS, illustrated with a vintage Soviet counter chipI recently came across an interesting die photo of a Soviet<span id="fnref:soviet"><a class="ref" href="#fn:soviet">1</a></span> chip, probably designed in the 1970s.
This article provides an introductory guide to reverse-engineering CMOS circuits, using this chip as an
example.
Although the chip looks like a tangle of lines at first,
its large features and simple layout make it possible to understand its circuits.
I'll first explain how to recognize the individual transistors.
Groups of transistors are connected in standard patterns to form CMOS gates, multiplexers, flip-flops, and other circuits.
Once these building blocks are understood, reverse-engineering the full chip becomes practical.
The chip turned out to be a 4-bit CMOS counter, a copy of
the Motorola <a href="https://www.onsemi.com/download/data-sheet/pdf/mc14516b-d.pdf">MC14516B</a>.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/die.jpg"><img alt="Die photo of the К561ИЕ11 chip on a wafer. Image courtesy of Martin Evtimov. Click this image (or any other) for a larger version." class="hilite" height="451" src="https://static.righto.com/images/soviet-cmos/die-w600.jpg" title="Die photo of the К561ИЕ11 chip on a wafer. Image courtesy of Martin Evtimov. Click this image (or any other) for a larger version." width="600" /></a><div class="cite">Die photo of the К561ИЕ11 chip on a wafer. Image courtesy of <a href="https://siliconpr0n.org/archive/doku.php?id=mevtimov:start">Martin Evtimov</a>. Click this image (or any other) for a larger version.</div></p>
<p>The photo above shows the tiny silicon die under a microscope.
Regions of the silicon are doped with impurities to change the silicon's electrical properties.
This doping also causes regions of the silicon to appear greenish or reddish, depending on how a region is doped.
(These color changes will turn out to be useful for reverse engineering.)
On top of the silicon, the whitish metal layer is visible, forming the chip's connections.
This chip uses metal-gate transistors, an old technology, so the metal layer also forms the gates of the transistors.
Around the outside of the chip, the 16 square bond pads connect the chip to the outside world.
When installed in a package, the die has tiny bond wires between the pads and the lead frame, the metal
structure that connects to the chip's pins.</p>
<p>According to the Russian datasheet,<span id="fnref:datasheet"><a class="ref" href="#fn:datasheet">2</a></span> the chip has 319 "elements", presumably counting the semiconductor devices. The chip has a handful of diodes
to protect the inputs,
so the total transistor count is a bit over 300.
This transistor count is nothing compared to a modern CMOS processor with tens of billions of transistors,
of course, but most of the circuit principles are the same.</p>
<h2>NMOS and PMOS transistors</h2>
<p>CMOS is a low-power logic family now used in almost all processors.<span id="fnref:cmos"><a class="ref" href="#fn:cmos">3</a></span>
CMOS (complementary MOS) circuitry uses two types of transistors, NMOS and PMOS, working together.
The diagram below shows how an NMOS transistor is constructed.
The transistor can be considered a switch between the source and drain, controlled by the gate.
The source and drain regions (red) consist of silicon doped with impurities to change its semiconductor properties, forming N+ silicon.
The gate consists of an aluminum layer, separated from the silicon by a very thin insulating oxide layer.<span id="fnref:metal-gate"><a class="ref" href="#fn:metal-gate">4</a></span>
(These three layers—Metal, Oxide, Semiconductor—give the MOS transistor its name.)
This oxide layer is an insulator, so there is essentially no current flow through the gate, one reason why
CMOS is a low-power technology.
However, the thin oxide layer is easily destroyed by static electricity, making MOS integrated circuits sensitive
to electrostatic discharge.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/mosfet-n.jpg"><img alt="Structure of an NMOS transistor." class="hilite" height="229" src="https://static.righto.com/images/soviet-cmos/mosfet-n-w400.jpg" title="Structure of an NMOS transistor." width="400" /></a><div class="cite">Structure of an NMOS transistor.</div></p>
<p>A PMOS transistor (below) has the opposite configuration from an NMOS transistor: the source and drain are doped to form P+ regions, while the
underlying bulk silicon is N-type silicon.
The doping process is interesting, but I'll leave the details to a footnote.<span id="fnref:doping"><a class="ref" href="#fn:doping">5</a></span></p>
<p><a href="https://static.righto.com/images/soviet-cmos/mosfet-p.jpg"><img alt="Structure of a PMOS transistor." class="hilite" height="230" src="https://static.righto.com/images/soviet-cmos/mosfet-p-w400.jpg" title="Structure of a PMOS transistor." width="400" /></a><div class="cite">Structure of a PMOS transistor.</div></p>
<p>The NMOS and PMOS transistors are opposite in their construction and operation; this is the "Complementary" in CMOS.
An NMOS transistor turns on when the gate is high, while a PMOS transistor turns on when the gate is low.
An NMOS transistor is best at pulling its output low, while a PMOS transistor is best at pulling its output high.
In a CMOS circuit, the transistors work as a team, pulling the output high or low as needed.
The behavior of MOS transistors is complicated, so this description is simplified, just enough to understand
digital circuits.</p>
<p>If you buy an MOS transistor from an electronics supplier, it comes as a package with three leads for the
source, gate, and drain. The source and drain are connected differently inside the package and are not interchangeable in
a circuit.
In an integrated circuit, however, the transistor is symmetrical and the source and drain are the same.
For that reason, I won't distinguish between the source and the drain in the following discussion.
I will use the symmetrical symbols below for NMOS and PMOS transistors; the inversion bubble on the PMOS gate
symbolizes that a low signal activates the PMOS transistor.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/symbols.png"><img alt="Symbols for NMOS and PMOS transistors." class="hilite" height="147" src="https://static.righto.com/images/soviet-cmos/symbols-w250.png" title="Symbols for NMOS and PMOS transistors." width="250" /></a><div class="cite">Symbols for NMOS and PMOS transistors.</div></p>
<p>One complication is that NMOS transistors are built on
P-type silicon, while PMOS transistors are built on N-type silicon.
Since the silicon die itself is N silicon, the NMOS transistors need to be surrounded by
a tub or well of P silicon.<span id="fnref:tub"><a class="ref" href="#fn:tub">6</a></span> The cross-section diagram below shows how the NMOS transistor on the right is embedded in the well of P-type silicon.
Constructing two transistor types with opposite behaviors makes manufacturing more complex, one reason why CMOS
took years to catch on.
CMOS was invented in 1963 at Fairchild Semiconductor, but RCA was the main proponent of CMOS, commercializing it
in the late 1960s. Although RCA produced a CMOS microprocessor in 1974, mainstream microprocessors didn't
switch to CMOS until the mid-1980s with chips such as the Motorola 68020 (1984) and the Intel 386 (1986).</p>
<p><a href="https://static.righto.com/images/soviet-cmos/cmos-cross-section.jpg"><img alt="Cross-section of CMOS transistors." class="hilite" height="207" src="https://static.righto.com/images/soviet-cmos/cmos-cross-section-w500.jpg" title="Cross-section of CMOS transistors." width="500" /></a><div class="cite">Cross-section of CMOS transistors.</div></p>
<p>For proper operation, the silicon that surrounds transistors needs to be connected to the appropriate voltage through
"tap" contacts.<span id="fnref:tap"><a class="ref" href="#fn:tap">7</a></span>
For PMOS transistors, the substrate is connected to power through the taps, while for NMOS transistors the well region is connected to ground through the taps.
When reverse-engineering, the taps can provide important clues, indicating which regions are NMOS and which are PMOS.
As will be seen below, these voltages are also important for understanding the circuitry of this chip.</p>
<p>The die photo below shows two transistors as they appear on the die.
The appearance of transistors varies between different integrated circuits, so a first step of
reverse engineering is determining how they look in a particular chip.
In this IC, a transistor gate can be distinguished by a large rectangular region over the silicon.
(In other metal-gate transistors, the gate often has a "bubble" appearance.)
The interactions between the metal wiring and the silicon can be distinguished by subtle differences.
For the most part, the metal wiring passes over the silicon, isolated by thick insulating oxide.
A contact between metal and silicon is recognizable by a smaller oval region that is slightly darker;
wires are connected to the transistor sources and drains below.
MOS transistors often don't have discrete boundaries; as will be seen later, the source of one transistor can overlap with the drain of another.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/transistors.jpg"><img alt="Two transistors on the die." class="hilite" height="378" src="https://static.righto.com/images/soviet-cmos/transistors-w350.jpg" title="Two transistors on the die." width="350" /></a><div class="cite">Two transistors on the die.</div></p>
<p>Distinguishing PMOS and NMOS transistors can be difficult.
On this chip, P-type silicon appears greenish, and N-type silicon appears reddish.
Thus, PMOS transistors appear as a green region surrounded by red, while NMOS is the opposite.
Moreover, PMOS transistors are generally larger than NMOS transistors because they are weaker.
Another way to distinguish them is by their connection in circuits.
As will be seen below, PMOS transistors in logic gates are connected to power while NMOS
transistors are connected to ground.</p>
<p>Metal-gate transistors are a very old technology, mostly replaced by silicon-gate transistors in the 1970s.
Silicon-gate circuitry uses an additional layer of polysilicon wiring. Moreover, modern ICs usually have more
than one layer of metal.
The metal-gate IC in this post is easier to understand than a modern IC, since there are fewer layers to analyze.
The CMOS principles are the same in modern ICs, but the layout will appear different.</p>
<h2>Implementing an inverter in CMOS</h2>
<p>The simplest CMOS gate is an inverter, shown below. Although basic, it illustrates most of the principles of
CMOS circuitry. The inverter is constructed from a PMOS transistor on top to pull the output high and an NMOS transistor below to pull the output low.
The input is connected to the gates of both transistors.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/inverter.png"><img alt="A CMOS inverter is constructed from a PMOS transistor (top) and an NMOS transistor (bottom)." class="hilite" height="229" src="https://static.righto.com/images/soviet-cmos/inverter-w300.png" title="A CMOS inverter is constructed from a PMOS transistor (top) and an NMOS transistor (bottom)." width="300" /></a><div class="cite">A CMOS inverter is constructed from a PMOS transistor (top) and an NMOS transistor (bottom).</div></p>
<p>Recall that an NMOS transistor is turned on by a high signal on the gate, while a PMOS transistor is the opposite,
turned
on by a low signal.
Thus, when the input is high, the NMOS transistor (bottom) turns on, pulling the output low.
When the input is low, the PMOS transistor (top) turns on, pulling the output high.
Notice how the transistors act in opposite (i.e. complementary) fashion.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/inverter-active.png"><img alt="How the inverter functions." class="hilite" height="257" src="https://static.righto.com/images/soviet-cmos/inverter-active-w500.png" title="How the inverter functions." width="500" /></a><div class="cite">How the inverter functions.</div></p>
<p>An inverter on the die is shown below. The PMOS and NMOS transistors are indicated by red boxes and
the transistors are connected according to the schematics above.
The input is connected to the gates of the two transistors, which can be distinguished as larger metal rectangles.
On the right, two contacts connect the transistor drains to the output.
The power and ground connections are a bit different from most chips since the metal lines appear to not
go anywhere.
The short metal line labeled "power" connects the PMOS transistor's source to the substrate, the reddish silicon
that surrounds the transistor.
As described earlier, the substrate is connected to the chip's power.
Thus, the transistor receives its power through the substrate silicon.
This approach isn't optimal, due to the relatively high resistance of silicon, but it simplifies the wiring.
Similarly, the ground metal connects the NMOS transistor's source to the well that surrounds the transistor,
P-type silicon that appears green.
Since the well is grounded, the transistor has its ground connection.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/inverter-die.jpg"><img alt="An inverter on the die." class="hilite" height="408" src="https://static.righto.com/images/soviet-cmos/inverter-die-w350.jpg" title="An inverter on the die." width="350" /></a><div class="cite">An inverter on the die.</div></p>
<p>Some inverters look different from the layout above.
Many of the chip's inverters are constructed as two inverters in parallel to provide twice the output current.
This gives the inverter more "fan-out", the ability to drive the inputs of a larger number of gates.<span id="fnref:width"><a class="ref" href="#fn:width">8</a></span>
The diagram below shows a doubled inverter, which is essentially the previous inverter mirrored and copied,
with two PMOS transistors at the top and two NMOS transistors at the bottom.
Note that there is no explicit boundary between the paired transistors; their drains share the same silicon.
Consequently, each output contact is shared between two transistors, rather than being duplicated.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/double-inverter.jpg"><img alt="An inverter consisting of two inverters in parallel." class="hilite" height="238" src="https://static.righto.com/images/soviet-cmos/double-inverter-w350.jpg" title="An inverter consisting of two inverters in parallel." width="350" /></a><div class="cite">An inverter consisting of two inverters in parallel.</div></p>
<p>Another style of inverter drives the chip's output pins.
The output pins require high current to drive external circuitry. The chip uses much larger transistors
to provide this current.
Nonetheless, the output driver uses the same inverter circuit described earlier, with a PMOS transistor
to put the output high and an NMOS transistor to pull the output low.
The photo below shows one of these output inverters on the die.
To fit the larger transistors into the available space, the transistors have a serpentine layout,
with the gate winding
between the source and the drain.
The inverter's output is connected to a bond pad.
When the die is mounted in a package, tiny bond wires connect the pads to the external pins.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/output-driver.jpg"><img alt="An output driver is an inverter, built with much larger transistors." class="hilite" height="343" src="https://static.righto.com/images/soviet-cmos/output-driver-w500.jpg" title="An output driver is an inverter, built with much larger transistors." width="500" /></a><div class="cite">An output driver is an inverter, built with much larger transistors.</div></p>
<h2>NOR and NAND gates</h2>
<p>Other logic gates are constructed using the same concepts as the inverter, but with additional transistors.
In a NOR gate,
the PMOS transistors on top are in series, so the output will be pulled high if all inputs are 0.
The NMOS transistors on the bottom are in parallel, so the output will be pulled low if any input is 1.
Thus, the circuit implements the NOR function.
Again, note the complementary action: the PMOS transistors pull the output high, while the NMOS transistors
pull the output low. Moreover, the PMOS transistors are in series, while the NMOS transistors are in parallel.
The circuit below is a 3-input NOR gate; different numbers of inputs are supported similarly.
(With just one input, the circuit becomes an inverter, as you might expect.)</p>
<p><a href="https://static.righto.com/images/soviet-cmos/nor3.png"><img alt="A 3-input NOR gate implemented in CMOS." class="hilite" height="301" src="https://static.righto.com/images/soviet-cmos/nor3-w300.png" title="A 3-input NOR gate implemented in CMOS." width="300" /></a><div class="cite">A 3-input NOR gate implemented in CMOS.</div></p>
<p>For any gate implementation, the input must be either pulled high by the PMOS side, or pulled low by the NMOS side.
If both happen simultaneously for some input, power and ground would be shorted, possibly destroying the chip. If neither
happens, the output would be floating, which is bad in a CMOS circuit.<span id="fnref:dynamic"><a class="ref" href="#fn:dynamic">9</a></span>
In the NOR gate above, you can see that for any input the output is always pulled either high or low as required.
Reverse engineering tip: if the output is not always pulled high or low, you probably made a mistake in either the PMOS circuit or
the NMOS circuit.<span id="fnref:exception"><a class="ref" href="#fn:exception">10</a></span></p>
<p>The diagram below shows how a 3-input NOR gate appears on the die.<span id="fnref:nor3"><a class="ref" href="#fn:nor3">11</a></span>
The transistor gates are the thick vertical metal rectangles; PMOS transistors are on top and NMOS below.
The three PMOS transistors are in series between power on the left and the output connection on the right.
As with the inverter, the power and ground connections are wired to the bulk silicon, not to the chip's power and ground lines.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/nor3-die.jpg"><img alt="A 3-input NOR gate as it is implemented on the die. The "extra" PMOS transistor on the left is part of a different gate." class="hilite" height="399" src="https://static.righto.com/images/soviet-cmos/nor3-die-w350.jpg" title="A 3-input NOR gate as it is implemented on the die. The "extra" PMOS transistor on the left is part of a different gate." width="350" /></a><div class="cite">A 3-input NOR gate as it is implemented on the die. The "extra" PMOS transistor on the left is part of a different gate.</div></p>
<p>The layout of the NMOS transistors is more complicated because it is difficult to wire the transistors in parallel with just one layer of metal.
The output wire connects between the first and second transistors as well as to the third transistor.
An unusual feature is the connection of the second and third NMOS transistors to ground is done by
a horizontal line of doped silicon (reddish "silicon path" indicated by the dotted line).
This silicon extends from the ground metal to the region between the two transistors.
Finally, note that the PMOS transistors are much larger than the NMOS transistors. This is both because PMOS
transistors are inherently less efficient and because transistors in series need to be lower resistance
to avoid degrading the output signal.
Reverse-engineering tip: It's often easier to recognize the transistors in series and then use that information
to determine which transistors must be in parallel.</p>
<p>A NAND gate is implemented by swapping the roles of the series and parallel transistors. That is, the PMOS transistors
are in parallel, while the NMOS transistors are in series.
For example, the schematic below shows a 4-input NAND gate.
If all inputs are 1, the NMOS transistors will pull the output low.
If any input is a 0, the corresponding PMOS transistor will pull the output high.
Thus, the circuit implements the NAND function.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/nand4.png"><img alt="A 4-input NAND gate implemented in CMOS." class="hilite" height="348" src="https://static.righto.com/images/soviet-cmos/nand4-w280.png" title="A 4-input NAND gate implemented in CMOS." width="280" /></a><div class="cite">A 4-input NAND gate implemented in CMOS.</div></p>
<p>The diagram below shows a four-input NAND gate on the die.
In the bottom half, four NMOS transistors are in series, while in the top half, four PMOS transistors are in
parallel. (Note that the series and parallel transistors are switched compared to the NOR gate.)
As in the NOR gate, the power and ground are provided by metal connections to the bulk silicon (two connections
for the power).
The parallel PMOS circuit uses a "silicon path" (green) to connect
each transistor to the output without intersecting the metal.
In the middle, this silicon has a vertical metal line on top; this reduces the resistance of the silicon path.
The NMOS transistors are larger than the PMOS transistors in this case because the NMOS transistors are in series.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/nand4.jpg"><img alt="A four-input NAND gate as it appears on the die." class="hilite" height="389" src="https://static.righto.com/images/soviet-cmos/nand4-w500.jpg" title="A four-input NAND gate as it appears on the die." width="500" /></a><div class="cite">A four-input NAND gate as it appears on the die.</div></p>
<h2>Complex gates</h2>
<p>More complex gates such as AND-NOR (AND-OR-INVERT) can also be constructed in CMOS; these gates are commonly
used because they are no harder to build than NAND or NOR gates.
The schematic below shows an AND-NOR gate.
To understand its construction, look at the paths to ground through the NMOS transistors.
The first path is through A, B, and C. If these inputs are all high, the output is low, implementing the
AND-INVERT side of the gate. The second path is through D, which will pull the output low by itself, implementing
the OR-INVERT side of the gate.
You can verify that the PMOS transistors pull the output high in the necessary circumstances.
Observe that the D transistor is in series on the PMOS side and in parallel on the NMOS side, again showing
the complementary nature of these circuits.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/and-nor.png"><img alt="An AND-NOR gate." class="hilite" height="296" src="https://static.righto.com/images/soviet-cmos/and-nor-w400.png" title="An AND-NOR gate." width="400" /></a><div class="cite">An AND-NOR gate.</div></p>
<p>The diagram below shows this AND-NOR gate on the die, with the four inputs
A, B, C, and D, corresponding to the schematic above.
This gate has a few tricky layout features.
The biggest surprise is that there is half of another gate (a 3-input NOR gate) in the middle of this gate.
Presumably, the designers found this arrangement efficient since the other gate also uses inputs A, B, and C. The output
of the other gate (D) is an input to the gate we're examining.
Ignoring the other gate, the AND-NOR gate has the NMOS transistors in the first column, on top of a reddish band,
and the PMOS transistors in the third column, on top of a greenish band.
Hopefully you can recognize the transistor gates, the large rectangles connected to A, B, C, and D.
Matching the schematic above, there are three NMOS transistors in series on the left, connected to A, B, and C,
as well as the D transistor providing a second path between ground and the output.
On the PMOS side, the A, B, and C transistors are in parallel, and then connected through the D transistor to
the output.
The green "silicon path" on the right provides the parallel connection from transistors A and B to transistors C and D.
Most of this path is covered by two long metal regions, reducing the resistance.
But in order to cross under wires B and C, the metal has a break where the green silicon provides the connection.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/and-nor.jpg"><img alt="An AND-NOR gate on the die." class="hilite" height="354" src="https://static.righto.com/images/soviet-cmos/and-nor-w600.jpg" title="An AND-NOR gate on the die." width="600" /></a><div class="cite">An AND-NOR gate on the die.</div></p>
<p>As with the other gates, the power is obtained by a connection to the bulk silicon, bridging the red and green regions.
If you look closely, there is a green band ("silicon path") down from the power connection and joining
the main green region between the B and C transistors, providing power to those transistors through the silicon.
The NMOS transistors, on the other hand, have ground connections at the top and bottom.
For this circuit, ground is supplied through solid metal wires at the top and the bottom, rather than a connection to the bulk silicon.</p>
<p>A few principles help when reverse-engineering logic gates.
First, because of the complementary nature of CMOS, the output must either be pulled high by the PMOS transistors
or pulled low by the NMOS transistors.
Thus, one group or the other must be activated for each possible input.
This implies that the same inputs must go to both the NMOS and PMOS transistors.
Moreover, the structures of the NMOS and PMOS circuits are complementary: where the NMOS transistors are parallel,
the PMOS transistors must be in series, and vice versa.
In the case of the AND-NOR circuit above, these principles are helpful.
For instance, you might not spot the "silicon paths", but since the PMOS half must be complementary to the NMOS
half, you know that those connections must exist.</p>
<p>Even complex gates can be reverse-engineered by breaking the NMOS transistors into series and parallel groups,
corresponding to AND and OR terms. Note that MOS transistors are inherently inverting, so a single gate will
always end with inversion. Thus, you can build an AND-OR-AND-NOR gate for instance, but you can't build
an AND gate as a single circuit.</p>
<h2>Transmission gate</h2>
<p>Another key circuit is the <em>transmission gate</em>. This acts as a switch, either passing a signal through or blocking it.
The schematic below shows how a transmission gate is constructed from two transistors, an NMOS transistor and a PMOS transistor.
If the enable line is high (i.e. low to the PMOS transistor) both transistors turn on, passing the input signal to the output.
The NMOS transistor primarily passes a low signal, while the PMOS transistor passes a high signal, so they
work together.
If the enable line is low, both transistors turn off, blocking the input signal.
The schematic symbol for a transmission gate is shown on the right.
Note that the transmission gate is bidirectional; it doesn't have a specific input and output.
Examining the surrounding circuitry usually reveals which side is the input and which side is the output.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/transmission-schematic.png"><img alt="A transmission gate is constructed from two transistors. The transistors and their gates are indicated. The schematic symbol is on the right." class="hilite" height="196" src="https://static.righto.com/images/soviet-cmos/transmission-schematic-w400.png" title="A transmission gate is constructed from two transistors. The transistors and their gates are indicated. The schematic symbol is on the right." width="400" /></a><div class="cite">A transmission gate is constructed from two transistors. The transistors and their gates are indicated. The schematic symbol is on the right.</div></p>
<p>The photo below shows how a transmission gate appears on the die.
It consists of a PMOS transistor at the top and an NMOS transistor at the bottom.
Both the enable signal and the complemented enable signal are used, one for the NMOS transistor's gate and
one for the PMOS transistor.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/transmission-die.jpg"><img alt="A transmission gate on the die, consisting of two transistors." class="hilite" height="277" src="https://static.righto.com/images/soviet-cmos/transmission-die-w300.jpg" title="A transmission gate on the die, consisting of two transistors." width="300" /></a><div class="cite">A transmission gate on the die, consisting of two transistors.</div></p>
<p>The inverter and transmission gate are both two-transistor circuits, but they can be easily distinguished
for reverse engineering.
One difference is that an inverter is connected to power and ground, while the transmission gate is unpowered.
Moreover, the inverter has one input, while the transmission gate has three inputs (counting the control lines).
In the inverter, both transistor gates have the same input, so one transistor turns on at a time.
In the transmission gate, however, the gates have opposite inputs, so the transistors turn on or off together.</p>
<p>One useful circuit that can be built from transmission gates is the multiplexer, a circuit that selects one of two (or more) inputs.
The multiplexer below selects either input <em>inA</em> or <em>inB</em> and connects it to the output, depending if the selection line <em>selA</em> is high or low respectively.
The multiplexer can be built from two transmission gates as shown. Note that the select lines are flipped on
the second transmission gate, so one transmission gate will be activated at a time.
Multiplexers with more inputs can be built by using more transmission gates with additional select lines.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/mux.png"><img alt="Schematic symbol for a multiplexer and its implementation with two transmission gates." class="hilite" height="276" src="https://static.righto.com/images/soviet-cmos/mux-w300.png" title="Schematic symbol for a multiplexer and its implementation with two transmission gates." width="300" /></a><div class="cite">Schematic symbol for a multiplexer and its implementation with two transmission gates.</div></p>
<p>The die photo below shows a block of transmission gates consisting of six PMOS transistors and six NMOS transistors.
The labels on the metal lines will make more sense as the reverse engineering progresses.
Note that the metal layer provides much of the wiring for the circuit, but not all of it.
Much of the wiring is implicit, in the sense that neighboring transistors are
connected because the source of one transistor overlaps the drain of another.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/transmission-block-die.jpg"><img alt="A block of transistors implementing multiple transmission gates." class="hilite" height="327" src="https://static.righto.com/images/soviet-cmos/transmission-block-die-w600.jpg" title="A block of transistors implementing multiple transmission gates." width="600" /></a><div class="cite">A block of transistors implementing multiple transmission gates.</div></p>
<p>While this may look like an incomprehensible block of zig-zagging lines, tracing out the transistors will
reveal the circuitry (below).
The wiring in the schematic matches the physical
layout on the die, so the schematic is a bit of a mess.
With a single layer of metal for wiring, the layout becomes a bit convoluted to avoid crossing wires.
(The only wire crossing in this image is in the upper left for wire <em>X</em>; the signal uses a short stretch of
silicon to pass under the metal.)</p>
<p><a href="https://static.righto.com/images/soviet-cmos/transmission-block-schematic.png"><img alt="Schematic of the previous block of transistors." class="hilite" height="190" src="https://static.righto.com/images/soviet-cmos/transmission-block-schematic-w600.png" title="Schematic of the previous block of transistors." width="600" /></a><div class="cite">Schematic of the previous block of transistors.</div></p>
<p>Looking at the PMOS and NMOS transistors as pairs reveals that the circuit above is a chain of transmission gates (shown below).
It's not immediately obvious which wires are inputs and which wires are outputs, but it's a good guess that
pairs of transmission gates using the opposite control lines form a multiplexer. That is, inputs A and C
are multiplexed to output B, inputs C and E are multiplexed to output D, and so forth.
As will be seen, these transmission gates form multiplexers that are part of a flip-flop.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/mux-block-schematic.png"><img alt="The transistors form six transmission gates." class="hilite" height="117" src="https://static.righto.com/images/soviet-cmos/mux-block-schematic-w600.png" title="The transistors form six transmission gates." width="600" /></a><div class="cite">The transistors form six transmission gates.</div></p>
<h2>Latches and flip-flops</h2>
<p>Flip-flops and latches are important circuits, able to hold one bit and controlled by a clock signal.
Terminology is inconsistent, but I'll use <em>flip-flop</em> to refer to an edge-triggered device and <em>latch</em>
to refer to a level-triggered device.
That is, a flip-flop will grab its input at the moment the clock signal goes high (i.e. it uses the clock edge),
store it, and provide it as the output, called <em>Q</em> for historical reasons.
A latch, on the other hand, will take its input, store it, and output it as long as the clock is high
(i.e. it uses the clock level).
The latch is considered "transparent", since the input immediately appears on the output if the clock is high.</p>
<p>The distinction between latches and flip-flops may seem pedantic, but it is important.
Flip-flops will predictably update once
per clock cycle, while latches will keep updating as long as the clock is high.
By connecting the output of a flip-flop through an inverter back to the input, you can create a toggle flip-flop,
which will flip its state once per clock cycle, dividing the clock by two.
(This example will be important shortly.)
If you try the same thing with a transparent latch, it will oscillate: as soon as the output flips, it will
feed back to the latch input and flip again.</p>
<p>The schematic below shows how a latch can be implemented with transmission gates. When the clock is high,
the first transmission gate passes the input through to the inverters and the output.
When the clock is low, the second transmission gate creates a feedback loop for the inverters, so they
hold their value, providing the latch action.
Below, the same circuit is drawn with a multiplexer, which may be easier to understand:
either the input or the feedback is selected for the inverters.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/latch-schematic.png"><img alt="A latch implemented from transmission gates. Below, the same circuit is shown with a multiplexer." class="hilite" height="385" src="https://static.righto.com/images/soviet-cmos/latch-schematic-w350.png" title="A latch implemented from transmission gates. Below, the same circuit is shown with a multiplexer." width="350" /></a><div class="cite">A latch implemented from transmission gates. Below, the same circuit is shown with a multiplexer.</div></p>
<p>An edge-triggered flip-flop can be created by combining two latches in a primary/secondary arrangement.
When the clock is low, the input will pass into the primary latch.
When the clock switches high, two things happen. The primary latch will hold the current value of the input.
Meanwhile, the secondary latch will start passing its input (the value from the primary latch) to its output,
and thus the flip-flop output.
The effect is that the flip-flop's output will be the value at the moment the clock goes high, and the
flip-flop is insensitive to changes at other times. (The primary latch's value can keep changing while the clock is
low, but this doesn't affect the flip-flop's output.)</p>
<p><a href="https://static.righto.com/images/soviet-cmos/flip-flop-schematic.png"><img alt="Two latches, combined to form a flip-flop." class="hilite" height="135" src="https://static.righto.com/images/soviet-cmos/flip-flop-schematic-w500.png" title="Two latches, combined to form a flip-flop." width="500" /></a><div class="cite">Two latches, combined to form a flip-flop.</div></p>
<p>The flip-flops in the counter chip are based on the above design, but they have two additional features.
First, the flip-flop can be loaded with a value under the control of a Preset Enable (PE) signal.
Second, the flip-flop can either hold its current value or toggle its value, under the control of a Toggle (T)
signal.
Implementing these features requires two more multiplexers in the primary latch as shown below.
The first multiplexer selects either the inverted output or uninverted output to be fed back into the flip
flop, providing the selectable toggle action.
The second multiplexer is the latch's standard clocked multiplexer. The third multiplexer allows a
"preset" value to be loaded directly into the flip-flop, bypassing the clock.
(The preset value is inverted, since there are three inverters between the preset and the output.)
The secondary latch is the same as before, except it provides the inverted and non-inverted outputs as
feedback, allowing the flip-flop to either hold or toggle its value.
This circuit illustrates how more complex flip-flops can be created from the building blocks that we've seen.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/toggle-flip-flop.png"><img alt="Schematic of the toggle flip-flop." class="hilite" height="172" src="https://static.righto.com/images/soviet-cmos/toggle-flip-flop-w600.png" title="Schematic of the toggle flip-flop." width="600" /></a><div class="cite">Schematic of the toggle flip-flop.</div></p>
<p>The gray letters in the schematic above match the earlier multiplexer diagram, showing how the
three multiplexers were implemented on the die.
The other multiplexer and the inverters are implemented in another block of circuitry.
I won't explain that circuitry in detail since it doesn't illustrate any new principles.</p>
<h2>Routing in silicon: cross-unders</h2>
<p>With just one metal layer for wiring, routing of signals on the chip is difficult and requires careful planning.
Even so, there are some cases where one signal must cross another.
This is accomplished by using silicon for a "cross-under", allowing a signal to pass underneath metal wiring.
These cross-unders are avoided unless necessary because silicon has much higher resistance than metal.
Moreover, the cross-under requires additional space on the die.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/crossunders.jpg"><img alt="Three cross-unders on the die." class="hilite" height="305" src="https://static.righto.com/images/soviet-cmos/crossunders-w500.jpg" title="Three cross-unders on the die." width="500" /></a><div class="cite">Three cross-unders on the die.</div></p>
<p>The images above show three cross-unders. In each one, signals are primarily routed in the metal layer,
but a signal passes under the metal using a doped silicon region (which appears green).
The first cross-under simply lets one signal cross under the second.
The second image shows a signal branching as well as crossing under two signals.
The third image shows a cross-under distributing a horizontal signal to the upper and lower halves of the chip, while
crossing under multiple horizontal signals.
Note the small oval contact between the green silicon region and the horizontal metal line, connecting them.
It is easy to miss the small contact and think that the vertical signal is simply crossing under the horizontal
signal, rather than branching.</p>
<h2>About the chip</h2>
<p>The focus of this article is the CMOS reverse engineering process rather than this specific chip, but
I'll give a bit of information about the chip.
The die has the Cyrillic characters ИЕ11 at the top indicating that the chip is a К561ИЕ11 or К564ИЕ11.<span id="fnref:military"><a class="ref" href="#fn:military">12</a></span>
The Soviet Union came up with a <a href="https://en.wikipedia.org/wiki/Soviet_integrated_circuit_designation">standardized numbering system</a> for integrated circuits in 1968.
This system is much more helpful than the American system of semi-random part numbers.
In this part number, the 5 indicates a monolithic integrated circuit, while 61 or 64 is the <a href="https://commons.wikimedia.org/wiki/Category:564_series_integrated_circuits">series</a>, specifically commercial-grade or military-grade
clones of 4000 series CMOS logic.
The character И indicates a digital circuit, while ИЕ is a counter.
Thus, the part number systematically indicates that the integrated circuit is a CMOS counter.</p>
<p>The 561ИЕ11 turns out to be a copy of the Motorola MC14516 binary up/down counter.<span id="fnref:motorola"><a class="ref" href="#fn:motorola">13</a></span>
Conveniently, the Motorola datasheet provides a schematic (below).
I won't explain the schematic in detail, but a quick overview may be helpful.
The chip is a four-bit counter that can count up or down, and the heart of the chip is the four toggle flip-flops (red).
To count up, a flip-flop is toggled if there is a carry from the lower bits, while counting down toggles
a flip-flop if there is a borrow from the lower bits. (Much like base-10 long addition or subtraction.)
The AND/NOR gates at the bottom (blue) look complex, but they are just generating the toggle signal T:
toggle if the lower bits are all-1's and you're counting up, or if the lower bits are all-0's and you're
counting down.
The flip-flops can also be loaded in parallel from the P inputs.
Additional logic allows the chips to be cascaded to form arbitrarily large counters;
the carry-out pin of one chip is connected to the carry-in of the next.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/datasheet-labeled.png"><img alt="Logic diagram of the MC14516 up/down counter chip, from the datasheet." class="hilite" height="393" src="https://static.righto.com/images/soviet-cmos/datasheet-labeled-w600.png" title="Logic diagram of the MC14516 up/down counter chip, from the datasheet." width="600" /></a><div class="cite">Logic diagram of the MC14516 up/down counter chip, from the <a href="https://www.onsemi.com/download/data-sheet/pdf/mc14516b-d.pdf">datasheet</a>.</div></p>
<p>I've labeled the die photo below with the pin functions and the functional blocks.
Each quadrant of the chip handles one bit of the counter in a roughly symmetrical way.
This quadrant layout accounts for the pin arrangement which otherwise appears semi-random with bits 3 and 0
on one side and bits 2 and 1 on the other, with inputs and output pins jumbled together.
The toggle and carry logic is squeezed into the top and middle of the chip.
You may recognize the large inverters next to each output pin.
When reverse-engineering, look for large transistors next to pads to determine which pins are outputs.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/die-labeled.jpg"><img alt="The die with pins and functional blocks labeled." class="hilite" height="397" src="https://static.righto.com/images/soviet-cmos/die-labeled-w600.jpg" title="The die with pins and functional blocks labeled." width="600" /></a><div class="cite">The die with pins and functional blocks labeled.</div></p>
<h2>Conclusions</h2>
<p>This article has discussed the basic circuits that can be found in a CMOS chip.
Although the counter chip is old and simple, later chips use the same principles.
An important change in later chips is the introduction of silicon-gate transistors, which use polysilicon for the transistor gates and for
an additional wiring layer.
The circuits are the same, but you need to be able to recognize the polysilicon layer.
Many chips have more than one metal layer, which makes it very hard to figure out the wiring connections.
Finally, when the feature size approaches the wavelength of light, optical microscopes break down.
Thus, these reverse-engineering techniques are only practical up to a point.
Nonetheless, many interesting CMOS chips can be studied and reverse-engineered.</p>
<p>For more,
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.
Thanks to
<a href="https://siliconpr0n.org/archive/doku.php?id=mevtimov:start">Martin Evtimov</a> for providing the die photos.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:soviet">
<p>I'm not sure of the date and manufacturing location of the chip. I think the design is old, from
the Soviet Union.
(Motorola introduced the MC14516 around 1972 but I don't know when it was copied.)
The wafer is said to be scrap from a Ukrainian manufacturer so it may have been manufactured more recently.
The die has a symbol that might be a manufacturing logo, but nobody on <a href="https://twitter.com/kenshirriff/status/1741661696320766233">Twitter</a> could identify it.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/logo.png"><img alt="A symbol that appears on the die." class="hilite" height="275" src="https://static.righto.com/images/soviet-cmos/logo-w300.png" title="A symbol that appears on the die." width="300" /></a><div class="cite">A symbol that appears on the die.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:soviet" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:datasheet">
<p>For more about this chip, the Russian databook can be downloaded <a href="http://publ.lib.ru/ARCHIVES/N/NEFEDOV_Anatoliy_Vladimirovich/_Nefedov_A.V..html">here</a>; see Volume 5 page 501. <a class="footnote-backref" href="#fnref:datasheet" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:cmos">
<p>Early CMOS microprocessors include
the 8-bit <a href="https://en.wikipedia.org/wiki/RCA_1802">RCA 1802</a> COSMAC (1974) and
the 12-bit <a href="https://en.wikipedia.org/wiki/Intersil_6100">Intersil 6100</a> (1974).
The 1802 is said to be the <a href="https://spectrum.ieee.org/semiconductors/processors/chip-hall-of-fame-rca-cdp-1802">first CMOS microprocessor</a>.
Mainstream microprocessors didn't switch to CMOS until the mid-1980s. <a class="footnote-backref" href="#fnref:cmos" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:metal-gate">
<p>The chip in this article has metal-gate transistors, with aluminum forming the transistor gate.
These transistors were not as advanced as the silicon-gate transistors that were developed in the late 1960s.
Silicon gate technology was much better in several ways. First, silicon-gate transistors were smaller, faster, more reliable, and used lower voltages.
Second, silicon-gate chips have a layer of polysilicon wiring in addition to the metal wiring; this made chip layouts about twice as dense. <a class="footnote-backref" href="#fnref:metal-gate" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:doping">
<p>To produce N-type silicon, the silicon is doped with small amounts of an element such as phosphorus or arsenic.
In the periodic table, these elements are one column to the right of silicon so they have one "extra" electron.
The free electrons move through the silicon, carrying charge. Because electrons are negative, this type of
silicon is called N-type.
Conversely,
to produce P-type silicon, the silicon is doped with small quantities of an element such as boron.
Since boron is one column to the left of silicon in the periodic table, it has one fewer free electrons.
A strange thing about semiconductor physics is that the missing electrons (called holes) can move around the
silicon much like electrons, but carrying positive charge.
Since the charge carriers are positive, this type of silicon is called P-type.
For various reasons, electrons carry charge better than holes, so NMOS transistors work better than PMOS transistors.
As a result, PMOS transistors need to be about twice the size of comparable NMOS transistors.
This quirk is useful for reverse engineering, since it can help distinguish NMOS and PMOS transistors.</p>
<p>The amount of doping required can be absurdly small, 20 atoms of
boron for every billion atoms of silicon in some cases.
A typical doping level for N-type silicon is 10<sup>15</sup> atoms of phosphorus or arsenic per cubic centimeter, which sounds
like a lot until you realize that pure silicon consists of 5×10<sup>22</sup> atoms per cubic centimeter.
A heavily doped P+ region might have 10<sup>20</sup> dopant atoms per cubic centimeter, one atom of boron per 500 atoms of silicon. (Doping levels are described <a href="https://users.encs.concordia.ca/~asim/COEN%20451/Lectures/W2_2_Slides.pdf#page=4">here</a>.) <a class="footnote-backref" href="#fnref:doping" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:tub">
<p>This chip is built on a substrate of N-type silicon, with wells of P-type silicon for the NMOS transistors.
Chips can be built the other way around, starting with P-type silicon and putting wells of N-type silicon
for the PMOS transistors.
Another approach is the "twin-well" CMOS process, constructing wells for both NMOS and PMOS transistors. <a class="footnote-backref" href="#fnref:tub" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:tap">
<p>The bulk silicon voltage makes the boundary between a transistor and the bulk silicon act as a reverse-biased diode, so current can't flow
across the boundary.
Specifically, for a PMOS transistor, the N-silicon substrate is connected to the positive supply.
For an NMOS transistor, the P-silicon well
is connected to ground.
A P-N junction acts as a diode, with current flowing from P to N. But the substrate voltages put P at ground and N at +5, blocking any
current flow.
The result is that the bulk silicon can be considered an insulator, with current restricted to the N+ and P+ doped regions.
If this back bias gets reversed, for example, due to power supply fluctuations, current can flow through the
substrate.
This can result in "latch-up", a situation where the N and P regions act as parasitic NPN and PNP transistors
that latch into the "on" state. This shorts power and ground and can destroy the chip.
The point is that the substrate voltages are very important for proper operation of the chip. <a class="footnote-backref" href="#fnref:tap" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:width">
<p>Many inverters in this chip duplicate the transistors to increase the current output.
The same effect could be achieved with single transistors with twice the gate width. (That is,
twice the height in the diagrams.)
Because these transistors are arranged in uniform rows, doubling the transistor height would
mess up the layout, so using more transistors instead of changing the size makes sense. <a class="footnote-backref" href="#fnref:width" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:dynamic">
<p>Some chips use dynamic logic, in which case
it is okay to leave the gate floating, neither pulled high nor low.
Since the gate resistance is extremely high, the capacitance of a gate will hold its value (0 or 1) for
a short time.
After a few milliseconds, the charge will leak away, so dynamic logic must constantly refresh its signals
before they decay.</p>
<p>In general, the reason you don't want an intermediate voltage as the input to a CMOS circuit is that the voltage
might end up turning the PMOS transistor partially on while also turning the NMOS transistor partially on.
The result is high current flow from power to ground through the transistors. <a class="footnote-backref" href="#fnref:dynamic" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:exception">
<p>One of the complicated logic gates on the die didn't match the implementation I expected.
In particular, for some inputs, the output is neither pulled high nor low.
Tracing the source of these inputs reveals what is going on: the gate takes both a signal and its
complement as inputs.
Thus, some of the "theoretical" inputs are not possible; these can't be both high or both low.
The logic gate is optimized to ignore these cases, making the implementation simpler. <a class="footnote-backref" href="#fnref:exception" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:nor3">
<p>This schematic explains the physical layout of the 3-input NOR gate on the die, in case the wiring
isn't clear.
Note that the PMOS transistors are wired in series and the NMOS transistors are in parallel, even
though both types are physically arranged in rows.</p>
<p><a href="https://static.righto.com/images/soviet-cmos/nor3-schematic.png"><img alt="The 3-input NOR gate on the die. This schematic matches the physical layout." class="hilite" height="193" src="https://static.righto.com/images/soviet-cmos/nor3-schematic-w400.png" title="The 3-input NOR gate on the die. This schematic matches the physical layout." width="400" /></a><div class="cite">The 3-input NOR gate on the die. This schematic matches the physical layout.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:nor3" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:military">
<p>The commercial-grade chips and military-grade chips presumably use the same die, but are distinguished by the
level of testing. So we can't categorize the die as 561-series or 564-series. <a class="footnote-backref" href="#fnref:military" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:motorola">
<p>Motorola introduced the MC14500 series in 1971 to fill holes in the CD4000 series.
For more about this series, see <a href="http://www.bitsavers.org/components/motorola/motorola_monitor_magazine/Motorola_Monitor_V10_N1_1972.pdf#page=16">A Strong Commitment to Complementary MOS</a>. <a class="footnote-backref" href="#fnref:motorola" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com3tag:blogger.com,1999:blog-6264947694886887540.post-83547609704619259282024-01-16T09:46:00.000-08:002024-01-16T09:46:04.591-08:00Inside the mechanical Bendix Air Data Computer, part 3: pressure transducers<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js"></script>
<script>
MathJax = {
tex: {
inlineMath: [['$', '$'], ['\\(', '\\)']]
},
svg: {
fontCache: 'global'
},
chtml: { displayAlign: 'left' }
};
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
"HTML-CSS": { scale: 175}
});
</script>
<style type="text/css">
.MathJax {font-size: 1em !important}
</style>
<p>The Bendix Central Air Data Computer (CADC) is an electromechanical analog computer that uses
gears and cams for its mathematics. It was a key part of military planes such as the F-101 and the F-111 fighters,
computing
airspeed, Mach number, and other "air data".
This article reverse-engineers the two pressure transducers, on the right in the photo below.
It is part 3 of my series on the CADC.<span id="fnref:previous"><a class="ref" href="#fn:previous">1</a></span></p>
<p><a href="https://static.righto.com/images/bendix-transducer/bendix-side.jpg"><img alt="The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version." class="hilite" height="322" src="https://static.righto.com/images/bendix-transducer/bendix-side-w600.jpg" title="The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version." width="600" /></a><div class="cite">The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version.</div></p>
<p>Aircraft have determined airspeed from
air pressure for over a
<a href="https://patents.google.com/patent/US1290875A">century</a>.
A port in the side of the plane provides the static air pressure,<span id="fnref:static"><a class="ref" href="#fn:static">2</a></span> the air pressure outside the aircraft.
A pitot tube points forward and receives the "total" air pressure, a higher pressure due to the speed of the airplane
forcing air into the tube.
The airspeed can be determined from the ratio of these two pressures, while the altitude can be determined
from the static pressure.</p>
<p>But as you approach the speed of sound, the fluid dynamics of air change and the calculations become very complicated.
With the development of supersonic fighter planes in the 1950s, simple mechanical instruments were no longer
sufficient.
Instead, an analog computer calculated the "air data" (airspeed, air density, Mach number, and so forth) from the pressure measurements.
This computer then transmitted the air data electrically to the systems that needed it: instruments, weapons
targeting, engine control, and so forth.
Since the computer was centralized,
such a system was called a Central Air Data Computer or CADC, manufactured by Bendix and other companies.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/gears.jpg"><img alt="A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible." class="hilite" height="367" src="https://static.righto.com/images/bendix-transducer/gears-w600.jpg" title="A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible." width="600" /></a><div class="cite">A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible.</div></p>
<p>Each value in the Bendix CADC is indicated by the rotational position of a shaft.
Compact electric motors rotated the shafts, controlled by <a href="https://spectrum.ieee.org/the-vacuum-tubes-forgotten-rival">magnetic amplifier</a> servo loops.
Gears, cams, and differentials performed computations, with the results indicated by more rotations.
Devices called synchros converted the rotations to electrical outputs that controlled other aircraft systems.
The CADC is said to contain 46 synchros, 511 gears, 820 ball bearings, and a total of 2,781 major parts (but I haven't counted).
These components are crammed into a compact cylinder: 15 inches long and weighing 28.7 pounds.</p>
<p>The equations computed by the CADC are impressively complicated. For instance, one equation
computes the Mach number $M$ from the total pressure \( P_t \) and the static pressure \( P_s \):<span id="fnref:equations"><a class="ref" href="#fn:equations">3</a></span></p>
<p>\[~~~\frac{P_t}{P_s} = \frac{166.9215M^7}{( 7M^2-1)^{2.5}}\]</p>
<p>It seems incredible that these functions could be computed mechanically, but three
techniques make this possible.
The fundamental mechanism is the differential gear, which adds or subtracts values.
Second, logarithms are used extensively, so multiplications and divisions become additions and subtractions performed by
a differential, while square roots are calculated by gearing down by a factor of 2.
Finally, specially-shaped cams implement functions: logarithm, exponential, and other one-variable functions.<span id="fnref:cams"><a class="ref" href="#fn:cams">4</a></span>
By combining these mechanisms, complicated functions can be computed mechanically.</p>
<!-- Aerodynamics for Naval Aviators (1965) https://www.faa.gov/sites/faa.gov/files/regulations_policies/handbooks_manuals/aviation/00-80T-80.pdf -->
<h2>The pressure transducers</h2>
<p>In this article, I'm focusing on the pressure transducers and how they turn pressures into shaft rotations.
The CADC receives two pressure inputs: the total pressure \( P_t \) from the pitot tube, and
the static pressure \( P_s \) from the static pressure port.<span id="fnref:pec"><a class="ref" href="#fn:pec">5</a></span>
The CADC has two independent pressure transducer subsystems, one for total pressure and one for static pressure.
The two pressure transducers make up the right half of the CADC.
The copper pressure tube for the static pressure is visible on top of the CADC below. This tube feeds into the
black-domed pressure sensor at the right.
The gears, motors, and other mechanisms to the left of the pressure sensor domes generate shaft rotations
that are fed into the remainder of the CADC for calculations.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/cadc-side.jpg"><img alt="Side view of the CADC." class="hilite" height="322" src="https://static.righto.com/images/bendix-transducer/cadc-side-w600.jpg" title="Side view of the CADC." width="600" /></a><div class="cite">Side view of the CADC.</div></p>
<p>The pressure transducer has a tricky job: it must measure tiny pressure changes, but it must also provide a
rotational signal that has enough torque to rotate all the gears in the CADC.
To accomplish this, the pressure transducer uses a servo loop that amplifies small pressure changes into accurate
rotations.
The diagram below provides an overview of the process.
The pressure input causes a small movement in the bellows diaphragm.
This produces a small shaft rotation that is detected by a sensitive inductive pickup.
This signal is amplified and drives a motor with enough power to drive the output shaft.
The motor is also geared to counteract the movement of the bellows.
The result is a feedback loop so the motor's rotation tracks the air pressure, but provides much more torque.
An adjustable cam corrects for any error produced by irregularities in the diaphragm response.
This complete mechanism is implemented twice, once for each pressure input.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/transducer-diagram.jpg"><img alt="This diagram shows the structure of the transducer. From "Air Data Computer Mechanization."" class="hilite" height="327" src="https://static.righto.com/images/bendix-transducer/transducer-diagram-w500.jpg" title="This diagram shows the structure of the transducer. From "Air Data Computer Mechanization."" width="500" /></a><div class="cite">This diagram shows the structure of the transducer. From "Air Data Computer Mechanization."</div></p>
<p>To summarize, as the pressure moves the diaphragm, the induction pick-up produces an error signal.
The motor is driven in the appropriate direction until the error signal becomes zero.
At this point, the output shaft rotation exactly matches the input pressure.
The advantage of the servo loop is that the diaphragm only needs to move the sensitive inductive pickup,
rather than driving the gears of the CADC, so the pressure reading is more accurate.</p>
<p>In more detail, the process starts with connections from the aircraft's pitot tube and static pressure port
to the CADC.
The front of the CADC (below) has connections for the total pressure and the static pressure.
The CADC also has five round military connectors for electrical connections between the CADC and the rest
of the aircraft. (The outputs from the CADC are electrical, with synchros converting the shaft rotations into
electrical representations.)
Finally, a tiny time clock at the upper right keeps track of how many hours the CADC has been in operation,
so it can be maintained according to schedule.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/panel.jpg"><img alt="The front panel of the CADC, showing the static pressure and total pressure connections at the bottom." class="hilite" height="502" src="https://static.righto.com/images/bendix-transducer/panel-w500.jpg" title="The front panel of the CADC, showing the static pressure and total pressure connections at the bottom." width="500" /></a><div class="cite">The front panel of the CADC, showing the static pressure and total pressure connections at the bottom.</div></p>
<p>The photo below shows the main components of the pressure transducer system.
At the upper left, the pressure line from the CADC's front panel goes to the pressure sensor,
airtight under a black dome.
The error signal from the sensor goes to the amplifier, which consists of three boards.
The amplifier's power transformer and magnetic amplifiers are the most visible components.
The amplifier drives the motors to the left. There are two motors controlled by the amplifier:
one for coarse adjustments and one for fine adjustments.
By using two motors, the CADC can respond rapidly to large pressure changes, while also accurately tracking
small pressure changes.
Finally, the output from the motor goes through the adjustable cam in the middle before providing the
feedback signal to the pressure sensor.
The output from the transducer to the rest of the CADC is a shaft on the left, but it is in the middle of
the CADC and isn't visible in the photo.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/parts.jpg"><img alt="A closeup of the transducer, showing the main parts." class="hilite" height="528" src="https://static.righto.com/images/bendix-transducer/parts-w500.jpg" title="A closeup of the transducer, showing the main parts." width="500" /></a><div class="cite">A closeup of the transducer, showing the main parts.</div></p>
<h3>The pressure sensor</h3>
<p>Each pressure sensor is packaged in a black airtight dome and is fed from its associated pressure line.
Inside the sensor, two sealed metal bellows (below) expand or contract as the pressure changes.
The bellows are connected to opposite sides of a metal shaft, which rotates as the bellows expand or contract.
This shaft rotates an inductive pickup, providing the error signal.
The servo loop rotates a second shaft that counteracts the rotation of the first shaft; this shaft and
gears are also visible below.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/bellows.jpg"><img alt="Inside the pressure transducer. The two disc-shaped bellows are connected to opposite sides of a shaft so the shaft rotates as the bellows expand or contract." class="hilite" height="484" src="https://static.righto.com/images/bendix-transducer/bellows-w500.jpg" title="Inside the pressure transducer. The two disc-shaped bellows are connected to opposite sides of a shaft so the shaft rotates as the bellows expand or contract." width="500" /></a><div class="cite">Inside the pressure transducer. The two disc-shaped bellows are connected to opposite sides of a shaft so the shaft rotates as the bellows expand or contract.</div></p>
<p>The end view of the sensor below shows the inductive pickup at the bottom, with colorful wires for the
input (400 Hz AC) and the output error signal.
The coil visible on the inductive pickup is an anti-backlash spring to ensure that the pickup doesn't wobble
back and forth.
The electrical pickup coil is inside the inductive pickup and isn't visible.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/transducer-inside.jpg"><img alt="Inside the transducer housing, showing the bellows and inductive pickup." class="hilite" height="464" src="https://static.righto.com/images/bendix-transducer/transducer-inside-w400.jpg" title="Inside the transducer housing, showing the bellows and inductive pickup." width="400" /></a><div class="cite">Inside the transducer housing, showing the bellows and inductive pickup.</div></p>
<h2>The amplifier</h2>
<p>Each transducer feedback signal is amplified by three circuit boards centered around <a href="https://spectrum.ieee.org/the-vacuum-tubes-forgotten-rival">magnetic amplifiers</a>,
transformer-like amplifiers that were popular before high-power transistors came along.
The photo below shows how the amplifier boards are packed next to the transducers.
The boards are complex, filled with resistors, capacitors, germanium transistors, diodes, relays, and other components.</p>
<!--
![This end-on view of the CADC shows the pressure transducers, the black cylinders. Next to each pressure transducer is a complex amplifier consisting of multiple boards with transistors and other components. The magnetic amplifiers are the yellowish transformer-like components.](end-view.jpg "w500")
-->
<p><a href="https://static.righto.com/images/bendix-transducer/pressure-transducers.jpg"><img alt="The pressure transducers are the two black domes at the top. The circuit boards next to each pressure transducer are the amplifiers. The yellowish transformer-like devices with three windings are the magnetic amplifiers." class="hilite" height="535" src="https://static.righto.com/images/bendix-transducer/pressure-transducers-w450.jpg" title="The pressure transducers are the two black domes at the top. The circuit boards next to each pressure transducer are the amplifiers. The yellowish transformer-like devices with three windings are the magnetic amplifiers." width="450" /></a><div class="cite">The pressure transducers are the two black domes at the top. The circuit boards next to each pressure transducer are the amplifiers. The yellowish transformer-like devices with three windings are the magnetic amplifiers.</div></p>
<p>I reverse-engineered the boards and created the schematic below.
I'll discuss the schematic at a high level; click it for a larger version if you want to see the full circuitry.
The process starts with the inductive sensor (yellow), which provides the error input signal to the amplifier.
The first stage of the amplifier (blue) is a two-transistor amplifier and filter.
From there, the signal goes to two separate output amplifiers to drive the two motors: fine (purple) and coarse (cyan).</p>
<p><a href="https://static.righto.com/images/bendix-transducer/schematic.pdf"><img alt="Schematic of the servo amplifier, probably with a few errors. Click for a larger version." class="hilite" height="290" src="https://static.righto.com/images/bendix-transducer/schematic-labeled-w600.jpg" title="Schematic of the servo amplifier, probably with a few errors. Click for a larger version." width="600" /></a><div class="cite">Schematic of the servo amplifier, probably with a few errors. Click for a larger version.</div></p>
<p>The inductive sensor provides its error signal as a 400 Hz sine wave, with a larger signal indicating more error.
The phase of the signal is 0° or 180°, depending on the direction of the error.
In other words, the error signal is proportional to the driving AC signal in one direction and
flipped when the error is in the other direction.
This is important since it indicates which direction the motors should turn.
When the error is eliminated, the signal is zero.</p>
<p>Each output amplifier consists of a transistor circuit driving two magnetic amplifiers.
Magnetic amplifiers are an old technology that can amplify AC signals, allowing the relatively weak transistor
output to control a larger AC output.
The basic idea of a magnetic amplifier is a controllable inductor.
Normally, the inductor blocks alternating current.
But applying a relatively small DC signal to a control winding causes the inductor to saturate, permitting
the flow of AC.
Since the magnetic amplifier uses a small signal to control a much larger signal, it provides amplification.</p>
<p>In the early 1900s, magnetic amplifiers were used in applications such as dimming lights.
Germany improved the technology in World War II, using magnetic amplifiers in ships, rockets, and trains.
The magnetic amplifier had a resurgence in the 1950s; the Univac Solid State computer used magnetic amplifiers
(rather than vacuum tubes or transistors) as its logic elements.
However, improvements in transistors made the magnetic amplifier obsolete except for specialized applications.
(See my IEEE Spectrum article on <a href="https://spectrum.ieee.org/the-vacuum-tubes-forgotten-rival">magnetic amplifiers</a> for more history of magnetic amplifiers.)</p>
<p>In the CADC, magnetic amplifiers control the AC power to the motors.
Two magnetic amplifiers are visible on top of the amplifier board stack, while two more are on the underside;
they are the yellow devices that
look like transformers.
(Behind the magnetic amplifiers, the power transformer is labeled "A".)</p>
<p><a href="https://static.righto.com/images/bendix-transducer/amplifier-stack.jpg"><img alt="One of the three-board amplifiers for the pressure transducer." class="hilite" height="530" src="https://static.righto.com/images/bendix-transducer/amplifier-stack-w500.jpg" title="One of the three-board amplifiers for the pressure transducer." width="500" /></a><div class="cite">One of the three-board amplifiers for the pressure transducer.</div></p>
<p>The transistor circuit generates the control signal to the magnetic amplifiers, and the output of the
magnetic amplifiers is the AC signal to the motors.
Specifically, the CADC uses two magnetic amplifiers for each motor.
One magnetic amplifier powers the motor to spin clockwise, while the other makes the motor spin counterclockwise.
The transistor circuit will pull one magnetic amplifier winding low;
the phase of the input signal controls which magnetic amplifier, and thus the motor direction.
(If the error input signal is zero, neither winding is pulled low, both magnetic amplifiers block AC, and the motor
doesn't turn.)<span id="fnref:phase"><a class="ref" href="#fn:phase">6</a></span>
The result of this is that the motor will spin in the correct direction based on the error input signal,
rotating the mechanism until the mechanical output position matches the input pressure.
The motors are "Motor / Tachometer Generator" units that also generate a voltage based on their speed.
This speed signal is fed into the transistor amplifier to provide negative feedback, limiting the motor speed as the error
becomes smaller and ensuring that the feedback loop doesn't overshoot.</p>
<p>The other servo loops in the CADC (temperature and position error correction) have one motor driver constructed from transistors and two magnetic amplifiers.
However, each pressure transducer has two motor drivers (and thus four magnetic amplifiers), one for fine adjustment and one for coarse adjustment.
This allows the servo loop to track the input pressure very closely, while also adjusting rapidly to larger
changes in pressure.
The coarse amplifier uses back-to-back diodes to block small changes; only input voltages larger than a diode drop
will pass through and energize the coarse amplifier.</p>
<p>The CADC is powered by standard avionics power of 115 volts AC, 400 hertz.
Each pressure transducer amplifier has a simple power supply running off this AC, using a multi-winding power
transformer.
A center-tapped winding and full wave rectifier produces DC for the transistor amplifiers.
Other windings supply AC (controlled by the magnetic amplifiers) to power the motors, AC for the
magnetic amplifier control signals, and AC for the sensor.
The transformer ensures that the transducer circuitry is electrically isolated from other parts of the CADC
and the aircraft.
The power supply is indicated in red in the schematic above.</p>
<p>The schematic also shows test circuitry (blue).
One of the features of the CADC is that it can be set to two test configurations before flight to ensure that
the system is operating properly and is correctly calibrated.<span id="fnref:tests"><a class="ref" href="#fn:tests">7</a></span>
Two relays allow the pressure transducer to switch to one of two test inputs.
This allows the CADC to be checked for proper operation and calibration.
The test inputs are provided from an external board and a helical feedback potentiometer (<a href="https://www.beckman-foundation.org/about-foundation/inventions/helical-potentiometer-helipot/">Helipot</a>) that provides simulated sensor
input.</p>
<p>Getting the amplifiers to work was a challenge. Many of the capacitors in the CADC had deteriorated and failed, as shown below.
<a href="https://www.youtube.com/@CuriousMarc">Marc</a> went through the CADC boards and replaced the bad capacitors.
However, one of the pressure transducer boards still failed to work.
After much debugging, we discovered that one of the <em>new</em> capacitors had also failed.
Finally, after replacing that capacitor a second time, the CADC was operational.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/bad-capacitors.jpg"><img alt="Some bad capacitors in the CADC. This is the servo amplifier for the temperature sensor." class="hilite" height="380" src="https://static.righto.com/images/bendix-transducer/bad-capacitors-w500.jpg" title="Some bad capacitors in the CADC. This is the servo amplifier for the temperature sensor." width="500" /></a><div class="cite">Some bad capacitors in the CADC. This is the servo amplifier for the temperature sensor.</div></p>
<h3>The mechanical feedback loop</h3>
<p>The amplifier boards energize two motors that rotate the output shaft,<span id="fnref:motors"><a class="ref" href="#fn:motors">8</a></span>
the coarse and fine motors.
The outputs from the coarse and fine motors are combined through a differential gear assembly that
sums its two input rotations.<span id="fnref:differential"><a class="ref" href="#fn:differential">9</a></span>
While the differential functions like the differential in a car, it is constructed differently, with a
<a href="https://en.wikipedia.org/wiki/Differential_(mechanical_device)#Spur-gear_design">spur-gear</a> design.
This compact arrangement of gears is about 1 cm thick and 3 cm in diameter.
The differential is mounted on a shaft along with three co-axial gears: two gears provide the inputs to the differential and the third
provides the output.
In the photo, the gears above and below the differential are the input gears.
The entire differential body rotates with the sum, connected to the output gear at the top through a concentric shaft.
The two thick gears inside the differential body are part of its mechanism.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/differential-closeup.png"><img alt="A closeup of a differential mechanism." class="hilite" height="285" src="https://static.righto.com/images/bendix-transducer/differential-closeup-w300.png" title="A closeup of a differential mechanism." width="300" /></a><div class="cite">A closeup of a differential mechanism.</div></p>
<p>(Differential gear assemblies are also used as the mathematical component of the CADC, as it performs addition or subtraction.
Since most values in the CADC are expressed logarithmically, the differential computes multiplication and division when it
adds or subtracts its inputs.)</p>
<p>The CADC uses cams to correct for nonlinearities in the pressure sensors.
The cam consists of a warped metal plate.
As the gear rotates, a spring-loaded vertical follower moves according to the shape of the plate.
The differential gear assembly under the plate adds this value to the original input to obtain a
corrected value. (This differential implementation is different from the one described above.)
The output from the cam is fed into the pressure sensor, closing the feedback loop.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/error-cam.jpg"><img alt="The corrector cam is adjusted to calibrate the output to counteract for variations in the bellows behavior." class="hilite" height="339" src="https://static.righto.com/images/bendix-transducer/error-cam-w400.jpg" title="The corrector cam is adjusted to calibrate the output to counteract for variations in the bellows behavior." width="400" /></a><div class="cite">The corrector cam is adjusted to calibrate the output to counteract for variations in the bellows behavior.</div></p>
<p>At the top, 20 screws can be rotated to adjust the shape of the cam plate and thus the correction factor.
These cams allow the CADC to be fine-tuned to maximize accuracy.
According to the spec, the required accuracy for pressure was "40 feet or 0.15 percent of attained altitude,
whichever is greater."</p>
<h2>Conclusions</h2>
<!--
The Bendix CADC is an amazing example of mechanical computation. In this article, I have focused on
the pressure transducers, rather than the computation itself.
(For the mathematics, see my [previous article](https://www.righto.com/2023/10/bendix-cadc-reverse-engineering.html).)
-->
<p>The Bendix CADC was built at an interesting point in time, when computations could be done digitally or
analog, mechanically or electrically.
Because the inputs were analog and the desired outputs were analog, the decision was made to use an analog
computer for the CADC.
Moreover, transistors were available but their performance was limited.
Thus, the servo amplifiers are built from a combination of transistors and magnetic amplifiers.</p>
<p>Modern air data computers are digital but they are still larger than you might expect because
they need to handle physical pressure inputs.
While a drone can use a tiny 5mm <a href="https://www.te.com/commerce/DocumentDelivery/DDEController?Action=showdoc&DocId=Data+Sheet%7FMS5611-01BA03%7FB3%7Fpdf%7FEnglish%7FENG_DS_MS5611-01BA03_B3.pdf%7FCAT-BLPS0036">MEMS pressure sensor</a>,
air data computers for aircraft have higher requirements and typically use larger <a href="https://www.auxitrolweston.com/wp-content/uploads/2023/09/7881-VIBRATING-CYLINDER-PRESSURE-SENSOR.pdf">vibrating cylinder pressure sensors</a>.
Even so, at 45 mm long, the modern pressure sensor is dramatically smaller than the CADC's pressure transducer with its
metal-domed bellows sensor, three-board amplifier, motors, cam, and gear train.
Although the mechanical Bendix CADC seems primitive, this CADC was used by the Air Force until the 1980s.
I guess if the system worked, there was no reason to update it.</p>
<!-- https://www.aerocontact.com/public/img/aviaexpo/produits/catalogues/316/AC32-RVSM_SUPERSONIC-THOMMEN_3.pdf -->
<p>I plan to continue reverse-engineering the Bendix CADC,<span id="fnref:refs"><a class="ref" href="#fn:refs">10</a></span>
so follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon as <a href="https://oldbytes.space/@kenshirriff">@oldbytes.space@kenshirriff</a>.
Thanks to Joe for providing the CADC. Thanks to Nancy Chen for obtaining a hard-to-find document for me.
Marc Verdiell and Eric Schlaepfer are working on the CADC with me.</p>
<!-- AiResearch F-14A air data computer with solid state pressure sensor https://www.google.com/books/edition/Aircraft_Yearbook/KnVGAAAAYAAJ?hl=en&gbpv=1&pg=PA85&printsec=frontcover
F-104 lawsuit over whether CADC is essential:
https://www.google.com/books/edition/United_States_Customs_Court_Reports/7p1ypJhUYukC?hl=en&gbpv=1&dq=electronic+air+data+computer&pg=PA331&printsec=frontcover
Bendix digital air data computer https://patentimages.storage.googleapis.com/ae/55/07/f72bf49c914d2c/US3564222.pdf
https://www.google.com/books/edition/Aircraft_Yearbook/KnVGAAAAYAAJ?hl=en&gbpv=1&dq=%22digital+air+data+computer%22&pg=RA1-PA190&printsec=frontcover
https://patents.google.com/patent/US3564222A/en
-->
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:previous">
<p>My previous posts on the CADC provide an <a href="https://www.righto.com/2023/02/bendix-central-air-data-computer-cadc.html">overview</a> and reverse-engineering of the <a href="https://www.righto.com/2023/10/bendix-cadc-reverse-engineering.html">left side</a>.
Much of the background of this article is copied from the previous articles, if it looks familiar. <a class="footnote-backref" href="#fnref:previous" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:static">
<p>The static air pressure can also be provided by holes in the side of the pitot tube.
I couldn't find information indicating exactly how the planes with the CADC received static pressure. <a class="footnote-backref" href="#fnref:static" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:equations">
<p>Although the CADC's equations may seem <em>ad hoc</em>, they can be derived from fluid dynamics principles.
These equations were standardized in the 1950s by various government organizations including the National
Bureau of Standards and NACA (the precursor of NASA). <a class="footnote-backref" href="#fnref:equations" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:cams">
<p>The CADC also uses cams to implement functions such as logarithms, exponentials, and
complicated functions of one variable such as ${M}/{\sqrt{1 + .2 M^2}}$.
These cams have a completely different design from the corrector cams.
The function cams are fixed shape, unlike the adjustable corrector cams.
The function is encoded into the cam's shape during manufacturing, so implementing a hard-to-compute nonlinear function isn't a problem for the CADC.
The photo below shows a cam with the follower arm in front. As the cam rotates, the follower
moves in and out according to the cam's radius.
The pressure transducers do not use fixed cams, so I won't discuss them more in this article.</p>
<p><a href="https://static.righto.com/images/bendix-transducer/cam.jpg"><img alt="A cam inside the CADC implements a function." class="hilite" height="297" src="https://static.righto.com/images/bendix-transducer/cam-w400.jpg" title="A cam inside the CADC implements a function." width="400" /></a><div class="cite">A cam inside the CADC implements a function.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:cams" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:pec">
<p>The CADC also has an input for the "position error correction".
This input provides a correction factor because the measured static pressure may not
exactly match the real static pressure. The problem is that the static pressure is measured from a port on the
aircraft. Distortions in the airflow may cause errors in this measurement.
A separate box, the "compensator", determined the correction factor based on the angle of attack and fed it to the CADC
as a synchro signal.
The position error correction is applied in a separate section of the CADC, downstream from the transducers,
so I will ignore it for this article. <a class="footnote-backref" href="#fnref:pec" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:phase">
<p>A bit more explanation of the transistor circuit driving the magnetic amplifier.
The idea is that one magnetic amplifier or the other is selected, depending on the phase of the error
signal, causing the motor to turn counterclockwise or clockwise as needed.
To implement this, the magnetic amplifier control windings are connected to opposite phases of the 400 Hz power.
The transistor is connected to both magnetic amplifiers through diodes, so current will flow only if
the transistor pulls the winding low during the half-cycle that the winding is powered high.
Thus, depending on the phase of the transistor output, one winding or the other will be powered,
allowing that magnetic amplifier to pass AC to the motor. <a class="footnote-backref" href="#fnref:phase" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:tests">
<p>According to the specification, the CADC has simulated "low point" and "high point" test conditions.
The low point is 11,806 feet altitude, 1064 ft/sec true airspeed, Mach .994, total temperature 317.1 °K,
and density × speed of sound of 1.774 lb sec/ft<sup>3</sup>.
The high point is 50,740 feet altitude, 1917 ft/sec true airspeed, Mach 1.980, total temperature 366.6 °K,
and density × speed of sound of .338 lb sec/ft<sup>3</sup>. <a class="footnote-backref" href="#fnref:tests" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:motors">
<p>The motor part number is Bendix FV101-5A1. <a class="footnote-backref" href="#fnref:motors" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:differential">
<p>Strictly speaking, the output of the differential is the sum of the inputs divided by two.
I'm ignoring the factor of 2 because the gear ratios can easily cancel it out.
It's also arbitrary whether you think of the differential as adding or subtracting, since it depends on which rotation
direction is defined as positive. <a class="footnote-backref" href="#fnref:differential" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:refs">
<p>It was very difficult to find information about the CADC.
The official military specification is MIL-C-25653C(USAF). After searching everywhere, I was finally able to
get a copy from the Technical Reports & Standards unit of the Library of Congress.
The other useful document was in an obscure conference proceedings from 1958:
"Air Data Computer Mechanization" (Hazen), Symposium on the USAF Flight Control Data Integration Program, Wright Air Dev Center US Air Force, Feb 3-4, 1958, pp 171-194. <a class="footnote-backref" href="#fnref:refs" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com3tag:blogger.com,1999:blog-6264947694886887540.post-48363352895775394052023-12-31T10:18:00.000-08:002023-12-31T10:18:58.596-08:00Interesting double-poly latches inside AMD's vintage LANCE Ethernet chip<p>I've studied a lot of chips from the 1970s and 1980s, so I usually know what to expect.
But an Ethernet chip from 1982 had something new: a strange layer of yellow wiring on the die.
After some study, I learned that the yellow wiring is a second layer of resistive polysilicon, used in
the chip's static storage cells and latches.</p>
<p><a href="https://static.righto.com/images/lance-poly/poly.jpg"><img alt="A closeup of the die of the LANCE chip. The metal has been removed to show the layers underneath." class="hilite" height="355" src="https://static.righto.com/images/lance-poly/poly-w350.jpg" title="A closeup of the die of the LANCE chip. The metal has been removed to show the layers underneath." width="350" /></a><div class="cite">A closeup of the die of the LANCE chip. The metal has been removed to show the layers underneath.</div></p>
<p>The die photo above shows a closeup of a latch circuit, with the diagonal yellow stripe in the middle.
For this photo, I removed the chip's metal layer so you can see the underlying circuitry.
The bottom layer, silicon, appears gray-purple under the microscope, with the active silicon regions slightly darker and bordered in black.
On top of the silicon, the pink regions are polysilicon, a special type of silicon.
Polysilicon has a critical role in the chip: when it crosses active silicon, polysilicon forms the gate of
a transistor. The circles are contacts between the metal layer and the underlying silicon or polysilicon.
So far, the components of the chip match most NMOS chips of that time.
But what about the bright yellow line crossing the circuit diagonally? That was new to me.
This second layer of polysilicon provides resistance. It crosses over the other layers, connected to the silicon at the ends with
a complex ring structure.</p>
<p>Why would you want high-resistance wiring in your digital chip? To understand this, let's first look at
how a bit can be stored.
An efficient way to store a bit is to connect two inverters in a loop, as shown below.
Each inverter sends the opposite value to the other inverter, so the circuit will be stable in two states,
holding one bit: a 1 or a 0.</p>
<p><a href="https://static.righto.com/images/lance-poly/inverters.jpg"><img alt="Two cross-coupled inverters can store either a 0 or a 1 bit." class="hilite" height="161" src="https://static.righto.com/images/lance-poly/inverters-w350.jpg" title="Two cross-coupled inverters can store either a 0 or a 1 bit." width="350" /></a><div class="cite">Two cross-coupled inverters can store either a 0 or a 1 bit.</div></p>
<p>But how do you store a new value into the inverter loop? There are a few techniques.
One is to use pass transistors to break the loop, allowing a new value to be stored.
In the schematic below, if the <em>hold</em> signal is activated, the transistor turns on, completing the loop.
But if <em>hold</em> is dropped and <em>load</em> is activated, a new value can be loaded from the input into the inverter loop.</p>
<p><a href="https://static.righto.com/images/lance-poly/pass-latch.png"><img alt="A latch, controlled by pass transistors." class="hilite" height="133" src="https://static.righto.com/images/lance-poly/pass-latch-w300.png" title="A latch, controlled by pass transistors." width="300" /></a><div class="cite">A latch, controlled by pass transistors.</div></p>
<p>An alternative is to use a weak inverter that produces a low-current output.
In this case, the input
signal can simply overpower the value produced by the inverter, forcing the loop into a new state.
The advantage of this circuit is that it
eliminates the "hold" transistor. However, a weak inverter turns out to be larger than a regular inverter,
negating much of the space saving.<span id="fnref:size"><a class="ref" href="#fn:size">1</a></span> (The Intel 386 processor uses this type of latch.)</p>
<p><a href="https://static.righto.com/images/lance-poly/weak-latch.png"><img alt="A latch using a weak inverter." class="hilite" height="87" src="https://static.righto.com/images/lance-poly/weak-latch-w300.png" title="A latch using a weak inverter." width="300" /></a><div class="cite">A latch using a weak inverter.</div></p>
<p>A third alternative, used in the Ethernet chip, is to use a resistor for the feedback, limiting the
current.<span id="fnref:power"><a class="ref" href="#fn:power">2</a></span>
As in the previous circuit, the input can overpower the low feedback current.
However, this circuit is more compact since it doesn't require a larger inverter.
The resistor doesn't require additional space since it can overlap the rest of the circuitry, as shown in the
photo at the top of the article.
The disadvantage is that manufacturing the die requires additional processing steps to create the resistive
polysilicon layer.</p>
<p><a href="https://static.righto.com/images/lance-poly/latch-resistor.png"><img alt="A latch using a resistor for feedback." class="hilite" height="86" src="https://static.righto.com/images/lance-poly/latch-resistor-w300.png" title="A latch using a resistor for feedback." width="300" /></a><div class="cite">A latch using a resistor for feedback.</div></p>
<p>In the Ethernet chip, this type of latch is used in many circuits. For example, shift registers are built
by connecting latches in sequence, controlled by the clock signals.
Latches are also used to create binary counters, with the latch value toggled when the lower bits produce a carry.</p>
<h2>The SRAM cell</h2>
<p>It would be overkill to create a separate polysilicon layer just for a few latches.
It turns out that the chip was constructed with AMD's "64K dynamic RAM process".
Dynamic RAM uses tiny capacitors to store data. In the late 1970s, dynamic RAM chips started using
a "double-poly" process with one layer of polysilicon to form the capacitors and a second layer of polysilicon
for transistor gates and wiring (<a href="https://www.righto.com/2020/11/reverse-engineering-classic-mk4116-16.html">details</a>).</p>
<p>The double-poly process was also useful for shrinking the size of static RAM.<span id="fnref:patent-cell"><a class="ref" href="#fn:patent-cell">3</a></span>
The Ethernet chip contains several blocks of storage buffers for various purposes. These blocks are implemented as static RAM,
including a 22×16 block, a 48×9 block, and a 16×7 block.
The photo below shows a closeup of some storage cells, showing how they are arranged in a regular grid.
The yellow lines of resistive polysilicon are visible in each cell.</p>
<p><a href="https://static.righto.com/images/lance-poly/storage.jpg"><img alt="A block of 28 storage cells in the chip. Some of the second polysilicon layer is damaged." class="hilite" height="363" src="https://static.righto.com/images/lance-poly/storage-w500.jpg" title="A block of 28 storage cells in the chip. Some of the second polysilicon layer is damaged." width="500" /></a><div class="cite">A block of 28 storage cells in the chip. Some of the second polysilicon layer is damaged.</div></p>
<p>A static RAM storage cell is roughly similar to the latch cell, with two inverters in
a loop to store each bit.
However, the storage is arranged in a grid: each row corresponds to a particular word, and each column
corresponds to the bits in a word.
To select a word, a word select line is activated, turning on the pass transistors in that row.
Reading and writing the cell is done through a pair of bitlines; each bit has a bitline and a complemented bitline.
To read a word, the bits in the word are accessed through the bitlines.
To write a word, the new value and its complement are applied to the bitlines, forcing the inverters into
the desired state.
(The bitlines must be driven with a high-current signal that can overcome the signal from the inverters.)</p>
<p><a href="https://static.righto.com/images/lance-poly/cell-schematic.png"><img alt="Schematic of one storage cell." class="hilite" height="195" src="https://static.righto.com/images/lance-poly/cell-schematic-w300.png" title="Schematic of one storage cell." width="300" /></a><div class="cite">Schematic of one storage cell.</div></p>
<p>The diagram below shows the physical layout of one memory cell, consisting of two resistors and four transistors.
The black lines indicate the vertical metal wiring that was removed.
The schematic on the right corresponds to the physical arrangement of the circuit.
Each inverter is constructed from a transistor and a pull-up resistor, and the inverters are connected into a loop.
(The role of these resistors is completely different from the feedback resistors in the latch.)
The two transistors at the bottom are the pass transistors that provide access to the cell for reads or writes.</p>
<p><a href="https://static.righto.com/images/lance-poly/cell-labeled.jpg"><img alt="One memory cell static memory cell as it appears on the die, along with its schematic." class="hilite" height="408" src="https://static.righto.com/images/lance-poly/cell-labeled-w600.jpg" title="One memory cell static memory cell as it appears on the die, along with its schematic." width="600" /></a><div class="cite">One memory cell static memory cell as it appears on the die, along with its schematic.</div></p>
<p>The layout of this storage cell is highly optimized to minimize its area. Note that the yellow resistors take
almost no additional area, as they overlap other parts of the cell.
If constructed without resistors, each inverter would require an additional transistor, making the cell larger.</p>
<p>To summarize, although the double-poly process was introduced for DRAM capacitors, it can also be used for SRAM cell pull-up resistors.
Reducing the size of the SRAM cells was probably the motivation to use this process for the Ethernet chip, with
the latch feedback resistors a secondary benefit.</p>
<h2>The Am7990 LANCE Ethernet chip</h2>
<p>I'll wrap up with some background on the AMD Ethernet chip.
Ethernet was invented in 1973 at Xerox PARC and became a standard in 1980.
Ethernet was originally implemented with a board full of chips, mostly TTL.
By the early 1980s, companies such as Intel, Seeq, and AMD introduced chips to put most of the circuitry onto VLSI chips.
These chips reduced the complexity of Ethernet interface hardware, causing the price to drop from <a href="https://books.google.com/books?id=3kpkQKds0lAC&pg=PA4#v=onepage&q&f=false">$2000</a> to $1000.</p>
<p>The chip that I'm examining is AMD's Am7990 LANCE (Local Area Network Controller for Ethernet).
This chip implemented much of the functionality for Ethernet and "Cheapernet" (now known as 10BASE2 Ethernet).
The chip handles serial/parallel conversion,
computing the 32-bit CRC checksum, handling collisions and backoff, and recognizing desired addresses.
The chip also provides DMA access for interfacing with a microcomputer.</p>
<p>The chip doesn't handle everything, though.
It was designed to work with an Am7992 Serial Interface Adapter chip that encodes and decodes the bitstream
using Manchester encoding.
The third chip was the Am7996 transceiver that handled the low-level signaling and interfacing with the coaxial network cable, as well as
detecting collisions if two nodes transmitted at the same time.</p>
<p>The LANCE chip is fairly complicated.
The die photo below shows the main functional blocks of the chip.
The chip is controlled by the large block of microcode ROM in the lower right.
The large dark rectangles are storage, implemented with the static RAM cells described above.
The chip has 48 pins, connected by tiny bond wires to the square pads around the edges of the die.</p>
<p><a href="https://static.righto.com/images/lance-poly/die-labeled.jpg"><img alt="Main functional blocks of the LANCE chip." class="hilite" height="597" src="https://static.righto.com/images/lance-poly/die-labeled-w600.jpg" title="Main functional blocks of the LANCE chip." width="600" /></a><div class="cite">Main functional blocks of the LANCE chip.</div></p>
<p>Thanks to Robert Garner for providing the AMD LANCE chip and information, thanks to a bunch of people on Twitter
for <a href="https://twitter.com/kenshirriff/status/1736444057503875516">discussion</a>, and
thanks to Bob Patel for providing the functional block labeling and other information.
For more,
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon occasionally as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:size">
<p>It may seem contradictory for a weak inverter to be larger than a regular inverter, since you'd expect
that the bigger the transistor, the stronger the signal.
It turns out, however, that creating a weak signal requires a larger transistor, due to how MOS transistors
are constructed.
The current from a transistor is proportional to the gate's width divided by the length.
Thus, to create a more powerful transistor, you increase the width.
But to create a weak transistor, you can't decrease the width because the minimum width is limited by
the manufacturing process. Thus, you need to increase the gate's length.
The result is that both stronger and weaker transistors are larger than "normal" transistors. <a class="footnote-backref" href="#fnref:size" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:power">
<p>You might worry that the feedback resistor will wastefully dissipate power.
However, the feedback current is essentially zero because NMOS transistor gates are essentially insulators.
Thus, the resistor only needs to pass enough current to charge or discharge the gate. <a class="footnote-backref" href="#fnref:power" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:patent-cell">
<p>An AMD patent describes the double-poly process as well as the static RAM cell; I'm not sure this
is the process used in the Ethernet chip, but I expect the process is similar.
The diagram below shows the RAM cell with its two resistors.
The patent describes how the resistors and second layer of wiring are formed by a silicide/polysilicon ("inverted polycide") sandwich.
(The silicide is a low-resistance compound of tantalum and silicon or molybdenum and silicon.)
Specifically, the second layer consists of a buffer layer of polysilicon, a thicker silicide layer, and
another layer of polysilicon forming the low-resistance "sandwich".
Where resistance is desired, the bottom two layers of "sandwich" are removed during fabrication to leave
just a layer of polysilicon. This polysilicon is then doped through implantation to give it the desired
resistance.</p>
<p><a href="https://static.righto.com/images/lance-poly/patent-cell.png"><img alt="The static RAM cell from patent 4569122, "Method of forming a low resistance quasi-buried contact"." class="hilite" height="277" src="https://static.righto.com/images/lance-poly/patent-cell-w500.png" title="The static RAM cell from patent 4569122, "Method of forming a low resistance quasi-buried contact"." width="500" /></a><div class="cite">The static RAM cell from patent 4569122, "Method of forming a low resistance quasi-buried contact".</div></p>
<p>The patent also describes using the second layer of polysilicon to provide a connection between silicon
and the main polysilicon layer.
Chips normally use a "buried contact" to connect silicon and polysilicon, but the patent describes how
putting the second layer of polysilicon on top reduces the alignment requirements for a low-resistance
contact.
I think this explains the yellow ring of polysilicon around all the silicon/polysilicon contacts in the
chip.
(These rings are visible in the die photo at the top of the article.)
Patent <a href="https://patents.google.com/patent/US4581815A">4581815</a> refines this process further.</p>
<p><!-- --> <a class="footnote-backref" href="#fnref:patent-cell" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com0tag:blogger.com,1999:blog-6264947694886887540.post-29711650909472944812023-12-19T22:33:00.000-08:002023-12-21T19:28:53.203-08:00The transparent chip inside a vintage Hewlett-Packard floppy drive<p>While repairing an eight-inch HP floppy drive, we found that the problem was a broken interface chip.
Since the chip was bad, I decapped it and took photos.
This chip is very unusual: instead of a silicon substrate, the chip is formed on
a base of sapphire, with silicon and metal wiring on top.
As a result, the chip is transparent as you can see from the gold "X" visible through the die in the photo below.</p>
<p><a href="https://static.righto.com/images/phi/transparent-die.jpg"><img alt="The PHI die as seen through an inspection microscope. Click this image (or any other) for a larger version." class="hilite" height="375" src="https://static.righto.com/images/phi/transparent-die-w500.jpg" title="The PHI die as seen through an inspection microscope. Click this image (or any other) for a larger version." width="500" /></a><div class="cite">The PHI die as seen through an inspection microscope. Click this image (or any other) for a larger version.</div></p>
<p>The chip is a custom HP chip from 1977 that provides an interface between HP's interface bus (HP-IB) and the
Z80 processor in the floppy drive controller.
HP designed this interface bus as a low-cost bus to connect test equipment, computers, and peripherals.
The chip, named PHI (Processor-to-HP-IB Interface), was used in multiple HP products.
It handles the bus protocol and buffered data between the
interface bus and a device's microprocessor.<span id="fnref:more"><a class="ref" href="#fn:more">1</a></span>
In this article, I'll take a look inside this "silicon-on-sapphire" chip, examine its metal-gate CMOS
circuitry, and explain how it works.</p>
<h2>Silicon-on-sapphire</h2>
<p>Most integrated circuits are formed on a silicon wafer.
Silicon-on-sapphire, on the other hand, starts with a sapphire substrate.
A thin layer of silicon is built up on the sapphire substrate to form the circuitry.
The silicon is N-type, and is converted to P-type where needed by ion implantation.
A metal wiring layer is created on top, forming the wiring as well as the metal-gate transistors.
The diagram below shows a cross-section of the circuitry.</p>
<p><a href="https://static.righto.com/images/phi/cross-section.jpg"><img alt="Cross-section from HP Journal, April 1977." class="hilite" height="396" src="https://static.righto.com/images/phi/cross-section-w700.jpg" title="Cross-section from HP Journal, April 1977." width="700" /></a><div class="cite">Cross-section from <a href="https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1977-04.pdf#page=5">HP Journal, April 1977</a>.</div></p>
<p>The important thing about silicon-on-sapphire is that silicon regions are separated from each other.
Since the sapphire substrate is an insulator, transistors are completely isolated, unlike a regular
integrated circuit. This reduces the capacitance between transistors, improving performance.
The insulation also prevents stray currents, protecting against latch-up and radiation.</p>
<p><a href="https://static.righto.com/images/phi/illuminated.jpg"><img alt="An HP MC2 die, illuminated from behind with fiber optics. From Hewlett-Packard Journal, April 1977." class="hilite" height="527" src="https://static.righto.com/images/phi/illuminated-w500.jpg" title="An HP MC2 die, illuminated from behind with fiber optics. From Hewlett-Packard Journal, April 1977." width="500" /></a><div class="cite">An HP MC<sup>2</sup> die, illuminated from behind with fiber optics. From <a href="http://hparchive.com/Journals/HPJ-1977-04.pdf">Hewlett-Packard Journal</a>, April 1977.</div></p>
<p>Silicon-on-sapphire integrated circuits date back to research in 1963 at Autonetics,
an innovative but now-forgotten avionics
company that produced guidance computers for the Minuteman ICBMs, among other things.
RCA developed silicon-on-sapphire integrated circuits in the 1960s and 1970s such as the <a href="http://www.cosmacelf.com/publications/data-sheets/cdp1821.pdf">CDP1821</a> silicon-on-sapphire 1K RAM.
HP used silicon-on-sapphire for multiple chips starting in 1977, such as the MC<sup>2</sup> Micro-CPU Chip.
HP also used SOS for the three-chip CPU in the <a href="https://www.cpushack.com/2016/11/26/hp-3000-series-33-16-bits-of-sapphire/">HP 3000 Amigo</a> (1978), but the system was a commercial failure.
The popularity of silicon-on-sapphire peaked in the early 1980s and
HP moved to <a href="https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1980-03.pdf#page=20">bulk silicon</a> integrated circuits for calculators such as the HP-41C.
Silicon-on-sapphire is still used in
various products, such as LEDs and RF applications, but is now mostly a niche technology.</p>
<!-- https://www.youtube.com/watch?v=p_ZoqOoW6sw&t=2s Why Silicon on Sapphire, Peregrine -->
<h2>Inside the PHI chip</h2>
<p>HP used an unusual package for the PHI chip. The chip is mounted on a ceramic substrate, protected by a ceramic cap.
The package has 48 gold fingers that
press into a socket. The chip is held into the socket by two metal spring clips.</p>
<p><a href="https://static.righto.com/images/phi/phi-package.jpg"><img alt="Package of the PHI chip, showing the underside. The package is flipped over when mounted in a socket." class="hilite" height="283" src="https://static.righto.com/images/phi/phi-package-w400.jpg" title="Package of the PHI chip, showing the underside. The package is flipped over when mounted in a socket." width="400" /></a><div class="cite">Package of the PHI chip, showing the underside. The package is flipped over when mounted in a socket.</div></p>
<p>Decapping the chip was straightforward, but more dramatic than I expected.
The chip's cap is attached with adhesive, which can be softened by heating.
Hot air wasn't sufficient, so we used a hot plate.
Eric tested the adhesive by poking it with an X-Acto knife, causing the cap to suddenly fly off with a loud "pop",
sending the blade flying through the air.
I was happy to be wearing safety glasses.</p>
<p><a href="https://static.righto.com/images/phi/decap.jpg"><img alt="Decapping the chip with a hotplate and hot air." class="hilite" height="300" src="https://static.righto.com/images/phi/decap-w400.jpg" title="Decapping the chip with a hotplate and hot air." width="400" /></a><div class="cite">Decapping the chip with a hotplate and hot air.</div></p>
<p>After decapping the chip, I created the high-resolution die photo below.
The metal layer is clearly visible as white lines, while the silicon is grayish and the sapphire appears purple.
Around the edge of the die, bond wires connect the chip's 48 external connections to the die.
Slightly upper left of center, a large regular rectangular block of circuitry provides 160 bits of storage: this
is two 8-word FIFO buffers, passing 10-bit words between the interface bus and a connected microprocessor.
The thick metal traces around the edges provide +12 volts, +5 volts, and ground to the chip.</p>
<p><a href="https://static.righto.com/images/phi/die.jpg"><img alt="Die photo of the PHI chip, created by stitching together microscope photos. Click for a much larger image." class="hilite" height="493" src="https://static.righto.com/images/phi/die-w600.jpg" title="Die photo of the PHI chip, created by stitching together microscope photos. Click for a much larger image." width="600" /></a><div class="cite">Die photo of the PHI chip, created by stitching together microscope photos. Click for a much larger image.</div></p>
<h2>Logic gates</h2>
<p>Circuitry on this chip has an unusual appearance due to the silicon-on-sapphire implementation as well as the
use of metal-gate transistors, but fundamentally the circuits are standard CMOS.
The photo below shows a block that implements an inverter and a NAND gate.
The sapphire substrate appears as dark purple. On top of this, the thick gray lines are the silicon.
The white metal on top connects the transistors. The metal can also form the gates of transistors when it
crosses silicon (indicated by letters).
Inconveniently, metal that contacts silicon, metal that crosses over silicon, and metal that forms a transistor
all appear very similar in this chip.
This makes it more difficult to determine the wiring.</p>
<p><a href="https://static.righto.com/images/phi/gate-diagram.jpg"><img alt="This diagram shows an inverter and a NAND gate on the die." class="hilite" height="363" src="https://static.righto.com/images/phi/gate-diagram-w300.jpg" title="This diagram shows an inverter and a NAND gate on the die." width="300" /></a><div class="cite">This diagram shows an inverter and a NAND gate on the die.</div></p>
<p>The schematic below shows how the gates are implemented, matching the photo above.
The metal lines at the top and bottom provide the power and ground rails respectively.
The inverter is formed from NMOS transistor A and PMOS transistor B; the output goes to transistors D and F.
The NAND gate is formed by NMOS transistors E and F in conjunction with PMOS transistors C and D.
The components of the NAND gate are joined at the square of metal, and then the output leaves through
silicon on the right.
Note that signals can only cross when one signal is in the silicon layer and one is in the metal layer.
With only two wiring layers, signals in the PHI chip must often meander to avoid crossings, wasting a lot
of space.
(This wiring is much more constrained than typical chips of the 1970s that also had a polysilicon layer, providing
three wiring layers in total.)</p>
<p><a href="https://static.righto.com/images/phi/gate-schematic.png"><img alt="This schematic shows how the inverter and a NAND gate are implemented." class="hilite" height="293" src="https://static.righto.com/images/phi/gate-schematic-w400.png" title="This schematic shows how the inverter and a NAND gate are implemented." width="400" /></a><div class="cite">This schematic shows how the inverter and a NAND gate are implemented.</div></p>
<h2>The FIFOs</h2>
<p>The PHI chip has two first-in-first-out buffers (FIFOs) that occupy a substantial part of the die.
Each FIFO holds 8 words of 10 bits, with one FIFO for data being read from the bus and the other for data
written to the bus.
These buffers help match the bus speed to the microprocessor speed, ensuring that data transmission is
as fast as possible.</p>
<p>Each bit of the FIFO is essentially a static RAM cell, as shown below.
Inverters A and B form a loop to hold a bit. Pass transistor C provides feedback so the inverter loop remains
stable.
To write a word, 10 bits are fed through vertical bit-in lines. A horizontal word write signal is activated
to select the word to update. This disables transistor C and turns on transistor D, allowing the new bit to
flow into the inverter loop.
To read a word, a horizontal word read line is activated, turning on pass transistor F.
This allows the bit in the cell to flow onto the vertical bit-out line, buffered by inverter E.
The two FIFOs have separate lines so they can be read and written independently.</p>
<p><a href="https://static.righto.com/images/phi/fifo-schematic.png"><img alt="One cell of the FIFO." class="hilite" height="217" src="https://static.righto.com/images/phi/fifo-schematic-w500.png" title="One cell of the FIFO." width="500" /></a><div class="cite">One cell of the FIFO.</div></p>
<p>The diagram below shows nine FIFO cells as they appear on the die. The red box indicates one cell, with
components labeled to match the schematic.
Cells are mirrored vertically and horizontally to increase the layout density.</p>
<p><a href="https://static.righto.com/images/phi/fifo-die.jpg"><img alt="Nine FIFO cells as they appear on the die." class="hilite" height="312" src="https://static.righto.com/images/phi/fifo-die-w600.jpg" title="Nine FIFO cells as they appear on the die." width="600" /></a><div class="cite">Nine FIFO cells as they appear on the die.</div></p>
<p>Control logic (not shown) to the left and right of the FIFOs manages the FIFOs.
This logic generates the appropriate read and write signals so data is
written to one end of the FIFO and read from the other end.</p>
<h2>The address decoder</h2>
<p>Another interesting circuit is the decoder that selects a particular register based on the address lines.
The PHI chip has eight registers, selected by three address lines. The decoder takes the address lines and
generates 16 control lines
(more or less), one to read from each register, and one to write to each register.</p>
<p><a href="https://static.righto.com/images/phi/decoder.jpg"><img alt="A die photo of the address decoder." class="hilite" height="440" src="https://static.righto.com/images/phi/decoder-w600.jpg" title="A die photo of the address decoder." width="600" /></a><div class="cite">A die photo of the address decoder.</div></p>
<p>The decoder has a regular matrix structure for efficient implementation.
Row lines are in pairs, with a line for each address bit input and its complement.
Each column corresponds to one output, with the transistors arranged so the column will be activated when
given the appropriate inputs.
At the top and bottom are inverters. These latch the incoming address bits, generate the complements, and
buffer the outputs.</p>
<p><a href="https://static.righto.com/images/phi/decoder-schematic.png"><img alt="Schematic of the decoder." class="hilite" height="501" src="https://static.righto.com/images/phi/decoder-schematic-w350.png" title="Schematic of the decoder." width="350" /></a><div class="cite">Schematic of the decoder.</div></p>
<p>The schematic above shows how the decoder operates. (I've simplified it to two inputs and two outputs.)
At the top, the address line goes through a latch formed from two inverters and a pass transistor.
The address line and its complement form two row lines; the other row lines are similar.
Each column has a transistor on one row line and a diode on the other, selecting the address for that column.
For instance, supposed a<sub>0</sub> is 1 and a<sub>n</sub> is 0.
This matches the first column since the transistor lines are low and the diode lines are high.
The PMOS transistors in the column will all turn on, pulling the input to the inverter high.
However, if any of the inputs are "wrong", the corresponding transistor will turn off, blocking the +12 volts.
Moreover, the output will be pulled low through the corresponding diode.
Thus, each column will be pulled high only if all the inputs match, and otherwise it will be pulled low.
Each column output controls one of the chip's registers, allowing that register to be accessed.</p>
<h2>The HP-IB bus and the PHI chip</h2>
<p>The Hewlett-Packard Interface Bus (HP-IB) was designed in the early 1970s as
a low-cost bus for connecting diverse devices including instrument systems (such as a digital voltmeter or frequency counter), storage, and computers.
This bus became an IEEE standard in 1975, known as the IEEE-488 bus.<span id="fnref:pet"><a class="ref" href="#fn:pet">2</a></span>
The bus is 8-bits parallel, with handshaking between devices so slow devices can control the speed.</p>
<p>In 1977, HP Developed a chip, known as PHI (Processor to HP-IB Interface) to implement the bus protocol and
provide a microprocessor interface.
This chip not only simplified construction of a bus controller but also ensured that devices implemented
the protocol consistently.
The block diagram below shows the components of the PHI chip. It's not an especially complex chip, but it
isn't trivial either. I estimate that it has several thousand transistors.</p>
<p><a href="https://static.righto.com/images/phi/block-diagram.jpg"><img alt="Block diagram from HP Journal, July 1989." class="hilite" height="305" src="https://static.righto.com/images/phi/block-diagram-w700.jpg" title="Block diagram from HP Journal, July 1989." width="700" /></a><div class="cite">Block diagram from <a href="fttps://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1978-07.pdf#page=16">HP Journal</a>, July 1989.</div></p>
<p>The die photo below shows some of the functional blocks of the PHI chip.
The microprocessor connected to the top pins, while the interface bus connected to the lower pins.</p>
<p><a href="https://static.righto.com/images/phi/die-labeled.jpg"><img alt="The PHI die with some functional blocks labeled," class="hilite" height="566" src="https://static.righto.com/images/phi/die-labeled-w700.jpg" title="The PHI die with some functional blocks labeled," width="700" /></a><div class="cite">The PHI die with some functional blocks labeled,</div></p>
<h2>Conclusions</h2>
<p><a href="https://static.righto.com/images/phi/top.jpg"><img alt="Top of the PHI chip, with the 1AA6-6004 part number. I'm not sure what the oval stamp at the top is, maybe a turtle?" class="hilite" height="304" src="https://static.righto.com/images/phi/top-w350.jpg" title="Top of the PHI chip, with the 1AA6-6004 part number. I'm not sure what the oval stamp at the top is, maybe a turtle?" width="350" /></a><div class="cite">Top of the PHI chip, with the 1AA6-6004 part number. I'm not sure what the oval stamp at the top is, maybe a turtle?</div></p>
<p>The PHI chip is interesting as an example of a "technology of the future" that didn't quite pan out.
HP put a lot of effort into silicon-on-sapphire chips, expecting that this would become an important
technology: dense, fast, and low power.
However, regular silicon chips turned out to be the winning technology and silicon-on-sapphire was relegated
to niche markets.</p>
<p>Comparing HP's silicon-on-sapphire chips to regular silicon chips at the time shows some advantages and disadvantages.
HP's MC<sup>2</sup> 16-bit processor (1977) used
silicon-on-sapphire technology and had 10,000 transistors and ran at 8 megahertz, using 350 mW.
In comparison, the Intel 8086 (1978) was also a 16-bit processor, but implemented on regular silicon and using NMOS instead of CMOS. The 8086 had 29,000 transistors, ran at 5 megahertz (at first) and
used up to 2.5 watts. The sizes of the chips were almost identical: 34 mm<sup>2</sup> for the HP processor and
33 mm<sup>2</sup> for the Intel processor.
This illustrates that CMOS uses much less power than NMOS, one of the reasons that CMOS is now the dominant technology.
For the other factors, silicon-on-sapphire had a bit of a speed advantage but wasn't as dense.
Silicon-on-sapphire's main problem was its low yield and high cost.
Crystal incompatibilities between silicon and sapphire made manufacturing difficult; HP achieved a yield of 9%,
meaning 91% of the dies failed.</p>
<p>The time period of the PHI chip is also interesting since interface buses were transitioning from
straightforward buses to high-performance buses with complex protocols.
Early buses could be implemented with simple integrated circuits, but as protocols became more complex,
custom interface chips became necessary.
(The <a href="https://en.wikipedia.org/wiki/MOS_Technology_6522">MOS 6522 Versatile Interface Adapter</a> chip (1977)
is another example, used in many home computers of the 1980s.)
But these interfaces were still simple enough that the interface chips didn't require microcontrollers,
using simple state machines instead.</p>
<p><a href="https://static.righto.com/images/phi/logo.jpg"><img alt="The HP logo on the die of the PHI chip." class="hilite" height="211" src="https://static.righto.com/images/phi/logo-w250.jpg" title="The HP logo on the die of the PHI chip." width="250" /></a><div class="cite">The HP logo on the die of the PHI chip.</div></p>
<p>For more,
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon occasionally as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.
Thanks to <a href="https://www.youtube.com/@CuriousMarc/featured">CuriousMarc</a> for providing the chip and to <a href="https://twitter.com/TubeTimeUS">TubeTimeUS</a> for help with decapping.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:more">
<p>More information:
The article <a href="https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=3383eb0297213d217c88723662c10d263b186a7b">What is Silicon-on-Sapphire</a> discusses the history and construction.
Details on the HP-IB bus are <a href="https://www.hp9845.net/9845/tutorials/hpib/">here</a>.
The <a href="http://www.hpmuseum.net/collection_document.php">HP 12009A HP-IB Interface Reference Manual</a> has information on the PHI chip and the protocol.
See also the PHI article from <a href="https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1978-07.pdf#page=16">HP Journal</a>, July 1989.
<a href="https://www.instagram.com/evilmonkeyzdesignz/reel/Cf6tJeWl-P4/">EvilMonkeyDesignz</a> also shows
a decapped PHI chip. <a class="footnote-backref" href="#fnref:more" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:pet">
<p>People with Commodore PET computers may recognize the IEEE-488 bus since peripherals such
as floppy disk drives were connected using the IEEE-488 bus.
The cables were generally expensive and harder to obtain than interface cables used by other computers.
The devices were also slow compared to other computers, although I think this was due to the hardware, not
the bus. <a class="footnote-backref" href="#fnref:pet" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com11tag:blogger.com,1999:blog-6264947694886887540.post-86473152083224410052023-12-15T22:36:00.000-08:002023-12-17T10:02:53.807-08:00Two interesting XOR circuits inside the Intel 386 processor<p>Intel's 386 processor (1985) was an important advance in the x86 architecture, not only moving to a 32-bit processor but also switching to a CMOS implementation.
I've been reverse-engineering parts of the 386 chip and came across two interesting and completely different
circuits that the 386 uses to implement an XOR gate: one uses standard-cell logic while the other uses pass-transistor logic.
In this article, I take a look at those circuits.</p>
<p><a href="https://static.righto.com/images/386-xor/386-die-labeled.jpg"><img alt="The die of the 386. Click this image (or any other) for a larger version." class="hilite" height="640" src="https://static.righto.com/images/386-xor/386-die-labeled-w600.jpg" title="The die of the 386. Click this image (or any other) for a larger version." width="600" /></a><div class="cite">The die of the 386. Click this image (or any other) for a larger version.</div></p>
<p>The die photo above shows the two metal layers of the 386 die. The polysilicon and silicon layers underneath
are mostly hidden by the metal.
The black dots around the edges are the bond wires connecting the die to the external pins.
The 386 is a complicated chip with 285,000 transistor sites. I've labeled the main functional blocks.
The datapath in the lower left does the actual computations, controlled by the microcode ROM in the lower right.</p>
<p>Despite the complexity of the 386, if you zoom in enough, you can see individual XOR gates.
The red rectangle at the top (below) is a shift register for the chip's self-test. Zooming in again shows the silicon for an XOR gate
implemented with pass transistors.
The purple outlines reveal active silicon regions, while the stripes are transistor gates.
The yellow rectangle zooms in on part of the standard-cell logic that controls the prefetch queue.
The closeup shows the silicon for an XOR gate implemented with two logic gates.
Counting the stripes shows that the first XOR gate is implemented with 8 transistors while the second uses 10 transistors. I'll explain below how these transistors are connected to form the XOR gates.</p>
<p><a href="https://static.righto.com/images/386-xor/die-zoom.jpg"><img alt="The die of the 386, zooming in on two XOR gates." class="hilite" height="489" src="https://static.righto.com/images/386-xor/die-zoom-w600.jpg" title="The die of the 386, zooming in on two XOR gates." width="600" /></a><div class="cite">The die of the 386, zooming in on two XOR gates.</div></p>
<h2>A brief introduction to CMOS</h2>
<p>CMOS circuits are used in almost all modern processors.
These circuits are built from two types of transistors: NMOS and PMOS.
These transistors can be viewed as switches between the <em>source</em> and <em>drain</em> controlled by the <em>gate</em>.
A high voltage on the gate of an NMOS transistor turns the transistor on, while a low voltage on the gate of
a PMOS transistor turns the transistor on.
An NMOS transistor is good at pulling the output low, while a PMOS transistor is good at pulling the output high.
Thus, NMOS and PMOS transistors are opposites in many ways; they are <em>complementary</em>, which is the "C" in CMOS.</p>
<p><a href="https://static.righto.com/images/386-xor/mosfet.jpg"><img alt="Structure of a MOS transistor. Although the transistor's name represents the Metal-Oxide-Semiconductor layers, modern MOS transistors typically use polysilicon instead of metal for the gate." class="hilite" height="207" src="https://static.righto.com/images/386-xor/mosfet-w400.jpg" title="Structure of a MOS transistor. Although the transistor's name represents the Metal-Oxide-Semiconductor layers, modern MOS transistors typically use polysilicon instead of metal for the gate." width="400" /></a><div class="cite">Structure of a MOS transistor. Although the transistor's name represents the Metal-Oxide-Semiconductor layers, modern MOS transistors typically use polysilicon instead of metal for the gate.</div></p>
<p>In a CMOS circuit, the NMOS and PMOS transistors work together, with the NMOS transistors pulling the output low
as needed while the PMOS transistors pull the output high.
By arranging the transistors in different ways, different logic gates can be created.
The diagram below shows a NAND gate constructed from two PMOS transistors (top) and two NMOS transistors (bottom).
If both inputs are high, the NMOS transistors turn on and pull the output low. But if either input is low, a PMOS transistor
will pull the output high. Thus, the circuit below implements a NAND gate.</p>
<p><a href="https://static.righto.com/images/386-xor/nand.png"><img alt="A NAND gate implemented in CMOS." class="hilite" height="177" src="https://static.righto.com/images/386-xor/nand-w160.png" title="A NAND gate implemented in CMOS." width="160" /></a><div class="cite">A NAND gate implemented in CMOS.</div></p>
<p>Notice that NMOS and PMOS transistors have an inherent inversion: a high input produces a low (for NMOS) or a low input produces a high (for PMOS).
Thus, it is straightforward to produce logic circuits such as an inverter, NAND gate, NOR gate,
or an AND-OR-INVERT gate.
However, producing an XOR (exclusive-or) gate doesn't work with this approach:
an XOR gate produces a 1 if either input is high, but not both.<span id="fnref:and"><a class="ref" href="#fn:and">1</a></span>
The XNOR (exclusive-NOR) gate, the complement of XOR, also has this problem.
As a result, chips often have creative implementations of XOR gates.</p>
<h2>The standard-cell two-gate XOR circuit</h2>
<p>Parts of the 386 were implemented with standard-cell logic.
The idea of standard-cell logic is to build circuitry out of standardized building blocks that can be wired
by a computer program.
In earlier processors such as the 8086, each transistor was carefully positioned by hand to create a chip layout
that was as dense as possible.
This was a tedious, error-prone process since the transistors were fit together like puzzle pieces.
Standard-cell logic is more like building with LEGO. Each gate is implemented as a standardized block and the blocks are arranged
in rows, as shown below.
The space between the rows holds the wiring that connects the blocks.</p>
<p><a href="https://static.righto.com/images/386-xor/standard-cell.jpg"><img alt="Some rows of standard-cell logic in the 386 processor. This is part of the segment descriptor control circuitry." class="hilite" height="410" src="https://static.righto.com/images/386-xor/standard-cell-w500.jpg" title="Some rows of standard-cell logic in the 386 processor. This is part of the segment descriptor control circuitry." width="500" /></a><div class="cite">Some rows of standard-cell logic in the 386 processor. This is part of the segment descriptor control circuitry.</div></p>
<p>The advantage of standard-cell logic is that it is much faster to create a design since the process can be
automated.
The engineer described the circuit in terms of the logic gates and their connections.
A computer algorithm placed the blocks so related blocks are near each other.
An algorithm then routed the circuit, creating the wiring between the blocks.
These "place and route" algorithms are challenging since it is an extremely difficult optimization problem,
determining the best locations for the blocks and how to pack the wiring as densely as possible.
At the time, the algorithm took a day on a powerful IBM mainframe to compute the layout.
Nonetheless, the automated process was much faster than manual layout, cutting weeks off the development
time for the 386.
The downside is that the automated layout is less dense than manually optimized layout, with a lot more wasted
space.
(As you can see in the photo above, the density is low in the wiring channels.)
For this reason, the 386 used manual layout for circuits where a dense layout was important, such as the datapath.</p>
<p>In the 386, the standard-cell XOR gate is built by combining a NOR gate with an AND-NOR gate as shown below.<span id="fnref:xnor"><a class="ref" href="#fn:xnor">2</a></span>
(Although AND-NOR looks complicated, it is implemented as a single gate in CMOS.)
You can verify that if both inputs are 0, the NOR gate forces the output low, while if both inputs are 1, the AND gate forces the output low, providing the XOR functionality.</p>
<p><a href="https://static.righto.com/images/386-xor/xor-schematic.png"><img alt="Schematic of an XOR circuit." class="hilite" height="157" src="https://static.righto.com/images/386-xor/xor-schematic-w300.png" title="Schematic of an XOR circuit." width="300" /></a><div class="cite">Schematic of an XOR circuit.</div></p>
<p>The photo below shows the layout of this XOR gate as a standard cell.
I have removed the metal and polysilicon layers to show the underlying silicon. The outlined regions are the
active silicon, with PMOS above and NMOS below. The stripes are the transistor gates, normally covered by polysilicon wires.
Notice that neighboring transistors are connected by shared silicon; there is no demarcation between the source
of one transistor and the drain of the next.</p>
<p><a href="https://static.righto.com/images/386-xor/xor-cell.jpg"><img alt="The silicon implementing the XOR standard cell. This image is rotated 180° from the layout on the die to put PMOS at the top." class="hilite" height="270" src="https://static.righto.com/images/386-xor/xor-cell-w300.jpg" title="The silicon implementing the XOR standard cell. This image is rotated 180° from the layout on the die to put PMOS at the top." width="300" /></a><div class="cite">The silicon implementing the XOR standard cell. This image is rotated 180° from the layout on the die to put PMOS at the top.</div></p>
<p>The schematic below corresponds to the silicon above. Transistors <em>a</em>, <em>b</em>, <em>c</em>, and <em>d</em> implement the first
NOR gate. Transistors <em>g</em>, <em>h</em>, <em>i</em>, and <em>j</em> implement the AND part of the AND-NOR gate.
Transistors <em>e</em> and <em>f</em>
implement the NOR input of the AND-NOR gate, fed from the first NOR gate.
The standard cell library is designed so all the cells are the same height
with a power rail at the top and a ground rail at the bottom. This allows the cells to "snap together" in rows.
The wiring inside the cell is implemented in polysilicon and the lower metal layer (M1), while the wiring between
cells uses the upper metal layer (M2) for vertical connections and lower metal (M1) for horizontal connections.
This strategy allows vertical wires to pass over the cells without interfering with the cell's wiring.</p>
<p><a href="https://static.righto.com/images/386-xor/xor-cell-schematic.png"><img alt="Transistor layout in the XOR standard cell." class="hilite" height="186" src="https://static.righto.com/images/386-xor/xor-cell-schematic-w500.png" title="Transistor layout in the XOR standard cell." width="500" /></a><div class="cite">Transistor layout in the XOR standard cell.</div></p>
<p>One important factor in a chip such as the 386 is optimizing the sizes of transistors.
If a transistor is too small, it will take too much time to switch its output line, reducing performance.
But if a transistor is too large, it will waste power as well as slowing down the circuit that is driving it.
Thus, the standard-cell library for the 386 includes several XOR gates of various sizes.
The diagram below shows a considerably larger XOR standard cell. The cell is the same height as the previous XOR
(as required by the standard cell layout), but it is much wider and the transistors inside the cell are taller.
Moreover, the PMOS side uses pairs
of transistors to double the current capacity. (NMOS has better performance than PMOS so doesn't require
doubling of the transistors.) Thus, there are 10 PMOS transistors and 5 NMOS transistors in this XOR cell.</p>
<p><a href="https://static.righto.com/images/386-xor/large-xor-cell.jpg"><img alt="A large XOR standard cell. This cell is also rotated from the die layout." class="hilite" height="232" src="https://static.righto.com/images/386-xor/large-xor-cell-w400.jpg" title="A large XOR standard cell. This cell is also rotated from the die layout." width="400" /></a><div class="cite">A large XOR standard cell. This cell is also rotated from the die layout.</div></p>
<h2>The pass transistor circuit</h2>
<p>Some parts of the 386 implement XOR gates completely differently, using <a href="https://en.wikipedia.org/wiki/Pass_transistor_logic">pass transistor logic</a>.
The idea of pass transistor logic is to use transistors as switches that pass inputs through to the output,
rather than using transistors as switches to pull the output high or low.
The pass transistor XOR circuit uses 8 transistors, compared with 10 for the previous circuit.<span id="fnref:advantages"><a class="ref" href="#fn:advantages">3</a></span></p>
<p>The die photo below shows a pass-transistor XOR circuit, highlighted in red.
Note that the surrounding circuitry is irregular and much more tightly packed than the standard-cell circuitry.
This circuit was laid out manually producing an optimized layout compared to standard cells.
It has four PMOS transistors at the top and four NMOS transistors at the bottom.</p>
<p><a href="https://static.righto.com/images/386-xor/xor-pass-die.jpg"><img alt="The pass-transistor XOR circuit on the die. The green regions are oxide that was not completely removed causing thin-film interference." class="hilite" height="376" src="https://static.righto.com/images/386-xor/xor-pass-die-w500.jpg" title="The pass-transistor XOR circuit on the die. The green regions are oxide that was not completely removed causing thin-film interference." width="500" /></a><div class="cite">The pass-transistor XOR circuit on the die. The green regions are oxide that was not completely removed causing thin-film interference.</div></p>
<p>The schematic below shows the heart of the circuit, computing the exclusive-NOR (XNOR) of X and Y with four pass transistors.
To understand the circuit, consider the four input cases for X and Y.
If X and Y are both 0, PMOS transistor <em>a</em> will turn on (because Y is low), passing 1
to the XNOR output.
(<span style="text-decoration:overline">X</span> is the complemented value of the X input.)
If X and Y are both 1, PMOS transistor <em>b</em> will turn on (because
<span style="text-decoration:overline">X</span>
is low), passing 1.
If X and Y are 1 and 0 respectively, NMOS transistor <em>c</em> will turn on (because X is high), passing 0.
If X and Y are 0 and 1 respectively, transistor <em>d</em> will turn on (because Y is high), passing 0.
Thus, the four transistors implement the XNOR function, with a 1 output if both inputs are the same.</p>
<p><a href="https://static.righto.com/images/386-xor/xnor-pass.png"><img alt="Partial implementation of XNOR with four pass transistors." class="hilite" height="198" src="https://static.righto.com/images/386-xor/xnor-pass-w300.png" title="Partial implementation of XNOR with four pass transistors." width="300" /></a><div class="cite">Partial implementation of XNOR with four pass transistors.</div></p>
<p>To make an XOR gate out of this requires two additional inverters.
The first inverter produces <span style="text-decoration:overline">X</span> from X.
The second inverter generates the XOR output by inverting the XNOR output.
The output inverter also has the important function of
buffering the output since the pass transistor output is weaker than the inputs.
Since each inverter takes two transistors, the complete XOR circuit uses 8 transistors.
The schematic below shows the full circuit. The <em>i1</em> transistors implement the input inverter and the <em>i2</em>
transistors implement the output inverter.
The layout of this schematic matches the earlier die photo.<span id="fnref:alternatives"><a class="ref" href="#fn:alternatives">5</a></span></p>
<p><a href="https://static.righto.com/images/386-xor/xor-pass.png"><img alt="Implementation of NOR with eight pass transistors." class="hilite" height="197" src="https://static.righto.com/images/386-xor/xor-pass-w500.png" title="Implementation of NOR with eight pass transistors." width="500" /></a><div class="cite">Implementation of NOR with eight pass transistors.</div></p>
<h2>Conclusions</h2>
<p>An XOR gate may seem like a trivial circuit, but there is more going on than you might expect.
I think it is interesting that there isn't a single solution for implementing XOR; even inside a
single chip, multiple approaches can be used.
(If you're interested in XOR circuits, I also looked at the <a href="https://www.righto.com/2013/09/understanding-z-80-processor-one-gate.html">XOR circuit in the Z80</a>.)
It's also reassuring to see that even for a complex chip such as the 386, the circuitry can be broken down into
logic gates and then understood at the transistor level.</p>
<p>I plan to write more about the 386, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon occasionally as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:and">
<p>You can't create an AND or OR gate directly from CMOS either, but this isn't usually a problem.
One approach is to create a NAND (or NOR) gate
and then follow it with an inverter, but this requires an "extra" inverter.
However, the inverter can often be eliminated by flipping the action of the next gate (using <a href="https://en.wikipedia.org/wiki/De_Morgan%27s_laws">De Morgan's laws</a>).
For example, if you have AND gates feeding into an OR gate, you can change the circuit to use NAND gates
feeding into a NAND gate, eliminating the inverters.
Unfortunately, flipping the logic levels doesn't help with XOR gates, since XNOR is just as hard to produce. <a class="footnote-backref" href="#fnref:and" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:xnor">
<p>The 386 also uses XNOR standard-cell gates. These are implemented with the "opposite" circuit from XOR, swapping the AND and OR gates:</p>
<p><a href="https://static.righto.com/images/386-xor/xnor-schematic.png"><img alt="Schematic of an XNOR circuit." class="hilite" height="174" src="https://static.righto.com/images/386-xor/xnor-schematic-w300.png" title="Schematic of an XNOR circuit." width="300" /></a><div class="cite">Schematic of an XNOR circuit.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:xnor" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:advantages">
<p>I'm not sure why some circuits in the 386 use standard logic for XOR while other circuits use pass transistor logic.
I suspect that the standard XOR is used when the XOR gate is part of a standard-cell logic circuit, while
the pass transistor XOR is used in hand-optimized circuits.
There may also be performance advantages to one over the other. <a class="footnote-backref" href="#fnref:advantages" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:inverter">
<p>The first inverter can be omitted in the pass transistor XOR circuit if the inverted input happens to be
available. In particular, if multiple XOR gates use the same input, one inverter can provide the inverted
input to all of them, reducing the per-gate transistor count. <a class="footnote-backref" href="#fnref:inverter" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:alternatives">
<p>The pass transistor XOR circuit uses different layouts in different parts of the 386, probably because hand
layout allows it to be optimized.
For instance, the instruction decoder uses the XOR circuit below. This circuit has four PMOS transistors
on the left and four NMOS transistors on the right.</p>
<p><a href="https://static.righto.com/images/386-xor/decoder-xor.jpg"><img alt="An XOR circuit from the instruction decoder." class="hilite" height="236" src="https://static.righto.com/images/386-xor/decoder-xor-w200.jpg" title="An XOR circuit from the instruction decoder." width="200" /></a><div class="cite">An XOR circuit from the instruction decoder.</div></p>
<p>The schematic shows the wiring of this circuit. Although the circuit is electrically the same as the
previous pass-transistor circuit, the layout is different. In the previous circuit, several of the
transistors were connected through their silicon, while this circuit has all the transistors separated
and arranged in columns.</p>
<p><a href="https://static.righto.com/images/386-xor/decoder-xor-schematic.png"><img alt="Schematic of the XOR circuit from the instruction decoder." class="hilite" height="320" src="https://static.righto.com/images/386-xor/decoder-xor-schematic-w250.png" title="Schematic of the XOR circuit from the instruction decoder." width="250" /></a><div class="cite">Schematic of the XOR circuit from the instruction decoder.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:alternatives" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com10tag:blogger.com,1999:blog-6264947694886887540.post-40995748519861620032023-12-06T09:12:00.000-08:002023-12-08T16:11:11.245-08:00Reverse engineering the barrel shifter circuit on the Intel 386 processor die<p>The Intel 386 processor (1985) was a large step from the 286 processor, moving x86 to a 32-bit architecture.
The 386 also dramatically improved the performance of shift and rotate operations by adding a
"barrel shifter", a circuit that can shift by multiple bits in one step.
The die photo below shows the 386's barrel shifter, highlighted in the lower left and taking up a
substantial part of the die.</p>
<p><a href="https://static.righto.com/images/386-shift/386-die-labeled.jpg"><img alt="The 386 die with the main functional blocks labeled. Click this image (or any other) for a larger version.)" class="hilite" height="641" src="https://static.righto.com/images/386-shift/386-die-labeled-w600.jpg" title="The 386 die with the main functional blocks labeled. Click this image (or any other) for a larger version.)" width="600" /></a><div class="cite">The 386 die with the main functional blocks labeled. Click this image (or any other) for a larger version.)</div></p>
<p>Shifting is a useful operation for computers, moving a binary value left or right by one or more bits.
Shift instructions can be used for multiplying or dividing by powers of two, and as part of more general multiplication
or division.
Shifting is also useful for extracting bit fields, aligning bitmap graphics, and many other tasks.<span id="fnref:barrel-shifter-history"><a class="ref" href="#fn:barrel-shifter-history">1</a></span></p>
<p>Barrel shifters require a significant amount of circuitry.
A common approach is to use a crossbar, a matrix of switches that can connect any input to any output.
By closing switches along a desired diagonal, the input bits are shifted.
The diagram below illustrates a 4-bit crossbar barrel shifter with inputs X (vertical) and outputs Y (horizontal).
At each point in the grid, a switch (triangle) connects a vertical input line to a horizontal output line.
Energizing the blue control line, for instance, passes the value through unchanged (X0 to Y0 and so forth).
Energizing the green control line rotates the value by one bit position (X0 to Y1 and so forth, with X3 wrapping around to X0).
Similarly, the circuit can shift by 2 or 3 bits.
The shift control lines select the amount of shift. These lines run diagonally, which will be important later.</p>
<p><a href="https://static.righto.com/images/386-shift/crossbar.png"><img alt="A four-bit crossbar switch with inputs X and outputs Y. Image by Cmglee, CC BY-SA 3.0." class="hilite" height="300" src="https://static.righto.com/images/386-shift/crossbar-w400.png" title="A four-bit crossbar switch with inputs X and outputs Y. Image by Cmglee, CC BY-SA 3.0." width="400" /></a><div class="cite">A four-bit crossbar switch with inputs X and outputs Y. Image by <a href="https://en.wikipedia.org/wiki/Barrel_shifter#/media/File:Crossbar_barrel_shifter.svg">Cmglee</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0">CC BY-SA 3.0</a>.</div></p>
<p>The main problem with a crossbar barrel shifter is that it takes a lot of hardware.
The 386's barrel shifter has a 64-bit input and a 32-bit output,<span id="fnref:64-bit"><a class="ref" href="#fn:64-bit">2</a></span>
so the approach above would require 2048 switches (64×32).
For this reason, the 386 uses a hybrid approach, as shown below. It has a 32×8 crossbar that can shift by 0 to 28 bits,
but only in multiples of 4, making the circuit much smaller.
The output from the crossbar goes to a second circuit that can shift by 0, 1, 2, or 3 bits.
The combined circuitry supports an arbitrary shift, but requires less hardware than a complete crossbar.
The inputs to the barrel shifter are two 32-bit values from the processor's register
file, stored in latches for use by the shifter.</p>
<p><a href="https://static.righto.com/images/386-shift/block-diagram.png"><img alt="Block diagram of the barrel shifter circuit." class="hilite" height="313" src="https://static.righto.com/images/386-shift/block-diagram-w150.png" title="Block diagram of the barrel shifter circuit." width="150" /></a><div class="cite">Block diagram of the barrel shifter circuit.</div></p>
<p>The figure below shows how the shifter circuitry appears on the die; this image shows the two metal layers on
the die's surface.
The inputs from the register file are at the bottom, for bits 31 through 0.
Above that, the input latches hold the two 32-bit inputs for the shifter.
In the middle is the heart of the shift circuit, the crossbar matrix. This takes the two 32-bit
inputs and produces a 32-bit output.
The matrix is controlled by sloping polysilicon lines, driven by control circuitry on the right.
The matrix output
goes to the circuit that applies a shift of 0 to 3 positions.
Finally, the outputs exit at the top, where they go to other parts of the CPU.
The shifter performs right shifts, but as will be explained below, the same circuit is used for the
left shift instructions.</p>
<p><a href="https://static.righto.com/images/386-shift/shifter-die.jpg"><img alt="The barrel shifter circuitry as it appears on the die. I have cut out repetitive circuitry from the middle because the complete image is too wide to display clearly." class="hilite" height="468" src="https://static.righto.com/images/386-shift/shifter-die-w700.jpg" title="The barrel shifter circuitry as it appears on the die. I have cut out repetitive circuitry from the middle because the complete image is too wide to display clearly." width="700" /></a><div class="cite">The barrel shifter circuitry as it appears on the die. I have cut out repetitive circuitry from the middle because the complete image is too wide to display clearly.</div></p>
<h2>The barrel shifter crossbar matrix</h2>
<p>In this section, I'll describe the matrix part of the barrel shifter circuit.
The shift matrix takes 32-bit values <code>a</code> and <code>b</code>. Value <code>b</code> is shifted to the right, with
bits from <code>a</code> filling in at the left, producing a 32-bit output.
(As will be explained below, the output is actually 37 bits due to some complications, but ignore that for now.)
The shift count is a multiple of 4 from 0 to 28.</p>
<p>The diagram below illustrates the structure of the shift matrix. The two 32-bit inputs are provided at the bottom, interleaved, and run vertically.
The 32 output lines run horizontally.
The 8 control lines run diagonally, activating the switches (black dots) to connect
inputs and outputs.
(For simplicity, only 3 control lines are shown.)
For a shift of 0, control line 0 (red) is selected and the output is b<sub>31</sub>b<sub>30</sub>...b<sub>1</sub>b<sub>0</sub>.
(You can verify this by matching up inputs to outputs through the dots along the red line.)</p>
<p><a href="https://static.righto.com/images/386-shift/matrix-diagram.png"><img alt="Diagram of the shift matrix, showing three of the shift control lines." class="hilite" height="309" src="https://static.righto.com/images/386-shift/matrix-diagram-w600.png" title="Diagram of the shift matrix, showing three of the shift control lines." width="600" /></a><div class="cite">Diagram of the shift matrix, showing three of the shift control lines.</div></p>
<p>For a shift right of 4, the cyan control line is activated.
It can be seen that the output in this case is
a<sub>3</sub>a<sub>2</sub>a<sub>1</sub>a<sub>0</sub>b<sub>31</sub>b<sub>30</sub>...b<sub>5</sub>b<sub>4</sub>, shifting <code>b</code> to the right 4 bits and filling in four bits from <code>a</code> as desired.
For a shift of 28, the purple control line is activated, producing the output
a<sub>27</sub>...a<sub>0</sub>b<sub>31</sub>...b<sub>28</sub>.
Note that the control lines are spaced four bits apart, which is why the matrix only shifts by a multiple of 4.
Another important feature is that below the red diagonal, the b inputs are connected to the output, while above the
diagonal, the a inputs are connected to the output. (In other words, the black dots are shifted to the right
above the diagonal.)
This implements the 64-bit support, taking bits from <code>a</code> or <code>b</code> as appropriate.</p>
<p>Looking at the implementation on the die,
the vertical wires use the lower metal layer (metal 1) while the horizontal wires use the upper metal
layer (metal 2), so the wires don't intersect.
NMOS transistors are used as the switches to connect inputs and outputs.<span id="fnref:multiplexer"><a class="ref" href="#fn:multiplexer">4</a></span>
The transistors are controlled by diagonal wires constructed of polysilicon that form the transistor gates.
When a particular polysilicon wire is energized, it turns on the transistors along a diagonal line,
connecting those inputs and outputs.</p>
<p>The image below shows the left side of the matrix.<span id="fnref:full-matrix"><a class="ref" href="#fn:full-matrix">5</a></span>
The polysilicon control lines are the green horizontal lines stepping down to the right.
These control the transistors, which appear as columns of blue-gray squares next to the polysilicon lines.
The metal layers have been removed; the position of the lower metal 1 layer is visible in the vertical bluish
lines.</p>
<p><a href="https://static.righto.com/images/386-shift/matrix-die.jpg"><img alt="The left side of the matrix as it appears on the die." class="hilite" height="460" src="https://static.righto.com/images/386-shift/matrix-die-w600.jpg" title="The left side of the matrix as it appears on the die." width="600" /></a><div class="cite">The left side of the matrix as it appears on the die.</div></p>
<p>The diagram below shows four of these transistors in the shifter matrix. There are four circuitry layers involved.
The underlying silicon is pinkish gray; the active regions are the squares with darker borders.
Next is the polysilicon (green), which forms the control lines and the transistor gates.
The lower metal layer (metal 1) forms the blue vertical lines that connect to the transistors.<span id="fnref:metal"><a class="ref" href="#fn:metal">3</a></span>
The upper metal layer (metal 2) forms the horizontal bit output lines.
Finally, the small black dots are the vias that connect metal 1 and metal 2.
(The well taps are silicon regions connected to ground to prevent latch-up.)</p>
<p><a href="https://static.righto.com/images/386-shift/matrix-closeup.jpg"><img alt="Four transistors in the shifter matrix. The polysilicon and metal lines have been drawn in." class="hilite" height="261" src="https://static.righto.com/images/386-shift/matrix-closeup-w450.jpg" title="Four transistors in the shifter matrix. The polysilicon and metal lines have been drawn in." width="450" /></a><div class="cite">Four transistors in the shifter matrix. The polysilicon and metal lines have been drawn in.</div></p>
<p>To see how this works, suppose the upper polysilicon line is activated, turning on the top two transistors.
The two vertical bit-in lines (blue) will be connected through the transistors to the top two bit out
lines (purple), by way of the short light blue metal segments and the via (black dot).
However, if the lower polysilicon line is activated, the bottom two transistors will be turned on.
This will connect the bit-in lines to the fifth and sixth bit-out lines, four lines down from
the previous ones. Thus, successive polysilicon lines shift the connections down by four lines at a time,
so the shifts change in steps of 4 bit positions.</p>
<p>As mentioned earlier,
to support the 64-bit input, the transistors below the diagonal are connected to <code>b</code> input while the transistors
above the diagonal are connected to the <code>a</code> input.
The photo below shows the physical implementation:
the four upper transistors are shifted to the right by one wire width, so they connect to vertical <code>a</code> wires, while
the four lower transistors are connected to <code>b</code> wires.
(The metal wires were removed for this photo to show the transistors.)</p>
<p><a href="https://static.righto.com/images/386-shift/matrix-shifted.jpg"><img alt="This photo of the underlying silicon shows eight transistors. The top four transistors are shifted one position to the right. the irregular lines are remnants of other layers that I couldn't completely remove from the die." class="hilite" height="379" src="https://static.righto.com/images/386-shift/matrix-shifted-w300.jpg" title="This photo of the underlying silicon shows eight transistors. The top four transistors are shifted one position to the right. the irregular lines are remnants of other layers that I couldn't completely remove from the die." width="300" /></a><div class="cite">This photo of the underlying silicon shows eight transistors. The top four transistors are shifted one position to the right. the irregular lines are remnants of other layers that I couldn't completely remove from the die.</div></p>
<p>In the matrix, the output signals run horizontally.
In order for signals to exit the shifter from the top of the matrix, each horizontal output wire is connected to
a vertical output wire.
Meanwhile, other processor signals (such as the register write data) must also pass vertically through the shifter region.
The result is a complicated layout, packing everything together as tightly as possible.</p>
<h3>The precharge/keepers</h3>
<p>At the left and the right of the barrel shifter, repeated blocks of circuitry are visible.
These blocks contain precharge and <a href="https://en.wikipedia.org/wiki/Bus-holder">keeper</a> circuits to hold the value on one of the lines.
During the first clock phase, each horizontal bit line is precharged to +5 volts.
Next, the matrix is activated and horizontal lines may be pulled low.
If the line is not pulled low, the inverter and PMOS transistor will continuously pull the line high.
The inverter and transistor can be viewed as a bus keeper, essentially a weak latch to hold the
line in the 1 state.
The keeper uses relatively weak transistors, so the line can be pulled low
when the barrel shifter is activated.
The purpose of the keeper is to ensure that the line doesn't drift into a state between 0 and 1.
This is a bad situation with CMOS circuitry, since the pull-up and pull-down transistors could both turn on,
yielding a short circuit.</p>
<p><a href="https://static.righto.com/images/386-shift/keeper.png"><img alt="The precharge/keeper circuit" class="hilite" height="101" src="https://static.righto.com/images/386-shift/keeper-w300.png" title="The precharge/keeper circuit" width="300" /></a><div class="cite">The precharge/keeper circuit</div></p>
<p>The motivation behind this design is that implementing the matrix with "real" CMOS would require twice as
many transistors.
By implementing the matrix with NMOS transistors only, the size is reduced.
In a standard NMOS implementation, pull-up transistors would continuously pull the lines high, but this results in
fairly high power consumption.
Instead, the precharge circuit pulls the line high at the start.
But this results in dynamic logic, dependent on the capacitance of the circuit to hold the charge.
To avoid the charge leaking away, the keeper circuit keeps the line high until it is pulled low.
Thus, this circuit minimizes the area of the matrix as well as minimizing power consumption.</p>
<p>There are 37 keepers in total for the 37 output lines from the matrix.<span id="fnref:keeper-structure"><a class="ref" href="#fn:keeper-structure">6</a></span>
(The extra 5 lines will be explained below.)
The photo below shows one block of three keepers; the metal has been removed to show the silicon transistors
and some of the polysilicon (green).</p>
<p><a href="https://static.righto.com/images/386-shift/keeper.jpg"><img alt="One block of keeper circuitry, to the right of the shift matrix. This block has 12 transistors, supporting three bits." class="hilite" height="225" src="https://static.righto.com/images/386-shift/keeper-w400.jpg" title="One block of keeper circuitry, to the right of the shift matrix. This block has 12 transistors, supporting three bits." width="400" /></a><div class="cite">One block of keeper circuitry, to the right of the shift matrix. This block has 12 transistors, supporting three bits.</div></p>
<h2>The register latches</h2>
<p>At the bottom of the shift circuit, two latches hold the two 32-bit input values.
The 386 has multi-ported registers, so it can access two registers and write a third register at the same time.
This allows the shift circuit to load both values at the same time.
I believe that a value can also come from the 386's constant ROM, which is useful for providing 0, 1, or all-ones
to the shifter.</p>
<p>The schematic below shows the register latches for one bit of the shifter.
Starting at the bottom are the two inputs from the register file (one appears to be inverted for no good reason).
Each input is stored in a latch, using the standard 386 latch circuit.<span id="fnref:latch"><a class="ref" href="#fn:latch">7</a></span>
The latched input is gated by the clock and then goes through multiplexers allowing either value to be used
as either input to the shifter.
(The shifter takes two 32-bit inputs and this multiplexer allows the inputs to be swapped to the other
sides of the shifter.)
A second latch stage holds the values for the output; this latch is cleared during the first clock phase
and holds the desired value during the second clock phase.</p>
<p><a href="https://static.righto.com/images/386-shift/reg-latch.png"><img alt="Circuit for one bit of the register latch." class="hilite" height="584" src="https://static.righto.com/images/386-shift/reg-latch-w350.png" title="Circuit for one bit of the register latch." width="350" /></a><div class="cite">Circuit for one bit of the register latch.</div></p>
<p>The die photo below shows the register latch circuit, contrasting the metal layers (left) with the
silicon layer (right). The dark spots in the metal image are vias between the metal layers
or connections to the underlying silicon or polysilicon.
The metal layer is very dense with vertical wiring in the lower metal 1 layer and horizontal wiring in
the upper metal 2 layer.
The density of the chip seems to be constrained by the metal wiring more than the density of the transistors.</p>
<p><a href="https://static.righto.com/images/386-shift/reg.jpg"><img alt="One of the register latch circuits." class="hilite" height="481" src="https://static.righto.com/images/386-shift/reg-w350.jpg" title="One of the register latch circuits." width="350" /></a><div class="cite">One of the register latch circuits.</div></p>
<h2>The 0-3 shifter</h2>
<p>The shift matrix can only shift in steps of 4 bits. To support other shifts, a circuit at the top of the
shifter provides a shift of 0 to 3 bits. In conjunction, these circuits permit a shift by an arbitrary
amount.<span id="fnref:log"><a class="ref" href="#fn:log">8</a></span>
The schematic below shows the circuit.
A bit enters at the bottom. The first shift stage passes the bit through, or sends it one bit position to the right.
The second stage passes the bit through, or sends it two bit positions to the right.
Thus, depending on the control lines, each bit can be shifted by 0 to 3 positions to the right.
At the top, a transistor pulls the circuit low to initialize it; the NOR gate at the bottom does the same.
A keeper transistor holds the circuit low until a data bit pulls it high.</p>
<p><a href="https://static.righto.com/images/386-shift/output-shift.png"><img alt="One bit of the 0-3 shifter circuit." class="hilite" height="510" src="https://static.righto.com/images/386-shift/output-shift-w280.png" title="One bit of the 0-3 shifter circuit." width="280" /></a><div class="cite">One bit of the 0-3 shifter circuit.</div></p>
<p>The diagram below shows the silicon implementation corresponding to two copies of the schematic above.
The shifters are implemented in pairs to slightly optimize the layout.
In particular, the two NOR gates are mirrored so the power connection can be shared.
This is a small optimization, but it illustrates that the 386 designers put a lot of work into making
the layout dense.</p>
<p><a href="https://static.righto.com/images/386-shift/shift03.jpg"><img alt="Two bits of the 0-3 shifter circuit as it appears on the die." class="hilite" height="499" src="https://static.righto.com/images/386-shift/shift03-w500.jpg" title="Two bits of the 0-3 shifter circuit as it appears on the die." width="500" /></a><div class="cite">Two bits of the 0-3 shifter circuit as it appears on the die.</div></p>
<h2>Complications</h2>
<p>As is usually the case with x86, there are a few complications. One complication is that the
shift matrix has 37 outputs, rather than the expected 32.
There are two reasons behind this.
First, the upper shifter will shift right by up to 3 positions, so it needs 3 extra bits.
Thus, the matrix needs to output bits 0 through 34 so three bits can be discarded.
Second, shift instructions usually produce a carry bit from the last bit shifted out of the word.
To support this, the shift matrix provides an extra bit at both ends for use as the carry.
The result is that the matrix produces 37 outputs, which can be viewed as bits -1 through 35.</p>
<p>Another complication is that the x86 instruction set supports shifts on bytes and 16-bit words as well
as 32-bit words.
If you put two 8-bit bytes into the shifter, there will be 24 unused bits in between, posing a problem for
the shifter.
The solution is that some of the diagonal control lines in the matrix are split on byte and word boundaries,
allowing an 8- or 16-bit value to be shifted independently.
For example, you can perform a 4-bit right shift on the right-hand byte, and a 28-bit right shift on the
left-hand byte.
This brings the two bytes together in the result, yielding the desired 4-bit right shift.
As a result, there are 18 diagonal control lines in the shifter (if I counted correctly), rather than the
expected 8 control lines.
This makes the circuitry to drive the control lines more complicated, as it must generate different
signals depending on the size of the operand.</p>
<h2>The control circuitry</h2>
<p>The control circuitry at the right of the shifter drives the diagonal polysilicon lines in the matrix, selecting
the desired shift.
It also generates control signals for the 0-3 shifter, selecting a shift-by-1 or shift-by-2 as necessary.
This circuitry operates under the control of the microcode, which tells it when to shift.
It gets the shift amount from the instruction or the <code>CL</code> register and generates the appropriate control
signals.</p>
<p>The distribution of control signals is more complex than you might expect.
If possible, the polysilicon diagonals are connected on the right of the matrix to the control circuitry,
providing a direct connection.
However, many of the diagonals do not extend all the way to the right, either because they start on the
left or because they are segmented for 8- or 16-bit values.
Some of these signals are transmitted through polysilicon lines that run underneath the matrix.
Others are transmitted through horizontal metal lines that run through the register latches.
(These latches don't use many horizontal lines, so there is available space to route other signals.)
These signals then ascend through the matrix at various points to connect with the polysilicon lines.
This shows that the routing of this circuitry is carefully optimized to make it as compact as possible.
Moreover, these "extra" lines disrupt the layout; the matrix is almost a regular pattern, but it has
small irregularities throughout.</p>
<h2>Implementing x86 shifts and rotates with the barrel shifter</h2>
<p>The x86 has a variety of shift and rotate instructions.<span id="fnref:shift-history"><a class="ref" href="#fn:shift-history">9</a></span> It is interesting to consider how they are
implemented using the barrel shifter, since it is not always obvious.
In this section, I'll discuss the instructions supported by the 386.</p>
<p>One important principle is that even though the circuitry shifts to the right, by changing the inputs
this can achieve a shift to the left.
To make this concrete, consider two input words <code>a</code> and <code>b</code>, with the shifter extracting the portion in red below.
(I'll use 8-bit examples instead of 32-bit here and below to keep the size manageable.)
The circuit shifts <code>b</code> to the right five bits, inserting bits from <code>a</code> at the left.
Alternatively, the result can be viewed as shifting <code>a</code> to the left three bits, inserting bits from <code>b</code> at the right.
Thus, the same result can be viewed as a right shift of <code>b</code> or a left shift of <code>a</code>.
This holds in general, with a 32-bit right shift by N bits equivalent to a left shift by 32-N bits,
depending on which word<span id="fnref:word"><a class="ref" href="#fn:word">10</a></span> you focus on.</p>
<p><span class="mono">a<sub>7</sub>a<sub>6</sub>a<sub>5</sub><span class="box">a<sub>4</sub>a<sub>3</sub>a<sub>2</sub>a<sub>1</sub>a<sub>0</sub>b<sub>7</sub>b<sub>6</sub>b<sub>5</sub></span>b<sub>4</sub>b<sub>3</sub>b<sub>2</sub>b<sub>1</sub>b<sub>0</sub></span></p>
<h3>Double shifts</h3>
<p>The double-shift instructions (Shift Left Double (SHLD) and Shift Right Double (SHRD)) were new in the 386,
shifting two 32-bit values to produce a 32-bit result.
The last bit shifted out goes into the carry flag (CF).
These instructions map directly onto the behavior of the barrel shifter, so I'll start with them.</p>
<p><a href="https://static.righto.com/images/386-shift/doubleshifts.png"><img alt="Actions of the double shift instructions." class="hilite" height="288" src="https://static.righto.com/images/386-shift/doubleshifts-w400.png" title="Actions of the double shift instructions." width="400" /></a><div class="cite">Actions of the double shift instructions.</div></p>
<p>The examples below show how the shifter implements the <code>SHLD</code> and <code>SHRD</code> instructions; the shifter
output is highlighted in red. (These examples use an 8-bit source (<code>s</code>) and destination (<code>d</code>)
to keep them manageable.)
In either case, 3 bits of the source are shifted into the destination; shifting left or right
is just a matter of whether the destination is on the left or right.</p>
<p><code>SHLD 3</code>: <span class="mono">ddd<span class="box">dddddsss</span>sssss</span></p>
<p><code>SHRD 3</code>: <span class="mono">sssss<span class="box">sssddddd</span>ddd</span></p>
<h3>Shifts</h3>
<p>The basic shift instructions are probably the simplest.
Shift Arithmetic Left (SAL) and Shift Logical Left (SHL) are synonyms, shifting the destination to the left
and filling with zeroes. This can be accomplished by performing a shift with the word on the left and zeroes on the right.
Shift Logical Right (SHR) is the opposite, shifting to the right and filling with zeros. This can be
accomplished by putting the word on the right and zeroes on the left.
Shift Arithmetic Right (SAR) is a bit different. It fills with the sign bit, the top bit.
The purpose of this is to shift a signed number while preserving its sign.
It can be implemented by putting all zeroes or all ones on the left, depending on the sign bit.
Thus, the shift instructions map nicely onto the barrel shifter.</p>
<p><a href="https://static.righto.com/images/386-shift/shifts.png"><img alt="Actions of the shift instructions." class="hilite" height="249" src="https://static.righto.com/images/386-shift/shifts-w400.png" title="Actions of the shift instructions." width="400" /></a><div class="cite">Actions of the shift instructions.</div></p>
<p>The 8-bit examples below show how the shifter accomplishes the <code>SHL</code>, <code>SHR</code>, and <code>SAR</code> instructions.
The destination value <code>d</code> is loaded into one half of the shifter.
For <code>SAR</code>, the value's sign bit <code>s</code> is loaded into the other half of the shifter, while the other instructions
load <code>0</code> into the other half of the shifter.
The red box shows the output from the shifter, selected from the input.</p>
<p><code>SHL 3</code>: <span class="mono">ddd<span class="box">ddddd000</span>00000</span></p>
<p><code>SHR 3</code>: <span class="mono">00000<span class="box">000ddddd</span>ddd</span></p>
<p><code>SAR 3</code>: <span class="mono">sssss<span class="box">sssddddd</span>ddd</span></p>
<h3>Rotates</h3>
<p>Unlike the shift instructions, the rotate instructions preserve all the bits. As bits shift off one end,
they fill in the other end, so the bit sequence rotates.
A rotate left or right is implemented by putting the same word on the left and right.</p>
<p><a href="https://static.righto.com/images/386-shift/rotates.png"><img alt="Actions of the rotate instructions." class="hilite" height="186" src="https://static.righto.com/images/386-shift/rotates-w400.png" title="Actions of the rotate instructions." width="400" /></a><div class="cite">Actions of the rotate instructions.</div></p>
<p>The shifter implements rotates as shown below, using the destination value as both shifter inputs.
A left shift by N bits is implemented by shifting right by 32-N bits.</p>
<p><code>ROL 3</code>: <span class="mono">d<sub>7</sub>d<sub>6</sub>d<sub>5</sub><span class="box">d<sub>4</sub>d<sub>3</sub>d<sub>2</sub>d<sub>1</sub>d<sub>0</sub>d<sub>7</sub>d<sub>6</sub>d<sub>5</sub></span>d<sub>4</sub>d<sub>3</sub>d<sub>2</sub>d<sub>1</sub>d<sub>0</sub></span></p>
<p><code>ROR 3</code>: <span class="mono">d<sub>7</sub>d<sub>6</sub>d<sub>5</sub>d<sub>4</sub>d<sub>3</sub><span class="box">d<sub>2</sub>d<sub>1</sub>d<sub>0</sub>d<sub>7</sub>d<sub>6</sub>d<sub>5</sub>d<sub>4</sub>d<sub>3</sub></span>d<sub>2</sub>d<sub>1</sub>d<sub>0</sub></span></p>
<h3>Rotates through carry</h3>
<p>The rotate through carry instructions perform 33-bit rotates, rotating the value through the carry bit.
You might wonder how the barrel shifter can perform a 33-bit rotate, and the answer is that it can't.
Instead, the instruction takes multiple steps.
If you look at the instruction timings, the other shifts and rotates take three clock cycles. Rotating
through the carry, however, takes nine clock cycles, performing multiple steps
under the control of the microcode.</p>
<p><a href="https://static.righto.com/images/386-shift/rotatesc.png"><img alt="Actions of the rotate through carry instructions." class="hilite" height="185" src="https://static.righto.com/images/386-shift/rotatesc-w400.png" title="Actions of the rotate through carry instructions." width="400" /></a><div class="cite">Actions of the rotate through carry instructions.</div></p>
<style type="text/css">
.mono {font-family: monospace; font-size: 100%;}
.box {border: 2px solid red; padding: 2px 0;}
</style>
<p>Without looking at the microcode, I can only speculate how it takes place. One sequence would be to
get the top bits by putting zeroes in the right 32 bits and shifting. Next, get the bottom bits by putting
the carry bit in the left 32 bits and shifting one bit more. (That is, set the left 32-bit input to either the constant 0 or 1,
depending on the carry.) Finally, the result can be generated by ORing the two shift values together.
The example below shows how an <code>RCL 3</code> could be implemented.
In the second step, the carry value <code>C</code> is loaded into the left side of the shifter, so it can get into
the result.
Note that bit <span class="mono">d<sub>5</sub></span> ends up in the carry bit, rather than the result.
The <code>RCR</code> instruction would be similar, but adjusting the shift parameters accordingly.</p>
<p>First shift:
<span class="mono">d<sub>7</sub>d<sub>6</sub>d<sub>5</sub><span class="box">d<sub>4</sub>d<sub>3</sub>d<sub>2</sub>d<sub>1</sub>d<sub>0</sub>000</span>00000</span></p>
<p>Second shift:
<span class="mono">00<span class="box">00000Cd<sub>7</sub>d<sub>6</sub></span>d<sub>5</sub>d<sub>4</sub>d<sub>3</sub>d<sub>2</sub>d<sub>1</sub>d<sub>0</sub></span></p>
<p>Result from <code>OR</code>:
<span class="mono">d<sub>4</sub>d<sub>3</sub>d<sub>2</sub>d<sub>1</sub>d<sub>0</sub>Cd<sub>7</sub>d<sub>6</sub></span></p>
<h2>Conclusions</h2>
<p>The shifter circuit illustrates how the rapidly increasing transistor counts in the 1980s allowed new features.
Programming languages make it easy to shift numbers with an expression such as <code>a>>5</code>. But it takes a lot of hardware in the processor
to perform these shifts efficiently.
The additional hardware of the 386's barrel shifter dramaticallly improved shift performance for shifts and rotates compared to earlier
x86 processors.
I estimate that the barrel shifter requires about 2000 transistors,
about half the number of the entire 6502 processor (1975).
But by 1985, putting 2000 transistors into a feature was practical.
(In total, the 386 contains 285,000 transistors, a trivial number now, but a large number for the time.)</p>
<p>I plan to write more about the 386, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon occasionally as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:barrel-shifter-history">
<p>The earliest reference for a barrel shifter is often given as "A barrel switch design", Computer Design, 1972, but the idea of a barrel shifter goes back to 1964 at least.
(The "barrel switch" name presumably comes from a physical barrel switch, a cylindrical multi-position switch such as a
car ignition.)
The CDC 6600 supercomputer (1964) had a 6-stage shifter able to shift up to 63 positions in one cycle
(<a href="https://www.bitsavers.org/pdf/cdc/cyber/books/DesignOfAComputer_CDC6600.pdf#page=85">details</a>);
it was called a "parallel shifting network" rather than a "barrel shifter".
A Burroughs <a href="https://patents.google.com/patent/US3411139A/en">patent</a> filed in 1965
describes a barrel switch "capable of performing logical switching operations in a single time involving any amount of binary information," so the technology is older.</p>
<p>Early microprocessors shifted by one bit position at a time.
Although the Intel 8086 provided instructions to shift by multiple bits at a time, this was implemented
internally by a microcode loop, so the more bits you shifted, the longer the instruction took, four clock cycles
per bit.
Shifting on the 286 was faster, taking one additional cycle for each bit position shifted.
<!-- page B-97 of Programmer's Reference Manual -->
<!-- page 2-66 of The 8086 Family User's Manual -->
The first ARM processor (ARM1, 1985) included a <a href="https://www.righto.com/2015/12/reverse-engineering-arm1-ancestor-of.html#:~:text=inverter%20is%20active.-,The%20barrel%20shifter,-The%20barrel%20shifter">32-bit barrel shifter</a>.
It was considerably simpler than the 386's design, following ARM's RISC philosophy. <a class="footnote-backref" href="#fnref:barrel-shifter-history" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:64-bit">
<p>The 386 <a href="http://www.bitsavers.org/components/intel/80386/231732-001_80386_Hardware_Reference_Manual_1986.pdf#page=29">Hardware Reference Manual</a> states that the 386 contains a 64-bit barrel shifter.
I find this description a bit inaccurate, since the output is only 32 bits, so the barrel shifter is
much simpler than a full 64-bit barrel shifter. <a class="footnote-backref" href="#fnref:64-bit" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:metal">
<p>The 386 has two layers of metal. The vertical lines are in the lower layer of metal (metal 1) while
the horizontal lines are in the upper layer of metal (metal 2).
Transistors can only connect to lower metal, so the connection between the horizontal line and
the transistor uses a short piece of lower metal to bridge the layers. <a class="footnote-backref" href="#fnref:metal" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:multiplexer">
<p>Each row of the matrix can be considered a multiplexer with 8 inputs, implemented by 8 pass transistors.
One of the eight transistors is
activated, passing that input to the output. <a class="footnote-backref" href="#fnref:multiplexer" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:full-matrix">
<p>The image below shows the full shift matrix. Click the image for a much larger view.</p>
<p><a href="https://static.righto.com/images/386-shift/full-matrix.jpg"><img alt="The matrix with the metal layer removed." class="hilite" height="100" src="https://static.righto.com/images/386-shift/full-matrix-w600.jpg" title="The matrix with the metal layer removed." width="600" /></a><div class="cite">The matrix with the metal layer removed.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:full-matrix" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:keeper-structure">
<p>The keepers are arranged with 6 blocks of three on the left and 6 blocks of 3 on the right, plus an
additional one at the bottom right. <a class="footnote-backref" href="#fnref:keeper-structure" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:latch">
<p>The standard latch in the 386 consists of two cross-coupled inverters forming a static circuit to hold a bit.
The input goes through a transmission gate (back-to-back NMOS and PMOS transistors) to the inverters.
One inverter is weak, so it can be overpowered by the input.
The 8086, in contrast, uses dynamic latches that depend on the gate capacitance to hold a bit. <a class="footnote-backref" href="#fnref:latch" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:log">
<p>Some shifters take the idea of combining shift circuits to the extreme.
If you combine a shift-by-one circuit, a shift-by-two circuit, a shift-by-four circuit, and so forth,
you end up with a logarithmic shifter: selecting the appropriate stages provide an arbitrary shift.
(This design was used in the CDC 6600.)
This design has the advantage of reducing the amount of circuitry since it uses log<sub>2</sub>(N) layers rather than N layers.
However, the logarithmic approach has performance disadvantages since the signals need to go through
more circuitry.
<a href="https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/Shifter_Schulte.pdf">This paper</a> describes
various design alternatives for barrel shifters. <a class="footnote-backref" href="#fnref:log" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:shift-history">
<p>The basic rotate left and right instructions date back to the Datapoint 2200, the
<a href="https://www.righto.com/2023/08/datapoint-to-8086.html">ancestor</a> of the 8086 and x86.
The rotate left through carry and rotate right through carry instructions in x86 were added in the
Intel 8008 processor and the 8080 was the same.
The MOS 6502 had a different set of rotates and shifts: arithmetic shift left, rotate left,
logical shift right, and rotate right; the rotate instructions rotated through the carry.
The Z-80 had a more extensive set: rotates left and right, either through the carry or not,
shift left, shift right logical, shift right arithmetic, and
4-bit digit rotates left and right through two bytes.
The 8086's set of rotates and shifts was similar to the Z-80, except it didn't have the digit rotates.
The 8086 also supported shifting and rotating by multiple positions.
This illustrates that there isn't a "natural" set of shift and rotate instructions.
Instead, different processors supported different instructions, with complexity generally
increasing over time. <a class="footnote-backref" href="#fnref:shift-history" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:word">
<p>The x86 uses "word" to refer to a 16-bit value and "double word" or "dword" to refer to a 32-bit value.
I'm going to ignore the word/dword distinction. <a class="footnote-backref" href="#fnref:word" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com5tag:blogger.com,1999:blog-6264947694886887540.post-67260365697573032322023-11-30T08:52:00.000-08:002023-12-01T09:20:06.354-08:00Inside the Intel 386 processor die: the clock circuit<p>Processors are driven by a clock, which controls the timing of each step inside the chip.
In this blog post, I'll examine the clock-generation circuitry inside the Intel 386 processor.
Earlier processors such as the 8086 (1978) were simpler, using two clock phases internally.
The Intel 386 processor (1985) was a pivotal development for Intel as it moved x86 to CMOS (as well as
being the first 32-bit x86 processor).
The 386's CMOS circuitry required four clock signals.
An external crystal oscillator provided the 386 with a single clock signal and the 386's internal circuitry
generated four carefully-timed internal clock signals from the external clock.</p>
<p>The die photo below shows the Intel 386 processor with the clock generation circuitry and clock pad highlighted in red.
The heart of a processor is the datapath, the components that hold and process data. In the 386, these components are in the lower left: the ALU (Arithmetic/Logic Unit), a barrel shifter to shift data, and the registers. These components form regular rectangular blocks, 32 bits wide.
In the lower right is the microcode ROM, which breaks down machine instructions into micro-instructions, the low-level steps of the instruction.
Other parts of the chip prefetch and decode instructions, and handle memory paging and segmentation.
All these parts of the chip run under the control of the clock signals.</p>
<p><a href="https://static.righto.com/images/386-clock/386-die-labeled.jpg"><img alt="The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version." class="hilite" height="638" src="https://static.righto.com/images/386-clock/386-die-labeled-w600.jpg" title="The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version." width="600" /></a><div class="cite">The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version.</div></p>
<h2>A brief discussion of clock phases</h2>
<p>Many processors use a two-phase clock to control the timing of the internal processing steps.
The idea is that the two clock phases alternate: first phase 1 is high, and then phase 2 is high, as shown below.
During each clock phase, logic circuitry processes data.
A circuit called a "transparent latch" is used to hold data between steps.<span id="fnref:dynamic"><a class="ref" href="#fn:dynamic">2</a></span>
The concept of a latch is that when a latch's clock input is high, the input passes through the latch.
But when the latch's clock input is low, the latch remembers its previous value.
With two clock phases, alternating latches are active one at a time, so data passes through the circuit
step by step, under the control of the clock.</p>
<p><a href="https://static.righto.com/images/386-clock/8080-clock.png"><img alt="The two-phase clock signal used by the Intel 8080 processor. The 8080 uses asymmetrical clock signals, with phase 2 longer than phase 1. From the 8080 datasheet." class="hilite" height="91" src="https://static.righto.com/images/386-clock/8080-clock-w600.png" title="The two-phase clock signal used by the Intel 8080 processor. The 8080 uses asymmetrical clock signals, with phase 2 longer than phase 1. From the 8080 datasheet." width="600" /></a><div class="cite">The two-phase clock signal used by the Intel 8080 processor. The 8080 uses asymmetrical clock signals, with phase 2 longer than phase 1. From the <a href="https://deramp.com/downloads/intel/8080%20Data%20Sheet.pdf#page=12">8080 datasheet</a>.</div></p>
<p>The diagram below shows an abstracted model of the processor circuitry.
The combinational logic (i.e. the gate logic) is divided into two blocks, with latches between each block.
During clock phase 1, the first block of latches passes its input through to the output. Thus, values
pass through the first logic block, the first block of latches, and the second logic block, and then wait.</p>
<p><a href="https://static.righto.com/images/386-clock/clock-phases1.png"><img alt="Action during clock phase 1." class="hilite" height="152" src="https://static.righto.com/images/386-clock/clock-phases1-w500.png" title="Action during clock phase 1." width="500" /></a><div class="cite">Action during clock phase 1.</div></p>
<p>During clock phase 2 (below), the first block of latches stops passing data through and holds the previous values.
Meanwhile, the second block of latches passes its data through. Thus, the first logic block receives new
values and performs logic operations on them.
When the clock switches to phase 1, processing continues as in the first diagram.
The point of this is that processing takes place under the control of the clock, with values passed step-by-step
between the two logic blocks.<span id="fnref:flipflop"><a class="ref" href="#fn:flipflop">1</a></span></p>
<p><a href="https://static.righto.com/images/386-clock/clock-phases2.png"><img alt="Action during clock phase 2." class="hilite" height="169" src="https://static.righto.com/images/386-clock/clock-phases2-w500.png" title="Action during clock phase 2." width="500" /></a><div class="cite">Action during clock phase 2.</div></p>
<p>This circuitry puts some requirements on the clock timing. First, the clock phases must not overlap.
If both clocks are active at the same time, data will flow out of control around the loop, messing up the results.<span id="fnref:alu"><a class="ref" href="#fn:alu">3</a></span>
Moreover, because the two clock phases probably don't arrive at the exact same time (due to differences in
the wiring paths), a "dead zone" is needed between the two phases, an interval where both clocks are low, to
ensure that the clocks don't overlap even if there are timing skews.
Finally, the clock frequency must be slow enough that the logic has time to compute its result before the clock switches.</p>
<p>Many processors such as the 8080, 6502, and 8086 used this type of two-phase clocking.
Early processors such as the 8008 (1972) and 8080 (1974) required complicated external circuitry to produce two asymmetrical clock phases.<span id="fnref:8008"><a class="ref" href="#fn:8008">4</a></span>
For the 8080, Intel produced a special clock generator chip (the <a href="https://deramp.com/downloads/intel/D8224.pdf">8224</a>) that produced the two clock signals according to the required timing.
The Motorola 6800 (1974) required two non-overlapping (but at least symmetrical) clocks, produced by the
<a href="https://deramp.com/swtpc.com/MP_A2/MC6875/MC6875.htm">MC6875</a> clock generator chip.
The MOS 6502 processor (1975) simplified clock generation by producing the two phases internally (<a href="https://wilsonminesco.com/6502primer/ClkGen.html">details</a>) from a single clock input.
This approach was used by most later processors.</p>
<p>An important factor is that
the Intel 386 processor was implemented with CMOS circuitry, rather than the NMOS transistors of many earlier
processors.
A CMOS chip uses both NMOS transistors (which turn on when the gate is high) and PMOS transistors (which turn on when the gate is low).<span id="fnref:latch"><a class="ref" href="#fn:latch">7</a></span>
Thus, the 386 requires an active-high clock signal and an active-low clock signal for each phase,<span id="fnref:phases"><a class="ref" href="#fn:phases">5</a></span> four
clock signals in total.<span id="fnref:four-phase"><a class="ref" href="#fn:four-phase">6</a></span>
In the rest of this article, I'll explain how the 386 generates these four clock signals.</p>
<h2>The clock circuitry</h2>
<p>The block diagram below shows the components of the clock generation circuitry.
Starting at the bottom, the input clock signal (<em>CLK2</em>, at twice the desired frequency) is divided by two to generate
two drive signals with opposite phases.
These signals go to the large driver circuits in the middle, which generate the two main clock signals (phase 1 and phase 2).
Each driver sends an "inhibit" signal to the other when active, ensuring that the phases don't overlap.
Each driver also sends signals to a smaller driver that generates the inverted clock signal.
The "enable" signal shapes the output to prevent overlap.
The four clock output signals are then distributed to all parts of the processor.</p>
<p><a href="https://static.righto.com/images/386-clock/overview.png"><img alt="Block diagram of the clock circuitry. The layout of the blocks matches their approximate physical arrangement." class="hilite" height="440" src="https://static.righto.com/images/386-clock/overview-w400.png" title="Block diagram of the clock circuitry. The layout of the blocks matches their approximate physical arrangement." width="400" /></a><div class="cite">Block diagram of the clock circuitry. The layout of the blocks matches their approximate physical arrangement.</div></p>
<p>The diagram below shows a closeup of the clock circuitry on the die.
The external clock signal enters the die at the clock pad in the lower right. The signal is clamped
by protection diodes and a resistor before passing to the divide-by-two logic, which
generates the two clock phases.
The four driver blocks generate the high-current clock pulses
that are transmitted to the rest of the chip by the four output lines at the left.</p>
<p><a href="https://static.righto.com/images/386-clock/clock-details.jpg"><img alt="Details of the clock circuitry. This image shows the two metal layers. At the right, bond wires are connected to the pads on the die." class="hilite" height="475" src="https://static.righto.com/images/386-clock/clock-details-w450.jpg" title="Details of the clock circuitry. This image shows the two metal layers. At the right, bond wires are connected to the pads on the die." width="450" /></a><div class="cite">Details of the clock circuitry. This image shows the two metal layers. At the right, bond wires are connected to the pads on the die.</div></p>
<h2>Input protection</h2>
<p>The 386 has a pin "<em>CLK2</em>" that receives the external clock signal. It is called <em>CLK2</em> because this signal
has twice the frequency of the 386's clock. The chip package connects the <em>CLK2</em> pin through a tiny bond wire (visible above) to the <em>CLK2</em> pad on the silicon die.
The <em>CLK2</em> input has two protection diodes, created from MOSFETs, as shown in the schematic below.
If the input goes below ground or above +5 volts, the corresponding diode will turn on and clamp the
excess voltage, protecting the chip.
The schematic below shows how the diodes are constructed from an NMOS transistor and a PMOS transistor.
The schematic corresponds to the physical layout of the circuit, so power is at the bottom and the
ground is at the top.</p>
<p><a href="https://static.righto.com/images/386-clock/input.png"><img alt="The input protection circuit. The left shows the physical circuit built from an NMOS transistor and a PMOS transistor, while the right shows the equivalent diode circuit." class="hilite" height="193" src="https://static.righto.com/images/386-clock/input-w220.png" title="The input protection circuit. The left shows the physical circuit built from an NMOS transistor and a PMOS transistor, while the right shows the equivalent diode circuit." width="220" /></a><div class="cite">The input protection circuit. The left shows the physical circuit built from an NMOS transistor and a PMOS transistor, while the right shows the equivalent diode circuit.</div></p>
<p>The diagram below shows the implementation of these protection diodes (i.e. transistors) on the die.
Each transistor is much larger than the typical transistors inside the 386, because these transistors must
be able to handle high currents.
Physically, each transistor consists of 12 smaller (but still relatively large) transistors in parallel,
creating the stripes visible in the image.
Each transistor block is surrounded by two guard rings, which I will explain in the next section.</p>
<p><a href="https://static.righto.com/images/386-clock/clock-pad.jpg"><img alt="This diagram shows the circuitry next to the clock pad." class="hilite" height="277" src="https://static.righto.com/images/386-clock/clock-pad-w500.jpg" title="This diagram shows the circuitry next to the clock pad." width="500" /></a><div class="cite">This diagram shows the circuitry next to the clock pad.</div></p>
<h2>Latch-up and the guard rings</h2>
<p>The phenomenon of "latch-up" is the hobgoblin of CMOS circuitry, able to destroy a chip.
Regions of the silicon die are doped with impurities to form N-type and P-type silicon.
The problem is that the N- and P-doped regions in a CMOS chip can act as parasitic NPN and PNP transistors.
In some circumstances, these transistors can turn on, shorting power and ground. Inconveniently, the transistors
latch into this state until the power is removed or the chip burns up.
The diagram below shows how the substrate, well, and source/drain regions can combine to act as unwanted transistors.<span id="fnref:scr"><a class="ref" href="#fn:scr">8</a></span></p>
<p><a href="https://static.righto.com/images/386-clock/latchup.png"><img alt="This diagram illustrates how the parasitic NPN and PNP transistors are formed in a CMOS chip. Note that the 386's construction is opposite from this diagram, with an N substrate and P well. Image by Deepon, CC BY-SA 3.0." class="hilite" height="229" src="https://static.righto.com/images/386-clock/latchup-w500.png" title="This diagram illustrates how the parasitic NPN and PNP transistors are formed in a CMOS chip. Note that the 386's construction is opposite from this diagram, with an N substrate and P well. Image by Deepon, CC BY-SA 3.0." width="500" /></a><div class="cite">This diagram illustrates how the parasitic NPN and PNP transistors are formed in a CMOS chip. Note that the 386's construction is opposite from this diagram, with an N substrate and P well. Image by <a href="https://commons.wikimedia.org/wiki/File:Latchup.png">Deepon</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0">CC BY-SA 3.0</a>.</div></p>
<p>Normally, P-doped substrate or wells are connected to ground and the N-doped substrate or wells are connected to +5 volts.
As a result, the regions act as reverse-biased diodes and no current flows through the substrate.
However, a voltage fluctuation or large current can disturb the reverse biasing and
the resulting current flow will turn on these parasitic transistors.
Unfortunately, these parasitic transistors drive each other in a feedback loop, so once they get started, they
will conduct more and more strongly and won't stop until the chip is powered down.
The risk of latch-up is highest with circuits connected to the unpredictable voltages of the outside world,
or high-current circuits that can cause power fluctuations.
The clock circuitry has both of these risks.</p>
<p>One way of protecting against latch-up is to put a guard ring around a potentially risky circuit.
This guard ring will conduct away the undesired substrate current before it can cause latch-up.
In the case of the 386, two concentric guard rings are used for additional protection.<span id="fnref:guard"><a class="ref" href="#fn:guard">9</a></span>
In the earlier die photo, these guard rings can be seen surrounding the transistors.
Guard rings will also play a part in the circuitry discussed below.</p>
<h2>Polysilicon resistor</h2>
<p>After the protection diodes, the clock signal passes through a polysilicon resistor, followed by another protection diode.
Polysilicon is a special form of silicon that is used for wiring and also forms the transistor gates.
The polysilicon layer sits on top of the base silicon; polysilicon has a moderate amount of resistance,
considerably more than metal, so it can be used as a resistor.</p>
<p>The image below shows the polysilicon resistor along with a protection diode.
This circuit provides additional protection against transients in the clock signal.<span id="fnref:resistor"><a class="ref" href="#fn:resistor">10</a></span>
This circuit is surrounded by two concentric guard rings for more latch-up protection.</p>
<p><a href="https://static.righto.com/images/386-clock/resistor.jpg"><img alt="The polysilicon resistor and associated diode." class="hilite" height="304" src="https://static.righto.com/images/386-clock/resistor-w400.jpg" title="The polysilicon resistor and associated diode." width="400" /></a><div class="cite">The polysilicon resistor and associated diode.</div></p>
<h2>The divide-by-two logic</h2>
<p>The input clock to the 386 runs at twice the frequency of the internal clock. The circuit below divides
the input clock by 2, producing complemented outputs.
This circuit consists of two set-reset latch stages, one driven by the input clock inverted and the second
driven by the input clock, so the circuit will update once per input clock cycle.
Since there are three inversions in the loop, the output will be inverted for each update, so it will cycle at
half the rate of the input clock.
The reset input is asymmetrical: when it is low, it will force the output low and the complemented output high.
Presumably, this ensures that the processor starts with the correct clock phase when exiting the reset state.</p>
<p><a href="https://static.righto.com/images/386-clock/divider2.png"><img alt="The divide-by-two circuit." class="hilite" height="234" src="https://static.righto.com/images/386-clock/divider2-w550.png" title="The divide-by-two circuit." width="550" /></a><div class="cite">The divide-by-two circuit.</div></p>
<p>I have numbered the gates above to match their physical locations below.
In this image, I have etched the chip down to the silicon so you can see the active silicon regions.
Each logic gate consists of PMOS transistors in the upper half and NMOS transistors in the lower half.
The thin stripes are the transistor gates; the two-input NAND gates have two PMOS transistors and two NMOS transistors, while the three-input NAND gates have three of each transistor.
The AND-NOR gates need to drive other circuits, so they use paralleled transistors and are much larger.
Each AND-NOR gate contains 12 PMOS transistors, four for each input, but uses only 9 NMOS transistors.
Finally, the inverter (7) inverts the input clock signal for this circuit.
The transistors in each gate are sized to maximize performance and minimize power consumption.
The two outputs from the divider then go through large inverters (not shown) that feed the driver circuits.<span id="fnref:logic"><a class="ref" href="#fn:logic">11</a></span></p>
<p><a href="https://static.righto.com/images/386-clock/divider-die.jpg"><img alt="The silicon for the divide-by-two circuit as it appears on the die." class="hilite" height="272" src="https://static.righto.com/images/386-clock/divider-die-w600.jpg" title="The silicon for the divide-by-two circuit as it appears on the die." width="600" /></a><div class="cite">The silicon for the divide-by-two circuit as it appears on the die.</div></p>
<h2>The drivers</h2>
<p>Because the clock signals must be transmitted to all parts of the die, large transistors are required to
generate the high-current pulses.
These large transistors, in turn, are driven by medium-sized transistors.
Additional driver circuitry ensures that the clock signals do not overlap.
There are four driver circuits in total.
The two larger, lower driver circuits generate the positive clock pulses.
These drivers control the two smaller, upper driver circuits that generate the inverted clock pulses.</p>
<p>First, I'll discuss the larger, positive driver circuit.
The core of the driver consists of the large PMOS transistor (1) to pull the output high, and the large NMOS
transistor (1) to pull the output low.
Each transistor is driven by two inverters (2/3 and 6/7 respectively).
The circuit also produces two signals to shape the outputs from the other drivers.
When the clock output is high, the "inhibit" signal goes to the other lower driver and inhibits that
driver from pulling its output high.<span id="fnref:inhibit"><a class="ref" href="#fn:inhibit">12</a></span>
This prevents overlap in the output between the two drivers.
When the clock output is low, an "enable" output goes to the inverted driver (discussed below) to enable its
output.
The transistor sizes and propagation delays in this circuit are carefully designed to shape the internal
clock pulses as needed.</p>
<p><a href="https://static.righto.com/images/386-clock/lower-driver-schematic2.png"><img alt="Schematic of the lower driver." class="hilite" height="214" src="https://static.righto.com/images/386-clock/lower-driver-schematic2-w500.png" title="Schematic of the lower driver." width="500" /></a><div class="cite">Schematic of the lower driver.</div></p>
<p>The diagram below shows how this driver is implemented on the die.
The left image shows the two metal layers. The right image shows the transistors on the underlying silicon.
The upper section holds PMOS transistors, while the lower section holds NMOS transistors. Because PMOS transistors
have poorer performance than NMOS transistors, they need to be larger, so the PMOS section is larger.
The transistors are numbered, corresponding to the schematic above.
Each transistor is physically constructed from multiple transistors in parallel.
The two guard rings are visible in the silicon, surrounding and separating the PMOS and NMOS regions.</p>
<p><a href="https://static.righto.com/images/386-clock/lower-driver-diagram.jpg"><img alt="One of the lower drivers. The left image shows metal while the right image shows silicon." class="hilite" height="431" src="https://static.righto.com/images/386-clock/lower-driver-diagram-w700.jpg" title="One of the lower drivers. The left image shows metal while the right image shows silicon." width="700" /></a><div class="cite">One of the lower drivers. The left image shows metal while the right image shows silicon.</div></p>
<p>The 386 has two layers of metal wiring. In this circuit, the top metal layer (M2) provides +5 for the PMOS
transistors, ground for the NMOS transistors, and receives the output, all through large rectangular regions.
The lower metal layer (M1) provides the physical source and drain connections to the transistors as well as
the wiring between the transistors. The pattern of the lower metal layer is visible in the left photo.
The dark circles are connections between the lower metal layer and the transistors or the upper metal layer.
The connections to the two guard rings are visible around the edges.</p>
<p>Next, I'll discuss the two upper drivers that provided the inverted clock signals.
These drivers are smaller, presumably because less circuitry needs the inverted clocks.
Each upper driver is controlled by enable and drive from the corresponding lower driver.
As before, two large transistors pull the output high or low, and are driven by inverters.
The enable input must be high for inverter 4 to go low
Curiously, the enable input is wired to the output of inverter 4.
Presumably, this provides a bit of shaping to the signal.</p>
<p><a href="https://static.righto.com/images/386-clock/upper-driver-schematic2.png"><img alt="Schematic of the upper driver." class="hilite" height="237" src="https://static.righto.com/images/386-clock/upper-driver-schematic2-w450.png" title="Schematic of the upper driver." width="450" /></a><div class="cite">Schematic of the upper driver.</div></p>
<p>The layout (below) is roughly similar to the previous driver, but smaller.
The driver transistors (1) are arranged vertically rather than horizontally, so the metal 2 rectangle to get
the output is on the
left side rather than in the middle.
The transistor wiring is visible in the lower (metal 1) layer, running vertically through the circuit.
As before, two guard rings surround the PMOS and NMOS regions.</p>
<p><a href="https://static.righto.com/images/386-clock/upper-driver-diagram2.jpg"><img alt="One of the upper drivers. The left image shows metal while the right image shows silicon." class="hilite" height="473" src="https://static.righto.com/images/386-clock/upper-driver-diagram2-w500.jpg" title="One of the upper drivers. The left image shows metal while the right image shows silicon." width="500" /></a><div class="cite">One of the upper drivers. The left image shows metal while the right image shows silicon.</div></p>
<h2>Distribution</h2>
<p>Once the four clock signals have been generated, they are distributed to all parts of the chip.
The 386 has two metal layers. The top metal layer (M2) is thicker, so it has lower resistance and is
used for clock (and power) distribution where possible.
The clock signal will use the lower M1 metal layer when necessary to cross other M2 signals, as well as
for branch lines off the main clock lines.</p>
<p>The diagram below shows part of the clock distribution network; the four parallel clock lines are visible
similarly throughout the chip. The clock signal arrives at the upper right
and travels to the datapath circuitry on the left.
As you can see, the four clock lines are much wider than the thin signal lines; this width reduces the
resistance of the wiring, which reduces the RC (resistive-capacitive) delay of the signals.
The outlined squares at each branch are the vias, connections between the two metal
layers.
At the right, the incoming clock signals are in layer M1 and zig-zag to cross under other signals in M2.
The clock distribution scheme in the 386 is much simpler than in modern processors.</p>
<p><a href="https://static.righto.com/images/386-clock/clock-distribution-diagram.jpg"><img alt="Part of the wiring for clock distribution. This image spans about 1/5 of the chip's width." class="hilite" height="173" src="https://static.righto.com/images/386-clock/clock-distribution-diagram-w600.jpg" title="Part of the wiring for clock distribution. This image spans about 1/5 of the chip's width." width="600" /></a><div class="cite">Part of the wiring for clock distribution. This image spans about 1/5 of the chip's width.</div></p>
<h2>Clocks in modern processors</h2>
<p>The 386's internal clock speed was simply the external clock divided by 2.
However, modern processors allow the clock speed to be adjusted to optimize performance or to overclock the chip.
This is implemented by an on-chip PLL (Phase-Locked Loop) that generates the internal clock from
a fixed external clock, multiplying the clock speed by a selectable multiplier.
Intel introduced a PLL to the 80486 processor, but the multipler was fixed until the Pentium.</p>
<p>The Intel 386's clock can go up to 40 megahertz.
Although this was fast for the time, <a href="https://www.trustedreviews.com/news/intels-new-core-i9-13900ks-hits-6-0ghz-and-is-the-fastest-cpu-ever-4295014">modern processors</a> are over two orders of magnitude faster,
so keeping the clock synchronized in a modern processor requires complex techniques.<span id="fnref:clock-techniques"><a class="ref" href="#fn:clock-techniques">13</a></span>
With fast clocks, even the speed of light becomes a constraint; at 6 GHz, light can travel just 5 centimeters during a clock cycle.</p>
<!-- The RC delay (due to wire capacitance) is an order of magnitude larger. 1 ns for a 15mm squre die. CMOS VLSI Design p793 -->
<p>The problem is to ensure that the clock arrives at all circuits at the same time, minimizing "clock skew".
Modern processors can reduce the clock skew to a few picoseconds.
The clock is typically distributed by a "clock tree", where the clock is split into branches with each
branch buffered and the same length, so the delays nearly match.
One approach is an "H-tree", which distributes the clock through an H-shaped path. Each leg of the H branches into
a smaller H recursively, forming a space-filling fractal, as shown below.</p>
<p><a href="https://static.righto.com/images/386-clock/powerpc-clock.png"><img alt="Clock distribution in a PowerPC chip. The recursive H pattern is only approximate since other layout factors constrain the clock tree. From ISSCC 2000." class="hilite" height="404" src="https://static.righto.com/images/386-clock/powerpc-clock-w450.png" title="Clock distribution in a PowerPC chip. The recursive H pattern is only approximate since other layout factors constrain the clock tree. From ISSCC 2000." width="450" /></a><div class="cite">Clock distribution in a PowerPC chip. The recursive H pattern is only approximate since other layout factors constrain the clock tree. From <a href="https://doi.org/10.1109/ISSCC.2000.839705">ISSCC 2000</a>.</div></p>
<p>Delay circuitry can actively compensate for differences in path time.
A Delay-Locked Loop (DLL) circuit adds variable delays to counteract variations along different clock paths.
The Itanium used a clock distribution hierarchy with global, regional, and local distribution of the clock.
The main clock was distributed to eight regions that each deskewed the clock (in 8.5 ps steps) and drove a regional clock grid, keeping the clock skew under 28 ps.
The Pentium 4's complex distribution tree and skew compensation circuitry got clock skew below ±8 ps.</p>
<!--
For example, the clock skew in the Pentium 4 is [said](https://people.computing.clemson.edu/~mark/464/transistors.html) to be 51 picoseconds.
-->
<h1>Conclusions</h1>
<p>The 386's clock circuitry turned out to be more complicated than I expected, with a lot of subtlety and
complications.
However, examining the circuit illustrates several features of CMOS design, from latch circuits and
high-current drivers to guard rings and multi-phase clocks.
Hopefully you have found this interesting.</p>
<p>I plan to write more about the 386, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon occasionally as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.</p>
<p>Thanks to William Jones for discussing a couple of errors.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:flipflop">
<p>You might wonder why processors use transparent latches and two clock phases instead of using edge-triggered flip-flops and a single clock phase.
First, edge-triggered flip-flops take at least twice as many transistors as latches.
(An edge-triggered flip flop is often built from two latch stages.)
Second, the two-phase approach allows processing to happen twice per clock cycle, rather than once per
clock cycle.
This may allow a faster implementation with more pipelining. <a class="footnote-backref" href="#fnref:flipflop" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:dynamic">
<p>The transparent latch was implemented by a single pass transistor in processors such as the MOS 6502. When the transistor was on,
the input signal passed through.
But when the transistor was off, the former value was held by the transistor's gate capacitance.
Eventually the charge on the gate would leak away (like DRAM), so a minimum clock speed was required
for reliable operation. <a class="footnote-backref" href="#fnref:dynamic" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:alu">
<p>To see why having multiple stages active at once is bad, here's a simplified example.
Consider a circuit that increments the accumulator register.
In the first clock phase, the accumulator's value might go through the adder circuit. In the second clock
phase, the new value can be stored in the accumulator.
If both clock phases are high at the same time, the circuit will form a loop and the accumulator will get
incremented multiple times, yielding the wrong result.
Moreover, different parts of the adder probably have different delays, so the result is likely to be complete
garbage. <a class="footnote-backref" href="#fnref:alu" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:8008">
<p>To generate the clocks for the Intel 8008 processor,
the <a href="http://www.bitsavers.org/components/intel/MCS8/Intel_8008_8-Bit_Parallel_Central_Processing_Unit_Rev1_Apr72.pdf#page=23">suggested circuit</a> used four analog (<a href="https://pdf1.alldatasheet.com/datasheet-pdf/view/8090/NSC/9602.html">one-shot</a>) delays to generate the clock phases.
The 8008 and 8080 required asymmetrical clocks because the two blocks of logic took different
amounts of time to process their inputs. The asymemtrical clock minimized wasted time, improving performance.
(More discussion <a href="https://bread80.com/2022/11/07/designing-an-intel-8008-computer-part-1-power-clocks-and-signals/()">here</a>.) <a class="footnote-backref" href="#fnref:8008" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:phases">
<p>You might think that the 386 could use two clock signals:
one latch could use phase 1 for NMOS and
phase 2 for PMOS, while the next stage is the other way around.
Unfortunately, that won't work because the two phases aren't exactly complements.
During the "dead time" when phase 1 and phase 2 are both low, the PMOS transistors
for both stages will turn on, causing problems. <a class="footnote-backref" href="#fnref:phases" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:four-phase">
<p>Even though the 80386 has four clock signals internally, there are really just two clock phases.
This is different from <a href="https://en.wikipedia.org/wiki/Four-phase_logic">four-phase logic</a>, a type of
logic that was used in the late 1960s in some MOS processor chips.
Four-phase logic was said to provide 10 times the density, 10 times the speed, and 1/10 the power consumption
of standard MOS logic techniques.
Designer Lee Boysel was a strong proponent of four-phase logic, forming the company Four Phase Systems and
building a processor from a small number of MOS chips.
Improvements in MOS circuitry in the 1970s (in particular <a href="https://en.wikipedia.org/wiki/Depletion-load_NMOS_logic">depletion-mode logic</a>) made four-phase logic obsolete. <a class="footnote-backref" href="#fnref:four-phase" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:latch">
<p>The clocking scheme in the 386 is closely tied to the latch circuit used in the processor, shown below.
This is a transparent latch:
when <em>enable</em> is high and the complemented <em>enable</em> is low, the input is passed
through to the output (inverted).
When <em>enable</em> is low and the complemented <em>enable</em> is high, the latch remembers the previous value.
The important factor is that the <em>enable</em> and complemented <em>enable</em> inputs must switch in lockstep.
(In comparison, earlier chips such as the 8086 used a dynamic latch built from one transistor that
used a single <em>enable</em> input.)</p>
<p><a href="https://static.righto.com/images/386-clock/latch.png"><img alt="The basic latch circuit used in the 386." class="hilite" height="178" src="https://static.righto.com/images/386-clock/latch-w500.png" title="The basic latch circuit used in the 386." width="500" /></a><div class="cite">The basic latch circuit used in the 386.</div></p>
<p>The circuit on the right shows the implementation of the 386 latch.
The two transistors on the left form a transmission gate: when both transistors are on, the input is passed
through, but when both transistors are off, the input is blocked.
Data storage is implemented through the two inverters connected in a loop.
The bottom inverter is "weak", generating a small output current. Because of this, its output will be overpowered
by the input, replacing the value stored in the latch.
This latch uses 6 transistors in total.</p>
<p>The 386 uses several variants of the latch circuit, for instance with <em>set</em> or <em>reset</em> inputs, or multiplexers
to select multiple data inputs. <a class="footnote-backref" href="#fnref:latch" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:scr">
<p>The parasitic transistors responsible for latch-up can also be viewed as an SCR (silicon-controlled rectifier) or thyristor.
An SCR is a four-layer (PNPN) silicon device that is switched on by its gate and remains on until power is removed.
SCRs were popular in the 1970s for high-current applications, but have been replaced by transistors in many cases. <a class="footnote-backref" href="#fnref:scr" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:guard">
<p>The 386 uses two guard rings to prevent latch-up.
NMOS transistors are surrounded by an inner N+ guard ring connected to ground and an outer P+ guard ring connected
to +5.
The guard rings are reversed for PMOS transistors.
<a href="https://teamvlsi.com/2020/05/latch-up-prevention-in-cmos-design.html">This page</a> has a diagram showing
how the guard rings prevent latch-up. <a class="footnote-backref" href="#fnref:guard" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:resistor">
<p>The polysilicon resistor appears to be unique to the clock input. My hypothesis is that the <em>CLK2</em> signal
runs at a much higher frequency than other inputs (since it is twice the clock frequency), which raises
the risk of ringing or other transients.
If these transients go below ground, they could cause latch-up, motivating additional protection on the
clock input. <a class="footnote-backref" href="#fnref:resistor" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:logic">
<p>To keep the main article focused, I'll describe the inverters in this footnote.
The circuitry below is between the divider logic and the polysilicon resistor, and consists of six
inverters of various sizes. The large inverters 1 and 2 buffer the output from the divider to send to
the drivers.
Inverter 3 is a small inverter that drives larger inverter 4. I think this clock signal goes to the bus
interface logic, perhaps to ensure that communication with the outside world is synchronized with the
external clock, rather than the internal clock, which is shaped and perhaps slightly delayed.
The output of small inverter 5 appears to be unused. My hypothesis is that this is a "dummy" inverter
to match inverter 3 and ensure that both clock phases have identical circuitry. Otherwise, the load from
inverter 3 might make that phase switch slightly slower.</p>
<p><a href="https://static.righto.com/images/386-clock/inverter.jpg"><img alt="The inverters that buffer the divider's output." class="hilite" height="271" src="https://static.righto.com/images/386-clock/inverter-w600.jpg" title="The inverters that buffer the divider's output." width="600" /></a><div class="cite">The inverters that buffer the divider's output.</div></p>
<p>The final block of logic is shown below. This logic appears to take the chip reset signal from the reset
pin and synchronize it with the clock. The first three latches use the <em>CLK2</em> input as the clock, while
the last two latches use the internal clock.
Using the external reset signal directly would risk <a href="https://en.wikipedia.org/wiki/Metastability_(electronics)">metastability</a> because the reset signal could change asynchronously with respect to the rest of the system.
The latches ensure that the timing of the reset signal matches the rest of the system,
minimizing the risk of metastability.
The NAND gate generates a reset pulse that resets the divide-by-two counter to ensure that it starts in
a predictable state.</p>
<p><a href="https://static.righto.com/images/386-clock/reset-schematic.png"><img alt="The reset synchronizer. (Click for a larger image.)" class="hilite" height="96" src="https://static.righto.com/images/386-clock/reset-schematic-w750.png" title="The reset synchronizer. (Click for a larger image.)" width="750" /></a><div class="cite">The reset synchronizer. (Click for a larger image.)</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:logic" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:inhibit">
<p>The gate (2) that receives the inhibit signal is a bit strange, a cross between an inverter and a NAND gate.
The gate goes low if the clk' input is high, but goes high only if both inputs are low.
In other words, it acts like an inverter but the inhibit signal blocks the transition to the high output.
Instead, the output will "float" with its previous low value.
This will keep the driver's output low, ensuring that it doesn't overlap with the other driver's high output.</p>
<p>The upper driver has a similar gate (4), except the extra input (enable) is on the NMOS side so the
polarity is reversed. That is, the
enable input must be high in order for the inverter to go low. <a class="footnote-backref" href="#fnref:inhibit" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:clock-techniques">
<p>An interesting 2004 presentation is <a href="https://web.stanford.edu/class/archive/ee/ee371/ee371.1066/lectures/Old/lect_08_2up.pdf">Clocking for High Performance Processors</a>.
A 2005 <a href="https://web.stanford.edu/class/ee380/Abstracts/ClockingInMicroprocessorsLecture2005-final.pdf">Intel presentation</a> also discusses clock distribution. <a class="footnote-backref" href="#fnref:clock-techniques" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com4tag:blogger.com,1999:blog-6264947694886887540.post-65657626798149783172023-11-09T08:52:00.001-08:002023-11-09T09:53:59.015-08:00Reverse engineering the Intel 386 processor's register cell<p>The groundbreaking Intel 386 processor (1985) was the first 32-bit processor in the x86 line.
It has numerous internal registers: general-purpose registers, index registers, segment selectors, and
more specialized registers.
In this blog post, I look at the silicon die of the 386 and explain how some of these registers are
implemented at the transistor level.
The registers that I examined are implemented as static RAM, with each bit stored in a common 8-transistor circuit, known as "8T".
Studying this circuit shows the interesting layout techniques that Intel used to squeeze two storage cells together to minimize the space they require.</p>
<p>The diagram below shows the internal structure of the 386. I have marked the relevant registers with three red
boxes. Two sets of registers are in the segment descriptor cache, presumably holding cache entries, and one set is at the bottom of
the data path. Some of the registers at the bottom are 32 bits wide, while others are half as wide and
hold 16 bits. (More registers with different circuits, but I
won't discuss them in this post.)</p>
<p><a href="https://static.righto.com/images/386-reg/die-labeled-regs.jpg"><img alt="The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici." class="hilite" height="641" src="https://static.righto.com/images/386-reg/die-labeled-regs-w600.jpg" title="The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici." width="600" /></a><div class="cite">The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici.</div></p>
<h2>The 6T and 8T static RAM cells</h2>
<p>First, I'll explain how a 6T or 8T static cell holds a bit.
The basic idea behind a static RAM cell is to connect two inverters into a loop.
This circuit will be stable, with one inverter on and one inverter off, and each inverter supporting the other.
Depending on which inverter is on,
the circuit stores a 0 or a 1.</p>
<p><a href="https://static.righto.com/images/386-reg/inverter-loop.png"><img alt="Two inverters in a loop can store a 0 or a 1." class="hilite" height="121" src="https://static.righto.com/images/386-reg/inverter-loop-w250.png" title="Two inverters in a loop can store a 0 or a 1." width="250" /></a><div class="cite">Two inverters in a loop can store a 0 or a 1.</div></p>
<p>To write a new value into the circuit, two signals are fed in, forcing the inverters to the desired new values.
One inverter receives the new bit value, while the other inverter receives the complemented bit value.
This may seem like a brute-force way to update the bit, but it works.
The trick is that the inverters in the cell are small and weak, while the input signals are higher current,
able to overpower the inverters.<span id="fnref:flip"><a class="ref" href="#fn:flip">1</a></span>
The write data lines (called bitlines) are connected to the inverters by pass transistors.<span id="fnref:pass"><a class="ref" href="#fn:pass">2</a></span> When the pass transistors are on, the
signals on the write lines can pass through to the inverters. But when the pass transistors are off, the
inverters are isolated from the write lines.
Thus, the write control signal enables writing a new value to the inverters.
(This signal is called a wordline since it controls access to a word of storage.)
Since each inverter consists of two transistors<span id="fnref:inverter"><a class="ref" href="#fn:inverter">7</a></span>, the circuit below consists of six transistors,
forming the 6T storage cell.</p>
<p><a href="https://static.righto.com/images/386-reg/simple-cell.png"><img alt="Adding pass transistor so the cell can be written." class="hilite" height="125" src="https://static.righto.com/images/386-reg/simple-cell-w350.png" title="Adding pass transistor so the cell can be written." width="350" /></a><div class="cite">Adding pass transistor so the cell can be written.</div></p>
<p>The 6T cell uses the same bitlines for reading and writing.
Adding two transistors creates the 8T circuit, which has the advantage that you can read one register
and write to another register at the same time. (I.e. the register file is two-ported.)
In the 8T cell below, two additional transistors (G and H) are used for reading.
Transistor G buffers the cell's value; it turns on if the inverter output is high, pulling the read output bitline low.<span id="fnref:precharge"><a class="ref" href="#fn:precharge">3</a></span>
Transistor H is a pass transistor that blocks this signal until a read is performed on this register;
it is controlled by a read wordline.</p>
<p><a href="https://static.righto.com/images/386-reg/cell-schematic.png"><img alt="Schematic of a storage cell. Each transistor is labeled with a letter." class="hilite" height="155" src="https://static.righto.com/images/386-reg/cell-schematic-w500.png" title="Schematic of a storage cell. Each transistor is labeled with a letter." width="500" /></a><div class="cite">Schematic of a storage cell. Each transistor is labeled with a letter.</div></p>
<p>To form registers (or memory), a grid is constructed from these cells.
Each row corresponds to a register, while each column corresponds to a bit position.
The horizontal lines are the wordlines, selecting which word to access, while the
vertical lines are the bitlines, passing bits in or out of the registers.
For a write, the vertical bitlines provide the 32 bits (along with their complements).
For a read, the vertical bitlines receive the 32 bits from the register.
A wordline is activated to read or write the selected register.</p>
<p><a href="https://static.righto.com/images/386-reg/grid.png"><img alt="Static memory cells (8T) organized into a grid." class="hilite" height="433" src="https://static.righto.com/images/386-reg/grid-w500.png" title="Static memory cells (8T) organized into a grid." width="500" /></a><div class="cite">Static memory cells (8T) organized into a grid.</div></p>
<h2>Silicon circuits in the 386</h2>
<p>Before showing the layout of the circuit on the die, I should give a bit of background on the technology used
to construct the 386.
The 386 was built with CMOS technology, with NMOS and PMOS transistors working together, an advance over the
earlier x86 chips that were built with NMOS transistors.
Intel called this CMOS technology CHMOS-III (complementary high-performance metal-oxide-silicon), with 1.5 µm features.
While Intel's earlier chips had a single metal layer, CHMOS-III provided two metal layers, making signal
routing much easier.</p>
<p>Because CMOS uses both NMOS and PMOS transistors, fabrication is more complicated.
In an MOS integrated circuit, a transistor is formed where a polysilicon wire crosses active silicon,
creating the transistor's gate.
A PMOS transistor is constructed directly on the silicon substrate (which is N-doped).
However, an NMOS transistor is the opposite, requiring a P-doped substrate.
This is created by forming a P well, a region
of P-doped silicon that holds NMOS transistors.
Each P well must be connected to ground; this is accomplished by connecting ground to specially-doped regions of the P well, called "well taps"`.</p>
<p>The diagram below shows a cross-section through two transistors, showing the layers of the chip.
There are four important layers: silicon (which has some regions doped to form active
silicon), polysilicon for wiring and transistors, and the two metal layers.
At the bottom is the silicon, with P or N doping; note the P-well for the NMOS transistor on the left.
Next is the polysilicon layer.
At the top are the two layers of metal, named M1 and M2.
Conceptually, the chip is constructed from flat layers, but the layers have a three-dimensional
structure influenced by the layers below.
The layers are separated by silicon dioxide ("ox") or silicon oxynitride<span id="fnref:oxynitride"><a class="ref" href="#fn:oxynitride">4</a></span>; the
oxynitride under M2 caused me considerable difficulty.</p>
<p><a href="https://static.righto.com/images/386-reg/process-cross-section.jpg"><img alt="A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology." class="hilite" height="244" src="https://static.righto.com/images/386-reg/process-cross-section-w600.jpg" title="A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology." width="600" /></a><div class="cite">A cross-section of circuitry formed with the CHMOS-III process. From <a href="(https://doi.org/10.1109/IEDM.1984.190640">A double layer metal CHMOS III technology</a>.</div></p>
<p>The image below shows how circuitry appears on the die;<span id="fnref:layers"><a class="ref" href="#fn:layers">5</a></span>
I removed the metal layers to show the silicon and polysilicon that form transistors.
(As will be described below, this image shows two static cells, holding two bits.)
The pinkish and dark regions are active silicon, doped to take part in the circuits, while the "background" silicon
can be ignored.
The green lines are polysilicon lines on top of the silicon.
Transistors are the most important feature here: a transistor gate is formed when polysilicon crosses active
silicon, with the source and drain on either side.
The upper part of the image has PMOS transistors, while the lower part of the image has the P well that holds
NMOS transistors. (The well itself is not visible.)
In total, the image shows four PMOS transistors and 12 NMOS transistors.
At the bottom, the well taps connect the P well to ground.
Although the metal has been removed, the contacts between the lower metal layer (M1) and the silicon or
polysilicon are visible as faint circles.</p>
<p><a href="https://static.righto.com/images/386-reg/die-components.jpg"><img alt="A (heavily edited) closeup of the die." class="hilite" height="407" src="https://static.righto.com/images/386-reg/die-components-w600.jpg" title="A (heavily edited) closeup of the die." width="600" /></a><div class="cite">A (heavily edited) closeup of the die.</div></p>
<h2>Register layout in the 386</h2>
<p>Next, I'll explain the layout of these cells in the 386.
To increase the circuit density, two cells are put side-by-side, with a mirrored layout.
In this way, each row holds two interleaved registers.<span id="fnref:density"><a class="ref" href="#fn:density">6</a></span>
The schematic below shows the arrangement of the paired cells, matching the die image above.
Transistors A and B form the first inverter,<span id="fnref2:inverter"><a class="ref" href="#fn:inverter">7</a></span> while transistors C and D form the second
inverter.
Pass transistors E and F allow the bitlines to write the cell.
For reading, transistor G amplifies the signal while pass transistor H connects the selected bit to the
output.</p>
<p><a href="https://static.righto.com/images/386-reg/schematic-full.png"><img alt="Schematic of two static cells in the 386. The schematic approximately matches the physical layout." class="hilite" height="273" src="https://static.righto.com/images/386-reg/schematic-full-w700.png" title="Schematic of two static cells in the 386. The schematic approximately matches the physical layout." width="700" /></a><div class="cite">Schematic of two static cells in the 386. The schematic approximately matches the physical layout.</div></p>
<p>The left and right sides are approximately mirror images, with separate read and write control lines for
each half.
Because the control lines for the left and right sides are in different positions, the two sides have
some layout differences, in particular, the bulging loop on the right.
Mirroring the cells increases the density since the bitlines can be shared by the cells.</p>
<!--
Inconveniently, the read outputs for the two cells are at opposite ends, so the output on the left is connected
to the read bitline on the right.
-->
<p>The diagram below shows the various components on the die,
labeled to match the schematic above.
I've drawn the lower M1 metal wiring in blue, but omitted the M2 wiring (horizontal control lines, power, and ground). "Read crossover" indicates the connection from the read output on the left to the bitline on the right.
Black circles indicate vias between M1 and M2, green circles indicate contacts between silicon and M1, and
reddish circles indicate contacts between polysilicon and M1.</p>
<p><a href="https://static.righto.com/images/386-reg/cell-silicon.jpg"><img alt="The layout of two static cells. The M1 metal layer is drawn in blue; the horizontal M2 lines are not shown." class="hilite" height="396" src="https://static.righto.com/images/386-reg/cell-silicon-w600.jpg" title="The layout of two static cells. The M1 metal layer is drawn in blue; the horizontal M2 lines are not shown." width="600" /></a><div class="cite">The layout of two static cells. The M1 metal layer is drawn in blue; the horizontal M2 lines are not shown.</div></p>
<p>One more complication is that alternating registers (i.e. rows) are reflected vertically, as shown below.
This allows one horizontal power line to feed two rows, and similarly for a horizontal ground line.
This cuts the number of power/ground lines in half, making the layout more efficient.</p>
<p><a href="https://static.righto.com/images/386-reg/multi-cells.jpg"><img alt="Multiple storage cells." class="hilite" height="387" src="https://static.righto.com/images/386-reg/multi-cells-w500.jpg" title="Multiple storage cells." width="500" /></a><div class="cite">Multiple storage cells.</div></p>
<p>Having two layers of metal makes the circuitry considerably more difficult to reverse engineer. The
photo below (left) shows one of the static RAM cells as it appears under the microscope.
Although the structure of the metal layers is visible in the photograph, there is a lot of ambiguity.
It is difficult to distinguish the two layers of metal. Moreover, the metal completely hides the
polysilicon layer, not to mention the underlying silicon.
The large black circles are vias between the two metal layers.
The smaller faint circles are contacts between a metal layer and the underlying silicon or polysilicon.</p>
<p><a href="https://static.righto.com/images/386-reg/metal-layers.jpg"><img alt="One cell as it appears on the die, with a diagram of the upper (M2) and lower (M1) metal layers." class="hilite" height="452" src="https://static.righto.com/images/386-reg/metal-layers-w600.jpg" title="One cell as it appears on the die, with a diagram of the upper (M2) and lower (M1) metal layers." width="600" /></a><div class="cite">One cell as it appears on the die, with a diagram of the upper (M2) and lower (M1) metal layers.</div></p>
<p>With some effort, I determined the metal layers, which I show on the right:
M2 (upper) and M1 (lower).
By comparing the left and right images, you can see how the structure of the metal layers is somewhat visible.
I use black circles to
indicate vias between the layers, green circles indicate contacts between M1 and silicon, and pink circles indicate
contacts between M1 and polysilicon.
Note that both metal layers are packed as tightly as possible.
The layout of this circuit was highly optimized to minimize the area.
It is interesting to note that decreasing the size of the transistors wouldn't help with this circuit, since
the size is limited by the metal density.
This illustrates that a fabrication process must balance the size of the metal features, polysilicon features,
and silicon features since over-optimizing one won't help the overall chip density.</p>
<p>The photo below shows the bottom of the register file.
The "notch" makes the registers at the very bottom half-width:
4 half-width rows corresponding to eight 16-bit registers.
Since there are six 16-bit segment registers in the 386, I suspect these are the segment registers and
two mystery registers.</p>
<p><a href="https://static.righto.com/images/386-reg/split.jpg"><img alt="The bottom of the register file." class="hilite" height="186" src="https://static.righto.com/images/386-reg/split-w500.jpg" title="The bottom of the register file." width="500" /></a><div class="cite">The bottom of the register file.</div></p>
<p>I haven't been able to determine which registers in the 386 correspond to the other registers on the die.
In the segment descriptor circuitry, there are two rows of register cells with ten more rows below,
corresponding to 24 32-bit registers. These are presumably segment descriptors.
At the bottom of the datapath, there are 10 32-bit registers with the T8 circuit.
The 386's programmer-visible registers consist of eight general-purpose 32-bit registers (EAX, etc.).
The 386 has various control registers, test registers, and segmentation registers<span id="fnref:registers"><a class="ref" href="#fn:registers">8</a></span> that are
not well known.
The 8086 has a few registers for internal use that aren't visible to the programmer, so the 386 presumably
has even more invisible registers.
At this point, I can't narrow down the functionality.</p>
<h1>Conclusions</h1>
<p>It's interesting to examine how registers are implemented in a real processor.
There are plenty of descriptions of the 8T static cell circuit, but it turns out that the physical implementation
is more complicated than the theoretical description.
Intel put a lot of effort into optimizing this circuit, resulting in a dense block of circuitry.
By mirroring cells horizontally and vertically, the density could be increased further.</p>
<p>Reverse engineering one small circuit of the 386 turned out to be pretty tricky, so I don't plan to do
a complete reverse engineering.
The main difficulty is the two layers of metal are hard to untangle.
Moreover, I lost most of the polysilicon when removing the metal.
Finally, it is hard to draw diagrams with four layers without the diagram turning into a mess,
but hopefully the diagrams made sense.</p>
<p>I plan to write more about the 386, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon occasionally as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:flip">
<p>Typically the write driver circuit generates a strong low on one of the bitlines,
flipping the corresponding inverter to a high output.
As soon as one inverter flips, it will force the other inverter into the right state.
To support this, the pullup transistors in the inverters are weaker than normal. <a class="footnote-backref" href="#fnref:flip" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:pass">
<p>The pass transistor passes its signal through or blocks it.
In CMOS, this is usually implemented with a transmission gate with an NMOS and a PMOS transistor in parallel.
The cell uses only the NMOS transistor, which makes it worse at passing a high signal, but substantially
reduces the size, a reasonable tradeoff for a storage cell. <a class="footnote-backref" href="#fnref:pass" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:precharge">
<p>The bitline is typically precharged to a high level for a read, and then the cell pulls the line low for
a 0.
This is more compact than including circuitry in each cell to pull the line high. <a class="footnote-backref" href="#fnref:precharge" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:oxynitride">
<p>One problem is that the 386 uses a layer of insulating silicon oxynitride as well as the
usual silicon dioxide. I was able to remove the oxynitride with boiling phosphoric acid, but this removed
most of the polysilicon as well. I'm still experimenting with the timing; 20 minutes of boiling was too long. <a class="footnote-backref" href="#fnref:oxynitride" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:layers">
<p>The image is an edited composite of multiple cells since the polysilicon was highly damaged when removing
the metal layers.
Unfortunately, I haven't found a process for the 386 to remove one layer of metal at a time.
As a result, reverse-engineering the 386 is much more difficult than earlier processors such as the 8086;
I have to look for faint traces of polysilicon and puzzle over what connections the circuit requires. <a class="footnote-backref" href="#fnref:layers" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:density">
<p>You might wonder why they put two cells side-by-side instead of simply cramming the cells together more tightly.
The reason for putting two cells in each row is presumably to match the size of each bit with the rest of
the circuitry in the datapath.
If the register circuitry is half the width of the ALU circuitry, a bunch of space will be wasted by the
wiring to line up each register bit with the corresponding ALU bit. <a class="footnote-backref" href="#fnref:density" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:inverter">
<p>A CMOS inverter is constructed from an NMOS transistor (which pulls the output low on a 1 input) and a PMOS transistor (which pulls the output high on a 0 input), as shown below.</p>
<p><a href="https://static.righto.com/images/386-reg/inverter.png"><img alt="A CMOS inverter." class="hilite" height="167" src="https://static.righto.com/images/386-reg/inverter-w120.png" title="A CMOS inverter." width="120" /></a><div class="cite">A CMOS inverter.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:inverter" title="Jump back to footnote 7 in the text">↩</a><a class="footnote-backref" href="#fnref2:inverter" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:registers">
<p>The 386 has multiple registers that are documented but not well known.
Chapter 4 of the <a href="http://www.bitsavers.org/components/intel/80386/230985-003_386DX_Microprocessor_Programmers_Reference_Manual_1990.pdf">386 Programmers Reference Manual discusses</a> various registers that are only relevant
to operating systems programmers.
These include the Global Descriptor Table Register (GDTR), Local Descriptor Table Register (LDTR),
Interrupt Descriptor Table Register (IDTR), and Task Register (TR).
There are four Control Registers CR0-CR3; CR0 controls coprocessor usage, paging, and a few other things.
The six Debug Registers for hardware breakpoints are named DR0-DR3, DR6, and DR7 (which suggests undocumented DR4 and DR5 registers).
The two Test Registers for TLB testing are named TR6 and TR7 (which suggests undocumented TR0-TR5 registers).
I expect that these registers are located near the relevant functional units, rather than part of the
processing datapath. <a class="footnote-backref" href="#fnref:registers" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com5tag:blogger.com,1999:blog-6264947694886887540.post-68418745134043724812023-10-31T08:39:00.007-07:002023-10-31T18:47:43.010-07:00Reverse-engineering Ethernet backoff on the Intel 82586 network chip's dieIntroduced in 1973, Ethernet is the predominant way of wiring computers together.
Chips were soon introduced to handle the low-level aspects of Ethernet: converting data packets into bits,
implementing checksums, and handling network collisions.
In 1982, Intel announced the i82586 Ethernet LAN coprocessor chip, which went much further by offloading most of the
data movement from the main processor to an on-chip coprocessor.
Modern Ethernet networks handle a gigabit of data per second or more, but at the time, the Intel chip's support for 10 Mb/s
Ethernet put it on the cutting edge.
(Ethernet was surprisingly expensive, about $2000 at the time, but <a href="https://books.google.com/books?id=3kpkQKds0lAC&lpg=PA4&pg=PA4#v=onepage&q&f=false">expected to drop</a> under $1000 with the Intel chip.)
In this blog post, I focus on a specific part of the coprocessor chip: how it handles network collisions and implements
exponential backoff.</p>
<p>The die photo below shows the i82586 chip. This photo shows the metal layer on top of the chip, which hides the underlying
polysilicon wiring and silicon transistors.
Around the edge of the chip, square bond pads provide the link to the chip's 48 external pins.
I have labeled the function blocks based on my reverse engineering and published descriptions. The left side of the chip is called the "receive unit" and
handles the low-level networking,
with circuitry for the network transmitter and receiver.
The left side also contains low-level control and status registers.
The right side is called the "command unit" and interfaces to memory and the main processor.
The right side contains a simple processor controlled by a microinstruction ROM.<span id="fnref:microcode"><a class="ref" href="#fn:microcode">1</a></span>
Data is transmitted between the two halves of the chip by 16-byte FIFOs (first in, first out queues).</p>
<p><a href="https://static.righto.com/images/i82586s/die-labeled.jpg"><img alt="The die of the Intel 82586 with the main functional blocks labeled. Click this image (or any other) for a larger version." class="hilite" height="528" src="https://static.righto.com/images/i82586s/die-labeled-w600.jpg" title="The die of the Intel 82586 with the main functional blocks labeled. Click this image (or any other) for a larger version." width="600" /></a><div class="cite">The die of the Intel 82586 with the main functional blocks labeled. Click this image (or any other) for a larger version.</div></p>
<p>The 82586 chip is more complex than the typical Ethernet chip at the time.
It was designed to improve system performance by moving most of the Ethernet processing from the main processor to
the coprocessor, allowing the main processor and the coprocessor to operate in parallel.
The coprocessor provides four DMA channels to move data between memory and the network without the main processor's involvement.
The main processor and the coprocessor communicate through complex data structures<span id="fnref:structures"><a class="ref" href="#fn:structures">2</a></span> in shared memory: the main processor puts control blocks in memory
to tell the I/O coprocessor what to do, specifying the locations of transmit and receive buffers in memory.
In response, the I/O coprocessor puts status blocks in memory.
The processor onboard the 82586 chip allows the chip to handle these complex data structures in software.
Meanwhile, the transmission/receive circuitry on the left side of the chip uses dedicated circuitry to handle the low-level,
high-speed aspects of Ethernet. </p>
<h2>Ethernet and collisions</h2>
<p>A key problem with a shared network is how to prevent multiple computers from trying to send data on the network at the same
time.
Instead of a centralized control mechanism, Ethernet allows computers to transmit whenever they want.<span id="fnref:token-ring"><a class="ref" href="#fn:token-ring">3</a></span>
If two computers transmit at the same time, the "collision" is detected and the computers try again, hoping to
avoid a collision the next time.
Although this may sound inefficient, it turns out to work out remarkably well.<span id="fnref:csma"><a class="ref" href="#fn:csma">4</a></span>
To avoid a second collision, each computer waits a random amount of time before retransmitting the packet.
If a collision happens again (which is likely on a busy network), an exponential backoff algorithm is used, with each
computer waiting longer and longer after each collision.
This automatically balances the retransmission delay to minimize collisions and maximize throughput.</p>
<p>I traced out a bunch of circuitry to determine how the exponential backoff logic is implemented.
To summarize, exponential backoff is implemented with a 10-bit counter to provide a pseudorandom number, a 10-bit mask register to get an exponentially sized
delay, and a delay counter to count down the delay.
I'll discuss how these are implemented, starting with the 10-bit counter.</p>
<h2>The 10-bit counter</h2>
<p>A 10-bit counter may seem trivial, but it still takes up a substantial area of the chip.
The straightforward way of implementing a counter is to hook up 10 latches as a "ripple counter".
The counter is controlled by a clock signal that indicates that the counter should increment.
The clock toggles the lowest bit of the counter.
If this bit flips from 1 to 0, the next higher bit is toggled.
The process is repeated from bit to bit, toggling a bit if there is a carry.
The problem with this approach is that the carry "ripples" through the counter.
Each bit is delayed by the lower bit, so the bits don't all flip at the same time.
This limits the speed of the counter as the top bit isn't settled until the carry has propagated through the nine lower bits.</p>
<p>The counter in the chip uses a different approach with additional circuitry to improve performance.
Each bit has logic to check if all the lower bits are ones. If so, the clock signal toggles the bit.
All the bits toggle at the same time, rapidly incrementing the counter in response to the clock signals.
The drawback of this approach is that it requires much more logic.</p>
<p>The diagram below shows how the carry logic is implemented.
The circuitry is optimized to balance speed and complexity.
In particular, bits are examined in groups of three, allowing some of the logic to be shared across multiple bits.
For instance, instead of using a 9-input gate to examine the nine lower bits, separate gates test bits 0-2 and 3-5.</p>
<p><a href="https://static.righto.com/images/i82586s/toggle-logic.jpg"><img alt="The circuitry to generate the toggle signals for each bit of the counter." class="hilite" height="555" src="https://static.righto.com/images/i82586s/toggle-logic-w350.jpg" title="The circuitry to generate the toggle signals for each bit of the counter." width="350" /></a><div class="cite">The circuitry to generate the toggle signals for each bit of the counter.</div></p>
<p>The implementation of the latches is also interesting.
Each latch is implemented with dynamic logic, using the circuit's capacitance to store each bit.
The input is connected to the output with two inverters.
When the clock is high, the transistor turns on, connecting the inverters in a loop that holds the value.
When the clock is low, the transistor turns off. However, the 0 or 1 value will still remain on the input to
the first inverter, held by the charge on the transistor's gate.
At this time, an input can be fed into the latch, overriding the old value.</p>
<p><a href="https://static.righto.com/images/i82586s/bit.jpg"><img alt="The basic dynamic latch circuit." class="hilite" height="148" src="https://static.righto.com/images/i82586s/bit-w300.jpg" title="The basic dynamic latch circuit." width="300" /></a><div class="cite">The basic dynamic latch circuit.</div></p>
<p>The latch has some additional circuitry to make it useful.
To toggle the latch, the output is inverted before feeding it back to the input. The toggle control signal selects
the inverted output through another pass transistor.
The toggle signal is only activated when the clock is low, ensuring that the
circuit doesn't repeatedly toggle, oscillating out of control.</p>
<p><a href="https://static.righto.com/images/i82586s/counter-bit.jpg"><img alt="One bit of the counter." class="hilite" height="196" src="https://static.righto.com/images/i82586s/counter-bit-w400.jpg" title="One bit of the counter." width="400" /></a><div class="cite">One bit of the counter.</div></p>
<p>The image below shows how the counter circuit is implemented on the die. I have removed the metal layer to show the underlying transistors; the circles are contacts where the metal was connected to the underlying silicon.
The pinkish regions are doped silicon. The pink-gray lines are polysilicon wiring. When polysilicon crosses doped silicon, it
creates a transistor.
The blue color swirls are not significant; they are bits of oxide remaining on the die.</p>
<p><a href="https://static.righto.com/images/i82586s/die-counter.jpg"><img alt="The counter circuitry on the die." class="hilite" height="764" src="https://static.righto.com/images/i82586s/die-counter-w400.jpg" title="The counter circuitry on the die." width="400" /></a><div class="cite">The counter circuitry on the die.</div></p>
<h2>The 10-bit mask register</h2>
<p>The mask register has a particular number of low bits set, providing a mask of length 0 to 10.
For instance, with 4 bits set, the mask register is 0000001111.
The mask register can be updated in two ways. First, it can be set to length 1-8 with a three-bit length input.<span id="fnref:length"><a class="ref" href="#fn:length">5</a></span>
Second, the mask can be lengthened by one bit, for example going from 0000001111 to 0000011111 (length 4 to 5).</p>
<p>The mask register is implemented with dynamic latches similar to the counter, but the inputs to the latches are different.
To load the mask to a particular length, each bit has logic to determine if the bit should be set based on the three-bit input.
For example, bit 3 is cleared if the specified length is 0 to 3, and set otherwise.
The lengthening feature is implemented by shifting the mask value to the left by one bit and inserting a 1 into the lowest bit.</p>
<p>The schematic below shows one bit of the mask register. At the center is a two-inverter latch as seen before.
When the clock is high, it holds its value. When the clock is low, the latch can be loaded with a new value.
The "shift" line causes the bit from the previous stage to be shifted in. The "load" line loads the mask bit generated from
the input length. The "reset" line clears the mask.
At the right is the NAND gate that applies the mask to the count and inverts the result.
As will be seen below, these NAND gates are unusually large.</p>
<p><a href="https://static.righto.com/images/i82586s/mask-stage.png"><img alt="One stage of the mask register." class="hilite" height="259" src="https://static.righto.com/images/i82586s/mask-stage-w500.png" title="One stage of the mask register." width="500" /></a><div class="cite">One stage of the mask register.</div></p>
<p>The logic to set a mask bit based on the length input is shown below.<span id="fnref:demorgan"><a class="ref" href="#fn:demorgan">6</a></span>
The three-bit "sel" input selects the mask length from 1 to 8 bits; note that the mask0 bit is always set while bits
8 and 9 are cleared.<span id="fnref:control"><a class="ref" href="#fn:control">7</a></span>
Each set of gates energizes the corresponding mask line for the appropriate inputs.</p>
<p><a href="https://static.righto.com/images/i82586s/mask-ctrl.png"><img alt="The control logic to enable mask bits based on length." class="hilite" height="602" src="https://static.righto.com/images/i82586s/mask-ctrl-w350.png" title="The control logic to enable mask bits based on length." width="350" /></a><div class="cite">The control logic to enable mask bits based on length.</div></p>
<p>The diagram below shows the mask register on the die. I removed the metal layer to show the underlying
silicon and polysilicon, so the transistors are visible.
On the left are the NAND gates that combine each bit of the counter with the mask. Note that large snake-like
transistors; these larger transistors provide enough current to drive the signal over the long bus to
the delay counter register at the bottom of the chip.
Bit 0 of the mask is always set, so it doesn't have a latch. Bits 8 and 9 of the mask are only set by
shifting, not by selecting a mask length, so they don't have mask logic.<span id="fnref:state"><a class="ref" href="#fn:state">8</a></span></p>
<p><a href="https://static.righto.com/images/i82586s/mask-register.jpg"><img alt="The mask register on the die." class="hilite" height="764" src="https://static.righto.com/images/i82586s/mask-register-w400.jpg" title="The mask register on the die." width="400" /></a><div class="cite">The mask register on the die.</div></p>
<h2>The delay counter register</h2>
<p>To generate the pseudorandom exponential backoff, the counter register and the mask register are NANDed together.
This generates a number of the desired binary length, which is stored in the delay counter.
Note that the NAND operation inverts the result, making it negative.
Thus, as the delay counter counts up, it counts toward zero, reaching zero after the desired number of clock ticks.</p>
<p>The implementation of the delay counter is similar to the first counter, so I won't include a schematic.
However, the delay counter is attached to the register bus, allowing its value to be read by the chip's CPU.
Control lines allow the delay counter's value to pass onto the register bus.</p>
<p>The diagram below shows the locations of the counter, mask, and delay register on the die.
In this era, something as simple as a 10-bit register occupied a significant part of the die.
Also note the distance between the counter and mask and the delay register at the bottom of the chip.
The NAND gates for the counter and mask required large transistors to drive the signal across this large distance.</p>
<p><a href="https://static.righto.com/images/i82586s/die-labeled-backoff.jpg"><img alt="The die, with counter, mask, and delay register." class="hilite" height="446" src="https://static.righto.com/images/i82586s/die-labeled-backoff-w600.jpg" title="The die, with counter, mask, and delay register." width="600" /></a><div class="cite">The die, with counter, mask, and delay register.</div></p>
<h2>Conclusions</h2>
<p>The Intel Ethernet chip provides an interesting example of how a real-world circuit is implemented on a chip.
Exponential backoff is a key part of the Ethernet standard.
This chip implements backoff with a simple but optimized circuit.<span id="fnref:drawback"><a class="ref" href="#fn:drawback">9</a></span></p>
<p><a href="https://static.righto.com/images/i82586s/die-stripped-r.jpg"><img alt="A high-resolution image of the die with the metal removed. (Click for a larger version.) Some of the oxide layer remains, causing colored regions due to thin-film interference." class="hilite" height="511" src="https://static.righto.com/images/i82586s/die-stripped-r-w600.jpg" title="A high-resolution image of the die with the metal removed. (Click for a larger version.) Some of the oxide layer remains, causing colored regions due to thin-film interference." width="600" /></a><div class="cite">A high-resolution image of the die with the metal removed. (Click for a larger version.) Some of the oxide layer remains, causing colored regions due to thin-film interference.</div></p>
<p>For more chip reverse engineering,
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon occasionally as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.
Acknowledgments: Thanks to Robert Garner for providing the chip and questions.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:microcode">
<p>I think the on-chip processor is a very simple processor that doesn't match other Intel architectures.
It is described as executing microcode. I don't think this is microcode in the usual sense of machine
instructions being broken down into microcode.
Instead, I think the processor's instructions are primitive, single-clock instructions that are more
like micro-instructions than machine instructions. <a class="footnote-backref" href="#fnref:microcode" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:structures">
<p>The diagram below shows the data structures in shared memory for communication between the main processor and the coprocessor.
The Command List specifies the commands that the coprocessor should perform.
The Receive Frame area provides memory blocks for incoming network packets.</p>
<p><a href="https://static.righto.com/images/i82586s/memory-structures.jpg"><img alt="A diagram of the 82586 shared memory structures, from the 82586 datasheet." class="hilite" height="555" src="https://static.righto.com/images/i82586s/memory-structures-w600.jpg" title="A diagram of the 82586 shared memory structures, from the 82586 datasheet." width="600" /></a><div class="cite">A diagram of the 82586 shared memory structures, from the 82586 datasheet.</div></p>
<p>I think Intel was inspired by mainframe-style I/O channels, which moved I/O processing to separate processors
communicating through memory.
Another sign of Intel's attempts to move mainframe technology to microprocessors was the ill-fated iAPX 432 processor,
which Intel called a "micro-mainframe." (I discuss the iAPX 432 as part of <a href="https://www.righto.com/2023/07/the-complex-history-of-intel-i960-risc.html">this blog post</a>.)</p>
<p><!-- --> <a class="footnote-backref" href="#fnref:structures" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:token-ring">
<p>An alternative approach to networking is token-ring, where the computers in the network pass a token from machine to machine.
Only the machine with the token can send a packet on the network, ensuring collision-free transmission.
I looked inside an IBM token-ring chip in <a href="https://www.righto.com/2021/02/strange-chip-teardown-of-vintage-ibm.html">this post</a>. <a class="footnote-backref" href="#fnref:token-ring" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:csma">
<p>Ethernet's technique is called CSMA/CD (Carrier Sense Multiple Access with Collision Detection).
The idea of Carrier Sense is that the "carrier" signal on the network indicates that the network is in use.
Each computer on the network listens for the lack of carrier before transmitting, which avoids most collisions.
However, there is still a small chance of collision.
(In particular, the speed of light means that there is a delay on a long network
between when one computer starts transmitting and when a second computer can detect this transmission.
Thus, both computers can think the network is free while the other computer is transmitting.
This factor also imposes a maximum length on an Ethernet network segment: if the network is too long,
a computer can finish transmitting a packet before the collision occurs, and it won't detect the collision.)
Modern Ethernet has moved from the shared network to a star topology that avoids collisions. <a class="footnote-backref" href="#fnref:csma" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:length">
<p>The length of the mask is one more than the three-bit length input. E.g. An input of 7 sets eight mask bits. <a class="footnote-backref" href="#fnref:length" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:demorgan">
<p>The mask generation logic is a bit tricky to understand. You can try various bit combinations to see how it works.
The logic is easier to understand if you apply De Morgan's law to change the NOR gates to AND gates, which also removes
the negation on the inputs. <a class="footnote-backref" href="#fnref:demorgan" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:control">
<p>The control line appears to enable or disable mask selection but its behavior is inexplicably negated on bit 1. <a class="footnote-backref" href="#fnref:control" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:state">
<p>The circuitry below the counter appears to be a state machine that is unrelated to the exponential backoff.
From reverse engineering, my hypothesis is that the counter is reused by the state machine: it both generates pseudorandom numbers for exponential backoff and times events when a packet is being received.
In particular, it has circuitry to detect when the counter reaches 9,
20, and 48, and takes actions at these values.</p>
<p>The state itself is held in numerous latches. The new state is computed by a PLA (Programmable Logic Array)
below and to the right of the counter along with numerous individual gates. <a class="footnote-backref" href="#fnref:state" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:drawback">
<p>One drawback of this exponential backoff circuit is that the pseudorandom numbers are completely synchronous.
If two network nodes happen to be in the exact same counter state when they collide, they will go through the same exponential
backoff delays, causing a collision every time.
While this may seem unlikely, it apparently happened occasionally during use.
The LANCE Ethernet chip from AMD used a different approach. Instead of running the pseudorandom counter from the highly accurate
quartz clock signal, the counter used an on-chip ring oscillator that was deliberately designed to be inaccurate.
This prevented two nodes from locking into inadvertent synchronization. <a class="footnote-backref" href="#fnref:drawback" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com4tag:blogger.com,1999:blog-6264947694886887540.post-40716442857708376212023-10-12T17:24:00.005-07:002023-10-14T13:35:26.323-07:00Examining the silicon dies of the Intel 386 processorYou might think of the Intel 386 processor (1985) as just an early processor in the x86
line, but the 386 was a critical turning point for modern computing in several ways.<span id="fnref:second-source"><a class="ref" href="#fn:second-source">1</a></span>
First, the 386 moved the x86 architecture to 32 bits, defining the dominant computing
architecture for the rest of the 20th century.
The 386 also established the overwhelming importance of x86, not just for Intel, but for the entire computer
industry.
Finally, the 386 ended IBM's control over the PC market, turning Compaq into the architectural
leader.</p>
<p>In this blog post, I look at die photos of the Intel 386 processor and explain what they reveal
about the history of the processor, such as the move from the 1.5 µm process to the
1 µm process.
You might expect that Intel simply made the same 386 chip at a smaller scale, but there were
substantial changes to the chip's layout, even some visible to the naked eye.<span id="fnref:keychains"><a class="ref" href="#fn:keychains">2</a></span>
I also look at why the 386 SL had over three times the transistors as the other 386 versions.<span id="fnref:variants"><a class="ref" href="#fn:variants">3</a></span></p>
<p>The 80386 was a major advancement over the 286: it implemented a 32-bit architecture,
added more instructions, and supported 4-gigabyte segments.
The 386 is a complicated processor (by 1980s standards), with 285,000 transistors, ten times the number of the original 8086.<span id="fnref:transistors"><a class="ref" href="#fn:transistors">4</a></span>
The 386 has eight logical units
that are pipelined<span id="fnref:parallelism"><a class="ref" href="#fn:parallelism">5</a></span> and operate mostly autonomously.<span id="fnref:block-diagram"><a class="ref" href="#fn:block-diagram">6</a></span>
The diagram below shows the internal structure of the 386.<span id="fnref:die-diagram"><a class="ref" href="#fn:die-diagram">7</a></span></p>
<p><a href="https://static.righto.com/images/386-versions/386-die-labeled.jpg"><img alt="The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici." class="hilite" height="641" src="https://static.righto.com/images/386-versions/386-die-labeled-w600.jpg" title="The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici." width="600" /></a><div class="cite">The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici.</div></p>
<p>The heart of a processor is the datapath, the components that hold and process data.
In the 386, these components are in the lower left: the ALU (Arithmetic/Logic Unit), a barrel shifter to shift data, and the registers.
These components form regular rectangular blocks, 32 bits wide.
The datapath, along with the circuitry to the left that manages it, forms the Data Unit.
In the lower right is the microcode ROM, which breaks down machine instructions into
micro-instructions, the low-level steps of the instruction.
The microcode ROM, along with the microcode engine circuitry, forms the Control Unit.</p>
<p>The 386 has a complicated instruction format.
The Instruction Decode Unit breaks apart an instruction into its component parts
and generates a pointer to the microcode that implements the instruction.
The instruction queue holds three decoded instructions.
To improve performance, the Prefetch Unit reads instructions from memory before they are
needed, and stores them in the 16-byte prefetch queue.<span id="fnref:queue"><a class="ref" href="#fn:queue">8</a></span></p>
<!-- Introduction to the 80386 p52 -->
<p>The 386 implements segmented memory and virtual memory, with access protection.<span id="fnref:protection"><a class="ref" href="#fn:protection">9</a></span>
The Memory Management Unit
consists of the Segment Unit and the Paging Unit:
the Segment Unit translates a logical address to a linear address, while the Paging Unit
translates the linear address to a physical address.
The segment descriptor cache and page cache (TLB) hold data about segments and pages;
the 386 has no on-chip instruction or data cache.<span id="fnref:cache"><a class="ref" href="#fn:cache">10</a></span>
The Bus Interface Unit in the upper right handles communication between the 386 and the external
memory and devices.</p>
<!--
The 386 can support thousands of segments, with the most recent segment information stored
in the [segment descriptor cache](https://web.archive.org/web/20220405211445/http://www.rcollins.org/ddj/Aug98/Aug98.html).
The 32-entry translation lookaside buffer (TLB or page cache) caches the translation information for
the most recently used pages.
El-Ayat p14 discusses the protection test unit.
-->
<p>Silicon dies are often labeled with the initials of the designers. The 386 DX, however,
has an unusually large number of initials. In the image below, I have enlarged the tiny initials so they are visible.
I think the designers put their initials next to the unit they worked on, but
I haven't been able to identify most of the names.<span id="fnref:et"><a class="ref" href="#fn:et">11</a></span></p>
<p><a href="https://static.righto.com/images/386-versions/386-initials.jpg"><img alt="The 386 die with the initials magnified." class="hilite" height="641" src="https://static.righto.com/images/386-versions/386-initials-w600.jpg" title="The 386 die with the initials magnified." width="600" /></a><div class="cite">The 386 die with the initials magnified.</div></p>
<h2>The shrink from 1.5 µm to 1 µm</h2>
<p>The original 386 was built on a process called CHMOS-III that had 1.5 µm features (specifically the gate channel length for a transistor).
Around 1987, Intel moved to an improved process called CHMOS-IV, with 1 µm features,
permitting a considerably smaller die for the 386.
However, shrinking the layout wasn't a simple mechanical process. Instead, many changes were
made to the chip, as shown in the comparison diagram below.
Most visibly, the Instruction Decode Unit and the Protection Unit in the center-right are
horizontal in the smaller die, rather than vertical.
The standard-cell logic (discussed later) is considerably more dense, probably due to
improved layout algorithms.
The data path (left) was highly optimized in the original so it remained essentially unchanged, but smaller.
One complication is that the bond pads around the border needed to remain the same size so bond wires could be attached.
To fit the pads around the smaller die, many of the pads are staggered.
Because different parts of the die shrank differently, the blocks no longer fit together as compactly, creating wasted space at the bottom of the die.
For some reason, the numerous initials on the original 386 die were removed.
Finally, the new die was labeled 80C386I with a copyright date of 1985, 1987; it is unclear what "C" and "I" indicate.</p>
<p><a href="https://static.righto.com/images/386-versions/shrink.jpg"><img alt="Comparison of the 1.5 µm die and the 1 µm die at the same scale. Photos courtesy of Antoine Bercovici." class="hilite" height="442" src="https://static.righto.com/images/386-versions/shrink-w700.jpg" title="Comparison of the 1.5 µm die and the 1 µm die at the same scale. Photos courtesy of Antoine Bercovici." width="700" /></a><div class="cite">Comparison of the 1.5 µm die and the 1 µm die at the same scale. Photos courtesy of Antoine Bercovici.</div></p>
<p>The change from 1.5 µm to 1 µm may not sound significant, but it reduced the die size by
60%.
This allowed more dies on a wafer, substantially dropping the manufacturing cost.<span id="fnref:wafer"><a class="ref" href="#fn:wafer">12</a></span>
The strategy of shrinking a processor to a new process before designing a new microarchitecture
for the process became Intel's <a href="https://retailedge.intel.com/content/pdf/asmo/201303art_computerpoweruser.pdf">tick-tock</a> strategy.</p>
<h2>The 386 SX</h2>
<p>In 1988, Intel introduced the 386 SX processor, the low-cost version of the 386,
with a 16-bit bus instead of a 32-bit bus.
(This is reminiscent of the 8088 processor with an 8-bit bus versus the 8086 processor
with a 16-bit bus.)
According to the <a href="https://archive.computerhistory.org/resources/access/text/2015/06/102702019-05-01-acc.pdf#page=25">386 oral history</a>,
the cost of the original 386 die decreased to the point where the chip's package cost about as
much as the die.
By reducing the number of pins, the 386 SX could be put in a one-dollar plastic package
and sold for a considerably reduced price.
The SX allowed Intel to segment the market, moving low-end customers <a href="https://books.google.com/books?id=fSMW_RgWwIgC&lpg=PT95&pg=PT94#v=onepage&q&f=false">from the 286</a> to the 386 SX, while preserving the
higher sales price of the original 386, now called the DX.<span id="fnref:286"><a class="ref" href="#fn:286">13</a></span>
<a href="https://www.nytimes.com/1988/06/21/science/personal-computers-new-chip-less-cost-for-power.html">In 1988</a>, Intel sold the 386 SX for $219, at least $100 less than the 386 DX.
A complete SX computer could be $1000 cheaper than a similar DX model.</p>
<!-- p25 Shrink of 386.-->
<!-- p25 claim that the 8-bit bus of the 8088 was what tipped the IBM PC processor choice to Intel -->
<p>For compatibility with older 16-bit peripherals, the original 386 was designed to support a mixture of 16-bit and 32-bit buses, dynamically
switching on a cycle-by-cycle basis if needed.
Because 16-bit support was built into the 386, the 386 SX didn't require much design work.
(Unlike the 8088, which required a redesign of the 8086's bus interface unit.)</p>
<p>The 386 SX was built at both 1.5 µm and 1 µm.
The diagram below compares the two sizes of the 386 SX die.
These photos may look identical to the 386 DX photos in the previous section,
but close examination shows a few differences.
Since the 386 SX uses fewer pins, it has fewer bond pads, eliminating the staggered pads of
the shrunk 386 DX.
There are a few differences at the bottom of the chip, with wiring in much of the 386 DX's
wasted space.</p>
<p><a href="https://static.righto.com/images/386-versions/dies-sx.jpg"><img alt="Comparison of two dies for the 386 SX. Photos courtesy of Antoine Bercovici." class="hilite" height="462" src="https://static.righto.com/images/386-versions/dies-sx-w700.jpg" title="Comparison of two dies for the 386 SX. Photos courtesy of Antoine Bercovici." width="700" /></a><div class="cite">Comparison of two dies for the 386 SX. Photos courtesy of Antoine Bercovici.</div></p>
<p>Comparing the two SX revisions,
the larger die is labeled "80P9"; Intel's internal name for the chip was "P9", using their
confusing series of <a href="https://www.righto.com/search?q=i960#fn:processor-numbers">P numbers</a>.
The shrunk die is labeled "80386SX", which makes more sense.
The larger die is copyright 1985, 1987, while the shrunk die (which should be newer) is copyright 1985 for some reason.
The larger die has mostly the same initials as the DX, with a few changes.
The shrunk die has about 21 sets of initials.</p>
<h2>The 386 SL die</h2>
<p>The 386 SL (1990) was a major extension to the 386, combining a 386 core and other functions on one chip to save power and space.
Named "SuperSet", it was designed to corner the notebook PC market.<span id="fnref:superset"><a class="ref" href="#fn:superset">14</a></span>
The 386 SL chip included an ISA bus controller, power management logic, a cache controller
for an external cache, and the main memory controller.</p>
<p>Looking at the die photo below, the 386 core itself takes up about 1/4 of the SL's die.
The 386 core is very close to the standard 386 DX, but there are a few visible differences.
Most visibly, the bond pads and pin drivers have been removed from the core.
There are also some circuitry changes. For instance, the 386 SL core supports the <a href="https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/060805.PDF">System Management
Mode</a>, which suspends normal execution, allowing power management and other low-level hardware
tasks to be performed outside the regular operating system.
System Management Mode is now a standard part of the x86 line, but it was introduced in the 386 SL.</p>
<!-- 386 SL system management patent https://patents.google.com/patent/US5175853A/en?oq=+US5175853A -->
<p><a href="https://static.righto.com/images/386-versions/386sl-die-labeled.jpg"><img alt="The 386 SL die with functional blocks labeled. Die photo courtesy of Antoine Bercovici." class="hilite" height="606" src="https://static.righto.com/images/386-versions/386sl-die-labeled-w600.jpg" title="The 386 SL die with functional blocks labeled. Die photo courtesy of Antoine Bercovici." width="600" /></a><div class="cite">The 386 SL die with functional blocks labeled. Die photo courtesy of Antoine Bercovici.</div></p>
<p>In total, the 386 SL contains 855,000 transistors,<span id="fnref:quickref"><a class="ref" href="#fn:quickref">15</a></span> over 3 times as many as the regular 386 DX.
The cache tag RAM takes up a lot of space and transistors.
The cache data itself is external; this on-chip circuitry just manages the cache.
The other new components are largely implemented with standard-cell logic (discussed below); this is visible as uniform stripes
of circuitry, most clearly in the ISA bus controller.</p>
<h2>A brief history of the 386</h2>
<p>From the modern perspective, it seems obvious for Intel to extend the x86 line from the
286 to the 386, while keeping backward compatibility.
But at the time, this path was anything but clear.
This history starts in the late 1970s, when Intel decided to build a "micromainframe" processor, an advanced 32-bit
processor for object-oriented programming that had objects, interprocess communication,
and memory protection implemented in the CPU.
This overly ambitious project fell behind schedule, so Intel created a stopgap processor to
sell until the micromainframe processor was ready.
This stopgap processor was the 16-bit 8086 processor (1978).</p>
<p>In 1981, IBM decided to use the Intel 8088 (an 8086 variant) in the IBM Personal Computer (PC),
but Intel did not realize the importance of this at the time.
Instead, Intel was focused on their
micromainframe processor, also released in 1981 as the iAPX 432, but this became
"one of the great disaster stories of modern computing" as the <a href="https://archive.nytimes.com/www.nytimes.com/library/tech/98/04/biztech/articles/05merced.html">New York Times</a> called it.
Intel then reimplemented the ideas of the
ill-fated iAPX 432 on top of a RISC architecture, creating the more successful <a href="https://www.righto.com/2023/07/the-complex-history-of-intel-i960-risc.html">i960</a>. </p>
<p>Meanwhile, things weren't going well at first for the 286 processor, the follow-on to the 8086<span id="fnref:186"><a class="ref" href="#fn:186">16</a></span>.
Bill Gates and others called its design "brain-damaged".
IBM was unenthusiastic about the 286 for their own reasons.<span id="fnref:ibm-286"><a class="ref" href="#fn:ibm-286">17</a></span>
As a result, the 386 project was a low priority for Intel and the 386 team felt that it was the
"stepchild"; internally, the 386 was pitched as another <a href="https://archive.computerhistory.org/resources/text/Oral_History/Intel_386_Business_Strategy/102701962.05.01.pdf#page=14">stopgap</a>, not Intel's "official" 32-bit processor.</p>
<p>Despite the lack of corporate enthusiasm, the 386 team came up with two proposals to
extend the 286 to a 32-bit architecture.
The first was a minimal approach to extend the existing registers and
address space to 32 bits.
The more ambitious proposal would add more registers and create a 32-bit instruction set that
was significantly different from the 8086's 16-bit instruction set.
At the time, the IBM PC was still relatively new, so the importance of the installed
base of software wasn't obvious; software compatibility was viewed as a "nice to have" feature rather than essential.
After much debate, the decision was made around the end of 1982 to go with the minimal proposal,
but supporting both segments and flat addressing, while keeping compatibility with the 286.</p>
<!--
A key design challenge was if it was possible to be compatible with the 8086's segment-based addressing, while
supporting the flat 32-bit address space that would support Unix well.
-->
<p>By 1984, though, the PC industry was booming and the 286 was proving to be a success.
This produced enormous political benefits for the 386 team, who saw the project change from
"stepchild" to "king". <!-- p17 -->
Intel introduced the 386 in 1985, which was otherwise
"a miserable year for Intel and the rest of the semiconductor industry,"
as Intel's <a href="https://www.intel.com/content/dam/doc/report/history-1985-annual-report.pdf">annual report</a> put it.
Due to an industry-wide business slowdown, Intel's net income "essentially disappeared."
Moreover, facing heavy competition from Japan, Intel dropped out of the DRAM business, a crushing blow
for a company that got its start in the memory industry.
Fortunately, the 386 would change everything.</p>
<p>Given IBM's success with the IBM PC, Intel was puzzled that IBM wasn't interested in the 386 processor, but IBM had a strategy of their own.<span id="fnref:ibm"><a class="ref" href="#fn:ibm">18</a></span>
By this time, the IBM PC was being cloned by many competitors, but IBM had a plan to regain
control of the PC architecture and thus the market: in 1987, IBM introduced the PS/2 line.
These new computers ran the OS/2 operating system instead of Windows and used the proprietary Micro Channel architecture.<span id="fnref:ps2"><a class="ref" href="#fn:ps2">19</a></span>
IBM used multiple engineering and legal strategies to make cloning the PS/2 slow, expensive, and risky,
so IBM expected they could take back the market from the clones.</p>
<!-- https://doi.org/10.1109/CMPCON.1988.4889 -->
<p>Compaq took the risky approach of ignoring IBM and following their own architectural direction.<span id="fnref:compaq"><a class="ref" href="#fn:compaq">20</a></span>
Compaq introduced the high-end Deskpro 386 line in <a href="https://www.nytimes.com/1986/09/10/business/company-news-compaq-introduces-more-powerful-pc.html">September 1986</a>, becoming the first major company to build 386-based computers.
An "executive" system, the Deskpro 386 model 40 had a 40-megabyte hard drive and sold for $6449 (over $15,000 in current dollars).
Compaq's <a href="https://www.nytimes.com/1987/09/20/business/the-executive-computer-compaq-s-gamble-on-an-advanced-chip-pays-off.html">gamble paid off</a>
and the Deskpro 386 was a rousing success.</p>
<!--
![The Compaq Deskpro 386. From a <a href="https://archive.org/details/personalcomputercompaq/page/n4/mode/1up">Compaq brochure</a>.](deskpro-386.jpg "w400")
-->
<p><a href="https://static.righto.com/images/386-versions/compaq-386.jpg"><img alt="The Compaq Deskpro 386 in front of the 386 processor (not to scale). From PC Tech Journal, 1987. Curiously, the die image of the 386 has been mirrored, as can be seen both from the positions of the microcode ROM and instruction decoder at the top as well as from the position of the cut corner of the package." class="hilite" height="548" src="https://static.righto.com/images/386-versions/compaq-386-w400.jpg" title="The Compaq Deskpro 386 in front of the 386 processor (not to scale). From PC Tech Journal, 1987. Curiously, the die image of the 386 has been mirrored, as can be seen both from the positions of the microcode ROM and instruction decoder at the top as well as from the position of the cut corner of the package." width="400" /></a><div class="cite">The Compaq Deskpro 386 in front of the 386 processor (not to scale). From <a href="https://archive.org/details/PC_Tech_Journal_vol05_n03/page/n51/mode/2up">PC Tech Journal</a>, 1987. Curiously, the die image of the 386 has been mirrored, as can be seen both from the positions of the microcode ROM and instruction decoder at the top as well as from the position of the cut corner of the package.</div></p>
<!-- BYTE Technical implications of PS/2 https://www.ardent-tool.com/docs/scans/BYTE_1987_10_Technical_Implications_of_PS2.pdf -->
<p>As for IBM, the PS/2 line was largely unsuccessful and failed to become the standard.
Rather than
regaining control over the PC,
"IBM lost control of the PC standard in 1987 when it introduced its PS/2 line of systems."<span id="fnref:logic"><a class="ref" href="#fn:logic">21</a></span>
IBM exited the PC market in <a href="https://www.smh.com.au/business/ibms-exit-from-pc-business-seen-as-turning-point-20041209-gdka6w.html">2004</a>, selling the business to Lenovo.
One slightly hyperbolic <a href="https://amzn.to/3ZyK7nq">book title</a> summed it up: "Compaq Ended IBM's PC Domination and Helped Invent Modern Computing".
The 386 was a huge moneymaker for Intel, leading to Intel's first billion-dollar quarter in 1990.
It cemented the importance of the x86 architecture, not just for Intel but for the entire
computing industry, dominating the market up to the present day.<span id="fnref:marketshare"><a class="ref" href="#fn:marketshare">22</a></span></p>
<h2>How the 386 was designed</h2>
<p>The design process of the 386 is interesting because it illustrates Intel's migration
to automated design systems and heavier use of simulation.<span id="fnref:design-and-test"><a class="ref" href="#fn:design-and-test">23</a></span>
At the time, Intel was behind the industry in its use of tools so the leaders of the 386
realized that more automation would be necessary to build a complex chip like the 386 on schedule.
By making a large investment in automated tools, the 386 team completed the design ahead of schedule.
Along with proprietary CAD tools, the team made heavy use of standard Unix tools such as <code>sed</code>, <code>awk</code>, <code>grep</code>, and <code>make</code> to manage the various design databases.</p>
<p>The 386 posed new design challenges compared to the previous 286 processor.
The 386 was much more complex, with twice the transistors.
But the 386 also used fundamentally different circuitry.
While the 286 and earlier processors were built from NMOS transistors, the 386 moved to
CMOS (the technology still used today). Intel's CMOS process was called CHMOS-III
(complementary high-performance metal-oxide-silicon) and had a feature size of 1.5 µm.
CHMOS-III was based on Intel's HMOS-III process (used for the 286), but extended to
CMOS. Moreover, the CHMOS process provided two layers of metal instead of one, changing
how signals were routed on the chip and requiring new design techniques.</p>
<p>The diagram below shows a cross-section through a CHMOS-III circuit, with an NMOS
transistor on the left and a PMOS transistor on the right.
Note the jagged three-dimensional topography that is formed as layers cross each other
(unlike modern polished wafers).
This resulted in the
"<a href="https://archive.computerhistory.org/resources/access/text/2015/06/102702019-05-01-acc.pdf#page=21">forbidden gap</a>" problem that caused difficulty for the 386 team.
Specifically second-layer metal (M2) could be close to the first-layer metal (M1) or it could be far apart,
but an in-between distance would cause problems: the forbidden gap.
If the metal layer crossed in the "forbidden gap", the metal could crack and whiskers of metal
would touch, causing the chip to fail.
These problems reduced the yield of the 386.</p>
<p><a href="https://static.righto.com/images/386-versions/process-cross-section.jpg"><img alt="A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology." class="hilite" height="203" src="https://static.righto.com/images/386-versions/process-cross-section-w500.jpg" title="A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology." width="500" /></a><div class="cite">A cross-section of circuitry formed with the CHMOS-III process. From <a href="https://doi.org/10.1109/IEDM.1984.190640">A double layer metal CHMOS III technology</a>.</div></p>
<!-- -->
<p>The design of the 386 proceeded both top-down, starting with the architecture definition,
and bottom-up, designing standard cells and other basic circuits at the transistor level.
The processor's microcode, the software that controlled the chip, was a fundamental component.
It was designed with two CAD tools: an assembler and microcode rule checker.
The high-level design of the chip (register-level RTL)
was created and refined until clock-by-clock and phase-by-phase timing were represented.
The RTL was programmed in <a href="http://www.bitsavers.org/pdf/xidak/mainsail_v12/Mainsail_Tutorial_Volume_1_Mar89.pdf">MAINSAIL</a>, a portable Algol-like language based
on SAIL (Stanford Artificial Intelligence Language).
Intel used a proprietary simulator called Microsim to simulate the RTL, stating that
full-chip RTL simulation was "the single most important simulation model of the 80386".</p>
<p>The next step was to convert this high-level design into a detailed logic design, specifying
the gates and other circuitry
using Eden, a proprietary schematics-capture system.
Simulating the logic design required
a dedicated IBM 3083 mainframe that compared it against the
RTL simulations.
Next, the circuit design phase created the transistor-level design.
The chip layout was performed on <a href="https://www.shapr3d.com/history-of-cad/applicon">Applicon</a> and Eden graphics systems.
The layout started with critical blocks such as the ALU and barrel shifter.
To meet the performance requirements, the TLB (translation lookaside
buffer) for the paging mechanism required a creative design, as did the binary adders.</p>
<p><a href="https://static.righto.com/images/386-versions/standard-cell-colored.jpg"><img alt="Examples of standard cells used in the 386. From "Automatic Place and Route Used on the 80386" by Joseph Krauskopf and Pat Gelsinger. I have added color." class="hilite" height="205" src="https://static.righto.com/images/386-versions/standard-cell-colored-w700.jpg" title="Examples of standard cells used in the 386. From "Automatic Place and Route Used on the 80386" by Joseph Krauskopf and Pat Gelsinger. I have added color." width="700" /></a><div class="cite">Examples of standard cells used in the 386. From "Automatic Place and Route Used on the 80386" by Joseph Krauskopf and Pat Gelsinger, Intel Technology Journal spring 1986. I have added color.</div></p>
<p>The "random" (unstructured) logic was implemented with standard cells, rather than the
transistor-by-transistor design of earlier processors.
The idea of standard cells is to have fixed blocks of circuitry (above) for logic gates, flip-flops,
and other basic functions.<span id="fnref:standard-cell-library"><a class="ref" href="#fn:standard-cell-library">24</a></span>
These cells are arranged in rows by software to implement the specified logic description.
The space between the rows is used as a wiring channel for connections between the cells.
The disadvantage of a standard cell layout is that it generally takes up more space than
an optimized hand-drawn layout, but it is much faster to create and easier to modify.</p>
<p>These standard cells are visible in the die as regular rows of circuitry.
Intel used the <a href="https://ee.sharif.edu/~asic/References/Physical%20Design%20Papers/timberwolf-P2.pdf">TimberWolf</a> automatic placement and routing package, which used simulated annealing to
optimize the placement of cells.
TimberWolf was built by a Berkeley <a href="http://www.aycinena.com/index2/index3/archive/uw%20-%20sechen.html">grad student</a>; one 386 engineer said,
"If management had known that we were using a tool by some grad
student as the key part of the methodology, they would never have let us use it. "
Automated layout was a new thing at Intel; using it improved the schedule, but
the lower density raised the risk that the chip would be too large.</p>
<p><a href="https://static.righto.com/images/386-versions/standard-cells.jpg"><img alt="Standard cells in the 386. Each row consists of numerous standard cells packed together. Each cell is a simple circuit such as a logic gate or flip flop. The wide wiring channels between the rows hold the wiring that connects the cells. This block of circuitry is in the bottom center of the chip." class="hilite" height="308" src="https://static.righto.com/images/386-versions/standard-cells-w350.jpg" title="Standard cells in the 386. Each row consists of numerous standard cells packed together. Each cell is a simple circuit such as a logic gate or flip flop. The wide wiring channels between the rows hold the wiring that connects the cells. This block of circuitry is in the bottom center of the chip." width="350" /></a><div class="cite">Standard cells in the 386. Each row consists of numerous standard cells packed together. Each cell is a simple circuit such as a logic gate or flip flop. The wide wiring channels between the rows hold the wiring that connects the cells. This block of circuitry is in the bottom center of the chip.</div></p>
<p>The data path consists of the registers, ALU (Arithmetic Logic Unit), barrel shifter,
and multiply/divide unit that process the 32-bit data.
Because the data path is critical to the performance of the system,
it was laid out by hand using a CALMA system.
The designers could optimize the layout, taking advantage of regularities in the circuitry,
optimizing the shape and size of each transistor and fitting them together like puzzle pieces.
The data path is visible on the left side of the die, forming orderly 32-bit-wide rectangles
in contrast to the tangles of logic next to it.</p>
<p>Once the transistor-level layout was complete,
Intel's Hierarchical Connectivity Verification System checked that the final layout matched
the schematics and adhered to the process design rules.
The 386 set an Intel speed record, taking just 11 days from completing the layout to "tapeout",
when the chip data is sent on magnetic tape to the mask fabrication company.
(The tapeout team was led by Pat Gelsinger, who later became CEO of Intel.)
After the glass masks were created using
an electron-beam process, Intel's "Fab 3" in Livermore (the first to wear the <a href="https://www.elivermore.com/photos/intel_fab3.htm">bunnysuits</a>) produced the 386 silicon wafers.</p>
<p>Chip designers like to claim that their chip worked the first time, but that was not
the case for the 386. <!-- p19 -->
When the team received the first silicon for the 386, they ran a trivial do-nothing test
program, "NoOp, NoOp, Halt", and it failed.
Fortunately, they found a small fix to a PLA (Programmable Logic Array). Rather than create new masks, they were able to
patch the existing mask with ion milling and get new wafers quickly. <!-- p20 -->
These wafers worked well enough that they could start the long cycles of debugging and fixing.</p>
<p>Once the processor was released, the problems weren't over.<span id="fnref:bugs"><a class="ref" href="#fn:bugs">25</a></span>
Some early 386 processors had a <a href="https://groups.google.com/g/comp.arch/c/qm7FVR_YV4A/m/VKR47yLoUdEJ">32-bit multiply problem</a>, where some arguments would
unpredictably produce the wrong results under particular temperature/voltage/frequency conditions.
(This is unrelated to the famous <a href="https://en.wikipedia.org/wiki/Pentium_FDIV_bug">Pentium FDIV bug</a> that cost Intel $475 million.)
The root cause was a layout problem, not a logic problem; they didn't allow enough margin to
handle the worst case data in combination with manufacturing process and environment factors.
This tricky problem didn't show up in simulation or chip verification, but was only found in
stress testing.
Intel sold the faulty processors, but marked them as only valid for 16-bit software, while marking the
good processors with a double sigma, as seen below.<span id="fnref:xbts"><a class="ref" href="#fn:xbts">26</a></span>
This led to embarrassing headlines such as <a href="https://archive.org/details/bub_gb_mDsEAAAAMBAJ/page/n5/mode/1up?view=theater">Some 386 Systems Won't Run 32-Bit Software, Intel Says</a>.
The multiply bug also caused a shortage of 386 chips in
<a href="https://books.google.com/books?id=u5dYmhF7jc4C&lpg=PA96&pg=PA96#v=onepage&q&f=false">1987</a>
and <a href="https://books.google.com/books?id=Av0xuNrEJakC&lpg=RA6-PT4&pg=RA6-PT4#v=onepage&q&f=false">1988</a> as Intel redesigned the chip to fix the bug.
Overall, the 386 issues probably weren't any worse than other processors and the problems were soon
forgotten.</p>
<p><a href="https://static.righto.com/images/386-versions/steppings.jpg"><img alt="Bad and good versions of the 386. Note the labels on the bottom line. Photos (L), (R) by Thomas Nguyen, (CC BY-SA 4.0)." class="hilite" height="257" src="https://static.righto.com/images/386-versions/steppings-w500.jpg" title="Bad and good versions of the 386. Note the labels on the bottom line. Photos (L), (R) by Thomas Nguyen, (CC BY-SA 4.0)." width="500" /></a><div class="cite">Bad and good versions of the 386. Note the labels on the bottom line. Photos (<a href="https://commons.wikimedia.org/wiki/File:Intel_A80386-16_16_bit_SW_Only.jpg">L</a>), (<a href="https://commons.wikimedia.org/wiki/File:Intel_A80386-16_%CE%A3%CE%A3.jpg">R</a>) by Thomas Nguyen</a>, (<a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en">CC BY-SA 4.0</a>).</div></p>
<h1>Conclusions</h1>
<p><a href="https://static.righto.com/images/386-versions/wall.jpg"><img alt="A 17-foot tall plot of the 386. The datapath is on the left and the microcode is in the lower right. It is unclear if this is engineering work or an exhibit at MOMA. Image spliced together from the 1985 annual report." class="hilite" height="621" src="https://static.righto.com/images/386-versions/wall-w600.jpg" title="A 17-foot tall plot of the 386. The datapath is on the left and the microcode is in the lower right. It is unclear if this is engineering work or an exhibit at MOMA. Image spliced together from the 1985 annual report." width="600" /></a><div class="cite">A 17-foot tall plot of the 386. The datapath is on the left and the microcode is in the lower right. It is unclear if this is engineering work or an exhibit at MOMA. Image spliced together from the <a href="https://www.intel.com/content/dam/doc/report/history-1985-annual-report.pdf">1985 annual report</a>.</div></p>
<p>The 386 processor was a key turning point for Intel.
Intel's previous processors sold very well, but this was largely due to heavy marketing
("<a href="https://www.computerhistory.org/collections/catalog/102746836">Operation Crush</a>") and the
good fortune to be selected for the IBM PC.
Intel was technologically behind the competition, especially Motorola.
Motorola had introduced the 68000 processor in 1979, starting a powerful line of
(more-or-less) 32-bit processors.
Intel, on the other hand, lagged with the "brain-damaged" 16-bit 286 processor in 1982.
Intel was also slow with the transition to CMOS; Motorola had moved to CMOS in 1984 with the 68020.</p>
<p>The 386 provided the necessary technological boost for Intel, moving to a 32-bit architecture,
transitioning to CMOS, and fixing the 286's memory model and multitasking limitations, while
maintaining compatibility with the earlier x86 processors.
The overwhelming success of the 386 solidified the dominance of the x86 and Intel, and put
other processor manufacturers on the defensive.
Compaq used the 386 to take over PC architecture leadership from IBM, leading to the success of Compaq, Dell, and other
companies, while IBM eventually departed the PC market entirely.
Thus, the 386 had an oversized effect on the computer industry, shaping the winners and losers for decades.</p>
<!--
Moreover, as a result of the 386, IBM lost control of the IBM PC architecture to Compaq.
-->
<p>I plan to write more about the 386, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I'm also on Mastodon occasionally as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.
Acknowledgements: The die photos are courtesy of Antoine Bercovici; you should follow him on Twitter as <a href="https://twitter.com/Siliconinsid">@Siliconinsid</a>.<span id="fnref:s-specs"><a class="ref" href="#fn:s-specs">27</a></span>
Thanks to Pat Gelsinger and Roxanne Koester for providing helpful papers.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:second-source">
<p>The 386 also changed the industry because Intel abandoned the standard practice of <a href="https://archive.computerhistory.org/resources/text/Oral_History/Intel_386_Business_Strategy/102701962.05.01.pdf">second sourcing</a> (allowing other companies to manufacture a chip).
AMD, for example, had been a second source for the 286.
But Intel decided to keep production of the 386 to themselves.
Intel ended up licensing the 386 to IBM, though, as the <a href="https://en.wikipedia.org/wiki/IBM_386SLC">IBM 386SLC</a>. Despite the name, this was the 386 SX, not the 386 SL. <a class="footnote-backref" href="#fnref:second-source" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:keychains">
<p>Intel made various keychains containing the 386 die, as shown at <a href="https://www.cpu-world.com/Memorabilia/80386/index.html">CPU World</a>.
If you know where to look, it is easy to distinguish the variants.
In particular, look at the instruction decoders above the microcode and see if they are
oriented vertically (pre-shrink 386) or horizontally (post-shrink 386). <a class="footnote-backref" href="#fnref:keychains" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:variants">
<p>The naming of the 386 versions is a bit of a mess. The 386 started as the 80386
and later the i386.
The 80386SX was introduced in 1988; this is the version with a 16-bit bus.
The "regular" 386 was then renamed the DX to distinguish it from the SX.
There are several other versions of the 386 that I won't discuss here, such as the EX, CXSB, and 80376. See <a href="https://en.wikipedia.org/wiki/I386#Models_and_variants">Wikipedia</a> for details.</p>
<p>Confusingly, the 486 also used the SX and DX names, but in a different way.
The 486 DX was the original that included a floating-point unit,
while floating-point was disabled in the 486 SX.
Thus, in both cases "DX" was the full chip, while "SX" was the low-cost version,
but the removed functionality was entirely different.</p>
<p>Another complication is that a 386DX chip will have a marking like "SX217",
but this has nothing to do with the 386 SX.
SX217 is an Intel <a href="https://en.wikichip.org/wiki/intel/s-spec">S-Specification</a> number,
which specifies the particular <a href="https://www.intel.com/content/www/us/en/support/articles/000057218/processors.html">stepping</a> of the processor,
indicating a manufacturing change or if a feature has been fixed or removed. <a class="footnote-backref" href="#fnref:variants" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:transistors">
<p>Counting transistors isn't as straightforward as you might think.
For example, a ROM may have a transistor for a 1 bit and no transistor for a 0 bit.
Thus, the number of transistors depends on the data stored in the ROM.
Likewise, a PLA has transistors present or absent in a grid, depending on the desired
logic functions.
For this reason, transistor counts are usually the number of "transistor sites",
locations that could have a transistor, even if a transistor is not physically present.
In the case of the 386, it has 285,000 transistor sites and 181,000 actual transistors
(<a href="https://doi.org/10.1109/MDT.1987.295165">source</a>), so over 100,000 reported transistors
don't actually exist.</p>
<p>I'll also point out that most sources claim 275,000 transistors for the 386.
My assumption is that 285,000 is the more accurate number (since this source
distinguishes between transistor sites and physical transistors), while 275,000 is
the rounded number. <a class="footnote-backref" href="#fnref:transistors" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:parallelism">
<p>The 386's independent, pipelined functional units provide a significant performance
improvement and
the pipeline can be executing up to eight instructions at one time. <!-- gelsinger -->
For instance, the 386's microcode engine permits some overlap between the end of one instruction and the beginning
of the next, an overlap that speeds up the processor by about 9%. <!-- el-ayat1985 -->
But note that instructions are still executed sequentially,
taking multiple clocks per instruction,
so it is nothing like the superscalar execution introduced in the Pentium. <a class="footnote-backref" href="#fnref:parallelism" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:block-diagram">
<p>The diagram of the 386 die shows eight functional units. It can be compared to the
block diagram below, which shows how the units are interconnected.</p>
<p><a href="https://static.righto.com/images/386-versions/386-block-diagram.jpg"><img alt="Block diagram of the 386. From The Intel 80386—Architecture and Implementation." class="hilite" height="387" src="https://static.righto.com/images/386-versions/386-block-diagram-w500.jpg" title="Block diagram of the 386. From The Intel 80386—Architecture and Implementation." width="500" /></a><div class="cite">Block diagram of the 386. From <a href="https://doi.org/10.1109/MM.1985.304507">The Intel 80386—Architecture and Implementation</a>.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:block-diagram" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:die-diagram">
<p>My labeled die diagram combines information from two Intel papers:
<a href="https://doi.org/10.1109/10.1109/MM.1985.304507">The Intel 80386—Architecture and Implementation</a> and
<a href="https://doi.org/10.1109/MDT.1987.295165">Design and Test of the 80386</a>.
The former paper describes the eight functional units. The latter paper provides
more details, but only shows six functional units.
(The Control Unit and Data Unit are combined into the Execution Unit, while the
Protection Test Unit is dropped as an independent unit.)
Interestingly, the second paper is by Patrick Gelsinger, who is now CEO of Intel.
Pat Gelsinger also wrote "80386 Tapeout - Giving Birth to an Elephant", which says there are
nine functional units. I don't know what the ninth unit is, maybe the substrate bias generator?
In any case, the count of functional units is flexible.</p>
<p><a href="https://static.righto.com/images/386-versions/gelsinger-bio.jpg"><img alt="Patrick Gelsinger's biography from his 80386 paper." class="hilite" height="344" src="https://static.righto.com/images/386-versions/gelsinger-bio-w300.jpg" title="Patrick Gelsinger's biography from his 80386 paper." width="300" /></a><div class="cite">Patrick Gelsinger's biography from his 80386 paper.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:die-diagram" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:queue">
<p>The 386 has a 16-byte prefetch queue, but apparently only 12 bytes are used due to
a pipeline bug (<a href="https://web.archive.org/web/20230103235532/http://www.rcollins.org/secrets/PrefetchQueue.html">details</a>). <a class="footnote-backref" href="#fnref:queue" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:protection">
<p>Static checks for access violations are performed by the Protection Test Unit, while dynamic checks are performed by the Segment Unit and the Paging Unit. <a class="footnote-backref" href="#fnref:protection" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:cache">
<p>The 386 was originally supposed to have an on-chip cache, but there wasn't room and the cache was dropped in the middle of the project.
As it was, the 386 die barely fit into the lithography machine's field of view. <a class="footnote-backref" href="#fnref:cache" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:et">
<p>It kind of looks like the die has the initials ET next to a telephone.
Could this be a reference to the movie E.T. and its catchphrase "E.T. phone home"?
"SEC" must be senior mask designer Shirley Carter. "KF" is engineer Kelly Fitzpatrick.
"PSR" is probably Paul S. Ries who designed the 386's paging unit. <a class="footnote-backref" href="#fnref:et" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:wafer">
<p>I think that Intel used a 6" (150mm) wafer for the 386. With a 10mm×10mm die, about
128 chips would fit on a wafer.
But with a 6mm×6.5mm die, about 344 would fit on a wafer, over 2.5 times as many.
(See <a href="https://www.silicon-edge.co.uk/j/index.php/resources/die-per-wafer">Die per wafer estimator</a>.) <a class="footnote-backref" href="#fnref:wafer" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:286">
<p>The 286 remained popular compared to the 386, probably due to its lower price. It wasn't
until 1991 that the number of 386 units sold exceeded the 286 (<a href="https://archive.computerhistory.org/resources/access/text/2013/04/102723256-05-01-acc.pdf#page=326">source</a>).
Intel's revenue for the 386 was much, much higher than for the 286 though
(<a href="https://archive.computerhistory.org/resources/access/text/2013/04/102723256-05-01-acc.pdf#page=112">source</a>). <a class="footnote-backref" href="#fnref:286" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:superset">
<p>The "SuperSet" consisted of the 386 SL along with the 82360SL peripheral I/O chip.
The I/O chip contained various ISA bus peripherals, taking the place of multiple chips such
as the 8259 that dated back to the 8080 processor.
The I/O chip included DMA controllers, timers, interrupt controllers, a real time clock,
serial ports, and a parallel port.
It also had a hard disk interface, a floppy disk controller, and a keyboard controller. <a class="footnote-backref" href="#fnref:superset" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
<li id="fn:quickref">
<p>The 386 SL transistor count is from the <a href="https://www.intel.com/pressroom/kits/quickreffam.htm">Intel Microprocessor Quick Reference Guide</a>,
which contains information on most of Intel's processors. <a class="footnote-backref" href="#fnref:quickref" title="Jump back to footnote 15 in the text">↩</a></p>
</li>
<li id="fn:186">
<p>The 186 processor doesn't fit cleanly into the sequence of x86 processors.
Specifically, the 186 is an incompatible side-branch, rather than something in the 286, 386, 486 sequence.
The 186 was essentially an 8086 that included additional functionality (clock generator, interrupt controller, timers, etc.) to make it more suitable for an emedded system.
The 186 was used in some personal computers, but it was incompatible with the IBM PC so it wasn't very popular. <a class="footnote-backref" href="#fnref:186" title="Jump back to footnote 16 in the text">↩</a></p>
</li>
<li id="fn:ibm-286">
<p>IBM didn't want to use the 286 because they were <a href="https://archive.computerhistory.org/resources/text/Oral_History/Intel_386_Business_Strategy/102701962.05.01.pdf#page=17">planning</a> to reverse-engineer the 286 and make their own version, a 16-megahertz CMOS version.
This was part of IBM's plan to regain control of the PC architecture with the PS/2.
Intel told IBM that "the fastest path to a 16-megahertz CMOS 286 is the 386 because it
is CMOS and 16-megahertz", but IBM continued on their own 286 path.
Eventually, IBM gave up and used Intel's 286 in the PS/2. <a class="footnote-backref" href="#fnref:ibm-286" title="Jump back to footnote 17 in the text">↩</a></p>
</li>
<li id="fn:ibm">
<p>IBM might have been reluctant to support the 386 processor because of the risk of
<a href="https://dfarq.homeip.net/compaq-deskpro-386/#4">cutting into</a> sales of IBM's mid-range 4300 mainframe line.
An IBM 4381-2 system ran at about 3.3 <a href="http://www.roylongbottom.org.uk/mips.htm#anchorIBM6">MIPS</a>
and cost $500,000, about the same MIPS performance as 386/16 system for under $10,000.
The systems aren't directly comparable, of course, but many customers
could use the 386 for a fraction of the price.
IBM's sales of 4300 and other systems <a href="https://www.nytimes.com/1987/02/11/business/where-ibm-hurts-the-most.html">declined sharply</a>
in 1987, but the decline was blamed on DEC's VAX systems.</p>
<p><a href="https://static.righto.com/images/386-versions/ibm-4381.jpg"><img alt="An IBM 4381 system. The 4381 processor is the large cabinet to the left of the terminals. The cabinets at the back are probably IBM 3380 disk drives. From an IBM 4381 brochure." class="hilite" height="330" src="https://static.righto.com/images/386-versions/ibm-4381-w500.jpg" title="An IBM 4381 system. The 4381 processor is the large cabinet to the left of the terminals. The cabinets at the back are probably IBM 3380 disk drives. From an IBM 4381 brochure." width="500" /></a><div class="cite">An IBM 4381 system. The 4381 processor is the large cabinet to the left of the terminals. The cabinets at the back are probably IBM 3380 disk drives. From an <a href="https://bitsavers.org/pdf/ibm/4381/G510-0947-0_IBMs_4381_Processor_Family_Jan85.pdf#page=2">IBM 4381 brochure</a>.</div></p>
<p><!-- Decline in 4300 sales in 1986 https://www.nytimes.com/1987/02/11/business/where-ibm-hurts-the-most.html
Concerns about 386 cutting into IBM's higher-end sales https://www.infoworld.com/article/2638538/intel-s-disruptive-move-with-the-386.html
--> <a class="footnote-backref" href="#fnref:ibm" title="Jump back to footnote 18 in the text">↩</a></p>
</li>
<li id="fn:ps2">
<p>The most lasting influence of the PS/2 was the round purple and green keyboard and mouse ports that were
used by most PCs until USB obsoleted them.
The PS2 ports are still available on some motherboards and gaming computers.</p>
<p><a href="https://static.righto.com/images/386-versions/ps2-connectors.jpg"><img alt="The PS/2 keyboard and mouse ports on the back of a Gateway PC." class="hilite" height="139" src="https://static.righto.com/images/386-versions/ps2-connectors-w250.jpg" title="The PS/2 keyboard and mouse ports on the back of a Gateway PC." width="250" /></a><div class="cite">The PS/2 keyboard and mouse ports on the back of a Gateway PC.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:ps2" title="Jump back to footnote 19 in the text">↩</a></p>
</li>
<li id="fn:compaq">
<p>When Compaq introduced their 386-based system, "they warned IBM that it has but six months
to announce a similar machine or be supplanted as the market's standard setter."
(<a href="https://www.pcjs.org/blog/2019/06/18/">source</a>).
Compaq turned out to be correct. <a class="footnote-backref" href="#fnref:compaq" title="Jump back to footnote 20 in the text">↩</a></p>
</li>
<li id="fn:logic">
<p>The quote is from <a href="https://www.google.com/books/edition/Computer_Structure_and_Logic/soQbBQAAQBAJ?hl=en&gbpv=1&pg=PT54&printsec=frontcover">Computer Structure and Logic</a>. <a class="footnote-backref" href="#fnref:logic" title="Jump back to footnote 21 in the text">↩</a></p>
</li>
<li id="fn:marketshare">
<p>Whenever I mention x86's domination of the computing market, people bring up ARM, but
ARM has a lot more market share in people's minds than in actual numbers.
One research firm says that ARM has 15% of the laptop market share in 2023, expected to increase to 25% by 2027. (Surprisingly, Apple only has 90% of the ARM laptop market.)
In the server market, just an estimated 8% of CPU shipments in 2023 were ARM.
See <a href="https://www.counterpointresearch.com/arm-based-pcs-to-nearly-double-market-share-by-2027/">Arm-based PCs to Nearly Double Market Share by 2027</a>
and <a href="https://www.digitimes.com/news/a20230217VL209/amd-arm-digitimes-research-intel-meet-the-analyst.html">Digitimes</a>.
(Of course, mobile phones are almost entirely ARM.) <a class="footnote-backref" href="#fnref:marketshare" title="Jump back to footnote 22 in the text">↩</a></p>
</li>
<li id="fn:design-and-test">
<p>Most of my section on the 386 design process is based on <a href="https://doi.org/10.1109/MDT.1987.295165">Design and Test of the 80386</a>.
The <a href="https://archive.computerhistory.org/resources/access/text/2015/06/102702019-05-01-acc.pdf">386 oral history</a> also provides information on the design process.
The article <a href="https://doi.org/10.1109/MSSC.2010.937691">Such a CAD!</a>
also describes Intel's CAD systems.
Amusingly, I noticed that one of its figures (below)
used a photo of the 386SL instead of the 386DX, with the result that the text is
completely wrong.
For instance, what it calls the microcode ROM is the cache tag RAM.</p>
<p><a href="https://static.righto.com/images/386-versions/such-a-cad.jpg"><img alt="Erroneous description of the 386 layout. I put an X through it so nobody reuses it." class="hilite" height="552" src="https://static.righto.com/images/386-versions/such-a-cad-w450.jpg" title="Erroneous description of the 386 layout. I put an X through it so nobody reuses it." width="450" /></a><div class="cite">Erroneous description of the 386 layout. I put an X through it so nobody reuses it.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:design-and-test" title="Jump back to footnote 23 in the text">↩</a></p>
</li>
<li id="fn:standard-cell-library">
<p>Intel has published a guide to their <a href="http://www.bitsavers.org/components/intel/_dataBooks/1988_Introduction_to_Intel_Cell-Based_Design.pdf">1.5 micron CHMOS III cell library</a>.
I assume this is the same standard-cell library that was used for the logic in the 386.
The library provided over 150 logic functions. It also provided
cell-based versions of the Intel 80C51 microcontroller and various Intel support chips such as the
82C37A DMA controller, the 82C54 interval timer, and the 82C59 interrupt controller.</p>
<p><a href="https://static.righto.com/images/386-versions/82360sl-die-photo.png"><img alt="Die photo of the 82360SL ISA Peripheral I/O Chip, from the 386 SL Data Book." class="hilite" height="372" src="https://static.righto.com/images/386-versions/82360sl-die-photo-w300.png" title="Die photo of the 82360SL ISA Peripheral I/O Chip, from the 386 SL Data Book." width="300" /></a><div class="cite">Die photo of the 82360SL ISA Peripheral I/O Chip, from the <a href="http://www.bitsavers.org/components/intel/80386/240814-005_386SL_Data_Book_Jul92.pdf#page=2">386 SL Data Book</a>.</div></p>
<p>Interestingly, the 386 SL's Peripheral I/O chip (the 82360SL) included the functionality
of these support chips.
Standard-cell construction is visible as the stripes in the die photo (above).
Moreover, the layout of the
die shows separated blocks, probably corresponding to each embedded chip.
I expect that Intel designed standard-cell versions of the controller chips to embed in the
I/O chip and then added the chips to the standard-cell library since they were available. <a class="footnote-backref" href="#fnref:standard-cell-library" title="Jump back to footnote 24 in the text">↩</a></p>
</li>
<li id="fn:bugs">
<p>For an example of the problems that could require a new stepping of the 386, see <a href="https://books.google.com/books?id=VMJcvUyoqJcC&lpg=PA10&dq=intel%2080386dx&pg=PA10#v=onepage&q&f=false">Intel backs off 80386 claims but denies chip recast needed</a> (1986).
It discusses multitasking issues with the 386, with Intel calling them "minor imperfections" that could
cause "little glitches", while others suggested that the chip would need replacement.
The bugs fixed in each stepping of the 386 are documented <a href="https://www.pcjs.org/documents/manuals/intel/80386/">here</a>. <a class="footnote-backref" href="#fnref:bugs" title="Jump back to footnote 25 in the text">↩</a></p>
</li>
<li id="fn:xbts">
<p>One curiosity about the 386 is the <code>IBTS</code> and <code>XBTS</code> <a href="http://www.bitsavers.org/components/intel/80386/231746-001_Introduction_to_the_80386_Apr86.pdf#page=177">instructions</a>.
The Insert Bit String and Extract Bit String instructions were implemented in the
early 386 processors, but then removed in the B1 stepping.
It's interesting that the bit string instructions were removed in the B1 stepping, the same stepping that fixed
the 32-bit multiplication bug.
Intel said that
they were removed "in order to use the area of the chip previously occupied for other microcircuitry" (<a href="https://archive.org/details/pc-magazine-programmers-technical-reference-1992/page/409/mode/1up">source</a>).
I wonder if Intel fixed the multiplication bug in microcode,
and needed to discard the bit string operations to free up enough microcode space.
Intel reused these opcodes in the 486 for the <code>CMPXCHG</code> instruction, but that caused conflicts
with old 386 programs, so Intel changed the 486 opcodes in the B stepping.
<!-- https://books.google.com/books?id=FeIuiOQN-nEC&pg=PT330&dq=%22ibts%22+%22xbts%22&hl=en&sa=X&ved=2ahUKEwjXzO2OoMmBAxUeEFkFHTIYA6IQ6AF6BAgEEAI#v=onepage&q=%22ibts%22%20%22xbts%22&f=false -->
<!-- errata https://books.google.com/books?id=tSLe3yMjc-AC&pg=PT190&dq=oct+1991&hl=en&sa=X&ved=2ahUKEwiyypHeosmBAxVyrokEHRc5B644ChC7BXoECAUQBg#v=onepage&q=errata&f=false --> <a class="footnote-backref" href="#fnref:xbts" title="Jump back to footnote 26 in the text">↩</a></p>
</li>
<li id="fn:s-specs">
<p>Since Antoine photographed many different 386 chips, I could correlate the S-Specs with the
layout changes. I'll summarize the information here, in case anyone happens to want it.
The larger DX layout is associated with SX213 and SX215.
(Presumably the two are different, but nothing that I could see in the photographs.)
The shrunk DX layout is associated with SX217, SX218, SX366, and SX544.
The 386 SL image is SX621. <a class="footnote-backref" href="#fnref:s-specs" title="Jump back to footnote 27 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com5tag:blogger.com,1999:blog-6264947694886887540.post-61939356941937185632023-10-07T09:04:00.001-07:002023-10-10T12:50:06.613-07:00Reverse-engineering the mechanical Bendix Central Air Data Computer<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js"></script>
<script>
MathJax = {
tex: {
inlineMath: [['$', '$'], ['\\(', '\\)']]
},
svg: {
fontCache: 'global'
},
chtml: { displayAlign: 'left' }
};
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
"HTML-CSS": { scale: 175}
});
</script>
<style type="text/css">
.MathJax {font-size: 1em !important}
</style>
<p>How did fighter planes in the 1950s perform calculations before compact digital computers were available?
The Bendix Central Air Data Computer (CADC) is an electromechanical analog computer that used
gears and cams for its mathematics. It was used in military planes such as the F-101 and the F-111 fighters, and the B-58 bomber to compute
airspeed, Mach number, and other "air data".</p>
<p><a href="https://static.righto.com/images/bendix-top/bendix-side.jpg"><img alt="The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version." class="hilite" height="322" src="https://static.righto.com/images/bendix-top/bendix-side-w600.jpg" title="The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version." width="600" /></a><div class="cite">The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version.</div></p>
<p>Aircraft have determined airspeed from
air pressure for over a
<a href="https://patents.google.com/patent/US1290875A">century</a>.
A port in the side of the plane provides the static air pressure,<span id="fnref:static"><a class="ref" href="#fn:static">1</a></span> the air pressure outside the aircraft.
A pitot tube points forward and receives the "total" air pressure, a higher pressure due to the speed of the airplane
forcing air into the tube.
The airspeed can be determined from the ratio of these two pressures, while the altitude can be determined
from the static pressure.</p>
<p>But as you approach the speed of sound, the fluid dynamics of air changes and the calculations become very complicated.
With the development of supersonic fighter planes in the 1950s, simple mechanical instruments were no longer
sufficient.
Instead, an analog computer calculated the "air data" (airspeed, air density, Mach number, and so forth) from the pressure measurements.
This computer then transmitted the air data electrically to the systems that needed it: instruments, weapons
targeting, engine control, and so forth.
Since the computer was centralized, the
system was called a Central Air Data Computer or CADC, manufactured by Bendix and other companies.</p>
<p><a href="https://static.righto.com/images/bendix-top/gears.jpg"><img alt="A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible." class="hilite" height="367" src="https://static.righto.com/images/bendix-top/gears-w600.jpg" title="A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible." width="600" /></a><div class="cite">A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible.</div></p>
<p>Each value in the CADC is indicated by the rotational position of a shaft.
Compact electric motors rotated the shafts, controlled by <a href="https://spectrum.ieee.org/the-vacuum-tubes-forgotten-rival">magnetic amplifier</a> servos.
Gears, cams, and differentials performed computations, with the results indicated by more rotations.
Devices called synchros converted the rotations to electrical outputs that controlled other aircraft systems.
The CADC is said to contain 46 synchros, 511 gears, 820 ball bearings, and a total of 2,781 major parts (but I haven't counted).
These components are crammed into a compact cylinder: 15 inches long and weighing 28.7 pounds.</p>
<p>The equations computed by the CADC are impressively complicated. For instance, one equation is:<span id="fnref:equations"><a class="ref" href="#fn:equations">2</a></span></p>
<p>\[~~~\frac{P_t}{P_s} = \frac{166.9215M^7}{( 7M^2-1)^{2.5}}\]</p>
<p>It seems incredible that these functions could be computed mechanically, but three
techniques make this possible.
The fundamental mechanism is the differential gear, which adds or subtracts values.
Second, logarithms are used extensively, so multiplications and divisions become additions and subtractions performed by
a differential, while square roots are calculated by gearing down by a factor of 2.
Finally, specially-shaped cams implement functions: logarithm, exponential, and functions specific to the application.
By combining these mechanisms, complicated functions can be computed mechanically, as I will explain below.</p>
<!-- Aerodynamics for Naval Aviators (1965) https://www.faa.gov/sites/faa.gov/files/regulations_policies/handbooks_manuals/aviation/00-80T-80.pdf -->
<h3>The differential</h3>
<p>The differential gear assembly is the mathematical component of the CADC, as it performs addition or subtraction.
The differential takes two input rotations and produces an output rotation that is the sum or difference of these rotations.<span id="fnref:differential"><a class="ref" href="#fn:differential">3</a></span>
Since most values in the CADC are expressed logarithmically, the differential computes multiplication and division when it
adds or subtracts its inputs.</p>
<p><a href="https://static.righto.com/images/bendix-top/differential-closeup.png"><img alt="A closeup of a differential mechanism." class="hilite" height="332" src="https://static.righto.com/images/bendix-top/differential-closeup-w350.png" title="A closeup of a differential mechanism." width="350" /></a><div class="cite">A closeup of a differential mechanism.</div></p>
<p>While the differential functions like the differential in a car, it is constructed differently, with a
<a href="https://en.wikipedia.org/wiki/Differential_(mechanical_device)#Spur-gear_design">spur-gear</a> design.
This compact arrangement of gears is about 1 cm thick and 3 cm in diameter.
The differential is mounted on a shaft along with three co-axial gears: two gears provide the inputs to the differential and the third
provides the output.
In the photo, the gears above and below the differential are the input gears.
The entire differential body rotates with the sum, connected to the output gear at the top through a concentric shaft.
(In practice, any of the three gears can be used as the output.)
The two thick gears inside the differential body are part of the mechanism.</p>
<p>Note that multiplying a rotation by a constant factor doesn't require a differential; it can be done simply with the ratio between
two gears. (If a large gear rotates a small gear, the small gear rotates faster according to the size ratio.)
Adding a constant to a rotation is even easier, just a matter of defining what shaft position indicates 0.
For this reason, I will ignore constants in the equations.</p>
<h3>The cams</h3>
<p>The CADC uses cams to implement various functions. Most importantly, cams compute logarithms and exponentials.
Cams also implement complicated functions of one variable such as ${M}/{\sqrt{1 + .2 M^2}}$.
The function is encoded into the cam's shape during manufacturing, so a hard-to-compute nonlinear function isn't a problem for the CADC.
The photo below shows a cam with the follower arm in front. As the cam rotates, the follower
moves in and out according to the cam's radius.</p>
<p><a href="https://static.righto.com/images/bendix-top/cam.jpg"><img alt="A cam inside the CADC implements a function." class="hilite" height="444" src="https://static.righto.com/images/bendix-top/cam-w500.jpg" title="A cam inside the CADC implements a function." width="500" /></a><div class="cite">A cam inside the CADC implements a function.</div></p>
<p>However, the shape of the cam doesn't provide the function directly, as you might expect.
The main problem with the straightforward approach is the discontinuity when the cam wraps around, which could catch the follower.
For example, if the cam implemented an exponential directly, its radius would spiral exponentially and there would be
a jump back to the starting value when it wraps around.</p>
<p>Instead, the CADC uses a clever <a href="https://patents.google.com/patent/US2969910A">patented</a> method:
the cam encodes the <em>difference</em> between the desired function and a straight line.
For example, an exponential curve is shown below (blue), with a line (red) between the endpoints.
The height of the gray segment, the difference, specifies the radius of the cam (added to the cam's fixed minimum radius).
The point is that this difference goes to 0 at the extremes, so the cam will no longer have a discontinuity when it
wraps around.
Moreover, this technique significantly reduces the size of the value (i.e. the height of the gray region is
smaller than the height of the blue line), increasing the cam's accuracy.<span id="fnref:cam-shape"><a class="ref" href="#fn:cam-shape">5</a></span></p>
<p><a href="https://static.righto.com/images/bendix-top/exponential.png"><img alt="An exponential curve (blue), linear curve (red), and the difference (gray)." class="hilite" height="216" src="https://static.righto.com/images/bendix-top/exponential-w300.png" title="An exponential curve (blue), linear curve (red), and the difference (gray)." width="300" /></a><div class="cite">An exponential curve (blue), linear curve (red), and the difference (gray).</div></p>
<p>To make this work, the cam position must be added to the linear value to yield the result.
This is implemented by combining each cam with a differential gear that performs the addition or subtraction.<span id="fnref:differential-cam"><a class="ref" href="#fn:differential-cam">4</a></span>
As the diagram below shows, the input (23) drives the cam (30) and the
differential (25, 37-41). The follower (32) tracks the cam and provides a second input (35) to the differential.
The sum from the differential produces the desired function (26).</p>
<p><a href="https://static.righto.com/images/bendix-top/differential-patent.png"><img alt="This diagram, from Patent 2969910, shows how the cam and follower are connected to a differential." class="hilite" height="312" src="https://static.righto.com/images/bendix-top/differential-patent-w350.png" title="This diagram, from Patent 2969910, shows how the cam and follower are connected to a differential." width="350" /></a><div class="cite">This diagram, from <a href="https://patents.google.com/patent/US2969910A">Patent 2969910</a>, shows how the cam and follower are connected to a differential.</div></p>
<h3>Pressure inputs</h3>
<!--
The Bendix CADC has two pneumatic inputs through tubes:
the static pressure and the total pressure.
It also receives the total temperature from a platinum temperature probe.
From these, it computes many outputs: true air speed, Mach number, log static pressure, differential pressure, air density, air density × the speed of sound, total temperature, and log true free air temperature.
-->
<p>The CADC receives two pressure inputs from the pitot tube.<span id="fnref:pec"><a class="ref" href="#fn:pec">6</a></span>
Inside the CADC, two pressure transducers convert the pressures into rotational positions.
Each pressure transducer contains a pair of bellows that
expand and contract as the applied pressure changes.
The pressure transducer has a tricky job: it must measure tiny pressure changes, but it must also provide a
rotational signal that has enough torque to rotate all the gears in the CADC.
To accomplish this, each pressure transducer uses a servo loop that drives a motor, controlled by a feedback loop.
Cams and differentials convert the rotation into logarithmic values, providing
the static pressure as
\( log \; P_s \) and the pressure ratio as \( log \; ({P_t}/{P_s}) \) to the rest of the CADC.</p>
<h3>The synchro outputs</h3>
<p>A synchro is an interesting device that can transmit a rotational position electrically over three wires.
In appearance, a synchro is similar to an electric motor, but its internal construction is different, as shown below.
Before digital systems, synchros were very popular for transmitting signals electrically through an aircraft.
For instance, a synchro could transmit an altitude reading to a cockpit display or a targeting system.
Two synchros at different locations have their stator windings connected together, while the rotor windings are driven with
AC.
Rotating the shaft of one synchro causes the other to rotate to the same position.<span id="fnref:synchro"><a class="ref" href="#fn:synchro">7</a></span></p>
<p><a href="https://static.righto.com/images/bendix-top/synchro-cutaway.png"><img alt="Cross-section diagram of a synchro showing the rotor and stators." class="hilite" height="306" src="https://static.righto.com/images/bendix-top/synchro-cutaway-w350.png" title="Cross-section diagram of a synchro showing the rotor and stators." width="350" /></a><div class="cite">Cross-section diagram of a synchro showing the rotor and stators.</div></p>
<p>For the CADC, most of the outputs are synchro signals, using compact synchros that are about 3 cm in length.
For improved resolution, some of the CADC outputs use two synchros: a coarse synchro and a fine synchro.
The two synchros are typically geared in an 11:1 ratio, so the fine synchro rotates 11 times as fast as the coarse
synchro.
Over the output range, the coarse synchro may turn 180°, providing the approximate output, while the fine
synchro spins multiple times to provide more accuracy.</p>
<h2>Examining the left section of the CADC</h2>
<p><a href="https://static.righto.com/images/bendix-top/side.jpg"><img alt="Another view of the CADC." class="hilite" height="343" src="https://static.righto.com/images/bendix-top/side-w600.jpg" title="Another view of the CADC." width="600" /></a><div class="cite">Another view of the CADC.</div></p>
<p>The Bendix CADC is constructed from modular sections.
The right section has the pressure transducers (the black domes), along with the servo mechanisms that control
them.
The middle section is the "Mach section".
In this blog post, I'm focusing on the left section of the CADC, which
computes true airspeed, air density, total temperature, log true free
air temperature, and air density × speed of sound.
I had feared that any attempt at disassembly would result in tiny gears flying in every direction, but the CADC
was designed to be taken apart for maintenance.
Thus, I could remove the left section of the CADC for analysis.</p>
<p>The diagram below shows the side that connects to the aircraft.<span id="fnref:connectors"><a class="ref" href="#fn:connectors">8</a></span>
The various synchros generate the outputs.
Some of the synchros have spiral anti-backlash springs installed.
These springs prevent wobble in the synchro and gear train as the gears change direction.
Three of the exponential cams are visible.
The differentials and gears are between the two metal plates, so they are not visible from this angle.</p>
<p><a href="https://static.righto.com/images/bendix-top/front-labeled.jpg"><img alt="The front of the CADC has multiple output synchros with anti-backlash springs." class="hilite" height="556" src="https://static.righto.com/images/bendix-top/front-labeled-w600.jpg" title="The front of the CADC has multiple output synchros with anti-backlash springs." width="600" /></a><div class="cite">The front of the CADC has multiple output synchros with anti-backlash springs.</div></p>
<p>Attached to the right side is the temperature transducer, a modular wedge that implements
a motorized servo loop to convert the temperature input to a rotation.
The servo amplifier consists of three boards of electronic components, including transistors and magnetic amplifiers to drive the motor.
The large red potentiometer provides feedback for the servo loop.
A flexible cam with 20 adjustment screws allows the transducer to be tuned to eliminate nonlinearities or other sources of error.
I'll describe this module in more detail in another post.<span id="fnref:temperature-servo"><a class="ref" href="#fn:temperature-servo">9</a></span></p>
<p>The photo below shows the other side of the section. This communicates with the rest of the CADC through the electrical connector and
three gears that mesh with gears in the other section.
Two gears receive the pressure signals \( P_t / P_s \) and \(P_s\) from the pressure transducer subsystem.
The third gear sends the log total temperature to the rest of the CADC.
The electrical connector (a standard 37-pin D-sub) supplies 120 V 400 Hz power to the rest of the CADC and
passes synchro signals from the rest of the CADC to the output connectors.</p>
<p><a href="https://static.righto.com/images/bendix-top/back-labeled.jpg"><img alt="This side of the section interfaces with the rest of the CADC." class="hilite" height="480" src="https://static.righto.com/images/bendix-top/back-labeled-w500.jpg" title="This side of the section interfaces with the rest of the CADC." width="500" /></a><div class="cite">This side of the section interfaces with the rest of the CADC.</div></p>
<h2>The equations</h2>
<p>Although the CADC looks like an inscrutable conglomeration of tiny gears, it is possible to trace out
the gearing and see exactly how it computes the air data functions.
With considerable effort, I have reverse-engineered the mechanisms to create the diagram below,
showing how each computation is broken down into mechanical steps.
Each line indicates a particular value, specified by a shaft rotation.
The ⊕ symbol indicates a differential gear, adding or subtracting its inputs to produce another value.
The cam symbol indicates a cam coupled to a differential gear.
Each cam computes either a specific function or an exponential, providing the value as a rotation.
At the right, the rotations are converted to outputs, either by synchros or a potentiometer.
This diagram abstracts out the physical details of the gears. In particular, scaling by constants or
reversing the rotation (subtraction versus addition) are not shown.</p>
<p><a href="https://static.righto.com/images/bendix-top/dataflow.png"><img alt="This diagram shows how the values are computed. The differential numbers are my own arbitrary numbers. Click for a larger version." class="hilite" height="395" src="https://static.righto.com/images/bendix-top/dataflow-w800.png" title="This diagram shows how the values are computed. The differential numbers are my own arbitrary numbers. Click for a larger version." width="800" /></a><div class="cite">This diagram shows how the values are computed. The differential numbers are my own arbitrary numbers. Click for a larger version.</div></p>
<p>I'll go through each calculation briefly.</p>
<h3>Total temperature</h3>
<p>The external temperature is an important input to the CADC since it affects the air density.
A platinum temperature probe provides a resistance that varies with temperature.
The resistance is converted to rotation by the temperature transducer, described earlier.
The definition of temperature is a bit complicated, though.
The temperature outside the aircraft is called the true free air temperature, T.
However, the temperature probe measures a higher temperature, called the indicated total air temperature, T<sub>i</sub>.
The reason for this discrepancy is that when the aircraft is moving at high speed, the air transfers kinetic energy to the temperature probe, heating it up.</p>
<p><a href="https://static.righto.com/images/bendix-top/d15.jpg"><img alt="The differential and cam D15." class="hilite" height="358" src="https://static.righto.com/images/bendix-top/d15-w500.jpg" title="The differential and cam D15." width="500" /></a><div class="cite">The differential and cam D15.</div></p>
<p>The temperature transducer provides the log of the total temperature as a rotation.
At the top of the equation diagram,
cam and differential D15 simply take the exponential of this value to determine the total temperature.
This rotates the shaft of a synchro to produce the total temperature as an electrical output.
As shown above, the D15 cam is attached to the differential by a shaft passing through the metal plate.
The follower rotates according to the cam radius, turning the follower gear which meshes with the differential input.
The result from the differential is the total temperature.</p>
<h3>log free air temperature</h3>
<p>A more complicated task of the CADC is to compute the true free air temperature from the measured total temperature.
Free air temperature, T, is defined by the formula below, which compensates for the additional
heating due to the aircraft's speed. \(T_i\) is the indicated total temperature, M is the Mach number and
K is a temperature probe constant.<span id="fnref:k"><a class="ref" href="#fn:k">10</a></span></p>
<p>\[ T = \frac {T_i} {1 + .2 K M^2 } \]</p>
<p>The diagram below shows the cams, differentials, gear trains, and synchro that compute \(log \; T\).
First, cam D11 computes
\( log \; (1 + .2 K M^2 ) \).
Although that expression is complicated, the key is that it is a function of one variable (M).
Thus, it can be computed by cam D11, carefully shaped for this function and attached to differential D11.
Differential D10 adds the log total temperature (from the temperature transducer) to produce the desired result.
The indicated servo outputs this value to other aircraft systems.
(Note that the output is a logarithm; it is not converted to a linear value.<span id="fnref:log-temp"><a class="ref" href="#fn:log-temp">11</a></span>
This value is also fed (via gears) into the calculations of three more equations, below.</p>
<p><a href="https://static.righto.com/images/bendix-top/d10.jpg"><img alt="The components that compute log free air temperature. D12 is not part of this equation." class="hilite" height="502" src="https://static.righto.com/images/bendix-top/d10-w500.jpg" title="The components that compute log free air temperature. D12 is not part of this equation." width="500" /></a><div class="cite">The components that compute log free air temperature. D12 is not part of this equation.</div></p>
<h3>Air density</h3>
<p>Air density is computed from the static pressure and true temperature:</p>
<p>\[ \rho = C_1 \frac{P_s} {T} \]</p>
<p>It is calculated using logarithms. D16 subtracts the log temperature from the log pressure and
cam D20 takes the exponential.</p>
<h3>True airspeed</h3>
<p>True airspeed is computed from the Mach number and the total temperature according to the following formula:</p>
<p>\[V = 38.94 M \frac{\sqrt{T_i}}{\sqrt{1+.2KM^2}}\]</p>
<p>Substituting the true free air temperature simplifies the formula to the equation implemented in the CADC:</p>
<p>\[V = 38.94 M \sqrt{T} \]</p>
<p>This is computed logarithmically. First, cam and differential D12 compute \(log \; M\) from the pressure ratio.<span id="fnref:m-function"><a class="ref" href="#fn:m-function">13</a></span>
Next differential D19 adds half the log temperature to multiply by the square root.
Exponential cam D13 removes the logarithms, producing the final result. (The constant 38.94 is an important part of the
equation, but is easily implemented with gear ratios.)
The output goes to two synchros, geared to provide coarse and fine outputs.<span id="fnref:coarse"><a class="ref" href="#fn:coarse">12</a></span></p>
<p><a href="https://static.righto.com/images/bendix-top/d14.jpg"><img alt="These components compute true airspeed and air density × speed of sound.
Note the large gear driving the coarse synchro and the small gear driving the fine synchro. This causes the fine
synchro to rotate at 11 times the speed of the coarse synchro." class="hilite" height="447" src="https://static.righto.com/images/bendix-top/d14-w650.jpg" title="These components compute true airspeed and air density × speed of sound.
Note the large gear driving the coarse synchro and the small gear driving the fine synchro. This causes the fine
synchro to rotate at 11 times the speed of the coarse synchro." width="650" /></a><div class="cite">These components compute true airspeed and air density × speed of sound.
Note the large gear driving the coarse synchro and the small gear driving the fine synchro. This causes the fine
synchro to rotate at 11 times the speed of the coarse synchro.</div></p>
<h3>Air density × speed of sound</h3>
<p>Air density × speed of sound<span id="fnref:density-speed"><a class="ref" href="#fn:density-speed">14</a></span> is given by the formula</p>
<p>\[ \rho \cdot a = C_2 \frac {P_s} {\sqrt{T}} \]</p>
<p>The calculation is almost the same as the air density calculation.
Differential D18 subtracts <em>half</em> the log temperature from the log pressure and then cam D14 computes
the exponential.
Unlike the other values, this output rotates the shaft of a 1 KΩ potentiometer (above), changing its resistance.
I don't know why this particular value is output as a resistance rather than a synchro angle.</p>
<h2>Conclusions</h2>
<p>The CADC performs nonlinear calculations that seem way too complicated to solve with mechanical gearing.
But reverse-engineering the mechanism shows how the equations are broken down into steps that can be performed
with cams and differentials, using logarithms for multiplication, division, and square roots.
I'll point out that reverse engineering the CADC is not as easy as you might expect.
It is difficult to see which gears are in contact, especially when gears are buried in the middle of the CADC and are hard to
see.
I did much of the reverse engineering by rotating one differential to see which other gears turn, but usually most of the gears turned
due to the circuitous interconnections.<span id="fnref:reverse-engineering"><a class="ref" href="#fn:reverse-engineering">15</a></span></p>
<p>By the late 1960s, as fighter planes became more advanced and computer technology improved, digital processors replaced the gears in air data computers.
Garrett AiResearch's ILAAS air data computer (1967) was the <a href="https://worldradiohistory.com/Archive-Electronics/60s/67/Electronics-1967-10-16.pdf">first all-digital unit</a>.
Other digital systems were Bendix's ADC-1000 Digital Air Data Computer (1967) which was "designed to solve all air data computations at a rate of
75 times per second",
Conrac's 3-pound solid-state air data computer (1967),
Honeywell's <a href="https://worldradiohistory.com/Archive-Electronics/60s/69/Electronics-1969-03-17.pdf#page=127">Digital Air Data System</a> (1968),
and the LSI-based Garrett AiResearch <a href="https://en.wikipedia.org/wiki/F-14_CADC">F-14 CADC</a> (1970).
Nonetheless, the gear-based Bendix CADC provides an interesting reverse-engineering challenge as well as a look at the forgotten era of analog computing.</p>
<p>For more background on the CADC, see my <a href="https://www.righto.com/2023/02/bendix-central-air-data-computer-cadc.html">overview article on the CADC</a>.
I plan to continue reverse-engineering the Bendix CADC and get it operational,<span id="fnref:refs"><a class="ref" href="#fn:refs">16</a></span>
so follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon as <a href="https://oldbytes.space/@kenshirriff">@oldbytes.space@kenshirriff</a>.
Thanks to Joe for providing the CADC. Thanks to Nancy Chen for obtaining a hard-to-find document for me.
Marc Verdiell and Eric Schlaepfer are working on the CADC with me.</p>
<!-- AiResearch F-14A air data computer with solid state pressure sensor https://www.google.com/books/edition/Aircraft_Yearbook/KnVGAAAAYAAJ?hl=en&gbpv=1&pg=PA85&printsec=frontcover
F-104 lawsuit over whether CADC is essential:
https://www.google.com/books/edition/United_States_Customs_Court_Reports/7p1ypJhUYukC?hl=en&gbpv=1&dq=electronic+air+data+computer&pg=PA331&printsec=frontcover
Bendix digital air data computer https://patentimages.storage.googleapis.com/ae/55/07/f72bf49c914d2c/US3564222.pdf
https://www.google.com/books/edition/Aircraft_Yearbook/KnVGAAAAYAAJ?hl=en&gbpv=1&dq=%22digital+air+data+computer%22&pg=RA1-PA190&printsec=frontcover
https://patents.google.com/patent/US3564222A/en
-->
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:static">
<p>The static air pressure can also be provided by holes in the side of the pitot tube.
I couldn't find information indicating exactly how the planes with the CADC received static pressure. <a class="footnote-backref" href="#fnref:static" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:equations">
<p>Although the CADC's equations may seem <em>ad hoc</em>, they can be derived from fluid dynamics principles.
These equations were standardized in the 1950s by various government organizations including the National
Bureau of Standards and NACA (the precursor of NASA). <a class="footnote-backref" href="#fnref:equations" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:differential">
<p>Strictly speaking, the output of the differential is the sum of the inputs divided by two.
I'm ignoring the factor of 2 because the gear ratios can easily cancel it out.
It's also arbitrary whether you think of the differential as adding or subtracting, since it depends on which rotation
direction is defined as positive. <a class="footnote-backref" href="#fnref:differential" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:differential-cam">
<p>The cam value will be added or subtracted, depending on whether the function is concave or convex.
This is a simple matter of gearing when the values are fed into the differential.
Matching the linear segment to the function is also done with gearing that scales the input value appropriately. <a class="footnote-backref" href="#fnref:differential-cam" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:cam-shape">
<p>The diagram below shows a typical cam function in more detail. The input is \(log~ dP/P_s\)
and the output is \(log~M / \sqrt{1+.2KM^2}\).
The small humped curve at the bottom is the cam correction.
Although the input and output functions cover a wide range, the difference that is encoded in the cam is much
smaller and drops to zero at both ends.</p>
<p><a href="https://static.righto.com/images/bendix-top/cam-diagram.jpg"><img alt="This diagram, from Patent 2969910, shows how a cam implements a complicated function." class="hilite" height="439" src="https://static.righto.com/images/bendix-top/cam-diagram-w350.jpg" title="This diagram, from Patent 2969910, shows how a cam implements a complicated function." width="350" /></a><div class="cite">This diagram, from <a href="https://patents.google.com/patent/US2969910A">Patent 2969910</a>, shows how a cam implements a complicated function.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:cam-shape" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:pec">
<p>The CADC also has an input for the "position error correction", which I will ignore in this post.
This input provides a correction factor because the measured static pressure may not
exactly match the real static pressure. The problem is that the static pressure is measured from a port on the
aircraft. Distortions in the airflow may cause errors in this measurement.
A separate box, the "compensator", determined the correction factor based on the angle of attack and fed it to the CADC
as a synchro signal. <a class="footnote-backref" href="#fnref:pec" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:synchro">
<p>Internally, a synchro has a moving rotor winding and three fixed stator windings.
When AC is applied to the rotor, voltages are developed on the stator windings depending on the position of the rotor.
These voltages produce a torque that rotates the synchros to the same position.
In other words, the rotor receives power (26 V, 400 Hz in this case), while the three stator wires transmit the position.
The diagram below shows how a synchro is represented schematically, with rotor and stator coils.</p>
<p><a class="footnote-backref" href="#fnref:synchro" title="Jump back to footnote 7 in the text">↩</a><a href="https://static.righto.com/images/bendix-top/synchro-symbol.jpg"><img alt="The schematic symbol for a synchro." class="hilite" height="240" src="https://static.righto.com/images/bendix-top/synchro-symbol-w250.jpg" title="The schematic symbol for a synchro." width="250" /></a><div class="cite">The schematic symbol for a synchro.</div></p>
</li>
<li id="fn:connectors">
<p>The CADC is wired to the rest of the aircraft through round military connectors.
The front panel interfaces these connectors to the D-sub connectors used internally.
The two pressure inputs are the black cylinders at the bottom of the photo.</p>
<p><a href="https://static.righto.com/images/bendix-top/exterior.jpg"><img alt="The exterior of the CADC. It is packaged in a rugged metal cylinder." class="hilite" height="470" src="https://static.righto.com/images/bendix-top/exterior-w300.jpg" title="The exterior of the CADC. It is packaged in a rugged metal cylinder." width="300" /></a><div class="cite">The exterior of the CADC. It is packaged in a rugged metal cylinder.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:connectors" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:temperature-servo">
<p>I don't have a blog post on the temperature module yet, but I have a <a href="https://twitter.com/kenshirriff/status/1673795616491667456">description</a> on Twitter and a <a href="https://twitter.com/kenshirriff/status/1678073863723454464">video</a>. <a class="footnote-backref" href="#fnref:temperature-servo" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:k">
<p>The constant K depends on the recovery factor of the temperature probe.
This compensates for a probe where not all of the air's kinetic energy gets transferred
to the probe.
The 1958 description says that with "modern total temperature probes available today", the
K factor can be considered to be 1. <a class="footnote-backref" href="#fnref:k" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:log-temp">
<p>The CADC specification says that it provides the log true free air temperature from -80° to +70° C.
Obviously the log won't work for a negative value so I assume this is the log of the Kelvin temperature (°K). <a class="footnote-backref" href="#fnref:log-temp" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:coarse">
<p>The CADC specification defines how the parameter values correspond to rotation angles of the synchros.
For instance, for the airspeed synchros, the CADC supports the airspeed range 104.3 to 1864.7 knots.
The coarse and fine outputs are geared in an 11:1 ratio, so the fine synchro will
rotate multiple times over the range to provide more accuracy.
Over this range, the coarse synchro rotates from -18.94° to +151.42° and the
fine synchro rotates from -208.29° to +1665.68°, with 0° corresponding to 300 knots. <a class="footnote-backref" href="#fnref:coarse" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:m-function">
<p>The Mach function is defined in terms of \(P_t/P_s \), with separate cases for
subsonic and supersonic:</p>
<p>\[M<1:\]
\[~~~\frac{P_t}{P_s} = ( 1+.2M^2)^{3.5}\]</p>
<p>\[M > 1:\]</p>
<p>\[~~~\frac{P_t}{P_s} = \frac{166.9215M^7}{( 7M^2-1)^{2.5}}\]</p>
<p>Although these equations are very complicated, the solution is a function of one
variable \(P_t/P_s\) so M can be computed with a single cam.
In other words, the mathematics needed to be done when the CADC was manufactured,
but once the cam exists, computing M is trivial. <a class="footnote-backref" href="#fnref:m-function" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:density-speed">
<p>I'm not sure why the CADC computes air density times speed of sound.
I couldn't find any useful aircraft characteristics that depend on this value, but there must be something.
In acoustics and audio, this product is useful as the "air impedance", but I couldn't determine the relevance for aviation. <a class="footnote-backref" href="#fnref:density-speed" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
<li id="fn:reverse-engineering">
<p>While reverse-engineering this system, I have gained more appreciation for the engineering involved.
Converting complicated equations to gearing is a remarkable feat.
But also remarkable is designing the CADC as a three-dimensional object that can be built, disassembled, and repaired,
long before any sort of 3-D modeling was available.
It must have been a puzzle to figure out where to position each differential. Each differential had three gears driving it,
which had to mesh with gears from other differentials.
There wasn't much flexibility in the gear dimensions, since the gear ratios had to be correct and the number of teeth on
each gear had to be an integer.
Moreover, it is impressive how tightly the gears are packed together without conflicting with each other. <a class="footnote-backref" href="#fnref:reverse-engineering" title="Jump back to footnote 15 in the text">↩</a></p>
</li>
<li id="fn:refs">
<p>It was very difficult to find information about the CADC.
The official military specification is MIL-C-25653C(USAF). After searching everywhere, I was finally able to
get a copy from the Technical Reports & Standards unit of the Library of Congress.
The other useful document was in an obscure conference proceedings from 1958:
"Air Data Computer Mechanization" (Hazen), Symposium on the USAF Flight Control Data Integration Program, Wright Air Dev Center US Air Force, Feb 3-4, 1958, pp 171-194. <a class="footnote-backref" href="#fnref:refs" title="Jump back to footnote 16 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com7tag:blogger.com,1999:blog-6264947694886887540.post-48303161791082086092023-09-30T09:03:00.004-07:002023-10-08T11:48:37.339-07:00How flip-flops are implemented in the Intel 8086 processorA key concept for a processor is the management of "state", information that persists
over time.
Much of a computer is built from logic gates, such as NAND or NOR gates, but logic gates
have no notion of time.
Processors also need a way to hold values, along with a mechanism to move from
step to step in a controlled fashion.
This is the role of "sequential logic", where the output depends on what happened before.
Sequential logic usually operates off a clock signal,<span id="fnref:asynchronous"><a class="ref" href="#fn:asynchronous">1</a></span>
a sequence of regular pulses that controls the timing of the computer.
(If you have a 3.2 GHz processor, for instance, that number is the clock frequency.)</p>
<p>A circuit called the flip-flop is a fundamental building block for sequential logic.
A flip-flop can hold one bit of state, a "0" or a "1", changing its value when the
clock changes.
Flip-flops are a key part of processors, with multiple roles.
Several flip-flops can be combined to form a register, holding a value.
Flip-flops are also used to build "state machines", circuits that move from step to step
in a controlled sequence.
A flip-flops can also delay a signal, holding it from from one clock cycle to the next.</p>
<p>Intel introduced the groundbreaking 8086 microprocessor in 1978, starting the x86 architecture
that is widely used today.
In this blog post, I take a close look at the flip-flops in the 8086: what they do and
how they are implemented.
In particular, I will focus on the dynamic flip-flop, which holds its
value using capacitance, much like DRAM.<span id="fnref:cross-coupled"><a class="ref" href="#fn:cross-coupled">2</a></span>
Many of these flip-flops use a somewhat unusual "enable" input, which allows the flip-flop
to hold its value for multiple clock cycles.</p>
<p><a href="https://static.righto.com/images/8086-flipflop/die-labeled.jpg"><img alt="The 8086 die under the microscope, with the main functional blocks.
I count 184 flip-flops with enable and 53 without enable.
Click this image (or any other) for a larger version." class="hilite" height="691" src="https://static.righto.com/images/8086-flipflop/die-labeled-w700.jpg" title="The 8086 die under the microscope, with the main functional blocks.
I count 184 flip-flops with enable and 53 without enable.
Click this image (or any other) for a larger version." width="700" /></a><div class="cite">The 8086 die under the microscope, with the main functional blocks.
I count 184 flip-flops with enable and 53 without enable.
Click this image (or any other) for a larger version.</div></p>
<p>The die photo above shows the silicon die of the 8086.
In this image, I have removed the metal and polysilicon layers to show the silicon transistors underneath.
The colored squares indicate the flip-flops: blue flip-flops have an <code>enable</code> input, while red lack <code>enable</code>.
Flip-flops are used throughout the processor for a variety of roles.
Around the edges, they hold the state for output pins.
The control circuitry makes heavy use of flip-flops for various state machines, such as
moving through the "T states" that control the bus cycle.
The "<a href="https://www.righto.com/2023/01/the-8086-processors-microcode-pipeline.html#:~:text=a%20memory%20access.-,The%20loader,-To%20decode%20and">loader</a>" uses a state machine to start each instruction.
The instruction register, along with some special-purpose registers (<a href="https://www.righto.com/2023/03/8086-register-codes.html">N, M</a>, and X) are built with
flip-flops.
Other flip-flops track the instructions in the prefetch queue.
The microcode engine uses flip-flops to hold the current microcode address as well as to
latch the 21-bit output from the microcode ROM.
The ALU (Arithmetic/Logic Unit) uses flip-flops to hold the status flags,
temporary input values, and information on the operation.</p>
<style type="text/css">
.bar {text-decoration:overline;}
</style>
<h2>The flip-flop circuit</h2>
<p>In this section, I'll explain how the flip-flop circuits work, starting with a basic D flip-flop.
The D flip-flop (below) takes a data input (D) and stores that value, 0 or 1.
The output is labeled <code>Q</code>, while the inverted output is called <code class="bar">Q</code> (Q-bar).
This flip-flop is "edge triggered", so the storage happens
on the edge when the clock changes from low to high.<span id="fnref:polarity"><a class="ref" href="#fn:polarity">4</a></span>
Except at this transition, the input can change without affecting the output.</p>
<p><a href="https://static.righto.com/images/8086-flipflop/d-flip-flop2.png"><img alt="The symbol for a D flip-flop." class="hilite" height="108" src="https://static.righto.com/images/8086-flipflop/d-flip-flop2-w150.png" title="The symbol for a D flip-flop." width="150" /></a><div class="cite">The symbol for a D flip-flop.</div></p>
<p>The 8086 implements most of its flip-flops dynamically, using <a href="https://en.wikipedia.org/wiki/Flip-flop_(electronics)#Edge-triggered_dynamic_D_storage_element">pass transistor logic</a>.
That is, the capacitance of the wiring (in particular the transistor gate) holds the 0 or 1 state.
The dynamic implementation is more compact than the typical static flip-flop implementation,
so it is often used in processors.
However, the charge on the capacitance will eventually leak away, just like DRAM (dynamic RAM).
Thus, the clock must keep going or the values will be lost.<span id="fnref:minimum-clock-speed"><a class="ref" href="#fn:minimum-clock-speed">3</a></span>
This behavior is different from a typical flip-flop chip, which will hold its value until the
next clock, whether that is a microsecond later or a day later.</p>
<p>The D flip-flop is built from two latch<span id="fnref:latch"><a class="ref" href="#fn:latch">5</a></span> stages, each consisting of a pass transistor and an inverter.<span id="fnref:intel"><a class="ref" href="#fn:intel">6</a></span>
The first pass transistor passes the input value through while the clock is low. When the clock switches high,
the first pass transistor turns off and isolates the inverter from the input, but the value persists due to the capacitance (blue arrow).
Meanwhile, the second pass transistor switches on, passing the value from the first inverter through the second
inverter to the output.
Similarly, when the clock switches low, the second transistor switches off but the value is held by capacitance
at the green arrow.
(The circuit does not need an explicit capacitor; the wiring has enough capacitance to hold
the value.)
Thus, the output holds the value of the <code>D</code> input that was present at the moment when the clock switched from low to high.
Any other changes to the <code>D</code> input do not affect the output.</p>
<p><a href="https://static.righto.com/images/8086-flipflop/d-flip-flop-schematic.png"><img alt="Schematic of a D flip-flop built from pass transistor logic." class="hilite" height="103" src="https://static.righto.com/images/8086-flipflop/d-flip-flop-schematic-w350.png" title="Schematic of a D flip-flop built from pass transistor logic." width="350" /></a><div class="cite">Schematic of a D flip-flop built from pass transistor logic.</div></p>
<p>The basic flip-flop can be modified by adding an "enable" input that enables or blocks the clock.<span id="fnref:enable"><a class="ref" href="#fn:enable">7</a></span>
When the <code>enable</code> input is high, the flip-flop records the <code>D</code> input on the clock edge as before, but when the <code>enable</code> input is low, the flip-flop holds its previous value.
The <code>enable</code> input allows the flip-flop to hold its value for an arbitrarily long period of time.</p>
<p><a href="https://static.righto.com/images/8086-flipflop/enable-flip-flop.png"><img alt="The symbol for the D flip-flop with enable." class="hilite" height="114" src="https://static.righto.com/images/8086-flipflop/enable-flip-flop-w150.png" title="The symbol for the D flip-flop with enable." width="150" /></a><div class="cite">The symbol for the D flip-flop with enable.</div></p>
<p>The enable flip-flop is constructed from a D flip-flop by feeding the flip-flop's output back to the input as shown below.
When the <code>enable</code> input is 0, the multiplexer selects the current <code>Q</code> output as the new flip-flop D input,
so the flip-flop retains its previous value.
But when the <code>enable</code> input is 1, the multiplexer selects the new <code>D</code> value.
(You can think of the <code>enable</code> input as selecting "hold" versus "load".)</p>
<p><a href="https://static.righto.com/images/8086-flipflop/enable-flip-flop-diagram.png"><img alt="Block diagram of a flip-flop with an enable input." class="hilite" height="141" src="https://static.righto.com/images/8086-flipflop/enable-flip-flop-diagram-w280.png" title="Block diagram of a flip-flop with an enable input." width="280" /></a><div class="cite">Block diagram of a flip-flop with an enable input.</div></p>
<p>The multiplexer is implemented with two more pass transistors, as shown on the left below.<span id="fnref:mux"><a class="ref" href="#fn:mux">8</a></span>
When <code>enable</code> is low, the upper pass transistor
switches on, passing the current <code>Q</code> output back to the input. When <code>enable</code> is high, the lower pass transistor switches on,
passing the <code>D</code> input through to the flip-flop.
The schematic below also shows how the inverted <code>Q'</code> output is provided by the first inverter.
The circuit "cheats" a bit; since the inverted output bypasses the second transistor, this output can
change before the clock edge.</p>
<p><a href="https://static.righto.com/images/8086-flipflop/enable-flip-flop-schematic.png"><img alt="Schematic of a flip-flop with an enable input." class="hilite" height="199" src="https://static.righto.com/images/8086-flipflop/enable-flip-flop-schematic-w600.png" title="Schematic of a flip-flop with an enable input." width="600" /></a><div class="cite">Schematic of a flip-flop with an enable input.</div></p>
<p>The flip-flops often have a <code>set</code> or <code>clear</code> input, setting the flip-flop high or low.
This input is typically connected to the processor's "reset" line, ensuring that the flip-flops are initialized
to the proper state when the processor is started.
The symbol below shows a flip-flop with a <code>clear</code> input.</p>
<p><a href="https://static.righto.com/images/8086-flipflop/clear-flip-flop.png"><img alt="The symbol for the D flip-flop with enable and clear inputs." class="hilite" height="141" src="https://static.righto.com/images/8086-flipflop/clear-flip-flop-w150.png" title="The symbol for the D flip-flop with enable and clear inputs." width="150" /></a><div class="cite">The symbol for the D flip-flop with enable and clear inputs.</div></p>
<p>To support the clear function, a NOR gate replaces the inverter as shown below (red).
When the <code>clear</code> input is high, it forces the output from the NOR gate to be low.
Note that the <code>clear</code> input is asynchronous, changing the <code>Q</code> output immediately. The inverted <code class="bar">Q</code> output, however, doesn't change until
<code class="bar">clk</code> is high and the output cycles around.
A similar modification implements a <code>set</code> input that forces the flip-flop high: a NOR gate replaces the first inverter.</p>
<p><a href="https://static.righto.com/images/8086-flipflop/clear-flip-flop-schematic.png"><img alt="This schematic shows the circuitry for the clear flip-flop." class="hilite" height="232" src="https://static.righto.com/images/8086-flipflop/clear-flip-flop-schematic-w600.png" title="This schematic shows the circuitry for the clear flip-flop." width="600" /></a><div class="cite">This schematic shows the circuitry for the clear flip-flop.</div></p>
<h2>Implementing a flip-flop in silicon</h2>
<p>The diagram below shows two flip-flops as they appear on the die.
The bright gray regions are doped silicon, the bottom layer of the chip
The brown lines are polysilicon, a layer on top of the silicon. When polysilicon crosses doped silicon, a transistor is formed with a polysilicon gate.
The black circles are vias (connections) to the metal layer.
The metal layer on top provides wiring between the transistors.
I removed the metal layer with acid to make the underlying circuitry visible.
Faint purple lines remain on the die, showing where the metal wiring was.</p>
<p><a href="https://static.righto.com/images/8086-flipflop/flip-flop-die.jpg"><img alt="Two flip-flops on the 8086 die." class="hilite" height="274" src="https://static.righto.com/images/8086-flipflop/flip-flop-die-w700.jpg" title="Two flip-flops on the 8086 die." width="700" /></a><div class="cite">Two flip-flops on the 8086 die.</div></p>
<p>Although the two flip-flops have the same circuitry, their layouts on the die are completely different.
In the 8086, each transistor was carefully shaped and positioned to make the layout compact, so the layout depends on the surrounding
logic and the connections.
This is in contrast to modern standard-cell layout, which uses a standard layout for each block (logic gate,
flip-flop, etc.) and puts the cells in orderly rows.
(Intel moved to standard-cell wiring for much of the logic in the the 386 processor since it is much faster to create
a standard-cell design than to perform manual layout.)</p>
<h2>Conclusions</h2>
<p>The flip-flop with <code>enable</code> input is a key part of the 8086, appearing throughout the processor.
However, the <code>enable</code> input is a fairly obscure feature for a flip-flop component;
most flip-flop chips have a clock input, but not an enable.<span id="fnref:chips"><a class="ref" href="#fn:chips">9</a></span>
Many FPGA and ASIC synthesis libraries, though, provide it, under the name
"D flip-flop with enable" or "D flip-flop with clock enable".</p>
<p>I plan to write more on the 8086, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>
so you can follow me there too.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:asynchronous">
<p>Some early computers were asynchronous, such as von Neumann's IAS machine (1952) and its
numerous descendants.
In this machine, there was no centralized clock. Instead, a circuit such as an adder
would send a pulse to the next circuit when it was done, triggering the next circuit in
sequence.
Thus, instruction execution would ripple through the computer.
Although almost all later computers are synchronous, there is active research into
asynchronous computing which is potentially faster and lower power. <a class="footnote-backref" href="#fnref:asynchronous" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:cross-coupled">
<p>I'm focusing on the dynamic flip-flops in this article, but I'll mention that
the 8086 has a few latches built from cross-coupled NOR gates. Most 8086 registers use
cross-coupled inverters (static memory cells) rather than flip-flops to hold bits.
I explained the 8086 processor's registers in <a href="https://www.righto.com/2020/07/the-intel-8086-processors-registers.html">this article</a>. <a class="footnote-backref" href="#fnref:cross-coupled" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:minimum-clock-speed">
<p>Dynamic circuitry is why the 8086 and many other processors have minimum clock speeds:
if the clock is too slow, signals will fade away.
For the 8086, the datasheet specifies a maximum clock period of 500 ns, corresponding to
a minimum clock speed of 2 megahertz.
The CMOS version of the Z80 processor, however, was designed so the clock could be slowed or even stopped. <a class="footnote-backref" href="#fnref:minimum-clock-speed" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:polarity">
<p>Some flip-flops in the 8086 use the inverted clock, so they transition when the clock switches from high to low.
Thus, there are two sets of transitions in the 8068 for each clock cycle. <a class="footnote-backref" href="#fnref:polarity" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:latch">
<p>The terminology gets confusing between flip-flops and latches, which sometimes refer to the same thing and sometimes different things.
The term "latch" is often used for a flip-flop that operates on the clock level, not the clock edge.
That is, when the clock input is high, the input passes through, and when the clock input is low, the value is retained.
Confusingly, the clock for a latch is often called "enable".
This is different from the <code>enable</code> input that I'm discussing, which is separate from the clock. <a class="footnote-backref" href="#fnref:latch" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:intel">
<p>I asked an Intel chip engineer if they designed the circuitry in the 8086 era
in terms of flip-flops.
He said that they typically designed the circuitry in terms of the underlying pass transistors and
gates, rather than using the flip-flop as a fundamental building block. <a class="footnote-backref" href="#fnref:intel" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:enable">
<p>You might wonder why the clock and enable are separate inputs. Why couldn't you just AND them together so when <code>enable</code> is low,
it will block the clock and the flip-flop won't transition?
That mostly works, but three factors make it a bad idea.
First, the idea of using a clock is so everything changes state at the same time. If you start putting gates in the clock
path, the clock gets a bit delayed and shifts the timing.
If the delay is too large, the input value might change before the flip-flop can latch it.
Thus, putting gates in the clock path is frowned upon.
The second factor is that combining the clock and enable signals risks race conditions.
For instance, suppose that the <code>enable</code> input goes low and high while the clock remains high. If you AND the two signals together,
this will yield a spurious clock edge, causing the flip-flop to latch its input a second time.
Finally, if you block the clock for too long, a dynamic flip-flop will lose its value.
(Note that the flip-flop circuit used in the 8086 will refresh its value on each clock even if the <code>enable</code> input is held low for a long period of time.) <a class="footnote-backref" href="#fnref:enable" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:mux">
<p>A multiplexer can be implemented with logic gates. However, it is more compact to implement
it with pass transistors. The pass transistor implementation takes four transistors
(two fewer if the inverted enable signal is already available).
A logic gate implementation would take about nine transistors: an AND-OR-INVERT gate, an inverter
on the output, and an inverter for the enable signal. <a class="footnote-backref" href="#fnref:mux" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:chips">
<p>The common <a href="https://www.ti.com/lit/ds/symlink/sn74ls74a.pdf">7474</a> is a typical TTL flip-flop
that does not have an <code>enable</code> input.
Chips with an enable are rarer, such as the <a href="https://microchip.ua/raznoe/74f377.pdf">74F377</a>.
Strangely, one manufacturer of the <a href="https://www.onsemi.com/pdf/datasheet/mc74hc377a-d.pdf">74HC377</a> shows the
enable as affecting the output; I think they simply messed up the schematic in the datasheet since it contradicts the function table.</p>
<p>Some examples of standard-cell libraries with enable flip-flops:
<a href="https://www.infineon.com/dgdl/Infineon-Component_D_Flip_Flop_w_Enable_V1.0-Software%20Module%20Datasheets-v01_00-EN.pdf?fileId=8ac78c8c7d0d8da4017d0e858df6165e">Cypress SoC</a>,
<a href="https://www.cl.cam.ac.uk/research/srg/han/ACS-P35/documents/90nm-cell.pdf#page=179">Faraday standard cell library</a>,
<a href="http://wla.berkeley.edu/~cs150/fa01/labs/project/Xilinx_Libraries.pdf#page=272">Xilinx Unified Libraries</a>,
<a href="https://www.infineon.com/cms/en/design-support/tools/sdk/psoc-software/psoc-4-components/d-flip-flop-w-enable/">Infineon PSoC 4 Components</a>,
<a href="https://www.bitsavers.org/components/intel/_dataBooks/Introduction_to_Intel_Cell-Based_Design_1988.pdf#page=103">Intel's CHMOS-III cell library</a> (probably used for the 386 processor),
and
<a href="https://www.intel.com/content/www/us/en/programmable/quartushelp/17.0/hdl/prim/prim_file_dffe.htm">Intel Quartus FPGA</a>. <a class="footnote-backref" href="#fnref:chips" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com7tag:blogger.com,1999:blog-6264947694886887540.post-43230090943311628342023-08-12T09:26:00.008-07:002023-08-26T15:55:37.351-07:00Tracing the roots of the 8086 instruction set to the Datapoint 2200 minicomputer<style>
.hilite {cursor:zoom-in}
a:link img.hilite, a:visited img.hilite {color: #fff;}
a:hover img.hilite {color: #f66;}
</style>
<p>The Intel 8086 processor started the x86 architecture that is still extensively used today.
The 8086 has some quirky characteristics: it is little-endian, has a parity flag, and uses explicit I/O instructions
instead of just memory-mapped I/O.
It has four 16-bit registers that can be split into 8-bit registers, but only one that
can be used for memory indexing.
Surprisingly, the reason for these characteristics and more
is compatibility with a computer dating back before the creation of the microprocessor:
the Datapoint 2200, a minicomputer with a processor built out of TTL chips.
In this blog post, I'll look in detail at how the Datapoint 2200 led to the architecture of Intel's modern processors, step by step through the 8008, 8080, and 8086 processors.</p>
<!--
Some questions about the x86 architecture can be answered by looking at the historical roots of the 8086.
For instance, why is the 8086 little-endian?
Why does the 8086 have a parity flag?
Why does the 8086 provide explicit I/O instructions instead of just using memory-mapped I/O?
Why does the 8086 have four 16-bit registers that can be split into 8-bit registers?
Why does the 8086 use the BX register for memory indexing, but not AX, CX, or DX?
The answers to these questions can be found in a computer dating back before the creation of the microprocessor:
the Datapoint 2200, a minicomputer with a processor built out of TTL chips.
In this blog post, I'll look in detail at how the Datapoint 2200 led to the architecture of Intel's modern processors.
-->
<h2>The Datapoint 2200</h2>
<p>In the late 1960s, 80-column IBM punch cards were the primary way of entering data into computers, although
CRT terminals were growing in popularity.
The Datapoint 2200 was designed as a low-cost terminal that could replace a keypunch, with a squat CRT display
the size of a punch card.
By putting some processing power into the Datapoint 2200, it could perform data validation and other tasks,
making data entry more efficient.
Even though the Datapoint 2200 was typically used as an
<a href="https://worldradiohistory.com/Archive-Electronics/70s/70/Electronics-1970-08-31.pdf#page=116">intelligent terminal</a>,
it was really a desktop
<a href="https://archive.org/details/computerworld6197unse50/page/n12/mode/1up">minicomputer</a> with a "unique combination of powerful computer, display, and dual cassette drives."
Although now mostly forgotten, the Datapoint 2200 was the origin of the 8-bit microprocessor, as I'll explain below.</p>
<p><a href="https://static.righto.com/images/8086-datapoint/datapoint2200.jpg"><img alt="The Datapoint 2200 computer (Version II)." class="hilite" height="366" src="https://static.righto.com/images/8086-datapoint/datapoint2200-w500.jpg" title="The Datapoint 2200 computer (Version II)." width="500" /></a><div class="cite">The Datapoint 2200 computer (Version II).</div></p>
<p>The memory storage of the Datapoint 2200 had a large impact on its architecture and thus the architecture of today's computers.
In the 1960s and early 1970s, magnetic core memory was the dominant form of computer storage.
It consisted of tiny ferrite rings, threaded into grids, with each ring storing one bit.
Magnetic core storage was bulky and relatively expensive, though.
Semiconductor RAM was new and very expensive; Intel's first product in 1969 was a RAM chip called the 3101, which held
just 64 bits and cost $99.50.
To minimize storage costs, the Datapoint 2200 used an alternative: MOS shift-register memory.
The Intel 1405 shift-register memory chip provided much more storage than RAM chips at a much lower cost (512 bits for $13.30).<span id="fnref:shift-register"><a class="ref" href="#fn:shift-register">1</a></span></p>
<p><a href="https://static.righto.com/images/8086-datapoint/intel1405.jpg"><img alt="Intel 1405 shift-register memory chips in metal cans, in the Datapoint 2200." class="hilite" height="373" src="https://static.righto.com/images/8086-datapoint/intel1405-w500.jpg" title="Intel 1405 shift-register memory chips in metal cans, in the Datapoint 2200." width="500" /></a><div class="cite">Intel 1405 shift-register memory chips in metal cans, in the Datapoint 2200.</div></p>
<p>The big problem with shift-register memory is that it is sequential: the bits come out one at a time, in the same order you put them in.
This wasn't a problem when executing instructions sequentially, since the memory provided each instruction as it was needed.
For a random access, though, you need to wait until the bits circulate around and you get the one you want, which is very slow.
To minimize the number of memory accesses, the Datapoint 2200 had seven registers, a relatively large number of registers for the time.<span id="fnref:registers"><a class="ref" href="#fn:registers">2</a></span>
The registers were called A, B, C, D, E, H, and L, and these names had a lasting impact on Intel processors.</p>
<p>Another consequence of shift-register memory was that the Datapoint 2200 was a serial computer, operating on one bit
at a time as the shift-register memory provided it, using a 1-bit ALU.
To handle arithmetic operations, the ALU needed to start with the lowest bit so it could process carries.
Likewise, a 16-bit value (such as a jump target) needed to start with the lowest bit.
This resulted in a little-endian architecture, with the low byte first.
The little-endian architecture has remained in Intel processors to the present.</p>
<p>Since the Datapoint 2200 was designed before the creation of the microprocessor, its processor was built from a board of TTL chips (as was typical for minicomputers at the time).
The diagram below shows the processor board with the chips categorized by function.
The board has a separate chip for each 8-bit register (<code>B</code>, <code>C</code>, <code>D</code>, etc.) and separate chips for control flags (<code>Z</code>, carry, etc.).
The Arithmetic/Logic Unit (ALU) takes about 18 chips, while instruction decoding is another 18 chips.
Because every feature required more chips, the designers of the Datapoint 2200 were strongly motivated to make
the instruction set as simple as possible.
This was necessary since
the Datapoint 2200 was a low-cost device, renting for just <a href="https://archive.org/details/sim_computerworld_1971-07-14_5_28/page/18/mode/1up">$148</a> a month.
In contrast, the popular PDP-8 minicomputer rented for <a href="http://bitsavers.org/magazines/Computers_And_Automation/197101.pdf#page=55">$500 a month</a>.</p>
<p><a href="https://static.righto.com/images/8086-datapoint/datapoint-board.jpg"><img alt="The Datapoint 2200 processor board with registers, flags, and other blocks labeled. Click this image (or any other) for a larger version." class="hilite" height="448" src="https://static.righto.com/images/8086-datapoint/datapoint-board-w600.jpg" title="The Datapoint 2200 processor board with registers, flags, and other blocks labeled. Click this image (or any other) for a larger version." width="600" /></a><div class="cite">The Datapoint 2200 processor board with registers, flags, and other blocks labeled. Click this image (or any other) for a larger version.</div></p>
<p>One way that the Datapoint 2200 simplified the hardware was by
creating a large set of instructions by combining simpler pieces in an orthogonal way.
For instance, the Datapoint 2200 has 64 ALU instructions that apply one of eight ALU operations to one of the eight registers.
This requires a small amount of hardware—eight ALU circuits and a circuit to select the register—but
provides a large number of instructions.
Another example is the register-to-register move instructions. Specifying one of eight source registers and one of
eight destination registers provides a large, flexible set of instructions to move data.</p>
<p>The Datapoint 2200's instruction format was designed around this principle, with groups of three bits specifying a
register. A common TTL chip could decode the group of three bits and activate the desired circuit.<span id="fnref:decoder"><a class="ref" href="#fn:decoder">3</a></span>
For instance, a data move instruction had the bit pattern <code>11DDDSSS</code> to move a byte from the specified source (SSS) to the specified destination (DDD).
(Note that this bit pattern maps onto three octal digits very nicely since the source and destination are separate digits.<span id="fnref:octal"><a class="ref" href="#fn:octal">4</a></span>)</p>
<p>One unusual feature of the Datapoint instruction set is that a memory access was just like a register access.
That is, an instruction could specify one of the seven physical registers or could specify a memory access (<code>M</code>),
using the identical instruction format.
One consequence of this is that you couldn't include a memory address in an instruction.
Instead, memory could only be accessed by first loading the address into the <code>H</code> and <code>L</code> registers, which held the high and
low byte of the address respectively.<span id="fnref:hl"><a class="ref" href="#fn:hl">5</a></span>
This is very unusual and inconvenient, since a memory access took three instructions: two to load the <code>H</code> and <code>L</code> registers and one to access memory as the <code>M</code> "register".
The advantage was that it simplified the instruction set and the decoding logic, saving chips and thus reducing
the system cost.
This decision also had lasting impact on Intel processors and how they access memory.</p>
<p>The table below shows the Datapoint 2200's instruction set in an octal table showing the 256 potential opcodes.<span id="fnref:octaltable"><a class="ref" href="#fn:octaltable">6</a></span>
I have roughly classified the instructions as arithmetic/logic (purple), control-flow (blue), data movement (green),
input/output (orange), and miscellaneous (yellow).
Note how the orthogonal instruction format produces large blocks of related instructions.
The instructions in the lower right (green) load (<code>L</code>) a value from a source to a destination.
(The no-operation <code>NOP</code> and <code>HALT</code> instructions are special cases.<span id="fnref:nop"><a class="ref" href="#fn:nop">7</a></span>)
In the upper-left are Load operations (<code>LA</code>, etc.) that use an "immediate" byte, a data byte that follows the instruction.
They use the same <code>DDD</code> code to specify the destination register, reusing that circuitry.</p>
<style type="text/css">
table.dp {border-collapse: collapse; font-size: 80%;}
table.dp th,td {text-align: center;}
table.dp td {border: 1px solid #ccc; padding: 3px; min-width: 4em;}
table.dp th:nth-child(10) {border-left: 2px solid #444}
table.dp td:nth-child(10) {border-left: 2px solid #444}
table.dp tr:nth-child(10) {border-top: 2px solid #444}
table.dp td.mov {background: #dfa;}
table.dp td.arith {background: #edf;}
table.dp td.ctrl {background: #def;}
table.dp td.misc {background: #ff9;}
table.dp td.io {background: #fd9;}
table.dp td.undoc {background: #f44;}
</style>
<table class="dp">
<tr><th> </th><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td></tr>
<tr><th>0</th><td class="misc">HALT</td><td class="misc">HALT</td><td class="arith">SLC</td><td class="ctrl">RFC</td><td class="arith">AD</td><td> </td><td class="mov">LA</td><td class="ctrl">RETURN</td><td class="ctrl">JFC</td><td class="io">INPUT</td><td class="ctrl">CFC</td><td> </td><td class="ctrl">JMP</td><td> </td><td class="ctrl">CALL</td><td> </td></tr>
<tr><th>1</th><td> </td><td> </td><td class="arith">SRC</td><td class="ctrl">RFZ</td><td class="arith">AC</td><td> </td><td class="mov">LB</td><td> </td><td class="ctrl">JFZ</td><td> </td><td class="ctrl">CFZ</td><td> </td><td> </td><td> </td><td> </td><td> </td></tr>
<tr><th>2</th><td> </td><td> </td><td> </td><td class="ctrl">RFS</td><td class="arith">SU</td><td> </td><td class="mov">LC</td><td> </td><td class="ctrl">JFS</td><td class="io">EX ADR</td><td class="ctrl">CFS</td><td class="io">EX STATUS</td><td> </td><td class="io">EX DATA</td><td> </td><td class="io">EX WRITE</td></tr>
<tr><th>3</th><td> </td><td> </td><td> </td><td class="ctrl">RFP</td><td class="arith">SB</td><td> </td><td class="mov">LD</td><td> </td><td class="ctrl">JFP</td><td class="io">EX COM1</td><td class="ctrl">CFP</td><td class="io">EX COM2</td><td> </td><td class="io">EX COM3</td><td> </td><td class="io">EX COM4</td></tr>
<tr><th>4</th><td> </td><td> </td><td> </td><td class="ctrl">RTC</td><td class="arith">ND</td><td> </td><td class="mov">LE</td><td> </td><td class="ctrl">JTC</td><td> </td><td class="ctrl">CTC</td><td> </td><td> </td><td> </td><td> </td><td> </td></tr>
<tr><th>5</th><td> </td><td> </td><td> </td><td class="ctrl">RTZ</td><td class="arith">XR</td><td> </td><td class="mov">LH</td><td> </td><td class="ctrl">JTZ</td><td class="io">EX BEEP</td><td class="ctrl">CTZ</td><td class="io">EX CLICK</td><td> </td><td class="io">EX DECK1</td><td> </td><td class="io">EX DECK2</td></tr>
<tr><th>6</th><td> </td><td> </td><td> </td><td class="ctrl">RTS</td><td class="arith">OR</td><td> </td><td class="mov">LL</td><td> </td><td class="ctrl">JTS</td><td class="io">EX RBK</td><td class="ctrl">CTS</td><td class="io">EX WBK</td><td> </td><td> </td><td> </td><td class="io">EX BSP</td></tr>
<tr><th>7</th><td> </td><td> </td><td> </td><td class="ctrl">RTP</td><td class="arith">CP</td><td> </td><td> </td><td> </td><td class="ctrl">JTP</td><td class="io">EX SF</td><td class="ctrl">CTP</td><td class="io">EX SB</td><td> </td><td class="io">EX REWND</td><td> </td><td class="io">EX TSTOP</td></tr>
<tr><th>0</th><td class="arith">ADA</td><td class="arith">ADB</td><td class="arith">ADC</td><td class="arith">ADD</td><td class="arith">ADE</td><td class="arith">ADH</td><td class="arith">ADL</td><td class="arith">ADM</td><td class="misc">NOP</td><td class="mov">LAB</td><td class="mov">LAC</td><td class="mov">LAD</td><td class="mov">LAE</td><td class="mov">LAH</td><td class="mov">LAL</td><td class="mov">LAM</td></tr>
<tr><th>1</th><td class="arith">ACA</td><td class="arith">ACB</td><td class="arith">ACC</td><td class="arith">ACD</td><td class="arith">ACE</td><td class="arith">ACH</td><td class="arith">ACL</td><td class="arith">ACM</td><td class="mov">LBA</td><td class="mov">LBB</td><td class="mov">LBC</td><td class="mov">LBD</td><td class="mov">LBE</td><td class="mov">LBH</td><td class="mov">LBL</td><td class="mov">LBM</td></tr>
<tr><th>2</th><td class="arith">SUA</td><td class="arith">SUB</td><td class="arith">SUC</td><td class="arith">SUD</td><td class="arith">SUE</td><td class="arith">SUH</td><td class="arith">SUL</td><td class="arith">SUM</td><td class="mov">LCA</td><td class="mov">LCB</td><td class="mov">LCC</td><td class="mov">LCD</td><td class="mov">LCE</td><td class="mov">LCH</td><td class="mov">LCL</td><td class="mov">LCM</td></tr>
<tr><th>3</th><td class="arith">SBA</td><td class="arith">SBB</td><td class="arith">SBC</td><td class="arith">SBD</td><td class="arith">SBE</td><td class="arith">SBH</td><td class="arith">SBL</td><td class="arith">SBM</td><td class="mov">LDA</td><td class="mov">LDB</td><td class="mov">LDC</td><td class="mov">LDD</td><td class="mov">LDE</td><td class="mov">LDH</td><td class="mov">LDL</td><td class="mov">LDM</td></tr>
<tr><th>4</th><td class="arith">NDA</td><td class="arith">NDB</td><td class="arith">NDC</td><td class="arith">NDD</td><td class="arith">NDE</td><td class="arith">NDH</td><td class="arith">NDL</td><td class="arith">NDM</td><td class="mov">LEA</td><td class="mov">LEB</td><td class="mov">LEC</td><td class="mov">LED</td><td class="mov">LEE</td><td class="mov">LEH</td><td class="mov">LEL</td><td class="mov">LEM</td></tr>
<tr><th>5</th><td class="arith">XRA</td><td class="arith">XRB</td><td class="arith">XRC</td><td class="arith">XRD</td><td class="arith">XRE</td><td class="arith">XRH</td><td class="arith">XRL</td><td class="arith">XRM</td><td class="mov">LHA</td><td class="mov">LHB</td><td class="mov">LHC</td><td class="mov">LHD</td><td class="mov">LHE</td><td class="mov">LHH</td><td class="mov">LHL</td><td class="mov">LHM</td></tr>
<tr><th>6</th><td class="arith">ORA</td><td class="arith">ORB</td><td class="arith">ORC</td><td class="arith">ORD</td><td class="arith">ORE</td><td class="arith">ORH</td><td class="arith">ORL</td><td class="arith">ORM</td><td class="mov">LLA</td><td class="mov">LLB</td><td class="mov">LLC</td><td class="mov">LLD</td><td class="mov">LLE</td><td class="mov">LLH</td><td class="mov">LLL</td><td class="mov">LLM</td></tr>
<tr><th>7</th><td class="arith">CPA</td><td class="arith">CPB</td><td class="arith">CPC</td><td class="arith">CPD</td><td class="arith">CPE</td><td class="arith">CPH</td><td class="arith">CPL</td><td class="arith">CPM</td><td class="mov">LMA</td><td class="mov">LMB</td><td class="mov">LMC</td><td class="mov">LMD</td><td class="mov">LME</td><td class="mov">LMH</td><td class="mov">LML</td><td class="misc">HALT</td></tr>
</table>
<p>The lower-left quadrant (purple) has the bulk of the ALU instructions.
These instructions have a regular, orthogonal structure making the instructions easy to decode: each row specifies the operation while each column
specifies the source.
This is due to the instruction structure:
eight bits in the pattern <code>10AAASSS</code>, where the <code>AAA</code> bits specified the
ALU operation and the <code>SSS</code> bits specified the register source.
The three-bit ALU code specifies the operations
Add, Add with Carry, Subtract, Subtract with Borrow, logical AND, logical XOR, logical OR,
and Compare.
This list is important because it defined the fundamental ALU operations for later Intel processors.<span id="fnref:alu"><a class="ref" href="#fn:alu">8</a></span>
In the upper-left are ALU operations that use an "immediate" byte.
These instructions use the same <code>AAA</code> bit pattern to select the ALU operation, reusing the decoding hardware.
Finally, the shift instructions <code>SLC</code> and <code>SRC</code> are implemented as special cases outside the pattern.</p>
<p>The upper columns contain conditional instructions in blue—Return, Jump, and Call.
The eight conditions test the four status flags (Carry, Zero, Sign, and Parity) for either True or False.
(For example, <code>JFZ</code> Jumps if the Zero flag is False.)
A 3-bit field selects the condition, allowing it to be easily decoded in hardware.
The parity flag is somewhat unusual because parity is surprisingly expensive to compute in hardware,
but because the Datapoint 2200 operated as a terminal, parity computation was important.</p>
<p>The Datapoint 2200 has an input instruction as well as many output instructions for a variety of specific hardware tasks
(orange, labeled <code>EX</code> for external).
Typical operations are <code>STATUS</code> to get I/O status, <code>BEEP</code> and <code>CLICK</code> to make sound, and <code>REWIND</code> to rewind the tape.
As a result of this decision to use separate I/O instructions, Intel processors still use I/O instructions operating in
an I/O space, different from processors such as the MOS 6502 and the Motorola 68000 that used memory-mapped I/O.</p>
<p>To summarize, the Datapoint 2200 has a fairly large number of instructions, but they are generated from about a dozen simple patterns that are easy to decode.<span id="fnref:risc"><a class="ref" href="#fn:risc">9</a></span>
By combining orthogonal bit fields (e.g. 8 ALU operations multiplied by 8 source registers), 64 instructions
can be generated from one underlying pattern.</p>
<h2>Intel 8008</h2>
<p>The Intel 8008 was created as a clone of the Datapoint 2200 processor.<span id="fnref:4004"><a class="ref" href="#fn:4004">10</a></span>
Around the end of 1969, the Datapoint company talked with
Intel and Texas Instruments about the possibility of replacing the processor board with a single chip.
Even though the microprocessor didn't exist at this point, both companies said they could create such a chip.
Texas Instruments was first with a chip called the TMX 1795 that they <a href="https://archive.org/details/bitsavers_ElectronicignV19N1219710610_71671761/page/n37/mode/2up">advertised</a> as a "CPU on a chip".
Slightly later, Intel produced the 8008 microprocessor.
Both chips copied the Datapoint 2200's instruction set architecture with minor changes.</p>
<p><a href="https://static.righto.com/images/8086-datapoint/Intel_C8008.jpg"><img alt="The Intel 8008 chip in its 18-pin package. The small number of pins hampered the performance of the 8008, but Intel was hesitant to even go to the 18-pin package. Photo by Thomas Nguyen, (CC BY-SA 4.0)." class="hilite" height="216" src="https://static.righto.com/images/8086-datapoint/Intel_C8008-w350.jpg" title="The Intel 8008 chip in its 18-pin package. The small number of pins hampered the performance of the 8008, but Intel was hesitant to even go to the 18-pin package. Photo by Thomas Nguyen, (CC BY-SA 4.0)." width="350" /></a><div class="cite">The Intel 8008 chip in its 18-pin package. The small number of pins hampered the performance of the 8008, but Intel was hesitant to even go to the 18-pin package. Photo by <a href="https://commons.wikimedia.org/wiki/File:Intel_C8008.jpg">Thomas Nguyen</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en"</a>(CC BY-SA 4.0)</a>.</div></p>
<p>By the time the chips were completed, however, the Datapoint corporation had lost interest in the chips. They were designing a much faster
version of the Datapoint 2200 with improved TTL chips (including the well-known <a href="https://www.righto.com/2017/01/die-photos-and-reverse-engineering.html">74181 ALU chip</a>).
Even the original Datapoint 2200 model was faster than the Intel 8008 processor, and the Version II was over 5 times faster,<span id="fnref:version2"><a class="ref" href="#fn:version2">11</a></span>
so moving to a single-chip processor would be a step backward.</p>
<p>Texas Instruments unsuccessfully tried to find a customer for their TMX 1795 chip and ended up abandoning the chip.
Intel, however, marketed the 8008 as an 8-bit microprocessor, essentially creating the microprocessor industry.
In my view, Intel's biggest innovation with the microprocessor wasn't creating a single-chip CPU,
but creating the microprocessor as a product category: a general-purpose processor along with everything customers
needed to take advantage of it.
Intel put an enormous amount of effort into making microprocessors a success:
from documentation and customer training to <a href="https://en.wikipedia.org/wiki/Intellec">Intellec</a> development systems,
from support chips to software tools such as assemblers, compilers, and operating systems.</p>
<p>The table below shows the opcodes of the 8008.
For the most part, the 8008 copies the Datapoint 2200, with identical instructions that have identical opcodes (in color).
There are a few additional instructions (shown in white), though.
Intel Designer Ted Hoff realized that increment and decrement instructions (<code>IN</code> and <code>DC</code>) would be very useful for loops.
There are two additional bit rotate instructions (RAL and RAR) as well as the "missing" LMI (Load Immediate to Memory)
instruction.
The <code>RST</code> (restart) instructions act as short call instructions to fixed addresses for
interrupt handling.
Finally, the 8008 turned the Datapoint 2200's device-specific I/O instructions into 32 generic I/O instructions.</p>
<table class="dp">
<tr><th> </th><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td></tr>
<tr><th>0</th><td class="misc">HLT</td><td class="misc">HLT</td><td class="arith">RLC</td><td class="ctrl">RFC</td><td class="arith">ADI</td><td class="new">RST 0</td><td class="mov">LAI</td><td class="ctrl">RET</td><td class="ctrl">JFC</td><td class="io">INP 0</td><td class="ctrl">CFC</td><td class="io">INP 1</td><td class="ctrl">JMP</td><td class="io">INP 2</td><td class="ctrl">CAL</td><td class="io">INP 3</td></tr>
<tr><th>1</th><td class="new">INB</td><td class="new">DCB</td><td class="arith">RRC</td><td class="ctrl">RFZ</td><td class="arith">ACI</td><td class="new">RST 1</td><td class="mov">LBI</td><td> </td><td class="ctrl">JFZ</td><td class="io">INP 4</td><td class="ctrl">CFZ</td><td class="io">INP 5</td><td> </td><td class="io">INP 6</td><td> </td><td class="io">INP 7</td></tr>
<tr><th>2</th><td class="new">INC</td><td class="new">DCC</td><td class="new">RAL</td><td class="ctrl">RFS</td><td class="arith">SUI</td><td class="new">RST 2</td><td class="mov">LCI</td><td> </td><td class="ctrl">JFS</td><td class="io">OUT 8</td><td class="ctrl">CFS</td><td class="io">OUT 9</td><td> </td><td class="io">OUT 10</td><td> </td><td class="io">OUT 11</td></tr>
<tr><th>3</th><td class="new">IND</td><td class="new">DCD</td><td class="new">RAR</td><td class="ctrl">RFP</td><td class="arith">SBI</td><td class="new">RST 3</td><td class="mov">LDI</td><td> </td><td class="ctrl">JFP</td><td class="io">OUT 12</td><td class="ctrl">CFP</td><td class="io">OUT 13</td><td> </td><td class="io">OUT 14</td><td> </td><td class="io">OUT 15</td></tr>
<tr><th>4</th><td class="new">INE</td><td class="new">DCE</td><td> </td><td class="ctrl">RTC</td><td class="arith">NDI</td><td class="new">RST 4</td><td class="mov">LEI</td><td> </td><td class="ctrl">JTC</td><td class="io">OUT 16</td><td class="ctrl">CTC</td><td class="io">OUT 17</td><td> </td><td class="io">OUT 18</td><td> </td><td class="io">OUT 19</td></tr>
<tr><th>5</th><td class="new">INH</td><td class="new">DCH</td><td> </td><td class="ctrl">RTZ</td><td class="arith">XRI</td><td class="new">RST 5</td><td class="mov">LHI</td><td> </td><td class="ctrl">JTZ</td><td class="io">OUT 20</td><td class="ctrl">CTZ</td><td class="io">OUT 21</td><td> </td><td class="io">OUT 22</td><td> </td><td class="io">OUT 23</td></tr>
<tr><th>6</th><td class="new">INL</td><td class="new">DCL</td><td> </td><td class="ctrl">RTS</td><td class="arith">ORI</td><td class="new">RST 6</td><td class="mov">LLI</td><td> </td><td class="ctrl">JTS</td><td class="io">OUT 24</td><td class="ctrl">CTS</td><td class="io">OUT 25</td><td> </td><td class="io">OUT 26</td><td> </td><td class="io">OUT 27</td></tr>
<tr><th>7</th><td> </td><td> </td><td> </td><td class="ctrl">RTP</td><td class="arith">CPI</td><td class="new">RST 7</td><td class="new">LMI</td><td> </td><td class="ctrl">JTP</td><td class="io">OUT 28</td><td class="ctrl">CTP</td><td class="io">OUT 29</td><td> </td><td class="io">OUT 30</td><td> </td><td class="io">OUT 31</td></tr>
<tr><th>0</th><td class="arith">ADA</td><td class="arith">ADB</td><td class="arith">ADC</td><td class="arith">ADD</td><td class="arith">ADE</td><td class="arith">ADH</td><td class="arith">ADL</td><td class="arith">ADM</td><td class="misc">NOP</td><td class="mov">LAB</td><td class="mov">LAC</td><td class="mov">LAD</td><td class="mov">LAE</td><td class="mov">LAH</td><td class="mov">LAL</td><td class="mov">LAM</td></tr>
<tr><th>1</th><td class="arith">ACA</td><td class="arith">ACB</td><td class="arith">ACC</td><td class="arith">ACD</td><td class="arith">ACE</td><td class="arith">ACH</td><td class="arith">ACL</td><td class="arith">ACM</td><td class="mov">LBA</td><td class="mov">LBB</td><td class="mov">LBC</td><td class="mov">LBD</td><td class="mov">LBE</td><td class="mov">LBH</td><td class="mov">LBL</td><td class="mov">LBM</td></tr>
<tr><th>2</th><td class="arith">SUA</td><td class="arith">SUB</td><td class="arith">SUC</td><td class="arith">SUD</td><td class="arith">SUE</td><td class="arith">SUH</td><td class="arith">SUL</td><td class="arith">SUM</td><td class="mov">LCA</td><td class="mov">LCB</td><td class="mov">LCC</td><td class="mov">LCD</td><td class="mov">LCE</td><td class="mov">LCH</td><td class="mov">LCL</td><td class="mov">LCM</td></tr>
<tr><th>3</th><td class="arith">SBA</td><td class="arith">SBB</td><td class="arith">SBC</td><td class="arith">SBD</td><td class="arith">SBE</td><td class="arith">SBH</td><td class="arith">SBL</td><td class="arith">SBM</td><td class="mov">LDA</td><td class="mov">LDB</td><td class="mov">LDC</td><td class="mov">LDD</td><td class="mov">LDE</td><td class="mov">LDH</td><td class="mov">LDL</td><td class="mov">LDM</td></tr>
<tr><th>4</th><td class="arith">NDA</td><td class="arith">NDB</td><td class="arith">NDC</td><td class="arith">NDD</td><td class="arith">NDE</td><td class="arith">NDH</td><td class="arith">NDL</td><td class="arith">NDM</td><td class="mov">LEA</td><td class="mov">LEB</td><td class="mov">LEC</td><td class="mov">LED</td><td class="mov">LEE</td><td class="mov">LEH</td><td class="mov">LEL</td><td class="mov">LEM</td></tr>
<tr><th>5</th><td class="arith">XRA</td><td class="arith">XRB</td><td class="arith">XRC</td><td class="arith">XRD</td><td class="arith">XRE</td><td class="arith">XRH</td><td class="arith">XRL</td><td class="arith">XRM</td><td class="mov">LHA</td><td class="mov">LHB</td><td class="mov">LHC</td><td class="mov">LHD</td><td class="mov">LHE</td><td class="mov">LHH</td><td class="mov">LHL</td><td class="mov">LHM</td></tr>
<tr><th>6</th><td class="arith">ORA</td><td class="arith">ORB</td><td class="arith">ORC</td><td class="arith">ORD</td><td class="arith">ORE</td><td class="arith">ORH</td><td class="arith">ORL</td><td class="arith">ORM</td><td class="mov">LLA</td><td class="mov">LLB</td><td class="mov">LLC</td><td class="mov">LLD</td><td class="mov">LLE</td><td class="mov">LLH</td><td class="mov">LLL</td><td class="mov">LLM</td></tr>
<tr><th>7</th><td class="arith">CPA</td><td class="arith">CPB</td><td class="arith">CPC</td><td class="arith">CPD</td><td class="arith">CPE</td><td class="arith">CPH</td><td class="arith">CPL</td><td class="arith">CPM</td><td class="mov">LMA</td><td class="mov">LMB</td><td class="mov">LMC</td><td class="mov">LMD</td><td class="mov">LME</td><td class="mov">LMH</td><td class="mov">LML</td><td class="misc">HLT</td></tr>
</table>
<h2>Intel 8080</h2>
<p>The 8080 improved the 8008 in many ways, focusing on speed and ease of use, and resolving customer issues with the 8008.<span id="fnref:faggin"><a class="ref" href="#fn:faggin">12</a></span>
Customers had criticized the 8008 for its small memory capacity, low speed, and difficult hardware interfacing.
The 8080 increased memory capacity from 16K to 64K and
was over an order of magnitude faster than the 8008.
The 8080 also moved to a 40-pin package that made interfacing easier, but the 8080 still required a large number of
support chips to build a working system.</p>
<p>Although the 8080 was widely used in embedded systems, it is more famous for its use in
the first generation of home computers, boxes such as the Altair and IMSAI.
Famed chip designer Federico Faggin <a href="https://arct hive.org/details/byte-magazine-1992-03/page/n175/mode/2up">said</a> that the
8080 really created the microprocessor; the 4004 and 8008 suggested it, but the 8080 made it real.<span id="fnref:usable"><a class="ref" href="#fn:usable">13</a></span></p>
<p><a href="https://static.righto.com/images/8086-datapoint/Altair_8800.jpg"><img alt="Altair 8800 computer on display at the Smithsonian. Photo by Colin Douglas, (CC BY-SA 2.0)." class="hilite" height="250" src="https://static.righto.com/images/8086-datapoint/Altair_8800-w500.jpg" title="Altair 8800 computer on display at the Smithsonian. Photo by Colin Douglas, (CC BY-SA 2.0)." width="500" /></a><div class="cite">Altair 8800 computer on display at the Smithsonian. Photo by <a href="https://commons.wikimedia.org/wiki/File:Altair_8800,_Smithsonian_Museum_(white_background).jpg">Colin Douglas</a>, <a href="https://creativecommons.org/licenses/by-sa/2.0/deed.en">(CC BY-SA 2.0)</a>.</div></p>
<p>The table below shows the instruction set for the 8080.
The 8080 was designed to be compatible with 8008 assembly programs after a simple translation process;
the instructions have been shifted around and the names have changed.<span id="fnref:8080-shift"><a class="ref" href="#fn:8080-shift">15</a></span>
The instructions from the Datapoint 2200 (colored) form the majority of the 8080's instruction set.
The instruction set was expanded by adding some 16-bit support, allowing register pairs (<code>BC</code>, <code>DE</code>, <code>HL</code>) to be used as 16-bit registers for
double add, 16-bit increment and decrement, and 16-bit memory transfers.
Many of the new instructions in the 8080 may seem like contrived special cases—
for example, <code>SPHL</code> (Load <code>SP</code> from <code>HL</code>) and <code>XCHG</code> (Exchange <code>DE</code> and <code>HL</code>)—
but they made accesses to memory easier.
The I/O instructions from the 8008 have been condensed to just IN and OUT, opening up room for new instructions.</p>
<table class="dp">
<tr><th> </th><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td></tr>
<tr><th>0</th><td class="misc">NOP</td><td class="new">LXI B</td><td class="new">STAX B</td><td class="new">INX B</td><td class="na">INR B</td><td class="na">DCR B</td><td class="mov">MVI B</td><td class="arith">RLC</td><td class="mov">MOV B,B</td><td class="mov">MOV B,C</td><td class="mov">MOV B,D</td><td class="mov">MOV B,E</td><td class="mov">MOV B,H</td><td class="mov">MOV B,L</td><td class="mov">MOV B,M</td><td class="mov">MOV B,A</td></tr>
<tr><th>1</th><td> </td><td class="new">DAD B</td><td class="new">LDAX B</td><td class="new">DCX B</td><td class="na">INR C</td><td class="na">DCR C</td><td class="mov">MVI C</td><td class="arith">RRC</td><td class="mov">MOV C,B</td><td class="mov">MOV C,C</td><td class="mov">MOV C,D</td><td class="mov">MOV C,E</td><td class="mov">MOV C,H</td><td class="mov">MOV C,L</td><td class="mov">MOV C,M</td><td class="mov">MOV C,A</td></tr>
<tr><th>2</th><td> </td><td class="new">LXI D</td><td class="new">STAX D</td><td class="new">INX D</td><td class="na">INR D</td><td class="na">DCR D</td><td class="mov">MVI D</td><td class="na">RAL</td><td class="mov">MOV D,B</td><td class="mov">MOV D,C</td><td class="mov">MOV D,D</td><td class="mov">MOV D,E</td><td class="mov">MOV D,H</td><td class="mov">MOV D,L</td><td class="mov">MOV D,M</td><td class="mov">MOV D,A</td></tr>
<tr><th>3</th><td> </td><td class="new">DAD D</td><td class="new">LDAX D</td><td class="new">DCX D</td><td class="na">INR E</td><td class="na">DCR E</td><td class="mov">MVI E</td><td class="na">RAR</td><td class="mov">MOV E,B</td><td class="mov">MOV E,C</td><td class="mov">MOV E,D</td><td class="mov">MOV E,E</td><td class="mov">MOV E,H</td><td class="mov">MOV E,L</td><td class="mov">MOV E,M</td><td class="mov">MOV E,A</td></tr>
<tr><th>4</th><td> </td><td class="new">LXI H</td><td class="new">SHLD</td><td class="new">INX H</td><td class="na">INR H</td><td class="na">DCR H</td><td class="mov">MVI H</td><td class="new">DAA</td><td class="mov">MOV H,B</td><td class="mov">MOV H,C</td><td class="mov">MOV H,D</td><td class="mov">MOV H,E</td><td class="mov">MOV H,H</td><td class="mov">MOV H,L</td><td class="mov">MOV H,M</td><td class="mov">MOV H,A</td></tr>
<tr><th>5</th><td> </td><td class="new">DAD H</td><td class="new">LHLD</td><td class="new">DCX H</td><td class="na">INR L</td><td class="na">DCR L</td><td class="mov">MVI L</td><td class="new">CMA</td><td class="mov">MOV L,B</td><td class="mov">MOV L,C</td><td class="mov">MOV L,D</td><td class="mov">MOV L,E</td><td class="mov">MOV L,H</td><td class="mov">MOV L,L</td><td class="mov">MOV L,M</td><td class="mov">MOV L,A</td></tr>
<tr><th>6</th><td> </td><td class="new">LXI SP</td><td class="new">STA</td><td class="new">INX SP</td><td class="na">INR M</td><td class="na">DCR M</td><td class="mov">MVI M</td><td class="new">STC</td><td class="mov">MOV M,B</td><td class="mov">MOV M,C</td><td class="mov">MOV M,D</td><td class="mov">MOV M,E</td><td class="mov">MOV M,H</td><td class="mov">MOV M,L</td><td class="misc">HLT</td><td class="mov">MOV M,A</td></tr>
<tr><th>7</th><td> </td><td class="new">DAD SP</td><td class="new">LDA</td><td class="new">DCX SP</td><td class="na">INR A</td><td class="na">DCR A</td><td class="mov">MVI A</td><td class="new">CMC</td><td class="mov">MOV A,B</td><td class="mov">MOV A,C</td><td class="mov">MOV A,D</td><td class="mov">MOV A,E</td><td class="mov">MOV A,H</td><td class="mov">MOV A,L</td><td class="mov">MOV A,M</td><td class="mov">MOV A,A</td></tr>
<tr><th>0</th><td class="arith">ADD B</td><td class="arith">ADD C</td><td class="arith">ADD D</td><td class="arith">ADD E</td><td class="arith">ADD H</td><td class="arith">ADD L</td><td class="arith">ADD M</td><td class="arith">ADD A</td><td class="ctrl">RNZ</td><td class="new">POP B</td><td class="ctrl">JNZ</td><td class="ctrl">JMP</td><td class="ctrl">CNZ</td><td class="new">PUSH B</td><td class="arith">ADI</td><td class="na">RST 0</td></tr>
<tr><th>1</th><td class="arith">ADC B</td><td class="arith">ADC C</td><td class="arith">ADC D</td><td class="arith">ADC E</td><td class="arith">ADC H</td><td class="arith">ADC L</td><td class="arith">ADC M</td><td class="arith">ADC A</td><td class="ctrl">RZ</td><td class="ctrl">RET</td><td class="ctrl">JZ</td><td> </td><td class="ctrl">CZ</td><td class="ctrl">CALL</td><td class="arith">ACI</td><td class="na">RST 1</td></tr>
<tr><th>2</th><td class="arith">SUB B</td><td class="arith">SUB C</td><td class="arith">SUB D</td><td class="arith">SUB E</td><td class="arith">SUB H</td><td class="arith">SUB L</td><td class="arith">SUB M</td><td class="arith">SUB A</td><td class="ctrl">RNC</td><td class="new">POP D</td><td class="ctrl">JNC</td><td class="io">OUT</td><td class="ctrl">CNC</td><td class="new">PUSH D</td><td class="arith">SUI</td><td class="na">RST 2</td></tr>
<tr><th>3</th><td class="arith">SBB B</td><td class="arith">SBB C</td><td class="arith">SBB D</td><td class="arith">SBB E</td><td class="arith">SBB H</td><td class="arith">SBB L</td><td class="arith">SBB M</td><td class="arith">SBB A</td><td class="ctrl">RC</td><td> </td><td class="ctrl">JC</td><td class="io">IN</td><td class="ctrl">CC</td><td> </td><td class="arith">SBI</td><td class="na">RST 3</td></tr>
<tr><th>4</th><td class="arith">ANA B</td><td class="arith">ANA C</td><td class="arith">ANA D</td><td class="arith">ANA E</td><td class="arith">ANA H</td><td class="arith">ANA L</td><td class="arith">ANA M</td><td class="arith">ANA A</td><td class="ctrl">RPO</td><td class="new">POP H</td><td class="ctrl">JPO</td><td class="new">XTHL</td><td class="ctrl">CPO</td><td class="new">PUSH H</td><td class="arith">ANI</td><td class="na">RST 4</td></tr>
<tr><th>5</th><td class="arith">XRA B</td><td class="arith">XRA C</td><td class="arith">XRA D</td><td class="arith">XRA E</td><td class="arith">XRA H</td><td class="arith">XRA L</td><td class="arith">XRA M</td><td class="arith">XRA A</td><td class="ctrl">RPE</td><td class="new">PCHL</td><td class="ctrl">JPE</td><td class="new">XCHG</td><td class="ctrl">CPE</td><td> </td><td class="arith">XRI</td><td class="na">RST 5</td></tr>
<tr><th>6</th><td class="arith">ORA B</td><td class="arith">ORA C</td><td class="arith">ORA D</td><td class="arith">ORA E</td><td class="arith">ORA H</td><td class="arith">ORA L</td><td class="arith">ORA M</td><td class="arith">ORA A</td><td class="ctrl">RP</td><td class="new">POP PSW</td><td class="ctrl">JP</td><td class="new">DI</td><td class="ctrl">CP</td><td class="new">PUSH PSW</td><td class="arith">ORI</td><td class="na">RST 6</td></tr>
<tr><th>7</th><td class="arith">CMP B</td><td class="arith">CMP C</td><td class="arith">CMP D</td><td class="arith">CMP E</td><td class="arith">CMP H</td><td class="arith">CMP L</td><td class="arith">CMP M</td><td class="arith">CMP A</td><td class="ctrl">RM</td><td class="new">SPHL</td><td class="ctrl">JM</td><td class="new">EI</td><td class="ctrl">CM</td><td> </td><td class="arith">CPI</td><td class="na">RST 7</td></tr>
</table>
<p>The 8080 also moved the stack to external memory, rather than using an internal fixed special-purpose stack as in the 8008 and
Datapoint 2200.
This allowed <code>PUSH</code> and <code>POP</code> instructions to put register data on the stack.
Interrupt handling was also improved by adding
the Enable Interrupt and Disable Interrupt instructions (EI and DI).<span id="fnref:mos"><a class="ref" href="#fn:mos">14</a></span></p>
<h2>Intel 8085</h2>
<p>The Intel 8085 was designed as a "mid-life kicker" for the 8080, providing incremental improvements while maintaining compatibility.
From the hardware perspective, the 8085 was much easier to use than the 8080.
While the 8080 required three voltages, the 8085 required a single 5-volt power supply (represented by the "5" in the part number).
Moreover, the 8085 eliminated most of the support chips required with the 8080; a working 8085 computer could be built with
just three chips.
Finally, the 8085 provided additional hardware functionality: better interrupt support and serial I/O.</p>
<p><a href="https://static.righto.com/images/8086-datapoint/8085.jpg"><img alt="The Intel 8085, like the 8080 and the 8086, was packaged in a 40-pin DIP. Photo by Thomas Nguyen, (CC BY-SA 4.0)." class="hilite" height="166" src="https://static.righto.com/images/8086-datapoint/8085-w400.jpg" title="The Intel 8085, like the 8080 and the 8086, was packaged in a 40-pin DIP. Photo by Thomas Nguyen, (CC BY-SA 4.0)." width="400" /></a><div class="cite">The Intel 8085, like the 8080 and the 8086, was packaged in a 40-pin DIP. Photo by <a href="https://commons.wikimedia.org/wiki/File:Intel_C8085.jpg">Thomas Nguyen</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en">(CC BY-SA 4.0)</a>.</div></p>
<p>On the software side, the 8085 is curious: 12 instructions were added to the instruction set (finally using every opcode), but all but two were hidden
and left undocumented.<span id="fnref:8085-undoc"><a class="ref" href="#fn:8085-undoc">16</a></span>
Moreover, the 8085 added two new condition codes, but these were also hidden.
This situation occurred because the 8086 project started up in 1976, near the release of the 8085 chip.
Intel wanted the 8086 to be compatible (to some extent) with the 8080 and 8085, but providing new instructions in the 8085
would make compatibility harder.
It was too late to remove the instructions from the 8085 chip, so Intel did the next best thing and removed them
from the documentation. These instructions are shown in red in the table below.
Only the new SIM and RIM instructions were supported, necessary in order to use
the 8085's new interrupt and serial I/O features.</p>
<table class="dp">
<tr><th> </th><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td></tr>
<tr><th>0</th><td class="misc">NOP</td><td class="na">LXI B</td><td class="na">STAX B</td><td class="na">INX B</td><td class="na">INR B</td><td class="na">DCR B</td><td class="mov">MVI B</td><td class="na">RLC</td><td class="mov">MOV B,B</td><td class="mov">MOV B,C</td><td class="mov">MOV B,D</td><td class="mov">MOV B,E</td><td class="mov">MOV B,H</td><td class="mov">MOV B,L</td><td class="mov">MOV B,M</td><td class="mov">MOV B,A</td></tr>
<tr><th>1</th><td class="undoc">DSUB</td><td class="na">DAD B</td><td class="na">LDAX B</td><td class="na">DCX B</td><td class="na">INR C</td><td class="na">DCR C</td><td class="mov">MVI C</td><td class="na">RRC</td><td class="mov">MOV C,B</td><td class="mov">MOV C,C</td><td class="mov">MOV C,D</td><td class="mov">MOV C,E</td><td class="mov">MOV C,H</td><td class="mov">MOV C,L</td><td class="mov">MOV C,M</td><td class="mov">MOV C,A</td></tr>
<tr><th>2</th><td class="undoc">ARHL</td><td class="na">LXI D</td><td class="na">STAX D</td><td class="na">INX D</td><td class="na">INR D</td><td class="na">DCR D</td><td class="mov">MVI D</td><td class="na">RAL</td><td class="mov">MOV D,B</td><td class="mov">MOV D,C</td><td class="mov">MOV D,D</td><td class="mov">MOV D,E</td><td class="mov">MOV D,H</td><td class="mov">MOV D,L</td><td class="mov">MOV D,M</td><td class="mov">MOV D,A</td></tr>
<tr><th>3</th><td class="undoc">RDEL</td><td class="na">DAD D</td><td class="na">LDAX D</td><td class="na">DCX D</td><td class="na">INR E</td><td class="na">DCR E</td><td class="mov">MVI E</td><td class="na">RAR</td><td class="mov">MOV E,B</td><td class="mov">MOV E,C</td><td class="mov">MOV E,D</td><td class="mov">MOV E,E</td><td class="mov">MOV E,H</td><td class="mov">MOV E,L</td><td class="mov">MOV E,M</td><td class="mov">MOV E,A</td></tr>
<tr><th>4</th><td class="new">RIM</td><td class="na">LXI H</td><td class="na">SHLD</td><td class="na">INX H</td><td class="na">INR H</td><td class="na">DCR H</td><td class="mov">MVI H</td><td class="na">DAA</td><td class="mov">MOV H,B</td><td class="mov">MOV H,C</td><td class="mov">MOV H,D</td><td class="mov">MOV H,E</td><td class="mov">MOV H,H</td><td class="mov">MOV H,L</td><td class="mov">MOV H,M</td><td class="mov">MOV H,A</td></tr>
<tr><th>5</th><td class="undoc">LDHI</td><td class="na">DAD H</td><td class="na">LHLD</td><td class="na">DCX H</td><td class="na">INR L</td><td class="na">DCR L</td><td class="mov">MVI L</td><td class="na">CMA</td><td class="mov">MOV L,B</td><td class="mov">MOV L,C</td><td class="mov">MOV L,D</td><td class="mov">MOV L,E</td><td class="mov">MOV L,H</td><td class="mov">MOV L,L</td><td class="mov">MOV L,M</td><td class="mov">MOV L,A</td></tr>
<tr><th>6</th><td class="new">SIM</td><td class="na">LXI SP</td><td class="na">STA</td><td class="na">INX SP</td><td class="na">INR M</td><td class="na">DCR M</td><td class="mov">MVI M</td><td class="na">STC</td><td class="mov">MOV M,B</td><td class="mov">MOV M,C</td><td class="mov">MOV M,D</td><td class="mov">MOV M,E</td><td class="mov">MOV M,H</td><td class="mov">MOV M,L</td><td class="misc">HLT</td><td class="mov">MOV M,A</td></tr>
<tr><th>7</th><td class="undoc">LDSI</td><td class="na">DAD SP</td><td class="na">LDA</td><td class="na">DCX SP</td><td class="na">INR A</td><td class="na">DCR A</td><td class="mov">MVI A</td><td class="na">CMC</td><td class="mov">MOV A,B</td><td class="mov">MOV A,C</td><td class="mov">MOV A,D</td><td class="mov">MOV A,E</td><td class="mov">MOV A,H</td><td class="mov">MOV A,L</td><td class="mov">MOV A,M</td><td class="mov">MOV A,A</td></tr>
<tr><th>0</th><td class="arith">ADD B</td><td class="arith">ADD C</td><td class="arith">ADD D</td><td class="arith">ADD E</td><td class="arith">ADD H</td><td class="arith">ADD L</td><td class="arith">ADD M</td><td class="arith">ADD A</td><td class="ctrl">RNZ</td><td class="na">POP B</td><td class="ctrl">JNZ</td><td class="ctrl">JMP</td><td class="ctrl">CNZ</td><td class="na">PUSH B</td><td class="arith">ADI</td><td class="na">RST 0</td></tr>
<tr><th>1</th><td class="arith">ADC B</td><td class="arith">ADC C</td><td class="arith">ADC D</td><td class="arith">ADC E</td><td class="arith">ADC H</td><td class="arith">ADC L</td><td class="arith">ADC M</td><td class="arith">ADC A</td><td class="ctrl">RZ</td><td class="ctrl">RET</td><td class="ctrl">JZ</td><td class="undoc">RSTV</td><td class="ctrl">CZ</td><td class="ctrl">CALL</td><td class="arith">ACI</td><td class="na">RST 1</td></tr>
<tr><th>2</th><td class="arith">SUB B</td><td class="arith">SUB C</td><td class="arith">SUB D</td><td class="arith">SUB E</td><td class="arith">SUB H</td><td class="arith">SUB L</td><td class="arith">SUB M</td><td class="arith">SUB A</td><td class="ctrl">RNC</td><td class="na">POP D</td><td class="ctrl">JNC</td><td class="io">OUT</td><td class="ctrl">CNC</td><td class="na">PUSH D</td><td class="arith">SUI</td><td class="na">RST 2</td></tr>
<tr><th>3</th><td class="arith">SBB B</td><td class="arith">SBB C</td><td class="arith">SBB D</td><td class="arith">SBB E</td><td class="arith">SBB H</td><td class="arith">SBB L</td><td class="arith">SBB M</td><td class="arith">SBB A</td><td class="ctrl">RC</td><td class="undoc">SHLX</td><td class="ctrl">JC</td><td class="io">IN</td><td class="ctrl">CC</td><td class="undoc">JNK</td><td class="arith">SBI</td><td class="na">RST 3</td></tr>
<tr><th>4</th><td class="arith">ANA B</td><td class="arith">ANA C</td><td class="arith">ANA D</td><td class="arith">ANA E</td><td class="arith">ANA H</td><td class="arith">ANA L</td><td class="arith">ANA M</td><td class="arith">ANA A</td><td class="ctrl">RPO</td><td class="na">POP H</td><td class="ctrl">JPO</td><td class="na">XTHL</td><td class="ctrl">CPO</td><td class="na">PUSH H</td><td class="arith">ANI</td><td class="na">RST 4</td></tr>
<tr><th>5</th><td class="arith">XRA B</td><td class="arith">XRA C</td><td class="arith">XRA D</td><td class="arith">XRA E</td><td class="arith">XRA H</td><td class="arith">XRA L</td><td class="arith">XRA M</td><td class="arith">XRA A</td><td class="ctrl">RPE</td><td class="na">PCHL</td><td class="ctrl">JPE</td><td class="na">XCHG</td><td class="ctrl">CPE</td><td class="undoc">LHLX</td><td class="arith">XRI</td><td class="na">RST 5</td></tr>
<tr><th>6</th><td class="arith">ORA B</td><td class="arith">ORA C</td><td class="arith">ORA D</td><td class="arith">ORA E</td><td class="arith">ORA H</td><td class="arith">ORA L</td><td class="arith">ORA M</td><td class="arith">ORA A</td><td class="ctrl">RP</td><td class="na">POP PSW</td><td class="ctrl">JP</td><td class="na">DI</td><td class="ctrl">CP</td><td class="na">PUSH PSW</td><td class="arith">ORI</td><td class="na">RST 6</td></tr>
<tr><th>7</th><td class="arith">CMP B</td><td class="arith">CMP C</td><td class="arith">CMP D</td><td class="arith">CMP E</td><td class="arith">CMP H</td><td class="arith">CMP L</td><td class="arith">CMP M</td><td class="arith">CMP A</td><td class="ctrl">RM</td><td class="na">SPHL</td><td class="ctrl">JM</td><td class="na">EI</td><td class="ctrl">CM</td><td class="undoc">JK</td><td class="arith">CPI</td><td class="na">RST 7</td></tr>
</table>
<h2>Intel 8086</h2>
<p>Following the 8080, Intel intended to revolutionize microprocessors with a 32-bit "micro-mainframe", the iAPX 432.
This extremely complex processor implemented objects, memory management, interprocess communication, and fine-grained memory
protection in hardware.
The iAPX 432 was too ambitious and the project fell behind schedule, leaving Intel vulnerable against
competitors such as Motorola and Zilog.
Intel quickly threw together a 16-bit processor as a stopgap until the iAPX 432 was ready; to show its continuity with the 8-bit processor line,
this processor was called the 8086.
The iAPX 432 ended up being <a href="https://archive.nytimes.com/www.nytimes.com/library/tech/98/04/biztech/articles/05merced.html">one of the great disaster stories of modern computing</a> and quietly disappeared.</p>
<p>The "stopgap" 8086 processor, however, started the x86 architecture that changed the history of Intel.
The 8086's victory was powered by the IBM PC,
designed in 1981 around the Intel 8088, a variant of the 8086 with a cheaper 8-bit bus.
The IBM PC was a rousing success, defining the modern computer and making Intel's fortune.
Intel produced a succession of more powerful chips that extended the 8086: 286, 386, 486, Pentium, and so on,
leading to the current x86 architecture.</p>
<p><a href="https://static.righto.com/images/8086-datapoint/Ibm_pc_5150.jpg"><img alt="The original IBM PC used the Intel 8088 processor, a variant of the 8086 with an 8-bit bus. Photo by Ruben de Rijcke, (CC BY-SA 3.0)." class="hilite" height="391" src="https://static.righto.com/images/8086-datapoint/Ibm_pc_5150-w400.jpg" title="The original IBM PC used the Intel 8088 processor, a variant of the 8086 with an 8-bit bus. Photo by Ruben de Rijcke, (CC BY-SA 3.0)." width="400" /></a><div class="cite">The original IBM PC used the Intel 8088 processor, a variant of the 8086 with an 8-bit bus. Photo by <a href="https://commons.wikimedia.org/wiki/File:Ibm_pc_5150.jpg">Ruben de Rijcke</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en">(CC BY-SA 3.0)</a>.</div></p>
<p>The 8086 was a major change from the 8080/8085, jumping from an 8-bit architecture to a 16-bit architecture
and expanding from 64K of memory to 1 megabyte.
Nonetheless, the 8086's architecture is closely related to the 8080.
The designers of the 8086 wanted it to be compatible with the 8080/8085, but the difference was too wide
for binary compatibility or even assembly-language compatibility.
Instead, the 8086 was designed so a program could translate 8080 assembly language to 8086 assembly language.<span id="fnref:conv86"><a class="ref" href="#fn:conv86">17</a></span>
To accomplish this, each 8080 register had a corresponding 8086 register and most
8080 instructions had corresponding 8086 instructions.</p>
<p>The 8086's instruction set was designed with a new concept, the "ModR/M" byte, which usually follows the
opcode byte.
The ModR/M byte specifies the memory addressing mode
and the register (or registers) to use, allowing that information to be moved out of the opcode.
For instance, where the 8080 had a quadrant of 64 instructions to move from register to register, the 8086 has a
single move instruction, with the ModR/M byte specifying the particular instruction.
(The move instruction, however, has variants to handle byte vs. word operations, moves to or from memory,
and so forth, so the 8086 ends up with a few move opcodes.)
The ModR/M byte preserves the Datapoint 2200's concept of using the same instruction for memory and register
operations, but allows a memory address to be provided in the instruction.</p>
<p>The 8086 also cleans up some of the historical baggage in the instruction set, freeing up space in the precious
256 opcodes for new instructions.
The conditional call and return instructions were eliminated, while the conditional jumps were expanded.
The 8008's <code>RST</code> (Restart) instructions were eliminated, replaced by interrupt vectors.</p>
<p>The 8086 extended its registers to 16 bits and added several new registers.
An Intel patent (below) shows that the 8086's registers were originally called <code>A</code>, <code>B</code>, <code>C</code>, <code>D</code>, <code>E</code>, <code>H</code>, and <code>L</code>,
matching the Datapoint 2200.
The <code>A</code> register was extended to the 16-bit <code>XA</code> register, while the
<code>BC</code>, <code>DE</code>, and <code>HL</code> registers were used unchanged.
When the 8086 was released, these registers were renamed to <code>AX</code>, <code>CX</code>, <code>DX</code>, and <code>BX</code> respectively.<span id="fnref:bx"><a class="ref" href="#fn:bx">18</a></span>
In particular, the <code>HL</code> register was renamed to <code>BX</code>; this is why <code>BX</code> can specify a memory address in the ModR/M byte, but
<code>AX</code>, <code>CX</code>, and <code>DX</code> can't.</p>
<p><a href="https://static.righto.com/images/8086-datapoint/patent-regs.png"><img alt="A patent diagram showing the 8086's registers with their original names. (MP, IJ, and IK are now known as BP, SI, and DI.) From patent US4449184." class="hilite" height="205" src="https://static.righto.com/images/8086-datapoint/patent-regs-w220.png" title="A patent diagram showing the 8086's registers with their original names. (MP, IJ, and IK are now known as BP, SI, and DI.) From patent US4449184." width="220" /></a><div class="cite">A patent diagram showing the 8086's registers with their original names. (MP, IJ, and IK are now known as BP, SI, and DI.) From <a href="https://patents.google.com/patent/US4449184A/en">patent US4449184</a>.</div></p>
<!--
The 8086's instructions have other similarities with the earlier processors.
The instructions have the same octal format, with fields of 3 bits specifying registers or ALU operations.
The eight core ALU operations are the same as the previous processors, going back to the Datapoint 2200.
-->
<p>The table below shows the 8086's instruction set, with "b", "w", and "i" indicating byte (8-bit), word (16-bit), and immediate
instructions.
The Datapoint 2200 instructions (colored) are all still supported.
The number of Datapoint instructions looks small
because the ModR/M byte collapses groups of old opcodes into a single new one.
This opened up space in the opcode table, though, allowing the 8086 to have many new instructions as well as
16-bit instructions.<span id="fnref:8086-instructions"><a class="ref" href="#fn:8086-instructions">19</a></span></p>
<table class="dp">
<tr><th> </th><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td><th>0</td><th>1</td><th>2</td><th>3</td><th>4</td><th>5</td><th>6</td><th>7</td></tr>
<tr><th>0</th><td class="arith">ADD b</td><td class="na">ADD w</td><td class="arith">ADD b</td><td class="na">ADD w</td><td class="arith">ADD bi</td><td class="na">ADD wi</td><td class="na">PUSH ES</td><td class="na">POP ES</td><td class="na">INC AX</td><td class="na">INC CX</td><td class="na">INC DX</td><td class="na">INC BX</td><td class="na">INC SP</td><td class="na">INC BP</td><td class="na">INC SI</td><td class="na">INC DI</td></tr>
<tr><th>1</th><td class="arith">OR b</td><td class="na">OR w</td><td class="arith">OR b</td><td class="na">OR w</td><td class="arith">OR bi</td><td class="na">OR wi</td><td class="na">PUSH CS</td><td> </td><td class="na">DEC AX</td><td class="na">DEC CX</td><td class="na">DEC DX</td><td class="na">DEC BX</td><td class="na">DEC SP</td><td class="na">DEC BP</td><td class="na">DEC SI</td><td class="na">DEC DI</td></tr>
<tr><th>2</th><td class="arith">ADC b</td><td class="na">ADC w</td><td class="arith">ADC b</td><td class="na">ADC w</td><td class="arith">ADC bi</td><td class="na">ADC wi</td><td class="na">PUSH SS</td><td class="na">POP SS</td><td class="na">PUSH AX</td><td class="na">PUSH CX</td><td class="na">PUSH DX</td><td class="na">PUSH BX</td><td class="na">PUSH SP</td><td class="na">PUSH BP</td><td class="na">PUSH SI</td><td class="na">PUSH DI</td></tr>
<tr><th>3</th><td class="arith">SBB b</td><td class="na">SBB w</td><td class="arith">SBB b</td><td class="na">SBB w</td><td class="arith">SBB bi</td><td class="na">SBB wi</td><td class="na">PUSH DS</td><td class="na">POP DS</td><td class="na">POP AX</td><td class="na">POP CX</td><td class="na">POP DX</td><td class="na">POP BX</td><td class="na">POP SP</td><td class="na">POP BP</td><td class="na">POP SI</td><td class="na">POP DI</td></tr>
<tr><th>4</th><td class="arith">AND b</td><td class="na">AND w</td><td class="arith">AND b</td><td class="na">AND w</td><td class="arith">AND bi</td><td class="na">AND wi</td><td class="na">ES:</td><td class="na">DAA</td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr>
<tr><th>5</th><td class="arith">SUB b</td><td class="na">SUB w</td><td class="arith">SUB b</td><td class="na">SUB w</td><td class="arith">SUB bi</td><td class="na">SUB wi</td><td class="na">CS:</td><td class="na">DAS</td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr>
<tr><th>6</th><td class="arith">XOR b</td><td class="na">XOR w</td><td class="arith">XOR b</td><td class="na">XOR w</td><td class="arith">XOR bi</td><td class="na">XOR wi</td><td class="na">SS:</td><td class="na">AAA</td><td class="na">JO</td><td class="na">JNO</td><td class="ctrl">JB</td><td class="ctrl">JNB</td><td class="ctrl">JZ</td><td class="ctrl">JNZ</td><td class="na">JBE</td><td class="na">JA</td></tr>
<tr><th>7</th><td class="arith">CMP b</td><td class="na">CMP w</td><td class="arith">CMP b</td><td class="na">CMP w</td><td class="arith">CMP bi</td><td class="na">CMP wi</td><td class="na">DS:</td><td class="na">AAS</td><td class="ctrl">JS</td><td class="ctrl">JNS</td><td class="ctrl">JPE</td><td class="ctrl">JPO</td><td class="na">JL</td><td class="na">JGE</td><td class="na">JLE</td><td class="na">JG</td></tr>
<tr><th>0</th><td class="na">GRP1 b</td><td class="na">GRP1 w</td><td class="na">GRP1 b</td><td class="na">GRP1 w</td><td class="na">TEST b</td><td class="na">TEST w</td><td class="na">XCHG b</td><td class="na">XCHG w</td><td> </td><td> </td><td class="ctrl">RET</td><td class="na">RET</td><td class="na">LES</td><td class="na">LDS</td><td class="mov">MOV b</td><td class="na">MOV w</td></tr>
<tr><th>1</th><td class="mov">MOV b</td><td class="na">MOV w</td><td class="mov">MOV b</td><td class="na">MOV w</td><td class="na">MOV sr</td><td class="na">LEA</td><td class="na">MOV sr</td><td class="na">POP</td><td> </td><td> </td><td class="na">RETF</td><td class="na">RETF</td><td class="na">INT 3</td><td class="na">INT</td><td class="na">INTO</td><td class="na">IRET</td></tr>
<tr><th>2</th><td class="misc">NOP</td><td class="na">XCHG CX</td><td class="na">XCHG DX</td><td class="na">XCHG BX</td><td class="na">XCHG SP</td><td class="na">XCHG BP</td><td class="na">XCHG SI</td><td class="na">XCHG DI</td><td class="arith">Shift b</td><td class="na">Shift w</td><td class="na">Shift b</td><td class="na">Shift w</td><td class="na">AAM</td><td class="na">AAD</td><td> </td><td class="na">XLAT</td></tr>
<tr><th>3</th><td class="na">CBW</td><td class="na">CWD</td><td class="ctrl">CALL</td><td class="na">WAIT</td><td class="na">PUSHF</td><td class="na">POPF</td><td class="na">SAHF</td><td class="na">LAHF</td><td class="na">ESC 0</td><td class="na">ESC 1</td><td class="na">ESC 2</td><td class="na">ESC 3</td><td class="na">ESC 4</td><td class="na">ESC 5</td><td class="na">ESC 6</td><td class="na">ESC 7</td></tr>
<tr><th>4</th><td class="mov">MOV AL,M</td><td class="na">MOV AX,M</td><td class="mov">MOV M,AL</td><td class="na">MOV M,AX</td><td class="na">MOVS b</td><td class="na">MOVS w</td><td class="na">CMPS b</td><td class="na">CMPS w</td><td class="na">LOOPNZ</td><td class="na">LOOPZ</td><td class="na">LOOP</td><td class="na">JCXZ</td><td class="io">IN b</td><td class="na">IN w</td><td class="io">OUT b</td><td class="na">OUT w</td></tr>
<tr><th>5</th><td class="na">TEST b</td><td class="na">TEST w</td><td class="na">STOS b</td><td class="na">STOS w</td><td class="na">LODS b</td><td class="na">LODS w</td><td class="na">SCAS b</td><td class="na">SCAS w</td><td class="na">CALL</td><td class="na">JMP</td><td class="na">JMP</td><td class="na">JMP</td><td class="na">IN b</td><td class="na">IN w</td><td class="na">OUT b DX</td><td class="na">OUT w DX</td></tr>
<tr><th>6</th><td class="mov">MOV AL,i</td><td class="mov">MOV CL,i</td><td class="mov">MOV DL,i</td><td class="mov">MOV BL,i</td><td class="na">MOV AH,i</td><td class="mov">MOV CH,i</td><td class="mov">MOV DH,i</td><td class="mov">MOV BH,i</td><td class="na">LOCK</td><td> </td><td class="na">REPNZ</td><td class="na">REPZ</td><td class="misc">HLT</td><td class="na">CMC</td><td class="na">GRP3a</td><td class="na">GRP3b</td></tr>
<tr><th>7</th><td class="na">MOV AX,i</td><td class="na">MOV CX,i</td><td class="na">MOV DX,i</td><td class="na">MOV BX,i</td><td class="na">MOV SP,i</td><td class="na">MOV BP,i</td><td class="na">MOV SI,i</td><td class="na">MOV DI,i</td><td class="na">CLC</td><td class="na">STC</td><td class="na">CLI</td><td class="na">STI</td><td class="na">CLD</td><td class="na">STD</td><td class="na">GRP4</td><td class="na">GRP5</td></tr>
</table>
<p>The 8086 has a 16-bit flags register, shown below, but the low byte remained compatible with the 8080.
The four highlighted flags (sign, zero, parity, and carry) are the ones originating in the Datapoint 2200.</p>
<p><a href="https://static.righto.com/images/8086-datapoint/8086-flags.jpg"><img alt="The flag word of the 8086 contains the original Datapoint 2200 registers." class="hilite" height="66" src="https://static.righto.com/images/8086-datapoint/8086-flags-w700.jpg" title="The flag word of the 8086 contains the original Datapoint 2200 registers." width="700" /></a><div class="cite">The flag word of the 8086 contains the original Datapoint 2200 registers.</div></p>
<h2>Modern x86 and x86-64</h2>
<p>The modern x86 architecture has extended the 8086 to a 32-bit architecture (IA-32) and a 64-bit architecture (x86-64<span id="fnref:ia64"><a class="ref" href="#fn:ia64">20</a></span>), but the Datapoint features remain.
At startup, an x86 processor runs in "<a href="https://en.wikipedia.org/wiki/Real_mode">real mode</a>", which operates
like the original 8086.
More interesting is 64-bit mode, which has some major architectural changes.
In 64-bit mode, the 8086's general-purpose registers are extended to sixteen 64-bit registers
(and <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html">soon</a> to be 32 registers).
However, the original Datapoint registers are special and can still be accessed as byte registers within the
corresponding 64-bit register; these are highlighted in the table below.<span id="fnref:aliasing"><a class="ref" href="#fn:aliasing">21</a></span></p>
<p><a href="https://static.righto.com/images/8086-datapoint/64-bit-regs.jpg"><img alt="General purpose registers in x86-64. From Intel Software Developer's Manual." class="hilite" height="145" src="https://static.righto.com/images/8086-datapoint/64-bit-regs-w700.jpg" title="General purpose registers in x86-64. From Intel Software Developer's Manual." width="700" /></a><div class="cite">General purpose registers in x86-64. From <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html">Intel Software Developer's Manual</a>.</div></p>
<p>The flag register of the 8086 was extended to 32 bits or 64 bits in x86. As the diagram below shows,
the original Datapoint 2200 status flags are still there (highlighted in yellow).</p>
<p><a href="https://static.righto.com/images/8086-datapoint/64-bit-flags.jpg"><img alt="The 32-bit and 64-bit flags of x86 contain the original Datapoint 2200 registers. From Intel Software Developer's Manual." class="hilite" height="335" src="https://static.righto.com/images/8086-datapoint/64-bit-flags-w500.jpg" title="The 32-bit and 64-bit flags of x86 contain the original Datapoint 2200 registers. From Intel Software Developer's Manual." width="500" /></a><div class="cite">The 32-bit and 64-bit flags of x86 contain the original Datapoint 2200 registers. From <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html">Intel Software Developer's Manual</a>.</div></p>
<p>The instruction set in x86 has been extended from the 8086, mostly through prefixes, but
the instructions from the Datapoint 2200 are still there.
The ModR/M byte was changed in 32-bit mode so the <code>BX</code> (originally <code>HL</code>) register is no longer special
when accessing memory (although it's still special with 16-bit addressing, until Intel removes that in the
upcoming <a href="https://en.wikipedia.org/wiki/X86-64#x86-S">x86-S</a> simplification.)
I/O ports still exist in x86, although they are viewed as more of a legacy feature: modern I/O devices
typically use memory-mapped I/O instead of I/O ports.
To summarize, fifty years later, x86-64 is slowly moving away from some of the Datapoint 2200 features, but they are still there.</p>
<h2>Conclusions</h2>
<p>The modern x86 architecture is descended from the Datapoint 2200's architecture.
Because there is backward-compatibility at each step, you should theoretically be able to take a Datapoint 2200 binary, disassemble it to 8008 assembly, automatically translate it
to 8080 assembly, automatically convert it to 8086 assembly, and then run it on a modern x86 processor.
(The I/O devices would be different and cause trouble, of course.)</p>
<p>The Datapoint 2200's complete instruction set, its flags, and its little-endian architecture have persisted into current
processors.
This shows the critical importance of backward compatibility to customers.
While Intel keeps attempting to create new architectures (iAPX 432, i960, i860, Itanium), customers would rather
stay on a compatible architecture.
Remarkably, Intel has managed to move from 8-bit computers to 16, 32, and 64 bits, while keeping systems mostly compatible.
As a result, design decisions made for the Datapoint 2200 over 50 years ago are still impacting modern computers.
Will processors still have the features of the Datapoint 2200 another fifty years from now? I wouldn't be surprised.<span id="fnref:refs"><a class="ref" href="#fn:refs">22</a></span></p>
<!--
The x86 architecture has changed dramatically over the years, expanded to 32-bit and now 64-bit operation.
The architecture provides modes for compatibility with the 8086, including the features derived from the Datapoint 2200.
In 64-bit mode, some features have been generalized; for instance, the BX register (formerly HL register) is no longer
special for memory accesses.
Although the registers are now 64 bits, the original Datapoint 2200 8-bit registers are still accessible as pieces
of the larger registers.
-->
<p>Thanks to Joe Oberhauser for suggesting this topic.
I plan to write more on the 8086, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>
so you can follow me there too.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:shift-register">
<p>Shift-register memory was also used in the <a href="https://en.wikipedia.org/wiki/TV_Typewriter">TV Typewriter</a> (1973)
and the display storage of the Apple I (1976).
However, dynamic RAM (DRAM) rapidly dropped in price, making shift-register memory obsolete by the mid 1970s.
(I wrote about the Intel 1405 shift register memory in detail in <a href="https://www.righto.com/2014/12/inside-intel-1405-die-photos-of-shift.html">this article</a>.) <a class="footnote-backref" href="#fnref:shift-register" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:registers">
<p>For comparison, the popular PDP-8 minicomputer had just two main registers: the accumulator and a multiplier-quotient register; instructions typically operated on the accumulator and a memory location.
The Data General Nova, a minicomputer released in 1969, had four accumulator / index registers.
Mainframes generally had many more registers; the IBM System/360 (1964), for instance, had 16 general registers and
four floating-point registers. <a class="footnote-backref" href="#fnref:registers" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:decoder">
<p>On the hardware side, instructions were decoded with BCD-to-decimal decoder chips (type <a href="https://www.ti.com/lit/ds/sdls109/sdls109.pdf">7442</a>).
These decoders normally decoded a 4-bit BCD value into one of 10 output lines.
In the Datapoint 2200, they decoded a 3-bit value into one of 8 output lines, and the other two lines were ignored.
This allowed the high-bit line to be used as a selection line; if it was set, none of the 8 outputs would be active. <a class="footnote-backref" href="#fnref:decoder" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:octal">
<p>These bit patterns map cleanly onto octal, so the opcodes are clearest when specified in octal.
This octal structure has persisted in Intel processors including modern x86 processors.
Unfortunately, Intel invariably specifies the opcodes in hexadecimal rather than octal, which obscures
the underlying structure.
This structure is described in detail in <a href="https://gist.github.com/seanjensengrey/f971c20d05d4d0efc0781f2f3c0353da">The 80x86 is an Octal Machine</a>. <a class="footnote-backref" href="#fnref:octal" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:hl">
<p>It is unusual for an instruction set to require memory addresses to be loaded into
a register in order to access memory.
This technique was common in microcode, where memory addresses were loaded into the Memory Address Register (<code>MAR</code>).
As <a href="https://news.ycombinator.com/reply?id=37102930">pwg</a> pointed out, the CDC mainframes (e.g. 6600) had special address registers; when you changed an address register, the specified memory location was read or written to the corresponding operand register automatically.
</p>
<p>At first, I thought that serial memory might motivate the use of an address register, but I don't think there's a connection.
Most likely, the Datapoint 2200 used these techniques to create a simple, orthogonal instruction set that was easy to
decode, and they weren't particularly concerned with performance. <a class="footnote-backref" href="#fnref:hl" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:octaltable">
<p>The instruction tables in this article are different from most articles, because I use octal instead of hexadecimal.
(Displaying an octal-based instruction in a hexadecimal table obscures much of the underlying structure.)
To display the table in octal, I break it into four quadrants based on the top octal digit of a three-digit opcode: 0, 1, 2, or 3.
The digit 0-7 along the left is the middle octal digit and the digit along the top is the low octal digit. <a class="footnote-backref" href="#fnref:octaltable" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:nop">
<p>The regular pattern of Load instructions is broken by the <code>NOP</code> and <code>HALT</code> instructions.
All the register-to-register load instructions along the diagonal accomplish nothing since they move a
register to itself, but only the first one is explicitly called <code>NOP</code>.
Moving a memory location to itself doesn't make sense, so its opcode is assigned the <code>HALT</code> instruction.
Note that the all-0's opcode and the all-1's opcode are both <code>HALT</code> instructions.
This is useful since it can stop execution if the program tries executing uninitialized memory. <a class="footnote-backref" href="#fnref:nop" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:alu">
<p>You might think that Datapoint and Intel used the same ALU operations simply because they are the obvious
set of 8 operations.
However, if you look at other processors around that time, they use a wide variety of ALU operations.
Similarly, the status flags in the Datapoint 2200 aren't the obvious set; systems with
four flags typically used Sign, Carry, Zero, and Overflow (not Parity).
Parity is surprisingly expensive to implement on a standard processor, but (as Philip Freidin pointed out) parity is cheap on a serial processor like the Datapoint 2200.
Intel processors didn't provide an Overflow flag until the 8086; even the 8080 didn't have it although
the Motorola 6800 and MOS 6502 did.
The 8085 implemented an overflow flag (<code>V</code>) but it was left undocumented. <a class="footnote-backref" href="#fnref:alu" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:risc">
<p>You might wonder if the Datapoint 2200 (and 8008) could be considered RISC processors since they have simple, easy-to-decode instruction sets.
I think it is a mistake to try to wedge every processor into the RISC or CISC categories (Reduced Instruction Set Computer or Complex Instruction Set Computer).
In particular, the Datapoint 2200 wasn't designed with the RISC philosophy (make a processor more powerful by simplifying the instruction set), its instruction set architecture is very different from RISC chips, and its implementation is different from RISC chips.
Similarly, it wasn't designed with a CISC philosophy (make a processor more powerful by narrowing the semantic gap with high-level languages) and it doesn't look like a CISC chip.</p>
<p>So where does that leave the Datapoint 2200?
In "<a href="https://www.researchgate.net/publication/259577568_RISC_Back_to_the_future">RISC: Back to the future?</a>", famed computer architect Gordon Bell uses the term MISC (Minimal Instruction Set Computer)
to describe the architecture of simple, early computers and microprocessors
such as the Manchester Mark I (1948), the PDP-8 minicomputer (1966), and the Intel 4004 (1971).
Computer architecture evolved from these early hardwired "simple computers" to microprogrammed processors,
processors with cache, and hardwired, pipelined processors.
"Minimal Instruction Set Computer" seems like a good description of the Datapoint 2200, since it is
about the smallest, simplest processor that could get the job done. <a class="footnote-backref" href="#fnref:risc" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:4004">
<p>Many people think that the Intel 8008 is an extension of the 4-bit Intel 4004 processor, but they are completely unrelated aside
from the part numbers.
The Intel 4004 is a 4-bit processor designed to implement a calculator for a company called Busicom.
Its architecture is completely different from the 8008. In particular, the 4004 is a "Harvard architecture" system, with
data storage and instruction storage completely separate.
The 4004 also has a fairly strange instruction set, designed for calculators.
For instance, it has a special instruction to convert a keyboard scan code to binary.
The 4004 team and the 8008 team at Intel had many people in common, however, so the two chips have physical layouts (floorplans) that are
very similar. <a class="footnote-backref" href="#fnref:4004" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:version2">
<p>In this article, I'm focusing on the Datapoint 2200 Version I.
Any time I refer to the Datapoint 2200, I mean the version I specifically.
The Version II has an expanded instruction set, but it was expanded in an entirely different direction
from the Intel 8080, so it's not relevant to this post.
The <a href="https://bitsavers.org/pdf/datapoint/2200/2200_Reference_Manual.pdf">Version II</a> is interesting, however,
since it provides a perspective of how the Intel 8080 could have developed in an "alternate universe". <a class="footnote-backref" href="#fnref:version2" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:faggin">
<p>Federico Faggin wrote <a href="https://archive.org/details/byte-magazine-1992-03/page/n169/mode/2up">The Birth of the Microprocessor</a>
in Byte Magazine, March 1992.
This article describes in some detail the creation of the 8008 and 8080.</p>
<p>The <a href="http://archive.computerhistory.org/resources/text/Oral_History/Intel_8080/102658123.05.01.pdf">Oral History of the 8080</a>
discusses many of the problems with the 8008 and how the 8080 addressed them. (See page 4.)
Masatoshi Shima, one of the architects of the 4004, described five problems with the 8008:
It was slow because it used two clock cycles per state. It had no general-purpose stack and was weak with interrupts.
It had limited memory and I/O space.
The instruction set was primitive, with only 8-bit data, limited addressing, and a single address pointer register.
Finally, the system bus required a lot of interface circuitry. (See page 7.) <a class="footnote-backref" href="#fnref:faggin" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:usable">
<p>The 8080 is often said to be the "first truly usable microprocessor".
Supposedly the source of this quote is <a href="https://www.computerworld.com/article/2532590/forgotten-pc-history-the-true-origins-of-the-personal-computer.html">Forgotten PC history</a>, but the statement doesn't appear there.
I haven't been able to find the original source of this statement, so let me know.
In any case, I don't think that statement is particularly accurate, as the Motorola 6800 was "truly usable" and
came out before the Intel 8080.</p>
<p>The 8080 was first in one important way, though: it was Intel's first microprocessor that was designed with feedback
from customers.
Both the 4004 and the 8008 were custom chips for a single company.
The 8080, however, was based on extensive customer feedback about the flaws in the 8008 and what features customers
wanted. The <a href="http://archive.computerhistory.org/resources/text/Oral_History/Intel_8080/102658123.05.01.pdf#page=20">8080 oral history</a> discusses this in more detail. <a class="footnote-backref" href="#fnref:usable" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:mos">
<p>The 8008 was built with PMOS circuitry, while the 8080 was built with NMOS.
This may seem like a trivial difference, but NMOS provided much superior performance.
NMOS became the standard microprocessor technology until the rise of CMOS in the 1980s, combining NMOS and PMOS
to dramatically reduce power consumption.</p>
<p>Another key hardware improvement was that the 8080 used a 40-pin package, compared to the 18-pin package of the 8008.
Intel had long followed the "religion" of small 16-pin packages, and only reluctantly moved to
18 pins (as in the 8008).
However, by the time the 8080 was introduced, Intel recognized the utility of industry-standard 40-pin packages.
The additional pins made the 8080 much easier to interface to a system.
Moreover, the 8080's 16-bit address bus supported four times the memory of the 8008's 14-bit address bus.
(The 40-pin package was still small for the time; some companies
used 50-pin or 64-pin packages for microprocessors.) <a class="footnote-backref" href="#fnref:mos" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
<li id="fn:8080-shift">
<p>The 8080 is not binary-compatible with the 8008 because almost all the instructions were shifted to different opcodes.
One important but subtle change was that the 8 register/memory codes were reordered to start with <code>B</code> instead of <code>A</code>.
The motivation is that this gave registers in a 16-bit register pair (<code>BC</code>, <code>DE</code>, or <code>HL</code>) codes that differ only
in the low bit.
This makes it easier to specify a register pair with a two-bit code. <a class="footnote-backref" href="#fnref:8080-shift" title="Jump back to footnote 15 in the text">↩</a></p>
</li>
<li id="fn:8085-undoc">
<p>Stan Mazor (one of the creators of the 4004 and 8080) <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5430762">explained</a> that
the 8085 removed 10 of the 12 new instructions because "they would burden the 8086 instruction set."
Because the decision came near the 8085's release, they would "leave all 12 instructions on the already designed 8085
CPU chip, but document and announce only two of them" since modifying a CPU is hard but modifying a CPU's paper
reference manual is easy.</p>
<p>Several of the Intel 8086 engineers provided a similar explanation in <a href="https://stevemorse.org/8086history/8086history.pdf">Intel Microprocessors: 8008 to 8086</a>:
While the 8085 provided the new <code>RIM</code> and <code>SIM</code> instructions,
"several other instructions that had been contemplated were not made available
because of the software ramifications and the compatibility constraints they would place
on the forthcoming 8086."</p>
<p>For more information on the 8085's undocumented instructions, see
<a href="https://cdn.hackaday.io/files/1766537557921952/UnDoc8085Instructions.pdf">Unspecified 8085 op codes enhance programming</a>.
The two new condition flags were <code>V</code> (2's complement overflow) and <code>X5</code> (underflow on decrement or overflow on increment).
The opcodes were <code>DSUB</code> (double (i.e. 16-bit) subtraction),
<code>ARHL</code> (arithmetic shift right of <code>HL</code>),
<code>RDEL</code> (rotate <code>DE</code> left through carry),
<code>LDHI</code> (load <code>DE</code> with <code>HL</code> plus an immediate byte),
<code>LDSI</code> (load DE with <code>SP</code> plus an immediate byte),
<code>RSTV</code> (restart on overflow),
<code>LHLX</code> (load <code>HL</code> indirect through <code>DE</code>),
<code>SHLX</code> (store <code>HL</code> indirect through <code>DE</code>),
<code>JX5</code> (jump on <code>X5</code>),
and <code>JNX5</code> (jump on not <code>X5</code>). <a class="footnote-backref" href="#fnref:8085-undoc" title="Jump back to footnote 16 in the text">↩</a></p>
</li>
<li id="fn:conv86">
<p>Conversion from 8080 assembly code to 8086 assembly code was performed with a tool called
<a href="http://www.bitsavers.org/pdf/intel/ISIS_II/9800642-02_MCS-86_Assembly_Language_Converter_Feb80.pdf">CONV86</a>.
Each line of 8080 assembly code was converted to the corresponding line (or sometimes a few lines) of 8086
assembly code.
The program wasn't perfect, so it was expected that the user would need to do some manual editing.
In particular, CONV86 couldn't handle self-modifying code, where the program changed its own instructions.
(Nowadays, self-modifying code is almost never used, but it was more common in the 1970s in order to make
code smaller and get more performance.)
CONV86 also didn't handle the 8085's <code>RIM</code> and <code>SIM</code> instructions, recommending a rewrite if code used these
instructions heavily.</p>
<p>Writing programs in 8086 assembly code manually was better, of course, since the program could take advantage
of the 8086's new features. Moreover, a program converted by CONV86 might be 25% larger, due to the 8086's
use of two-byte instructions and inefficiencies in the conversion. <a class="footnote-backref" href="#fnref:conv86" title="Jump back to footnote 17 in the text">↩</a></p>
</li>
<li id="fn:bx">
<p>This renaming is why the instruction set has the registers in the order <code>AX</code>, <code>CX</code>, <code>DX</code>, <code>BX</code>, rather than in alphabetical
order as you might expect.
The other factor is that Intel decided that <code>AX</code>, <code>BX</code>, <code>CX</code>, and <code>DX</code> corresponded to Accumulator, Base, Count, and Data,
so they couldn't assign the names arbitrarily. <a class="footnote-backref" href="#fnref:bx" title="Jump back to footnote 18 in the text">↩</a></p>
</li>
<li id="fn:8086-instructions">
<p>A few notes on how the 8086's instructions relate to the earlier machines, since the ModR/M byte and 8- vs. 16-bit
instructions make things a bit confusing.
For an instruction like <code>ADD</code>, I have three 8-bit opcodes highlighted: an add <em>to</em> memory/register, an add <em>from</em> memory/register,
and an immediate add.
The neighboring unhighlighted opcodes are the corresponding 16-bit versions.
Likewise, for <code>MOV</code>, I have highlighted the 8-bit moves to/from a register/memory. <a class="footnote-backref" href="#fnref:8086-instructions" title="Jump back to footnote 19 in the text">↩</a></p>
</li>
<li id="fn:ia64">
<p>Since the x86's 32-bit architecture is called IA-32, you might expect that IA-64 would be the 64-bit architecture.
Instead, IA-64 is the completely different architecture used in the ill-fated Itanium.
IA-64 was supposed to replace IA-32, despite being completely incompatible.
Since AMD was cut out of IA-64, AMD developed their own 64-bit extension of the existing x86 architecture and called it AMD64.
Customers flocked to this architecture while the Itanium languished.
Intel reluctantly copied the AMD64 architecture, calling it Intel 64. <a class="footnote-backref" href="#fnref:ia64" title="Jump back to footnote 20 in the text">↩</a></p>
</li>
<li id="fn:aliasing">
<p>The x86 architecture allows byte access to certain parts of the larger registers (accessing <code>AL</code>, <code>AH</code>, etc.) as well as word and larger accesses.
These partial-width reads and writes to registers make the implementation of the processor harder due to
register renaming.
The problem is that writing to part of a register means that the register's value is a combination of the old
and new values.
The Register Alias Table in the P6 architecture deals with this by adding a size field to each entry.
If you write a short value and then read a longer value, the pipeline stalls to figure out the right value.
Moreover, some 16-bit code uses the two 8-bit parts of a register as independent registers.
To support this, the Register Alias Table keeps separate entries for the high and low byte.
(For details, see the book <a href="https://amzn.to/3QBIjrr">Modern Processor Design</a>, in particular the
chapter on Intel's P6 Microarchitecture.)
The point of this is that obscure features of the Datapoint 2200 (such as <code>H</code> and <code>L</code> acting as a combined
register) can cause implementation difficulties 50 years later. <a class="footnote-backref" href="#fnref:aliasing" title="Jump back to footnote 21 in the text">↩</a></p>
</li>
<li id="fn:refs">
<p>Some miscellaneous references:
For a detailed history of the Datapoint 2200, see <a href="https://amzn.to/43Rckq6">Datapoint: The Lost Story of the Texans Who Invented the Personal Computer Revolution</a>.
The <a href="https://archive.computerhistory.org/resources/access/text/2012/07/102657982-05-01-acc.pdf">8008 oral history</a> provides
a lot of interesting information on the development of the 8008.
For another look at the Datapoint 2200 and instruction sets, see
<a href="https://bread80.com/2022/07/09/comparing-datapoint-2200-8008-8080-and-z80-instruction-sets/">Comparing Datapoint 2200, 8008, 8080 and Z80 Instruction Sets</a>. <a class="footnote-backref" href="#fnref:refs" title="Jump back to footnote 22 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com15tag:blogger.com,1999:blog-6264947694886887540.post-21954353119346563382023-08-05T09:39:00.000-07:002023-08-05T09:39:41.415-07:00A close look at the 8086 processor's bus hold circuitry<style>
.hilite {cursor:zoom-in}
a:link img.hilite, a:visited img.hilite {color: #fff;}
a:hover img.hilite {color: #f66;}
</style>
<p>The Intel 8086 microprocessor (1978) revolutionized computing by founding the x86 architecture that continues to this day.
One of the lesser-known features of the 8086 is the "hold" functionality, which
allows an external device to temporarily take control of the system's bus.
This feature was most important for supporting the 8087 math coprocessor chip, which was an option on the IBM PC;
the 8087 used the bus hold so it could interact with the system without conflicting with the 8086 processor.</p>
<p>This blog post explains in detail how the bus hold feature is implemented in the processor's logic.
(Be warned that this post is a detailed look at a somewhat obscure feature.)
I've also found some apparently undocumented characteristics of the 8086's hold acknowledge circuitry, designed
to make signal transition faster on the shared control lines.</p>
<p>The die photo below shows the main functional blocks of the 8086 processor.
In this image, the metal layer on top of the chip is visible, while the silicon and polysilicon underneath are obscured.
The 8086 is partitioned into a Bus Interface Unit (upper) that handles bus traffic, and an Execution Unit (lower) that
executes instructions.
The two units operate mostly independently, which will turn out to be important. The Bus Interface Unit handles read and write operations as requested
by the Execution Unit.
The Bus Interface Unit also prefetches instructions that the Execution Unit uses when it needs them.
The hold control circuitry is highlighted in the upper right; it takes a
nontrivial amount of space on the chip.
The square pads around the edge of the die are connected by tiny bond wires to the chip's 40 external pins.
I've labeled the <code>MN/MX</code>, <code>HOLD</code>, and <code>HLDA</code> pads; these are the relevant signals for this post.</p>
<p><a href="https://static.righto.com/images/8086-hold/die-labeled.jpg"><img alt="The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version." class="hilite" height="670" src="https://static.righto.com/images/8086-hold/die-labeled-w700.jpg" title="The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version." width="700" /></a><div class="cite">The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version.</div></p>
<h2>How bus hold works</h2>
<p>In an 8086 system, the processor communicates with memory and I/O devices over a bus consisting of address and data lines along with
various control signals.
For high-speed data transfer, it is useful for an I/O device to send data directly to memory, bypassing the processor; this is called DMA (Direct Memory Access).
Moreover, a co-processor such as the 8087 floating point unit may need to read data from memory.
The bus hold feature supports these operations: it is a mechanism for the 8086 to give up control of the bus, letting another
device use the bus to communicate with memory.
Specifically, an external device requests a bus hold and the 8086 stops putting electrical signals on the bus and acknowledges
the bus hold.
The other device can now use the bus. When the other device is done, it signals the 8086, which then resumes its regular bus
activity.</p>
<p>Most things in the 8086 are more complicated than you might expect, and the bus hold feature is no exception, largely
due to the 8086's minimum and maximum modes.
The 8086 can be designed into a system in one of two ways—minimum mode and maximum mode—that redefine the meanings of the 8086's external pins.
Minimum mode is designed for simple systems and gives the control pins straightforward meanings such as
indicating a read versus a write.
Minimum mode provides bus signals that were similar to the earlier 8080 microprocessor, making migration to the 8086 easier.
On the other hand, maximum mode is designed for sophisticated, multiprocessor systems and encodes the control signals to provide
richer system information.</p>
<p>In more detail, minimum mode is selected if the <code>MN/MX</code> pin is wired high, while maximum mode is selected if the <code>MN/MX</code> pin
is wired low.
Nine of the chip's pins have different meanings depending on the mode, but only two pins are relevant to
this discussion.
In minimum mode, pin 31 has the function <code>HOLD</code>, while pin 30 has the function <code>HLDA</code> (Hold Acknowlege).
In maximum mode, pin 31 has the function <code>RQ/GT0'</code>, while pin 30 has the function <code>RQ/GT1'</code>.</p>
<p>I'll start by explaining how a hold operation works in minimum mode.
When an external device wants to use the bus, it pulls the <code>HOLD</code> pin high.
At the end of the current bus cycle, the 8086 acknowledges the hold request by pulling <code>HLDA</code> high.
The 8086 also puts its bus output pins into "tri-state" mode, in effect disconnecting them electrically from the bus.
When the external device is done, it pulls <code>HOLD</code> low and the 8086 regains control of the bus.
Don't worry about the details of the timing below; the key point is that a device pulls <code>HOLD</code> high and the
8086 responds by pulling <code>HLDA</code> high.</p>
<p><a href="https://static.righto.com/images/8086-hold/hold.png"><img alt="This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14." class="hilite" height="154" src="https://static.righto.com/images/8086-hold/hold-w400.png" title="This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14." width="400" /></a><div class="cite">This diagram shows the HOLD/HLDA sequence. From <a href="http://www.bitsavers.org/components/intel/_dataBooks/1981_iAPX_86_88_Users_Manual.pdf">iAPX 86,88 User's Manual</a>, Figure 4-14.</div></p>
<p>The 8086's maximum mode is more complex, allowing two other devices to share the bus by using a priority-based scheme.
Maximum mode uses two bidirectional signals, <code>RQ/GT0</code> and <code>RQ/GT1</code>.<span id="fnref:active-low"><a class="ref" href="#fn:active-low">2</a></span>
When a device wants to use the bus, it issues a pulse on one of the signal lines, pulling it low.
The 8086 responds by pulsing the same line.
When the device is done with the bus, it issues a third pulse to inform the 8086.
The <code>RQ/GT0</code> line has higher priority than <code>RQ/GT1</code>, so if two devices request the bus at the same time,
the <code>RQ/GT0</code> device wins and the <code>RQ/GT1</code> device needs to wait.<span id="fnref:priority"><a class="ref" href="#fn:priority">1</a></span>
Keep in mind that the <code>RQ/GT</code> lines are bidirectional: the 8086 and the external device both use the same line
for signaling.</p>
<p><a href="https://static.righto.com/images/8086-hold/rqgt.png"><img alt="This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16." class="hilite" height="124" src="https://static.righto.com/images/8086-hold/rqgt-w500.png" title="This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16." width="500" /></a><div class="cite">This diagram shows the request/grant sequence. From <a href="http://www.bitsavers.org/components/intel/_dataBooks/1981_iAPX_86_88_Users_Manual.pdf">iAPX 86,88 User's Manual</a>, Figure 4-16.</div></p>
<!--
One tricky thing is that maximum mode and minimum mode use the same pins, even though the signaling
pattern is completely different.
Pin #34 is called `HOLD` in minimum mode and `RQ/GT0` in maximum mode, while pin #30 is
called `HLDA` in minimum mode and `RQ/GT1` in maximum mode.
-->
<p>The bus hold does not completely stop the 8086.
The hold operation stops the Bus Interface Unit, but the Execution Unit will continue executing instructions
until it needs to perform a read or write, or it empties the prefetch queue.
Specifically, the hold signal blocks the Bus Interface Unit from starting a memory cycle and
blocks an instruction prefetch from starting.</p>
<h2>Bus sharing and the 8087 coprocessor</h2>
<p>Probably the most common use of the bus hold feature was to support the Intel 8087 math coprocessor.
The <a href="https://www.righto.com/2018/08/inside-die-of-intels-8087-coprocessor.html">8087 coprocessor</a>
greatly improved the performance of floating-point operations, making them up to 100 times faster.
As well as floating-point arithmetic, the 8087 supported trigonometric operations, logarithms and powers.
The 8087's architecture became part of later Intel processors, and the 8087's instructions are still a part of today's x86 computers.<span id="fnref:8087"><a class="ref" href="#fn:8087">3</a></span></p>
<p>The 8087 had its own registers and didn't have access to the 8086's registers.
Instead, the 8087 could transfer values to and from the system's main memory.
Specifically, the 8087 used the <code>RQ/GT</code> mechanism (maximum mode) to take control of the bus if
the 8087 needed to transfer operands to or from memory.<span id="fnref:8087-rq"><a class="ref" href="#fn:8087-rq">4</a></span>
The 8087 could be installed as an option on the original IBM PC, which is why the IBM PC used maximum mode.</p>
<!--
The idea behind the 8087 coprocessor is that when the 8086 encounters an 8087 instruction (an "Escape" code),
the 8087 runs the instruction instead of the 8086.
However, this is more complicated than you might expect.
Because the 8086 prefetches instructions, the 8087 doesn't directly know what instruction the 8086 is running.
Instead, the 8087 has to maintain its own copy of of the 8086 prefetch queue and keep track of when the
8086 uses a byte from the queue or flushes the queue.
(The 8086 provides queue status pins with this information.)
-->
<!--
For details, see the [5150 Technical Reference Manual](https://www.minuszerodegrees.net/manuals/IBM_5150_Technical_Reference_6025005_AUG81.pdf) page 2-3.
-->
<h2>The enable flip-flop</h2>
<p>The circuit is built from six flip-flops. The flip-flops are a bit different from typical D flip-flops, so I'll discuss
the flip-flop behavior before explaining the circuit.</p>
<p>A flip-flop can store a single bit, 0 or 1. Flip flops are very important in the 8086 because they hold information (state)
in a stable way, and they synchronize the circuitry with the processor's clock.
A common type of flip-flop is the D flip-flop, which takes a data input (D) and stores that value.
In an edge-triggered flip-flop, this storage happens
on the edge when the clock changes state from low to high.<span id="fnref:polarity"><a class="ref" href="#fn:polarity">5</a></span>
(Except at this transition, the input can change without affecting the output.)
The output is called <code>Q</code>, while the inverted output is called Q-bar.</p>
<p><a href="https://static.righto.com/images/8086-hold/enable-flip-flop.png"><img alt="The symbol for the D flip-flop with enable." class="hilite" height="114" src="https://static.righto.com/images/8086-hold/enable-flip-flop-w150.png" title="The symbol for the D flip-flop with enable." width="150" /></a><div class="cite">The symbol for the D flip-flop with enable.</div></p>
<p>Many of the 8086's flip-flops, including the ones in the hold circuit, have an "enable" input.
When the enable input is high, the flip-flop records the D input, but when the enable input is low, the flip-flop keeps its previous value.
Thus, the enable input allows the flip-flop to hold its value for more than one clock cycle.
The enable input is very important to the functioning of the hold circuit, as it is used to control when the circuit
moves to the next state.</p>
<h2>How bus hold is implemented (minimum mode)</h2>
<p>I'll start by explaining how the hold circuitry works in minimum mode.
To review, in minimum mode the external device requests a hold through the <code>HOLD</code> input, keeping the input high for the duration of
the request.
The 8086 responds by pulling the hold acknowledge <code>HLDA</code> output high for the duration of the hold.</p>
<p>In minimum mode, only three of the six flip-flops are relevant.
The diagram below is highly simplified to show the essential behavior.
(The full schematic is in the footnotes.<span id="fnref:schematic"><a class="ref" href="#fn:schematic">6</a></span>)
At the left is the <code>HOLD</code> signal, the request from the external device.</p>
<p><a href="https://static.righto.com/images/8086-hold/simplified1.png"><img alt="Simplified diagram of the circuitry for minimum mode." class="hilite" height="164" src="https://static.righto.com/images/8086-hold/simplified1-w600.png" title="Simplified diagram of the circuitry for minimum mode." width="600" /></a><div class="cite">Simplified diagram of the circuitry for minimum mode.</div></p>
<p>When a <code>HOLD</code> request comes in, the first flip-flop is activated, and remains activated for the duration of the request.
The second flip-flop waits if any condition is blocking the hold request: a <code>LOCK</code> instruction, an unaligned
memory access, or so forth.
When the <code>HOLD</code> can proceed, the second flip-flop is enabled and it latches the request.
The second flip-flop controls the internal hold signal, causing the 8086 to stop further bus activity.
The third flip-flop is then activated when the current bus cycle (if any) completes; when it latches the request,
the hold is "official".
The third flip-flop drives the external <code>HLDA</code> (Hold Acknowledge) pin, indicating that the bus is free.
This signal also clears the bus-enabled latch (elsewhere in the 8086), putting the appropriate pins into floating tri-state mode.
The key point is that the flip-flops control the timing of the internal hold and the external <code>HLDA</code>, moving to the
next step as appropriate.</p>
<p>When the external device signals an end to the hold by pulling the <code>HOLD</code> pin low, the process reverses.
The three flip-flops return to their idle state in sequence.
The second flip-flop clears the internal hold signal, restarting bus activity.
The third flip-flop clears the <code>HLDA</code> pin.<span id="fnref:tristate"><a class="ref" href="#fn:tristate">7</a></span></p>
<h2>How bus hold is implemented (maximum mode)</h2>
<p>The implementation of maximum mode is tricky because it uses the same circuitry as minimum mode, but the behavior
is different in several ways.
First, minimum mode and maximum mode operate on opposite polarity: a hold is requested by pulling <code>HOLD</code> high in minimum mode versus pulling a request line low in maximum mode.
Moreover, in minimum mode, a request on the <code>HOLD</code> pin triggers a response on the opposite pin (HLDA), while in maximum mode, a request and response are on the same pin.
Finally, using the same pin for the request and grant signals requires the pin to act as both an input and an output, with tricky electrical properties.</p>
<p>In maximum mode, the top three flip-flops handle the request and grant on line 0, while the bottom three
flip-flops handle line 1.
At a high level, these flip-flops behave roughly the same as in the minimum mode case,
with the first flip-flop tracking the hold request,
the second flip-flop activated when the hold is "approved", and the third flip-flop activated when the bus
cycle completes.
An <code>RQ 0</code> input will generate a <code>GT 0</code> output, while a <code>RQ 1</code> input will generate
a <code>GT 1</code> output.
The diagram below is highly simplified, but illustrates the overall behavior.
Keep in mind that <code>RQ 0</code>, <code>GT 0</code>, and <code>HOLD</code> use the same physical pin, as do <code>RQ 1</code>, <code>GT 1</code>, and <code>HLDA</code>.</p>
<p><a href="https://static.righto.com/images/8086-hold/simplified2.png"><img alt="Simplified diagram of the circuitry for maximum mode." class="hilite" height="407" src="https://static.righto.com/images/8086-hold/simplified2-w550.png" title="Simplified diagram of the circuitry for maximum mode." width="550" /></a><div class="cite">Simplified diagram of the circuitry for maximum mode.</div></p>
<p>In more detail, the first flip-flop converts the pulse request input into a steady signal.
This is accomplished by configuring the first flip-flop is configured with to toggle on when the request pulse
is received and toggle off when the end-of-request pulse is received.<span id="fnref:feedback"><a class="ref" href="#fn:feedback">10</a></span>
The toggle action is implemented by feeding the output pulse back to the input, inverted (A);
since the flip-flop is enabled by the <code>RQ</code> input, the flip-flop holds its value until an input pulse.
One tricky part is that the acknowledge pulse must not toggle the flip-flop.
This is accomplished by using the output signal to block the toggle.
(To keep the diagram simple, I've just noted the "block" action rather than showing the logic.)</p>
<p>As before, the second flip-flop is blocked until the hold is "authorized" to proceed.
However, the circuitry is more complicated since it must prioritize the two request lines and ensure that only one
hold is granted at a time.
If <code>RQ0</code>'s first flip-flop is active, it blocks the enable of RQ1's second flip-flop (B).
Conversely, if <code>RQ1</code>'s second flip-flop is active, it blocks the enable of RQ0's second flip-flop (C).
Note the asymmetry, blocking on <code>RQ0</code>'s <em>first</em> flip-flop and <code>RQ1</code>'s <em>second</em> flip-flop.
This enforces the priority of <code>RQ0</code> over <code>RQ1</code>, since an <code>RQ0</code> request blocks <code>RQ1</code> but only an <code>RQ1</code> "approval" blocks
<code>RQ0</code>.</p>
<p>When the second flip-flop is activated in either path, it triggers the internal hold signal (D).<span id="fnref:hold-out"><a class="ref" href="#fn:hold-out">8</a></span>
As before, the hold request is latched into the third flip-flop when any existing memory cycle completes.
When the hold request is granted, a pulse is generated (E) on the corresponding <code>GT</code> pin.<span id="fnref:pulse"><a class="ref" href="#fn:pulse">9</a></span></p>
<p>The same circuitry is used for minimum mode and maximum mode, although the above diagrams show differences between
the two modes. How does this work?
Essentially, logic gates are used to change the behavior between minimum mode and maximum mode as required.
For the most part, the circuitry works the same, so only a moderate amount of logic is required to make
the same circuitry work for both.
On the schematic, the signal <code>MN</code> is active during minimum mode, while <code>MX</code> is active during maximum mode,
and these signals control the behavior.</p>
<h3>The "hold ok" circuit</h3>
<p>As usually happens with the 8086, there are a bunch of special cases when different features interact.
One special case is if a bus hold request comes in while the 8086 is acknowledging an interrupt.
In this case, the interrupt takes priority and the bus hold is not processed until the interrupt acknowledgment
is completed.
A second special case is if the bus hold occurs while the 8086 is halted. In this case, the 8086 issues
a second HALT indicator at the end of the bus hold.
Yet another special case is the 8086's <code>LOCK</code> prefix, which locks the use of the bus for the following instruction,
so a bus hold request is not honored until the locked instruction has completed.
Finally, the 8086 performs an unaligned word access to memory by breaking it into two 8-bit bus cycles;
these two cycles can't be interrupted by a bus hold.</p>
<p>In more detail, the "hold ok" circuit determines at each cycle if a hold could proceed. There are several conditions under which the hold can proceed:</p>
<ul>
<li>The bus cycle is `T2`, except if an unaligned bus operation is taking place (i.e. a word split into two byte operations), or
<li>A memory cycle is not active and a microcode memory operation is not taking place, or
<li>A memory cycle is not active and a hold is currently active.
</ul>
<p>The first case occurs during bus (memory) activity, where a hold request before cycle T2 will be handled at the end of that cycle.
The second case allows a hold if the bus is inactive. But if microcode is performing a memory operation, the hold will be delayed, even if the request is just starting.
The third case is opposite from the other two: it enables the flip flop so a hold request can be dropped.
(This ensures that the hold request can still be dropped in the corner case where a hold starts and then the microcode makes a memory
request, which will be blocked by the hold.)</p>
<p><a href="https://static.righto.com/images/8086-hold/hold-ok-circuit.png"><img alt="The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear." class="hilite" height="118" src="https://static.righto.com/images/8086-hold/hold-ok-circuit-w400.png" title="The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear." width="400" /></a><div class="cite">The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear.</div></p>
<p>An instruction with the <code>LOCK</code> prefix causes the bus to be locked against other devices for the duration of the instruction.
Thus, a hold cannot be granted while the instruction is running. This is implemented through a separate path.
This logic is between the output of the first (request) flip-flop and the second (accepted) flip-flop, tied into the <code>LOCK</code> signal.
Conceptually, it seems that the <code>LOCK</code> signal should block <code>hold-ok</code> and thus block the second (accepted) flip-flop from
being enabled.
But instead, the <code>LOCK</code> signal blocks the data path, unless the request has already been granted.
I think the motivation is to allow dropping of a hold request to proceed uninterrupted.
In other words, <code>LOCK</code> prevents a hold from being accepted, but it doesn't prevent a hold from being dropped, and it
was easier to implement this in the data path.</p>
<h3>The pin drive circuitry</h3>
<p>The circuitry for the <code>HOLD/RQ0/GT0</code> and <code>HLDA/RQ1/GT1</code> pins is somewhat complicated, since they are used for both input and output.
In minimum mode, the <code>HOLD</code> pin is an input, while the <code>HLDA</code> pin is an output.
In maximum mode, both pins act as an input, with a low-going pulse from an external device to start or stop the hold.
But the 8086 also issues pulses to grant the hold.
Pull-up resistors inside the 8086 to ensure that the pins remain high (idle) when unused.
Finally, an undocumented active pull-up system restores a pin to a high state after it is pulled low, providing faster
response than the resistor.</p>
<p>The schematic below shows the heart of the tri-state output circuit.
Each pin is connected to two large output MOSFETs, one to drive the pin high and one to drive the pin low.
The transistors have separate control lines; if both control lines are low, both transistors are off and the pin
floats in the "tri-state" condition.
This permits the pin to be used as an input, driven by an external device.
The pull-up resistor keeps the pin in a high state.</p>
<p><a href="https://static.righto.com/images/8086-hold/pin-output.png"><img alt="The tri-state output circuit for each hold pin." class="hilite" height="231" src="https://static.righto.com/images/8086-hold/pin-output-w300.png" title="The tri-state output circuit for each hold pin." width="300" /></a><div class="cite">The tri-state output circuit for each hold pin.</div></p>
<p>The diagram below shows how this circuitry looks on the die.
In this image, the metal and polysilicon layers have been removed with acid to show the underlying doped
silicon regions. The thin white stripes are transistor gates where polysilicon wiring crossed the silicon.
The black circles are vias that connected the silicon to the metal on top.
The empty regions at the right are where the metal pads for <code>HOLD</code> and <code>HLDA</code> were.
Next to the pads are the large transistors to pull the outputs high or low.
Because the outputs require much higher current than internal signals, these transistors are much
larger than logic transistors.
They are composed of several transistors placed in parallel, resulting in the parallel stripes.
The small pullup resistors are also visible.
For efficiency, these resistors are actually depletion-mode transistors, specially doped to act as constant-current sources.</p>
<p><a href="https://static.righto.com/images/8086-hold/output-die.jpg"><img alt="The HOLD/HLDA pin circuitry on the die." class="hilite" height="496" src="https://static.righto.com/images/8086-hold/output-die-w500.jpg" title="The HOLD/HLDA pin circuitry on the die." width="500" /></a><div class="cite">The HOLD/HLDA pin circuitry on the die.</div></p>
<p>At the left, some of the circuitry is visible.
The large output transistors are driven by "superbuffers" that provide more current than regular NMOS buffers.
(A superbuffer uses separate transistors to pull the signal high and low, rather than using a pull-up to pull
the signal high as in a standard NMOS gate.)
The small transistors are the pass transistors that gate output signals according to the clock.
The thick rectangles are crossovers, allowing the vertical metal wiring (no longer visible) to cross over
a horizontal signal in the signal layer.
The 8086 has only a single metal layer, so the layout requires a crossover if signals will otherwise
intersect.
Because silicon's resistance is much higher than metal's resistance, the crossover is relatively wide
to reduce the resistance.</p>
<p>The problem with a pull-up resistor is that it is relatively slow when connected to a line with high capacitance.
You essentially end up with a resistor-capacitor delay circuit, as the resistor slowly charges the line and
brings it up to a high level.
To get around this, the 8086 has an active drive circuit to pulse the RQ/GT lines high to pull them back from a low level.
This circuit pulls the line high one cycle after the 8086 drops it low for a grant acknowledge.
This circuit also pulls the line high after the external device pulls it low.<span id="fnref:pullup"><a class="ref" href="#fn:pullup">11</a></span>
(The schematic for this circuit is in the footnotes.)
The curious thing is that I couldn't find this behavior documented in the datasheets. The datasheets describe
the internal pull-up resistor, but don't mention that the 8086 actively pulls the lines high.<span id="fnref:driver"><a class="ref" href="#fn:driver">12</a></span></p>
<h2>Conclusions</h2>
<p>The hold circuitry was a key feature of the 8086, largely because it was necessary for the 8087 math coprocessor chip.
The hold circuitry seems like it should be straightforward, but there are many corner cases in this circuitry:
it interacts with unaligned memory accesses, the <code>LOCK</code> prefix, and minimum vs. maximum modes.
As a result, it is fairly complicated.</p>
<p>Personally, I find the hold circuitry somewhat unsatisfying to study, with few fundamental concepts but a lot of special-case logic.
The circuitry seems overly complicated for what it does.
Much of the complexity is probably due to the wildly different behavior of the pins between minimum and maximum mode.
Intel should have simply used a larger package (like the Motorola 68000) rather than re-using pins to support different
modes, as well as using the same pin for a request and a grant.
It's impressive, however, the same circuitry was made to work for both minimum and maximum modes, despite using
completely different signals to request and grant holds.
This circuitry must have been a nightmare for Intel's test engineers, trying to ensure that the circuitry performed properly
when there were so many corner cases and potential race conditions.</p>
<p>I plan to write more on the 8086, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>
and Bluesky as <a href="https://staging.bsky.app/profile/righto.com">@righto.com</a> so you can follow me there too.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:priority">
<p>The timing of priority between <code>RQ0</code> and <code>RQ1</code> is left vague in the documentation.
In practice, even if <code>RQ1</code> is requested first, a later <code>RQ0</code> can still preempt it until the point that <code>RQ1</code> is internally
granted (i.e. the second flip-flop is activated).
This happens before the hold is externally acknowledged, so it's not obvious to the user at what point priority no longer applies. <a class="footnote-backref" href="#fnref:priority" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:active-low">
<p>The <code>RQ/GT0</code> and <code>RQ/GT1</code> signals are active-low. These signals should have an overbar to indicate this,
but it makes the page formatting ugly :-) <a class="footnote-backref" href="#fnref:active-low" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:8087">
<p>Modern x86 processors still support the 8087 (x87) instruction set.
Starting with the 80486DX, the floating point unit was included on the CPU die, rather than as an external coprocessor.
The x87 instruction set used a stack-based model, which made it harder to parallelize.
To mitigate this, Intel introduced <a href="https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions">SSE</a> in 1999, a different set of
floating point instructions that worked on an independent register set.
The x87 instructions are now considered mostly obsolete and are <a href="https://www.realworldtech.com/physx87/4/">deprecated</a> in <a href="https://learn.microsoft.com/en-us/windows/win32/dxtecharts/sixty-four-bit-programming-for-game-developers#porting-applications-to-64-bit-platforms">64-bit Windows</a>. <a class="footnote-backref" href="#fnref:8087" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:8087-rq">
<p>The 8087 provides another <code>RQ/GT</code> input line for an external device. Thus, two external devices
can still be used in a system with an 8087.
That is, although the 8087 uses up one of the 8086's two <code>RQ/GT</code> lines, the 8087 provides another one, so
there are still two lines available.
The 8087 has logic to combine its bus requests and external bus requests into a single <code>RQ/GT</code> line to the 8086. <a class="footnote-backref" href="#fnref:8087-rq" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:polarity">
<p>Confusingly, some of the flip-flops in the hold circuit transistion when the clock goes high, while others
use the inverted clock signal and transition when the clock goes low.
Moreover, the flip-flops are inconsistent about how they treat the data. In each group of three flip-flops, the first flip-flop is active-high, while the remaining flip-flops are active-low.
For the most part, I'll ignore this in the discussion.
You can look at the schematic if you want to know the details. <a class="footnote-backref" href="#fnref:polarity" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:schematic">
<p>The schematics below shows my reverse-engineered schematic for the hold circuitry.
I have partitioned the schematic into the hold logic and the output driver circuitry.
This split matches the physical partitioning of the circuitry on the die.</p>
<p>In the first schematic, the upper part handles <code>HOLD</code> and <code>request0</code>, while the lower part handles
<code>request1</code>. There is some circuitry in the middle to handle the common enabling and to generate the
internal hold signal.
I won't explain the circuitry in much detail, but there are a few things I want to point out.
First, even though the hold circuit seems like it should be simple, there are a lot of gates connected in complicated ways.
Second, although there are many inverters, NAND, and NOR gates, there are also complex gates such as
AND-NOR, OR-NAND, AND-OR-NAND, and so forth.
These are implemented as single gates.
Due to how gates are constructed from NMOS transistors, it is just as easy to build a hierarchical gate as
a single gate. (The last step must be inversion, though.)
The XOR gates are more complex; they are constructed from a NOR gate and an AND-NOR gate.</p>
<p><a href="https://static.righto.com/images/8086-hold/schematic1.png"><img alt="Schematic of the hold circuitry. Click this image (or any other) for a larger version." class="hilite" height="391" src="https://static.righto.com/images/8086-hold/schematic1-w700.png" title="Schematic of the hold circuitry. Click this image (or any other) for a larger version." width="700" /></a><div class="cite">Schematic of the hold circuitry. Click this image (or any other) for a larger version.</div></p>
<p>The schematic below shows the output circuits for the two pins. These circuits are similar, but have a few
differences because only the bottom one is used as an output (<code>HLDA</code>) in minimum mode.
Each circuit has two inputs: what the current value of the pin is, and what the desired value of the pin is.</p>
<p><a href="https://static.righto.com/images/8086-hold/schematic2.png"><img alt="Schematic of the pin output circuits." class="hilite" height="386" src="https://static.righto.com/images/8086-hold/schematic2-w700.png" title="Schematic of the pin output circuits." width="700" /></a><div class="cite">Schematic of the pin output circuits.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:schematic" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:tristate">
<p>Interestingly, the external pins aren't taken out of tri-state mode immediately when the <code>HLDA</code> signal is dropped.
Instead, the 8086's bus drivers are re-enabled when a bus cycle starts, which is slightly later.
The bus circuitry has a separate flip-flop to manage the enable/disable state, and the start of a bus cycle
is what re-enables the bus.
This is another example of behavior that the documentation leaves ambiguous. <a class="footnote-backref" href="#fnref:tristate" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:hold-out">
<p>There's one more complication for the <code>hold-out</code> signal. If a hold is granted on one line, a request comes in on
the other line, and then the hold is released on the first line, the desired behavior is for the bus to remain
in the hold state as the hold switches to the second line.
However, because of the way a hold on line 1 blocks a hold on line 0, the <code>GT1</code> second flip-flop will drop a cycle before
the GT0 second flip-flop is activated.
This would cause <code>hold-out</code> to drop for a cycle and the 8086 could start unwanted bus activity.
To prevent this case, the <code>hold-out</code> line is also activated if there is an RQ0 request and RQ1 is granted.
This condition seems a bit random but it covers the "gap".
I have to wonder if Intel planned the circuit this way, or they added the extra test as a bug fix.
(The asymmetry of the priority circuit causes this problem to occur only going from a hold on line 1 to line 0, not the
other direction.) <a class="footnote-backref" href="#fnref:hold-out" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:pulse">
<p>The pulse-generating circuit is a bit tricky.
A pulse signal is generated if the request has been accepted, has not been granted, and will be granted on the next clock (i.e. no memory request is active so the flip-flop is enabled).
(Note that the pulse will be one cycle wide, since granting the request on the next cycle will cause the conditions to be no longer satisfied.)
This provides the pulse one clock cycle before the flip-flop makes it "official".
Moreover, the signals come from the inverted Q outputs from the flip-flops, which are updated half a clock cycle earlier.
The result is that the pulse is generated 1.5 clock cycles before the flip-flop output.
Presumably the point of this is to respond to hold requests faster, but it seems overly complicated. <a class="footnote-backref" href="#fnref:pulse" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:feedback">
<p>The request pulse is required to be one clock cycle wide. The feedback loop shows why:
if the request is longer than one clock cycle, the first flip-flop will repeatedly toggle on and off,
resulting in unexpected behavior. <a class="footnote-backref" href="#fnref:feedback" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:pullup">
<p>The details of the active pull-up circuitry don't make sense to me.
First it XORs the state of the pin with the desired state of the pin and uses this signal to control a
multiplexer, which generates the pull-up action based on other gates.
The result of all this ends up being simply NAND, implemented with excessive gates.
Another issue is that minimum mode blocks the active pull-up, which makes sense.
But there are additional logic gates so minimum mode can affect the value going into the multiplexer,
which gets blocked in minimum mode, so that logic seems wasted.
There are also two separate circuits to block pull-up during reset.
My suspicion is that the original logic accumulated bug fixes and redundant logic wasn't removed.
But it's possible that the implementation is doing something clever that I'm just missing. <a class="footnote-backref" href="#fnref:pullup" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:driver">
<p>My analysis of the <code>RQ/GT</code> lines being pulled high is based on simulation.
It would be interesting to verify this behavior on a physical 8086 chip.
By measuring the current out of the pin, the pull-up pulses should be visible. <a class="footnote-backref" href="#fnref:driver" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com2tag:blogger.com,1999:blog-6264947694886887540.post-55162051246400221202023-07-15T23:28:00.013-07:002023-09-19T10:27:16.494-07:00Undocumented 8086 instructions, explained by the microcode<style>
>Undocumented 8086 instructions, explained by the microcode</h1>
<style>
.hilite {cursor:zoom-in}
a:link img.hilite, a:visited img.hilite {color: #fff;}
a:hover img.hilite {color: #f66;}
</style>
<style>
pre.microcode {font-family: courier, fixed; padding: 10px; background-color: #f5f5f5; display:inline-block;border:none;}
pre.microcode span {color: green; font-style:italic; font-family: sans-serif; font-size: 90%;}
</style>
<p>What happens if you give the Intel 8086 processor an instruction that doesn't exist?
A modern microprocessor (80186 and later) will generate an exception, indicating that an illegal instruction
was executed.
However, early microprocessors didn't include the circuitry to detect illegal instructions, since the chips didn't have
transistors to spare. Instead these processors would do <em>something</em>,
but the results weren't specified.<span id="fnref:6502"><a class="ref" href="#fn:6502">1</a></span></p>
<p>The 8086 has a number of undocumented instructions.
Most of them are simply duplicates of regular instructions, but a few have unexpected behavior, such as revealing the
values of internal, hidden registers.
In the 8086, most instructions are implemented in microcode, so examining the 8086's microcode can explain why these instructions
behave the way they do.</p>
<p>The photo below shows the 8086 die under a microscope, with the important functional blocks labeled. The metal layer is visible, while the underlying silicon and polysilicon wiring is mostly hidden.
The microcode ROM and the microcode address decoder are in the lower right.
The Group Decode ROM (upper center) is also important, as it performs the first step of instruction decoding.</p>
<p><a href="https://static.righto.com/images/8086-ad-undoc/die-labeled.jpg"><img alt="The 8086 die under a microscope, with main functional blocks labeled. Click on this image (or any other) for a larger version." class="hilite" height="589" src="https://static.righto.com/images/8086-ad-undoc/die-labeled-w600.jpg" title="The 8086 die under a microscope, with main functional blocks labeled. Click on this image (or any other) for a larger version." width="600" /></a><div class="cite">The 8086 die under a microscope, with main functional blocks labeled. Click on this image (or any other) for a larger version.</div></p>
<h2>Microcode and 8086 instruction decoding</h2>
<p>You might think that machine instructions are the basic steps that a computer performs.
However, instructions usually require multiple steps inside the processor.
One way of expressing these multiple steps is through microcode, a technique dating back to 1951.
To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.
In other words, microcode forms another layer between the machine instructions and the hardware.
The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task.</p>
<p>The 8086's <a href="https://www.righto.com/2022/11/how-8086-processors-microcode-engine.html">microcode ROM</a> holds 512 micro-instructions, each 21 bits wide.
Each micro-instruction performs two actions in parallel. First is a move between a source and a destination, typically registers.
Second is an operation that can range from an arithmetic (ALU) operation to a memory access.
The diagram below shows the structure of a 21-bit micro-instruction, divided into six types.</p>
<p><a href="https://static.righto.com/images/8086-ad-undoc/microcode-format.jpg"><img alt="The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?" class="hilite" height="203" src="https://static.righto.com/images/8086-ad-undoc/microcode-format-w700.jpg" title="The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?" width="700" /></a><div class="cite">The encoding of a micro-instruction into 21 bits. Based on <a href="https://digitalcommons.law.scu.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1031&context=chtlj">NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?</a></div></p>
<p>When executing a machine instruction, the 8086 performs a decoding step.
Although the 8086 is a 16-bit processor, its instructions are based on bytes. In most cases, the first byte specifies the
opcode, which may be followed by additional instruction bytes.
In other cases, the byte is a "prefix" byte, which changes the behavior of the following instruction.
The first byte is analyzed
by something called the <a href="https://www.righto.com/2023/05/8086-processor-group-decode-rom.html">Group Decode ROM</a>.
This circuit categorizes the first byte of the instruction into about <code>35</code> categories that control how the instruction is
decoded and executed.
One category is "1-byte logic"; this indicates a one-byte instruction or prefix that is simple and implemented by logic circuitry in the 8086.
For instructions in this category, microcode is not involved while
the remaining instructions are implemented in microcode.
Many of these instructions are in the "two-byte ROM" category indicating that the instruction has a second byte
that also needs to be decoded by microcode.
This second byte, called the ModR/M byte, specifies that memory addressing mode or registers that the instruction uses.</p>
<p>The next step is the microcode's address decoder circuit, which determines where to start executing microcode based on
the opcode.
Conceptually, you can think of the microcode as stored in a ROM, indexed by the instruction opcode and a few sequence bits.
However, since many instructions can use the same microcode, it would be inefficient to store duplicate copies of these routines.
Instead, the microcode address decoder permits multiple instructions to reference the same entries in the ROM.
This decoding circuitry is similar to a PLA (Programmable Logic Array) so it matches bit patterns to determine a particular starting point.
This turns out to be important for undocumented instructions since undocumented instructions often match the pattern for a "real" instruction, making the undocumented instruction an alias.</p>
<p>The 8086 has several internal registers that are invisible to the programmer but are used by the microcode.
Memory accesses use the Indirect (<code>IND</code>) and Operand (<code>OPR</code>) registers; the <code>IND</code> register holds the address in the segment,
while the <code>OPR</code> register holds the data value that is read or written.
Although these registers are normally not accessible by the programmer, some undocumented instructions provide access to these registers, as will be described later.</p>
<p>The Arithmetic/Logic Unit (ALU) performs arithmetic, logical, and shift operations in the 8086.
The ALU uses three internal registers: <code>tmpA</code>, <code>tmpB</code>, and <code>tmpC</code>. An ALU operation requires two micro-instructions.
The first micro-instruction specifies the operation (such as <code>ADD</code>) and the temporary register that holds one argument (e.g. <code>tmpA</code>);
the second argument is always in <code>tmpB</code>.
A following micro-instruction can access the ALU result through the pseudo-register <code>Σ</code> (sigma).</p>
<h3>The ModR/M byte</h3>
<p>A fundamental part of the 8086 instruction format is the ModR/M byte, a byte that specifies addressing for many instructions.
The 8086 has a variety of addressing modes, so the ModR/M byte is somewhat complicated.
Normally it specifies one memory address and one register. The memory address is specified through one of eight addressing
modes (below) along with an optional 8- or 16-bit displacement in the instruction.
Instead of a memory address, the ModR/M byte can also specify a second register.
For a few opcodes, the ModR/M byte selects what instruction to execute rather than a register.</p>
<p><a href="https://static.righto.com/images/8086-ad-undoc/modrm.png"><img alt="The 8086's addressing modes. From The register assignments, from MCS-86 Assembly Language Reference Guide." class="hilite" height="220" src="https://static.righto.com/images/8086-ad-undoc/modrm-w250.png" title="The 8086's addressing modes. From The register assignments, from MCS-86 Assembly Language Reference Guide." width="250" /></a><div class="cite">The 8086's addressing modes. From <a href="http://bitsavers.org/components/intel/8086/9800749-1_MCS-86_Assembly_Language_Reference_Guide_Oct78.pdf">The register assignments, from MCS-86 Assembly Language Reference Guide</a>.</div></p>
<p>The implementation of the ModR/M byte plays an important role in the behavior of undocumented instructions.
Support for this byte is implemented in both microcode and hardware.
The various memory address modes above are implemented by microcode subroutines, which compute the appropriate memory address and
perform a read if necessary.
The subroutine leaves the memory address in the <code>IND</code> register, and if a read is performed, the value is in the <code>OPR</code> register.</p>
<p>The hardware hides the ModR/M byte's selection of memory versus register, by making the value available through the pseudo-register <code>M</code>, while the second register is available through <code>N</code>.
Thus, the microcode for an instruction doesn't need to know if the value was in memory or a register, or which register was selected.
The Group Decode ROM examines the first byte of the instruction to determine if a ModR/M byte is present, and if a read
is required.
If the ModR/M byte specifies memory, the Translation ROM determines which micro-subroutines to call before handling the
instruction itself.
For more on the ModR/M byte, see my post on <a href="https://www.righto.com/2023/02/8086-modrm-addressing.html">Reverse-engineering the ModR/M addressing microcode</a>.</p>
<h2>Holes in the opcode table</h2>
<p>The first byte of the instruction is a value from <code>00</code> to <code>FF</code> in hex.
Almost all of these opcode values correspond to documented 8086 instructions, but there are a few exceptions, "holes" in the opcode table.
The table below shows the 256 first-byte opcodes for the 8086, from hex <code>00</code> to <code>FF</code>. Valid opcodes for the 8086 are in white;
the colored opcodes are undefined and interesting to examine.
Orange, yellow, and green opcodes were given meaning in the 80186, 80286, and 80386 respectively.
The purple opcode is unusual: it was implemented in the 8086 and later processors but not documented.<span id="fnref:prefixes"><a class="ref" href="#fn:prefixes">2</a></span>
In this section, I'll examine the microcode for these opcode holes.</p>
<p><a href="https://static.righto.com/images/8086-ad-undoc/opcodes.png"><img alt="This table shows the 256 opcodes for the 8086, where the white ones are valid instructions. Click for a larger version." class="hilite" height="453" src="https://static.righto.com/images/8086-ad-undoc/opcodes-w450.png" title="This table shows the 256 opcodes for the 8086, where the white ones are valid instructions. Click for a larger version." width="450" /></a><div class="cite">This table shows the 256 opcodes for the 8086, where the white ones are valid instructions. Click for a larger version.</div></p>
<h3><code>D6</code>: <code>SALC</code></h3>
<p>The opcode <code>D6</code> (purple above) performs a well-known but undocumented operation that is typically called <code>SALC</code>, for Set AL to Carry.
This instruction sets the <code>AL</code> register to 0 if the carry flag is 0, and sets the <code>AL</code> register to <code>FF</code> if the carry flag is 1.
The curious thing about this undocumented instruction is that it exists in all x86 CPUs, but Intel didn't mention it until 2017.
Intel probably put this instruction into the processor deliberately as a <a href="https://en.wikipedia.org/wiki/Fictitious_entry#Copyright_traps">copyright trap</a>.
The idea is that if a company created a copy of the 8086 processor and the processor included the <code>SALC</code> instruction, this
would prove that the company had copied Intel's microcode and thus had potentially violated Intel's copyright on the microcode.
This came to light when NEC created improved versions of the 8086, the NEC V20 and V30 microprocessors, and was sued by Intel.
Intel analyzed NEC's microcode but was disappointed to find that NEC's chip did not include the hidden instruction, showing
that NEC hadn't copied the microcode.<span id="fnref:magic-instruction"><a class="ref" href="#fn:magic-instruction">3</a></span>
Although a Federal judge <a href="https://www.nytimes.com/1989/02/08/business/intel-loses-copyright-case-to-nec.html">ruled</a> in 1989 that NEC hadn't infringed
Intel's copyright, the 5-year trial ruined NEC's market momentum.</p>
<p>The <code>SALC</code> instruction is implemented with three micro-instructions, shown below.<span id="fnref:microcode"><a class="ref" href="#fn:microcode">4</a></span>
The first micro-instruction jumps if the carry (<code>CY</code>) is set.
If not, the next instruction moves 0 to the AL register. <code>RNI</code> (Run Next Instruction) ends the microcode execution
causing the next machine instruction to run.
If the carry was set, all-ones (i.e. <code>FF</code> hex) is moved to the <code>AL</code> register and RNI ends the microcode sequence.</p>
<pre class="microcode">
JMPS CY 2 <span><b>SALC</b>: jump on carry</span>
ZERO → AL RNI <span>Move 0 to AL, run next instruction</span>
ONES → AL RNI <span><b>2:</b>Move FF to AL, run next instruction</span>
</pre>
<h3><code>0F</code>: <code>POP CS</code></h3>
<p>The <code>0F</code> opcode is the first hole in the opcode table.
The 8086 has instructions to push and pop the four segment registers, except opcode <code>0F</code> is undefined where <code>POP CS</code> should be.
This opcode performs <code>POP CS</code> successfully, so the question is why is it undefined?
The reason is that <code>POP CS</code> is essentially useless and doesn't do what you'd expect, so Intel figured it was best not
to document it.</p>
<p>To understand why <code>POP CS</code> is useless, I need to step back and explain the 8086's segment registers.
The 8086 has a <code>20</code>-bit address space, but 16-bit registers.
To make this work, the 8086 has the concept of segments: memory is accessed in 64K chunks called segments, which are positioned
in the 1-megabyte address space.
Specifically, there are four segments: Code Segment, Stack Segment, Data Segment, and Extra Segment,
with four segment registers that define the start of the segment: <code>CS</code>, <code>SS</code>, <code>DS</code>, and <code>ES</code>.</p>
<p>An inconvenient part of segment addressing is that if you want to access more than 64K, you need to change the segment register.
So you might push the data segment register, change it temporarily so you can access a new part of memory, and then pop the old data segment
register value off the stack.
This would use the <code>PUSH DS</code> and <code>POP DS</code> instructions.
But why not <code>POP CS</code>?</p>
<p>The 8086 executes code from the code segment, with the instruction pointer (<code>IP</code>) tracking the location in the code segment.
The main problem with <code>POP CS</code> is that it changes the code segment, but not the instruction pointer, so now you are executing
code at the old offset in a new segment.
Unless you line up your code extremely carefully, the result is that you're jumping to an unexpected place in memory.
(Normally, you want to change <code>CS</code> and the instruction pointer at the same time, using a <code>CALL</code> or <code>JMP</code> instruction.)</p>
<p>The second problem with <code>POP CS</code> is prefetching.
For efficiency, the 8086 prefetches instructions before they are needed, storing them in an 8-byte prefetch queue.
When you perform a jump, for instance, the microcode flushes the prefetch queue so execution will continue with the
new instructions, rather than the old instructions.
However, the instructions that pop a segment register don't flush the prefetch buffer.
Thus, <code>POP CS</code> not only jumps to an unexpected location in memory, but it will execute an unpredictable number of instructions
from the old code path.</p>
<p>The <code>POP segment register</code> microcode below packs a lot into three micro-instructions.
The first micro-instruction pops a value from the stack.
Specifically, it moves the stack pointer (<code>SP</code>) to the Indirect (<code>IND</code>) register.
The Indirect register is an internal register, invisible to the programmer, that holds the address offset for memory
accesses.
The first micro-instruction also performs a memory read (<code>R</code>) from the stack segment (<code>SS</code>) and then increments <code>IND</code>
by 2 (<code>P2</code>, plus 2).
The second micro-instruction moves <code>IND</code> to the stack pointer, updating the stack pointer with the new value.
It also tells the microcode engine that this micro-instruction is the next-to-last (<code>NXT</code>) and the next machine instruction
can be started.
The final micro-instruction moves the value read from memory to the appropriate segment register and runs the next instruction.
Specifically, reads and writes put data in the internal <code>OPR</code> (Operand) register.
The hardware uses the register <code>N</code> to indicate the register specified by the instruction.
That is, the value will be stored in the <code>CS</code>, <code>DS</code>, <code>ES</code>, or <code>SS</code> register, depending on the bit pattern in the instruction.
Thus, the same microcode works for all four segment registers.
This is why <code>POP CS</code> works even though <code>POP CS</code> wasn't explicitly implemented in the microcode; it uses the common code.</p>
<pre class="microcode">
SP → IND R SS,P2 <span><b>POP sr</b>: read from stack, compute IND plus 2</span>
IND → SP NXT <span>Put updated value in SP, start next instruction.</span>
OPR → N RNI <span>Put stack value in specified segment register</span>
</pre>
<p>But why does <code>POP CS</code> run this microcode in the first place?
The microcode to execute is selected based on the instruction, but multiple instructions can execute the same microcode.
You can think of the address decoder as pattern-matching on the instruction's bit patterns, where some of the bits can be ignored.
In this case, the <code>POP sr</code> microcode above is run by any instruction with the bit pattern 000??111, where a question mark
can be either a 0 or a 1.
You can verify that this pattern matches <code>POP ES</code> (<code>07</code>), <code>POP SS</code> (<code>17</code>), and <code>POP DS</code> (<code>1F</code>).
However, it also matches <code>0F</code>, which is why the <code>0F</code> opcode runs the above microcode and performs <code>POP CS</code>.
In other words, to make <code>0F</code> do something other than <code>POP CS</code> would require additional circuitry, so it was easier to
leave the action implemented but undocumented.</p>
<h3><code>60</code>-<code>6F</code>: conditional jumps</h3>
<p>One whole row of the opcode table is unused: values <code>60</code> to <code>6F</code>.
These opcodes simply act the same as <code>70</code> to <code>7F</code>, the conditional jump instructions.</p>
<p>The conditional jumps use the following microcode.
It fetches the jump offset from the instruction prefetch queue (<code>Q</code>) and puts the value into the ALU's <code>tmpBL</code> register,
the low byte of the <code>tmpB</code> register.
It tests the condition in the instruction (<code>XC</code>) and jumps to the <code>RELJMP</code> micro-subroutine if satisfied.
The <code>RELJMP</code> code (not shown) updates the program counter to perform the jump.</p>
<pre class="microcode">
Q → tmpBL <span><b>Jcond cb:</b> Get offset from prefetch queue</span>
JMP XC RELJMP <span>Test condition, if true jump to RELJMP routine</span>
RNI <span>No jump: run next instruction</span>
</pre>
<p>This code is executed for any instruction matching the bit pattern <code>011?????</code>, i.e. anything from <code>60</code> to <code>7F</code>.
The condition is specified by the four low bits of the instruction.
The result is that any instruction <code>60</code>-<code>6F</code> is an alias for the corresponding conditional jump <code>70</code>-<code>7F</code>.</p>
<h3><code>C0</code>, <code>C8</code>: <code>RET/RETF imm</code></h3>
<p>These undocumented opcodes act like a return instruction, specifically <code>RET imm16</code> (<a href="https://www.os2museum.com/wp/undocumented-8086-opcodes-part-i/">source</a>).
Specifically, the instruction <code>C0</code> is the same as <code>C2</code>, near return, while <code>C8</code> is the same as <code>CA</code>, far return.</p>
<p>The microcode below is executed for the instruction bits <code>1100?0?0</code>, so it is executed for <code>C0</code>, <code>C2</code>, <code>C8</code>, and <code>CA</code>.
It gets two bytes from the instruction prefetch queue (<code>Q</code>) and puts them in the <code>tmpA</code> register.
Next, it calls <code>FARRET</code>, which performs either a near return (popping <code>PC</code> from the stack) or a far return (popping <code>PC</code> and <code>CS</code>
from the stack). Finally, it adds the original argument to the <code>SP</code>, equivalent to popping that many bytes.</p>
<pre class="microcode">
Q → tmpAL ADD tmpA <span><b>RET/RETF iw:</b> Get word from prefetch, set up ADD</span>
Q → tmpAH CALL FARRET <span>Call Far Return micro-subroutine</span>
IND → tmpB <span>Move SP (in IND) to tmpB for ADD</span>
Σ → SP RNI <span>Put sum in Stack Pointer, end</span>
</pre>
<p>One tricky part is that the <code>FARRET</code> micro-subroutine examines bit 3 of the instruction to determine whether it does a near
return or a far return.
This is why documented instruction <code>C2</code> is a near return and <code>CA</code> is a far return.
Since <code>C0</code> and <code>C8</code> run the same microcode, they will perform the same actions, a near return and a far return respectively.</p>
<h3><code>C1</code>: <code>RET</code></h3>
<p>The undocumented <code>C1</code> opcode is identical to the documented <code>C3</code>, near return instruction.
The microcode below is executed for instruction bits <code>110000?1</code>, i.e. <code>C1</code> and <code>C3</code>.
The first micro-instruction reads from the Stack Pointer, incrementing <code>IND</code> by 2.
Prefetching is suspended and the prefetch queue is flushed, since execution will continue at a new location.
The Program Counter is updated with the value from the stack, read into the <code>OPR</code> register.
Finally, the updated address is put in the Stack Pointer and execution ends.</p>
<pre class="microcode">
SP → IND R SS,P2 <span><b>RET: </b> Read from stack, increment by 2</span>
SUSP <span>Suspend prefetching</span>
OPR → PC FLUSH <span>Update PC from stack, flush prefetch queue</span>
IND → SP RNI <span>Update SP, run next instruction</span>
</pre>
<h3><code>C9</code>: <code>RET</code></h3>
<p>The undocumented <code>C9</code> opcode is identical to the documented <code>CB</code>, far return instruction.
This microcode is executed for instruction bits <code>110010?1</code>, i.e. <code>C9</code> and <code>CB</code>, so <code>C9</code> is identical to <code>CB</code>.
The microcode below simply calls the <code>FARRET</code> micro-subroutine to pop the Program Counter and CS register.
Then the new value is stored into the Stack Pointer.
One subtlety is that <code>FARRET</code> looks at bit 3 of the instruction to switch between a near return and a far return, as
described earlier.
Since <code>C9</code> and <code>CB</code> both have bit 3 set, they both perform a far return.</p>
<pre class="microcode">
CALL FARRET <span><b>RETF:</b> call FARRET routine</span>
IND → SP RNI <span>Update stack pointer, run next instruction</span>
</pre>
<h3><code>F1</code>: <code>LOCK</code> prefix</h3>
<p>The final hole in the opcode table is <code>F1</code>.
This opcode is different because it is implemented in logic rather than microcode.
The Group Decode ROM indicates that <code>F1</code> is a prefix, one-byte logic, and LOCK.
The Group Decode outputs are the same as <code>F0</code>, so <code>F1</code> also acts as a <code>LOCK</code> prefix.</p>
<h2>Holes in two-byte opcodes</h2>
<p>For most of the 8086 instructions, the first byte specifies the instruction.
However, the 8086 has a few instructions where the second byte specifies the instruction: the <code>reg</code> field of the ModR/M byte provides an opcode extension that selects the instruction.<span id="fnref:extension"><a class="ref" href="#fn:extension">5</a></span>
These fall into four categories which Intel labeled "Immed", "Shift", "Group 1", and "Group 2", corresponding to opcodes <code>80</code>-<code>83</code>, <code>D0</code>-<code>D3</code>,
<code>F6</code>-<code>F7</code>, and <code>FE</code>-<code>FF</code>.
The table below shows how the second byte selects the instruction.
Note that "Shift", "Group 1", and "Group 2" all have gaps, resulting in undocumented values.</p>
<p><a href="https://static.righto.com/images/8086-ad-undoc/groups.jpg"><img alt="Meaning of the reg field in two-byte opcodes. From MCS-86 Assembly Language Reference Guide." class="hilite" height="129" src="https://static.righto.com/images/8086-ad-undoc/groups-w600.jpg" title="Meaning of the reg field in two-byte opcodes. From MCS-86 Assembly Language Reference Guide." width="600" /></a><div class="cite">Meaning of the reg field in two-byte opcodes. From <a href="http://bitsavers.org/components/intel/8086/9800749-1_MCS-86_Assembly_Language_Reference_Guide_Oct78.pdf">MCS-86 Assembly Language Reference Guide</a>.</div></p>
<p>These sets of instructions are implemented in two completely different ways.
The "Immed" and "Shift" instructions run microcode in the standard way, selected by the first byte.
For a typical arithmetic/logic instruction such as <code>ADD</code>, bits 5-3 of the first instruction byte are latched into the <code>X</code> register to indicate
which ALU operation to perform.
The microcode specifies a generic ALU operation, while the <code>X</code> register controls whether the operation is an <code>ADD</code>, <code>SUB</code>, <code>XOR</code>, or
so forth.
However, the Group Decode ROM indicates that for the special "Immed" and "Shift" instructions, the <code>X</code> register latches the bits
from the <em>second</em> byte.
Thus, when the microcode executes a generic ALU operation, it ends up with the one specified in the second byte.<span id="fnref:alu"><a class="ref" href="#fn:alu">6</a></span></p>
<p>The "Group 1" and "Group 2" instructions (<code>F0</code>-<code>F1</code>, <code>FE</code>-<code>FF</code>), however, run different microcode for each instruction.
Bits 5-3 of the second byte replace bits 2-0 of the instruction before executing the microcode.
Thus, <code>F0</code> and <code>F1</code> act as if they are opcodes in the range <code>F0</code>-<code>F7</code>, while <code>FE</code> and <code>FF</code> act as if they are opcodes in the range <code>F8</code>-<code>FF</code>.
Thus, each instruction specified by the second byte can have its own microcode, unlike the "Immed" and "Shift" instructions.
The trick that makes this work is that all the "real" opcodes in the range <code>F0</code>-<code>FF</code> are implemented in logic, not microcode,
so there are no collisions.</p>
<h3>The hole in "Shift": <code>SETMO</code>, <code>D0</code>..<code>D3/6</code></h3>
<p>There is a "hole" in the list of shift operations when the second byte has the bits <code>110</code> (6).
(This is typically expressed as <code>D0/6</code> and so forth; the value after the slash is the opcode-selection bits in the ModR/M byte.)
Internally, this value selects the ALU's <code>SETMO</code> (Set Minus One) operation, which simply returns <code>FF</code> or <code>FFFF</code>, for a byte or word operation respectively.<span id="fnref:setmo"><a class="ref" href="#fn:setmo">7</a></span></p>
<p>The microcode below is executed for 1101000? bit patterns patterns (D0 and D1).
The first instruction gets the value from the <code>M</code> register and sets up the ALU to do whatever operation was
specified in the instruction (indicated by <code>XI</code>).
Thus, the same microcode is used for all the "Shift" instructions, including <code>SETMO</code>.
The result is written back to <code>M</code>. If no writeback to memory is required (<code>NWB</code>), then <code>RNI</code> runs the next instruction, ending
the microcode sequence.
However, if the result is going to memory, then the last line writes the value to memory.</p>
<pre class="microcode">
M → tmpB XI tmpB, NXT <span><b>rot rm, 1</b>: get argument, set up ALU</span>
Σ → M NWB,RNI F <span>Store result, maybe run next instruction</span>
W DS,P0 RNI <span>Write result to memory</span>
</pre>
<p>The D2 and D3 instructions (1101001?) perform a variable number of shifts, specified by the <code>CL</code> register, so they use different microcode (below).
This microcode loops the number of times specified by <code>CL</code>, but the control flow is a bit tricky to avoid shifting if
the intial counter value is 0.
The code sets up the ALU to pass the counter (in <code>tmpA</code>) unmodified the first time (<code>PASS</code>) and jumps to <b>4</b>, which
updates the counter and sets up the ALU for the shift operation (<code>XI</code>).
If the counter is not zero, it jumps back to <b>3</b>, which performs the previously-specified shift and sets up
the ALU to decrement the counter (<code>DEC</code>).
This time, the code at <b>4</b> decrements the counter.
The loop continues until the counter reaches zero. The microcode stores the result as in the previous microcode.</p>
<pre class="microcode">
ZERO → tmpA <span><b>rot rm,CL</b>: 0 to tmpA</span>
CX → tmpAL PASS tmpA <span>Get count to tmpAL, set up ALU to pass through</span>
M → tmpB JMPS 4 <span>Get value, jump to loop (4)</span>
Σ → tmpB DEC tmpA F <span><b>3</b>: Update result, set up decrement of count</span>
Σ → tmpA XI tmpB <span><b>4</b>: update count in tmpA, set up ALU</span>
JMPS NZ 3 <span>Loop if count not zero</span>
tmpB → M NWB,RNI <span>Store result, maybe run next instruction</span>
W DS,P0 RNI <span>Write result to memory</span>
</pre>
<h3>The hole in "group 1": <code>TEST</code>, <code>F6/1</code> and <code>F7/1</code></h3>
<p>The <code>F6</code> and <code>F7</code> opcodes are in "group 1", with the specific instruction specified by bits 5-3 of the second byte.
The second-byte table showed a hole for the <code>001</code> bit sequence.
As explained earlier, these bits replace the low-order bits of the instruction, so <code>F6</code> with 001 is processed as if it were
the opcode <code>F1</code>.
The microcode below matches against instruction bits <code>1111000?</code>, so <code>F6/1</code> and <code>F7/1</code> have the same effect as <code>F6/0</code> and <code>F7/1</code> respectively,
that is, the byte and word <code>TEST</code> instructions.</p>
<p>The microcode below gets one or two bytes from the prefetch queue (<code>Q</code>); the <code>L8</code> condition tests if the operation is
an 8-bit (i.e. byte) operation and skips the second micro-instruction.
The third micro-instruction ANDs the argument and the fetched value.
The condition flags (<code>F</code>) are set based on the result, but the result itself is discarded.
Thus, the <code>TEST</code> instruction tests a value against a mask, seeing if any bits are set.</p>
<pre class="microcode">
Q → tmpBL JMPS L8 2 <span><b>TEST rm,i:</b> Get byte, jump if operation length = 8</span>
Q → tmpBH <span>Get second byte from the prefetch queue</span>
M → tmpA AND tmpA, NXT <span><b>2:</b> Get argument, AND with fetched value</span>
Σ → no dest RNI F <span>Discard result but set flags.</span>
</pre>
<p>I explained the processing of these "Group 3" instructions in more detail in my <a href="https://www.righto.com/2022/11/how-8086-processors-microcode-engine.html">microcode article</a>.</p>
<h3>The hole in "group 2": <code>PUSH</code>, <code>FE/7</code> and <code>FF/7</code></h3>
<p>The <code>FE</code> and <code>FF</code> opcodes are in "group 2", which has a hole for the <code>111</code> bit sequence in the second byte.
After replacement, this will be processed as the <code>FF</code> opcode, which matches the pattern <code>1111111?</code>.
In other words, the instruction will be processed the same as the <code>110</code> bit pattern, which is <code>PUSH</code>.
The microcode gets the Stack Pointer, sets up the ALU to decrement it by 2.
The new value is written to <code>SP</code> and <code>IND</code>. Finally, the register value is written to stack memory.</p>
<!--
For some reason, the [8086 undocumented instructions](https://en.wikipedia.org/wiki/X86_instruction_listings#Undocumented_x86_instructions) page on Wikipedia doesn't list FE-FF.
-->
<pre class="microcode">
SP → tmpA DEC2 tmpA <span><b>PUSH rm</b>: set up decrement SP by 2</span>
Σ → IND <span>Decremented SP to IND</span>
Σ → SP <span>Decremented SP to SP</span>
M → OPR W SS,P0 RNI <span>Write the value to memory, done</span>
</pre>
<h3><code>82</code> and <code>83</code> "Immed" group</h3>
<p>Opcodes <code>80</code>-<code>83</code> are the "Immed" group, performing one of eight arithmetic operations, specified in the ModR/M byte.
The four opcodes differ in the size of the values: opcode <code>80</code> applies an 8-bit immediate value to an 8-bit register, <code>81</code> applies a 16-bit
value to a 16-bit register, <code>82</code> applies an 8-bit value to an 8-bit register, and <code>83</code> applies an 8-bit value to a 16-bit register.
The opcode 82 has the strange situation that <a href="https://en.wikipedia.org/wiki/X86_instruction_listings#Undocumented_instructions_that_are_widely_available_across_many_x86_CPUs_include">some sources</a> say it is undocumented, but it shows up in some Intel documentation as a valid bit combination (e.g. below).
Note that <code>80</code> and <code>82</code> have the 8-bit to 8-bit action, so the <code>82</code> opcode is redundant.</p>
<p><a href="https://static.righto.com/images/8086-ad-undoc/adc.png"><img alt="ADC is one of the instructions with opcode 80-83. From the 8086 datasheet, page 27." class="hilite" height="34" src="https://static.righto.com/images/8086-ad-undoc/adc-w600.png" title="ADC is one of the instructions with opcode 80-83. From the 8086 datasheet, page 27." width="600" /></a><div class="cite">ADC is one of the instructions with opcode 80-83. From the <a href="https://www.electro-tech-online.com/datasheets/8086_intel.pdf">8086 datasheet</a>, page 27.</div></p>
<p>The microcode below is used for all four opcodes.
If the ModR/M byte specifies memory, the appropriate micro-subroutine is called to compute the effective address in <code>IND</code>,
and fetch the byte or word into <code>OPR</code>.
The first two instructions below get the two immediate data bytes from the prefetch queue; for an 8-bit operation, the second byte
is skipped.
Next, the second argument <code>M</code> is loaded into tmpA and the desired ALU operation (<code>XI</code>) is configured.
The result <code>Σ</code> is stored into the specified register <code>M</code> and the operation may terminate with <code>RNI</code>.
But if the ModR/M byte specified memory, the following write micro-operation saves the value to memory.</p>
<pre class="microcode">
Q → tmpBL JMPS L8 2 <span><b>alu rm,i</b>: get byte, test if 8-bit op</span>
Q → tmpBH <span>Maybe get second byte</span>
M → tmpA XI tmpA, NXT <span><b>2</b>: </span>
Σ → M NWB,RNI F <span>Save result, update flags, done if no memory writeback</span>
W DS,P0 RNI <span>Write result to memory if needed</span>
</pre>
<p>The tricky part of this is the <code>L8</code> condition, which tests if the operation is 8-bit.
You might think that bit 0 acts as the byte/word bit in a nice, orthogonal way, but the 8086 has a bunch of special cases.
Bit 0 of the instruction typically selects between a byte and a word operation, but there are a bunch of special cases.
The Group Decode ROM creates a signal indicating if bit 0 should be used as the byte/word bit.
But it generates a second signal indicating that an instruction should be forced to operate on bytes, for instructions
such as <code>DAA</code> and <code>XLAT</code>.
Another Group Decode ROM signal indicates that bit 3 of the instruction should select byte or word; this
is used for the <code>MOV</code> instructions with opcodes Bx.
Yet another Group Decode ROM signal indicates that inverted bit 1 of the instruction should select byte or word;
this is used for a few opcodes, including <code>80</code>-<code>87</code>.</p>
<p>The important thing here is that for the opcodes under discussion (<code>80</code>-<code>83</code>), the <code>L8</code> micro-condition uses <em>both</em> bits 0 and 1
to determine if the instruction is 8 bits or not.
The result is that only opcode <code>81</code> is considered 16-bit by the <code>L8</code> test, so it is the only one that uses two immediate bytes
from the instruction.
However, the register operations use only bit 0 to select a byte or word transfer.
The result is that opcode <code>83</code> has the unusual behavior of using an 8-bit immediate operand with a 16-bit register.
In this case, the 8-bit value is sign-extended to form a 16-bit value. That is, the top bit of the 8-bit value fills
the entire upper half of the 16-bit value,
converting an 8-bit signed value to a 16-bit signed value (e.g. -1 is <code>FF</code>, which becomes <code>FFFF</code>).
This makes sense for arithmetic operations, but not much sense for logical operations.</p>
<p>Intel documentation is inconsistent about which opcodes are listed for which instructions.
Intel opcode maps generally define opcodes <code>80</code>-<code>83</code>.
However, lists of specific instructions show opcodes <code>80</code>, <code>81</code>, and <code>83</code> for arithmetic operations but only <code>80</code> and <code>81</code> for logical operations.<span id="fnref:immed"><a class="ref" href="#fn:immed">8</a></span>
That is, Intel omits the redundant <code>82</code> opcode as well as omitting logic operations that perform sign-extension (<code>83</code>).</p>
<h3>More <code>FE</code> holes</h3>
<p>For the "group 2" instructions, the <code>FE</code> opcode performs a byte operation while <code>FF</code> performs a word operation.
Many of these operations don't make sense for bytes: <code>CALL</code>, <code>JMP</code>, and <code>PUSH</code>.
(The only instructions supported for <code>FE</code> are <code>INC</code> and <code>DEC</code>.) But what happens if you use the unsupported instructions?
The remainder of this section examines those cases and shows that the results are not useful.</p>
<h4><code>CALL</code>: <code>FE/2</code></h4>
<p>This instruction performs an indirect subroutine call within a segment, reading the target address from the memory location specified by the ModR/M byte.</p>
<p>The microcode below is a bit convoluted because the code falls through into the shared <code>NEARCALL</code> routine, so there is
some unnecessary register movement.
Before this microcode executes, the appropriate ModR/M micro-subroutine will read the target address from memory.
The code below copies the destination address from <code>M</code> to <code>tmpB</code> and stores it into the PC later in the code
to transfer execution.
The code suspends prefetching, corrects the PC to cancel the offset from prefetching, and flushes the prefetch queue.
Finally, it decrements the SP by two and writes the old PC to the stack.</p>
<pre class="microcode">
M → tmpB SUSP <span><b>CALL rm</b>: read value, suspend prefetch</span>
SP → IND CORR <span>Get SP, correct PC</span>
PC → OPR DEC2 tmpC <span>Get PC to write, set up decrement</span>
tmpB → PC FLUSH <span><b>NEARCALL</b>: Update PC, flush prefetch</span>
IND → tmpC <span>Get SP to decrement</span>
Σ → IND <span>Decremented SP to IND</span>
Σ → SP W SS,P0 RNI <span>Update SP, write old PC to stack</span>
</pre>
<p>This code will mess up in two ways when executed as a byte instruction.
First, when the destination address is read from memory, only a byte will be read, so the destination address will be corrupted.
(I think that the behavior here depends on the bus hardware. The 8086 will ask for a byte from memory but will
read the word that is placed on the bus.
Thus, if memory returns a word, this part may operate correctly.
The 8088's behavior will be different because of its 8-bit bus.)
The second issue is writing the old PC to the stack because only a byte of the PC will be written.
Thus, when the code returns from the subroutine call, the return address will be corrupt.</p>
<h4><code>CALL</code>: <code>FE/3</code></h4>
<p>This instruction performs an indirect subroutine call between segments, reading the target address from the memory location specified by the ModR/M byte.</p>
<pre class="microcode">
IND → tmpC INC2 tmpC <span><b>CALL FAR rm</b>: set up IND+2</span>
Σ → IND R DS,P0 <span>Read new CS, update IND</span>
OPR → tmpA DEC2 tmpC <span>New CS to tmpA, set up SP-2</span>
SP → tmpC SUSP <span><b>FARCALL</b>: Suspend prefetch</span>
Σ → IND CORR <span><b>FARCALL2</b>: Update IND, correct PC</span>
CS → OPR W SS,M2 <span>Push old CS, decrement IND by 2</span>
tmpA → CS PASS tmpC <span>Update CS, set up for NEARCALL</span>
PC → OPR JMP NEARCALL <span>Continue with NEARCALL</span>
</pre>
<p>As in the previous <code>CALL</code>, this microcode will fail in multiple ways when executed in byte mode.
The new CS and PC addresses will be read from memory as bytes, which may or may not work.
Only a byte of the old CS and PC will be pushed to the stack.</p>
<h4><code>JMP</code>: <code>FE/4</code></h4>
<p>This instruction performs an indirect jump within a segment, reading the target address from the memory location specified by the ModR/M byte.
The microcode is short, since the ModR/M micro-subroutine does most of the work.
I believe this will have the same problem as the previous <code>CALL</code> instructions, that it will attempt to read a byte from
memory instead of a word.</p>
<pre class="microcode">
SUSP <span><b>JMP rm</b>: Suspend prefetch</span>
M → PC FLUSH RNI <span>Update PC with new address, flush prefetch, done</span>
</pre>
<h4><code>JMP</code>: <code>FE/5</code></h4>
<p>This instruction performs an indirect jump between segments, reading the new PC and CS values from the memory location specified by the ModR/M byte.
The ModR/M micro-subroutine reads the new PC address. This microcode increments <code>IND</code> and suspends prefetching.
It updates the PC, reads the new CS value from memory, and updates the CS.
As before, the reads from memory will read bytes instead of words, so this code will not meaningfully work in byte mode.</p>
<pre class="microcode">
IND → tmpC INC2 tmpC <span><b>JMP FAR rm</b>: set up IND+2</span>
Σ → IND SUSP <span>Update IND, suspend prefetch</span>
tmpB → PC R DS,P0 <span>Update PC, read new CS from memory</span>
OPR → CS FLUSH RNI <span>Update CS, flush prefetch, done</span>
</pre>
<h4><code>PUSH</code>: <code>FE/6</code></h4>
<p>This instruction pushes the register or memory value specified by the ModR/M byte.
It decrements the SP by 2 and then writes the value to the stack.
It will write one byte to the stack but decrements the SP by 2,
so one byte of old stack data will be on the stack along with the data byte.</p>
<pre class="microcode">
SP → tmpA DEC2 tmpA <span><b>PUSH rm</b>: Set up SP decrement </span>
Σ → IND <span>Decremented value to IND</span>
Σ → SP <span>Decremented value to SP</span>
M → OPR W SS,P0 RNI <span>Write the data to the stack</span>
</pre>
<h2>Undocumented instruction values</h2>
<p>The next category of undocumented instructions is where the first byte indicates a valid instruction, but
there is something wrong with the second byte.</p>
<h3><code>AAM</code>: ASCII Adjust after Multiply</h3>
<p>The <code>AAM</code> instruction is a fairly obscure one, designed to support binary-coded decimal
arithmetic (BCD).
After multiplying two BCD digits, you end up with a binary value between 0 and <code>81</code> (0×0 to 9×9).
If you want a BCD result, the <code>AAM</code> instruction converts this binary value to BCD, for instance splitting <code>81</code> into the
decimal digits 8 and 1, where the upper digit is <code>81</code> divided by <code>10</code>, and the lower digit is <code>81</code> modulo <code>10</code>.</p>
<p>The interesting thing about <code>AAM</code> is that the 2-byte instruction is <code>D4</code> <code>0A</code>. You might notice that hex <code>0A</code> is <code>10</code>, and this
is not a coincidence.
There wasn't an easy way to get the value <code>10</code> in the microcode, so instead they made the instruction
provide that value in the second byte.
The undocumented (but well-known) part is that if you provide a value other than <code>10</code>, the instruction will convert the binary input into
digits in that base. For example, if you provide 8 as the second byte, the instruction returns the value divided by 8
and the value modulo 8.</p>
<p>The microcode for <code>AAM</code>, below, sets up the registers. calls
the <code>CORD</code> (Core Division) micro-subroutine to perform the division,
and then puts the results into <code>AH</code> and <code>AL</code>.
In more detail, the <code>CORD</code> routine divides <code>tmpA/tmpC</code> by <code>tmpB</code>, putting the <em>complement</em> of the quotient in <code>tmpC</code> and leaving the remainder in <code>tmpA</code>.
(If you want to know how CORD works internally, see my <a href="https://www.righto.com/2023/04/reverse-engineering-8086-divide-microcode.html">division post</a>.)
The important step is that the <code>AAM</code> microcode gets the divisor from the prefetch queue (<code>Q</code>).
After calling <code>CORD</code>, it sets up the ALU to perform a 1's complement of <code>tmpC</code> and puts the result (<code>Σ</code>) into <code>AH</code>.
It sets up the ALU to pass <code>tmpA</code> through unchanged, puts the result (<code>Σ</code>) into <code>AL</code>, and updates the flags accordingly (<code>F</code>).</p>
<pre class="microcode">
Q → tmpB <span><b>AAM:</b> Move byte from prefetch to tmpB</span>
ZERO → tmpA <span>Move 0 to tmpA</span>
AL → tmpC CALL CORD <span>Move AL to tmpC, call CORD.</span>
COM1 tmpC <span>Set ALU to complement</span>
Σ → AH PASS tmpA, NXT <span>Complement AL to AH</span>
Σ → AL RNI F <span>Pass tmpA through ALU to set flags</span>
</pre>
<p>The interesting thing is why this code has undocumented behavior.
The 8086's microcode only has support for the constants 0 and all-1's (<code>FF</code> or <code>FFFF</code>), but the microcode needs to divide by <code>10</code>.
One solution would be to implement an additional micro-instruction and more circuitry to provide the constant <code>10</code>, but every
transistor was precious back then.
Instead, the designers took the approach of simply putting the number <code>10</code> as the second byte of the instruction and loading the
constant from there.
Since the <code>AAM</code> instruction is not used very much, making the instruction two bytes long wasn't much of a drawback.
But if you put a different number in the second byte, that's the divisor the microcode will use.
(Of course you could add circuitry to verify that the number is <code>10</code>, but then the implementation is no longer simple.)</p>
<p>Intel could have documented the full behavior, but that creates several problems.
First, Intel would be stuck supporting the full behavior into the future.
Second, there are corner cases to deal with, such as divide-by-zero.
Third, testing the chip would become harder because all these cases would need to be tested.
Fourth, the documentation would become long and confusing.
It's not surprising that Intel left the full behavior undocumented.</p>
<h3><code>AAD</code>: ASCII Adjust before Division</h3>
<p>The <code>AAD</code> instruction is analogous to <code>AAM</code> but used for BCD division.
In this case, you want to divide a two-digit BCD number by something, where the BCD digits are in <code>AH</code> and <code>AL</code>.
The <code>AAD</code> instruction converts the two-digit BCD number to binary by computing <code>AH</code>×<code>10+AL</code>, before you perform
the division.</p>
<p>The microcode for <code>AAD</code> is shown below. The microcode sets up the registers, calls the multiplication micro-subroutine
<code>CORX</code> (Core Times), and
then puts the results in <code>AH</code> and <code>AL</code>.
In more detail, the multiplier comes from the instruction prefetch queue <code>Q</code>.
The <code>CORX</code> routine multiples <code>tmpC</code> by <code>tmpB</code>, putting the result in <code>tmpA/tmpC</code>.
Then the microcode adds the low BCD digit (<code>AL</code>) to the product (<code>tmpB + tmpC</code>), putting the sum (<code>Σ</code>) into <code>AL</code>,
clearing <code>AH</code> and setting the status flags <code>F</code> appropriately.</p>
<p>One interesting thing is that the second-last micro-instruction jumps to <code>AAEND</code>, which is the last
micro-instruction of the <code>AAM</code> microcode above.
By reusing the micro-instruction from <code>AAM</code>, the microcode is one micro-instruction shorter, but
the jump adds one cycle to the execution time.
(The CORX routine is used for integer multiplication; I discuss the internals in <a href="https://www.righto.com/2023/03/8086-multiplication-microcode.html">this post</a>.)</p>
<pre class="microcode">
Q → tmpC <span><b>AAD:</b> Get byte from prefetch queue.</span>
AH → tmpB CALL CORX <span>Call CORX</span>
AL → tmpB ADD tmpC <span>Set ALU for ADD</span>
ZERO → AH JMP AAEND <span>Zero AH, jump to AAEND</span>
i
...
Σ → AL RNI F <span><b>AAEND:</b> Sum to AL, done.</span>
</pre>
<p>As with <code>AAM</code>, the constant <code>10</code> is provided in the second byte of the instruction.
The microcode accepts any value here, but values other than <code>10</code> are undocumented.</p>
<h3><code>8C</code>, <code>8E</code>: MOV sr</h3>
<p>The opcodes <code>8C</code> and <code>8E</code> perform a <code>MOV</code> register to or from the specified segment register, using the register specification
field in the ModR/M byte.
There are four segment registers and three selection bits, so an invalid segment register can be specified.
However, the hardware that decodes the register number ignores instruction bit 5 for a segment register. Thus,
specifying a segment register 4 to 7 is the same as specifying a segment register 0 to 3.
For more details, see my article on <a href="https://www.righto.com/2023/03/8086-register-codes.html">8086 register codes</a>.</p>
<h2>Unexpected <code>REP</code> prefix</h2>
<h3><code>REP IMUL</code> / <code>IDIV</code></h3>
<p>The <code>REP</code> prefix is used with string operations to cause the operation to be repeated across a block of memory.
However, if you use this prefix with an <code>IMUL</code> or <code>IDIV</code> instruction, it has the unexpected behavior
of negating the product or the quotient (<a href="https://www.reenigne.org/blog/8086-microcode-disassembled/">source</a>).</p>
<p>The reason for this behavior is that the string operations use an internal flag called <code>F1</code> to indicate that a <code>REP</code>
prefix has been applied.
The multiply and divide code reuses this flag to track the sign of the input values, toggling <code>F1</code> for each negative value.
If <code>F1</code> is set, the value at the end is negated. (This handles "two negatives make a positive.")
The consequence is that the <code>REP</code> prefix puts the flag in the 1 state when the multiply/divide starts, so the computed sign
will be wrong at the end and the result is the negative of the expected result.
The microcode is fairly complex, so I won't show it here; I explain it in detail in <a href="https://www.righto.com/2023/03/8086-multiplication-microcode.html">this blog post</a>.</p>
<h3><code>REP RET</code></h3>
<p><a href="https://en.wikipedia.org/wiki/X86_instruction_listings#Undocumented_x86_instructions">Wikipedia</a> lists
<code>REP RET</code> (i.e. <code>RET</code> with a <code>REP</code> prefix) as a way to implement a two-byte return instruction.
This is kind of trivial; the <code>RET</code> microcode (like almost every instruction) doesn't use the <code>F1</code> internal flag,
so the <code>REP</code> prefix has no effect.</p>
<h3><code>REPNZ MOVS/STOS</code></h3>
<p><a href="https://en.wikipedia.org/wiki/X86_instruction_listings#Undocumented_x86_instructions">Wikipedia</a> mentions that
the use of the <code>REPNZ</code> prefix (as opposed to <code>REPZ</code>) is undefined with string operations other than <code>CMPS/SCAS</code>.
An internal flag called <code>F1Z</code> distinguishes between the <code>REPZ</code> and <code>REPNZ</code> prefixes.
This flag is only used by <code>CMPS/SCAS</code>. Since the other string instructions ignore this flag, they will ignore the
difference between <code>REPZ</code> and <code>REPNZ</code>.
I wrote about string operations in more detail in <a href="https://www.righto.com/2023/04/8086-microcode-string-operations.html">this post</a>.</p>
<h2>Using a register instead of memory.</h2>
<p>Some instructions are documented as requiring a memory operand. However, the ModR/M byte can specify a register.
The behavior in these cases can be highly unusual, providing access to hidden registers.
Examining the microcode shows how this happens.</p>
<h3><code>LEA reg</code></h3>
<p>Many instructions have a ModR/M byte that indicates the memory address that the instruction should use, perhaps through
a complicated addressing mode.
The <code>LEA</code> (Load Effective Address) instruction is different: it doesn't access the memory location but returns the address itself.
The undocumented part is that the ModR/M byte can specify a register instead of a memory location. In that case,
what does the <code>LEA</code> instruction do? Obviously it can't return the address of a register, but it needs to return something.</p>
<p>The behavior of <code>LEA</code> is explained by how the 8086 handles the ModR/M byte.
Before running the microcode corresponding to the instruction, the microcode engine calls a short micro-subroutine
for the particular addressing mode.
This micro-subroutine puts the desired memory address (the effective address) into the <code>tmpA</code> register.
The effective address is copied to the <code>IND</code> (Indirect) register and the value is loaded from memory if needed.
On the other hand, if the ModR/M byte specified a register instead of memory, no micro-subroutine is called.
(I explain ModR/M handling in more detail in <a href="https://www.righto.com/2023/02/8086-modrm-addressing.html">this article</a>.)</p>
<p>The microcode for <code>LEA</code> itself is just one line. It stores the effective address in the <code>IND</code> register into the specified destination register, indicated by <code>N</code>.
This assumes that the appropriate ModR/M micro-subroutine was called before this code, putting the effective address into <code>IND</code>.</p>
<pre class="microcode">
IND → N RNI <span><b>LEA</b>: store IND register in destination, done</span>
</pre>
<p>But if a register was specified instead of a memory location, no ModR/M micro-subroutine gets called.
Instead, the <code>LEA</code> instruction will return whatever value was left
in <code>IND</code> from before, typically the previous memory location that was accessed.
Thus, <code>LEA</code> can be used to read the value of the <code>IND</code> register, which is normally hidden from the programmer.</p>
<h3><code>LDS reg</code>, <code>LES reg</code></h3>
<p>The <code>LDS</code> and <code>LES</code> instructions load a far pointer from memory into the specified segment register and general-purpose register.
The microcode below assumes that the appropriate ModR/M micro-subroutine has set up <code>IND</code> and read the first value into <code>OPR</code>.
The microcode updates the destination register, increments <code>IND</code> by 2, reads the next value, and updates <code>DS</code>.
(The microcode for <code>LES</code> is a copy of this, but updates <code>ES</code>.)</p>
<pre class="microcode">
OPR → N <span><b>LDS</b>: Copy OPR to dest register</span>
IND → tmpC INC2 tmpC <span>Set up incrementing IND by 2</span>
Σ → IND R DS,P0 <span>Update IND, read next location</span>
OPR → DS RNI <span>Update DS</span>
</pre>
<p>If the <code>LDS</code> instruction specifies a register instead of memory, a micro-subroutine will not be called, so <code>IND</code> and <code>OPR</code>
will have values from a previous instruction.
<code>OPR</code> will be stored in the destination register, while the <code>DS</code> value will be read from the address <code>IND+2</code>.
Thus, these instructions provide a mechanism to access the hidden <code>OPR</code> register.</p>
<h3><code>JMP FAR rm</code></h3>
<p>The <code>JMP FAR rm</code> instruction normally jumps to the far address stored in memory at the location indicated by the ModR/M byte.
(That is, the ModR/M byte indicates where the new PC and CS values are stored.)
But, as with <code>LEA</code>, the behavior is undocumented if the ModR/M byte specifies a register, since a register doesn't hold
a four-byte value.</p>
<p>The microcode explains what happens.
As with <code>LEA</code>, the code expects a micro-subroutine to put the address into the <code>IND</code> register.
In this case, the micro-subroutine also loads the value at that address (i.e. the destination <code>PC</code>) into tmpB.
The microcode increments <code>IND</code> by 2 to point to the <code>CS</code> word in memory and reads that into <code>CS</code>.
Meanwhile, it updates the <code>PC</code> with <code>tmpB</code>.
It suspends prefetching and flushes the queue, so instruction fetching will restart at the new address.</p>
<pre class="microcode">
IND → tmpC INC2 tmpC <span><b>JMP FAR rm</b>: set up to add 2 to IND</span>
Σ → IND SUSP <span>Update IND, suspend prefetching</span>
tmpB → PC R DS,P0 <span>Update PC with tmpB. Read new CS from specified address</span>
OPR → CS FLUSH RNI <span>Update CS, flush queue, done</span>
</pre>
<p>If you specify a register instead of memory, the micro-subroutine won't get called.
Instead, the program counter will be loaded with whatever value was in <code>tmpB</code> and the <code>CS</code> segment register will
be loaded from the memory location two bytes after the location that <code>IND</code> was referencing.
Thus, this undocumented use of the instruction gives access to the otherwise-hidden <code>tmpB</code> register.</p>
<h2>The end of undocumented instructions</h2>
<p>Microprocessor manufacturers soon realized that undocumented instructions were a problem, since
programmers find them and often use them.
This creates an issue for future processors, or even revisions of the current processor:
if you eliminate an undocumented instruction, previously-working code that used the instruction will break,
and it will seem like the new processor is faulty.</p>
<p>The solution was for processors to detect undocumented instructions and prevent them from executing.
By the early 1980s, processors had enough transistors (thanks to Moore's law) that they could include
the circuitry to block unsupported instructions.
In particular, the 80186/80188 and the 80286 generated a trap of type 6 when an unused opcode was executed,
blocking use of the instruction.<span id="fnref:186"><a class="ref" href="#fn:186">9</a></span>
This trap is also known as #UD (Undefined instruction trap).<span id="fnref:fault"><a class="ref" href="#fn:fault">10</a></span></p>
<h2>Conclusions</h2>
<p>The 8086, like many early microprocessors, has undocumented instructions but no traps to stop them from executing.<span id="fnref:references"><a class="ref" href="#fn:references">11</a></span>
For the 8086, these fall into several categories.
Many undocumented instructions simply mirror existing instructions.
Some instructions are implemented but not documented for one reason or another, such as <code>SALC</code> and <code>POP CS</code>.
Other instructions can be used outside their normal range, such as <code>AAM</code> and <code>AAD</code>.
Some instructions are intended to work only with a memory address, so specifying a register can have
strange effects such as revealing the values of the hidden <code>IND</code> and <code>OPR</code> registers.</p>
<p>Keep in mind that my analysis is based on transistor-level simulation and examining the microcode; I haven't verified the behavior on a
physical 8086 processor. Please let me know if you see any errors in my analysis or undocumented instructions that I have
overlooked.
Also note that the behavior could change between different versions of the 8086; in particular, some versions by different manufacturers
(such as the NEC V20 and V30) are known to be different.</p>
<p>I plan to write more about the 8086, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>
and Bluesky as <a href="https://staging.bsky.app/profile/righto.com">@righto.com</a> so you can follow me there too.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:6502">
<p>The 6502 processor, for instance, has illegal instructions with various effects, including causing the processor to hang.
The article <a href="https://www.pagetable.com/?p=39">How MOS 6502 illegal opcodes really work</a> describes in detail
how the instruction decoding results in various illegal opcodes. Some of these opcodes put the internal bus into a floating
state, so the behavior is electrically unpredictable. <a class="footnote-backref" href="#fnref:6502" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:prefixes">
<p>The 8086 used up almost all the single-byte opcodes, which made it difficult to extend the instruction set.
Most of the new instructions for the 386 or later are multi-byte opcodes, either using <code>0F</code> as a prefix or
reusing the earlier REP prefix (<code>F3</code>).
Thus, the x86 instruction set is less efficient than it could be, since many single-byte opcodes were "wasted"
on hardly-used instructions such as BCD arithmetic, forcing newer instructions to be multi-byte. <a class="footnote-backref" href="#fnref:prefixes" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:magic-instruction">
<p>For details on the "magic instruction" hidden in the 8086 microcode, see <a href="https://digitalcommons.law.scu.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1031&context=chtlj">NEC v. Intel: Will Hardware Be Drawn into the
Black Hole of Copyright Editors</a> page 49.
I haven't found anything stating that <code>SALC</code> was the hidden instruction, but this is the only undocumented instruction that
makes sense as something deliberately put into the microcode.
The court case is complicated since NEC had a licensing agreement with Intel, so I'm skipping lots of details.
See <a href="http://jolt.law.harvard.edu/articles/pdf/v03/03HarvJLTech209.pdf">NEC v. Intel: Breaking new ground in the law of copyright</a> for more. <a class="footnote-backref" href="#fnref:magic-instruction" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:microcode">
<p>The microcode listings are based on Andrew Jenner's <a href="https://www.reenigne.org/blog/8086-microcode-disassembled/">disassembly</a>.
I have made some modifications to (hopefully) make it easier to understand. <a class="footnote-backref" href="#fnref:microcode" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:extension">
<p>Specifying the instruction through the ModR/M reg field may seem a bit random, but there's a reason for this.
A typical instruction such as <code>ADD</code> has two arguments specified by the ModR/M byte.
But other instructions such as shift instructions or <code>NOT</code> only take one argument.
For these instructions, the ModR/M <code>reg</code> field would be wasted if it specified a register.
Thus, using the <code>reg</code> field to specify instructions that only use one argument makes the instruction set more efficient. <a class="footnote-backref" href="#fnref:extension" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:alu">
<p>Note that "normal" ALU operations are specified by bits 5-3 of the instruction; in order these are <code>ADD</code>, <code>OR</code>, <code>ADC</code>, <code>SBB</code>, <code>AND</code>, <code>SUB</code>, <code>XOR</code>,
and <code>CMP</code>.
These are exactly the same ALU operations that the "Immed" group performs, specified by bits 5-3 of the second byte.
This illustrates how the same operation selection mechanism (the <code>X</code> register) is used in both cases.
Bit 6 of the instruction switches between the set of arithmetic/logic instructions and the set of shift/rotate instructions. <a class="footnote-backref" href="#fnref:alu" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:setmo">
<p>As far as I can tell, SETMO isn't used by the microcode.
Thus, I think that SETMO wasn't deliberately implemented in the ALU,
but is a consequence of how the ALU's control logic is implemented.
That is, all the even entries are left shifts and the odd entries are right shifts, so operation 6 activates the
left-shift circuitry. But it doesn't match a specific left shift operation, so the ALU doesn't get configured for a
"real" left shift.
In other words, the behavior of this instruction is due to how the ALU handles a case that it wasn't specifically designed
to handle.</p>
<p>This function is implemented in the ALU somewhat similar to a shift left.
However, instead of passing each input bit to the left, the bit from the right is passed to the left.
That is, the input to bit 0 is shifted left to all of the bits of the result.
By setting this bit to 1, all bits of the result are set, yielding the minus 1 result. <a class="footnote-backref" href="#fnref:setmo" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:immed">
<p>This footnote provides some references for the "Immed" opcodes.
The <a href="https://www.electro-tech-online.com/datasheets/8086_intel.pdf">8086 datasheet</a>
has an opcode map showing opcodes <code>80</code> through <code>83</code> as valid.
However, in the listings of individual instructions it only shows <code>80</code> and <code>81</code> for logical instructions (i.e. bit 1 must be 0),
while it shows <code>80</code>-<code>83</code> for arithmetic instructions.
The modern <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html">Intel 64 and IA-32 Architectures Software Developer's Manual</a> is also contradictory.
Looking at the instruction reference for <code>AND</code> (Vol <code>2A</code> 3-<code>78</code>), for instance,
shows opcodes <code>80</code>, <code>81</code>, and <code>83</code>, explicitly labeling <code>83</code> as sign-extended.
But the opcode map (Table A-2 Vol <code>2D</code> A-7) shows <code>80</code>-<code>83</code> as defined except for <code>82</code> in <code>64</code>-bit mode.
The instruction bit diagram (Table B-<code>13</code> Vol <code>2D</code> B-7) shows <code>80</code>-<code>83</code> valid for the arithmetic and logical instructions. <a class="footnote-backref" href="#fnref:immed" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:186">
<p>The 80286 was more thorough about detecting undefined opcodes than the 80186, even taking into account the
differences in instruction set.
The 80186 generates a trap when <code>0F</code>, <code>63</code>-<code>67</code>, <code>F1</code>, or <code>FFFF</code> is executed.
The 80286 generates invalid opcode exception number 6 (#UD) on any undefined opcode, handling the following cases:
<ul>
<li> The first byte of an instruction is completely invalid (e.g., 64H).
<li> The first byte indicates a 2-byte opcode and the second byte is invalid (e.g., 0F followed by
0FFH).
<li> An invalid register is used with an otherwise valid opcode (e.g., MOV CS,AX).
<li> An invalid opcode extension is given in the REG field of the ModR/M byte (e.g., 0F6H /1).
<li> A register operand is given in an instruction that requires a memory operand (e.g., LGDT AX).
</ul>
<!-- http://www.bitsavers.org/components/intel/80186/210911-001_iAPX86_88_186_188_Programmers_Reference_1983.pdf page 5-3 -->
<!-- page B-11 of http://bitsavers.trailing-edge.com/components/intel/80286/210498-005_80286_and_80287_Programmers_Reference_Manual_1987.pdf --> <a class="footnote-backref" href="#fnref:186" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:fault">
<p>In modern x86 processors, most undocumented instructions cause faults. However, there are still a few undocumented instructions
that don't fault.
These may be for internal use or corner cases of documented instructions.
For details, see <a href="https://www.youtube.com/watch?v=KrksBdWcZgQ">Breaking the x86 Instruction Set</a>, a video from Black Hat 2017. <a class="footnote-backref" href="#fnref:fault" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:references">
<p>Several sources have discussed undocumented 8086 opcodes before.
The article <a href="https://www.os2museum.com/wp/undocumented-8086-opcodes-part-i/">Undocumented 8086 Opcodes</a> describes undocumented opcodes in detail.
<a href="https://en.wikipedia.org/wiki/X86_instruction_listings#Undocumented_x86_instructions">Wikipedia</a> has a list of
undocumented x86 instructions.
The book <a href="https://archive.org/details/undocumentedpc0000vang/page/58/mode/2up?view=theater">Undocumented PC</a> discusses
undocumented instructions in the 8086 and later processors.
This <a href="https://retrocomputing.stackexchange.com/questions/20031/undocumented-instructions-in-x86-cpu-prior-to-80386">StackExchange Retrocomputing</a> post describes undocumented instructions.
These <a href="https://news.ycombinator.com/item?id=34960243">Hacker News comments</a> discuss some undocumented instructions.
There are other sources with more myth than fact, claiming that the 8086 treats undocumented instructions as NOPs, for instance. <a class="footnote-backref" href="#fnref:references" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com15tag:blogger.com,1999:blog-6264947694886887540.post-27958138646115790362023-07-08T09:14:00.003-07:002023-09-19T16:35:40.194-07:00Reverse-engineering the 8086 processor's address and data pin circuits<style>
.hilite {cursor:zoom-in}
a:link img.hilite, a:visited img.hilite {color: #fff;}
a:hover img.hilite {color: #f66;}
</style>
<p>The Intel 8086 microprocessor (1978) started the x86 architecture that continues to this day.
In this blog post, I'm focusing on a small part of the chip: the address and data pins that connect the chip to
external memory and I/O devices.
In many processors, this circuitry is straightforward, but it is complicated in the 8086 for two reasons.
First, Intel decided to package the 8086 as a 40-pin DIP, which didn't provide enough pins for all the functionality.
Instead, the 8086 multiplexes address, data, and status.
In other words, a pin can have multiple roles, providing an address bit at one time and a data bit at another time.</p>
<p>The second complication is that the 8086 has a 20-bit address space (due to its infamous segment registers), while the
data bus is 16 bits wide.
As will be seen, the "extra" four address bits have more impact than you might expect.
To summarize, 16 pins, called AD0-AD15, provide 16 bits of address and data.
The four remaining address pins (A16-A19) are multiplexed for use as status pins,
providing information about what the processor is doing for use by other parts of the system.
You might expect that the 8086 would thus have two types of pin circuits, but it turns out that there are four
distinct circuits, which I will discuss below.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/die-labeled.jpg"><img alt="The 8086 die under the microscope, with the main functional blocks and address pins labeled. Click this image (or any other) for a larger version." class="hilite" height="623" src="https://static.righto.com/images/8086-ad-pins/die-labeled-w700.jpg" title="The 8086 die under the microscope, with the main functional blocks and address pins labeled. Click this image (or any other) for a larger version." width="700" /></a><div class="cite">The 8086 die under the microscope, with the main functional blocks and address pins labeled. Click this image (or any other) for a larger version.</div></p>
<p>The microscope image above shows the silicon die of the 8086.
In this image, the metal layer on top of the chip is visible, while the silicon and polysilicon underneath are obscured.
The square pads around the edge of the die are connected by tiny bond wires to the chip's 40 external pins.
The 20 address pins are labeled: Pins AD0 through AD15 function as
address and data pins. Pins A16 through A19 function as address pins and status pins.<span id="fnref:ad"><a class="ref" href="#fn:ad">1</a></span>
The circuitry that controls the pins is highlighted in red.
Two internal busses are important for this discussion: the 20-bit AD bus (green) connects the AD pins to the rest of the CPU,
while the 16-bit C bus (blue) communicates with the registers.
These buses are connected through a circuit that can swap the byte order or shift the value.
(The lines on the diagram are simplified; the real wiring twists and turns to fit the layout.
Moreover, the C bus (blue) has its bits spread across the width of the register file.)</p>
<h2>Segment addressing in the 8086</h2>
<p>One goal of the 8086 design was to maintain backward compatibility with the earlier 8080 processor.<span id="fnref:compatibility"><a class="ref" href="#fn:compatibility">2</a></span>
This had a major impact on the 8086's memory design, resulting in the much-hated segment registers.
The 8080 (like most of the 8-bit processors of the early 1970s) had a 16-bit address space, able to access 64K (65,536 bytes) of memory,
which was plenty at the time.
But due to the exponential growth in memory capacity described by Moore's Law, it was clear that the 8086 needed to
support much more. Intel decided on a 1-megabyte address space, requiring 20 address bits.
But Intel wanted to keep the 16-bit memory addresses used by the 8080.</p>
<p>The solution was to break memory into segments. Each segment was 64K long, so a 16-bit offset was sufficient to access memory
in a segment.
The segments were allocated in a 1-megabyte address space, with the result that you could access a megabyte of memory, but
only in 64K chunks.<span id="fnref:pointers"><a class="ref" href="#fn:pointers">3</a></span>
Segment addresses were also 16 bits, but were shifted left by 4 bits (multiplied by 16) to support the 20-bit address space.</p>
<p>Thus, every memory access in the 8086 required a computation of the physical address.
The diagram below illustrates this process: the logical address consists of the segment base address and the offset within the segment.
The 16-bit segment register was shifted 4 bits and added to the 16-bit offset to yield the 20-bit physical memory address.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/physical-address-generation.jpg"><img alt="The segment register and the offset are added to create a 20-bit physical address. From iAPX 86,88 User's Manual, page 2-13." class="hilite" height="260" src="https://static.righto.com/images/8086-ad-pins/physical-address-generation-w500.jpg" title="The segment register and the offset are added to create a 20-bit physical address. From iAPX 86,88 User's Manual, page 2-13." width="500" /></a><div class="cite">The segment register and the offset are added to create a 20-bit physical address. From <a href="http://www.bitsavers.org/components/intel/_dataBooks/1981_iAPX_86_88_Users_Manual.pdf">iAPX 86,88 User's Manual</a>, page 2-13.</div></p>
<p>This address computation was not performed by the regular ALU (Arithmetic/Logic Unit), but by a separate adder that
was devoted to address computation.
The address adder is visible in the upper-left corner of the die photo.
I will discuss the address adder in more detail below.</p>
<h2>The AD bus and the C Bus</h2>
<p>The 8086 has multiple internal buses to move bits internally, but the relevant ones are the AD bus and the C bus.
The AD bus is a 20-bit bus that connects the 20 address/data pins to the internal circuitry.<span id="fnref:patent"><a class="ref" href="#fn:patent">4</a></span>
A 16-bit bus called the C bus provides the connection between
the AD bus, the address adder and some of the registers.<span id="fnref:registers"><a class="ref" href="#fn:registers">5</a></span>
The diagram below shows the connections.
The AD bus can be connected to the 20 address pins through latches. The low 16 pins can also be used for data input, while the upper 4 pins
can also be used for status output.
The address adder performs the 16-bit addition necessary for segment arithmetic. Its output is shifted left by four bits
(i.e. it has four 0 bits appended), producing the 20-bit result.
The inputs to the adder are provided by registers, a constant ROM that holds small constants such as +1 or -2, or the C bus.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/buses.jpg"><img alt="My reverse-engineered diagram showing how the AD bus and the C bus interact with the address pins." class="hilite" height="311" src="https://static.righto.com/images/8086-ad-pins/buses-w350.jpg" title="My reverse-engineered diagram showing how the AD bus and the C bus interact with the address pins." width="350" /></a><div class="cite">My reverse-engineered diagram showing how the AD bus and the C bus interact with the address pins.</div></p>
<p>The shift/crossover circuit provides the interface between these two buses, handling the 20-bit to 16-bit conversion. The busses can be connected in three ways: direct, crossover, or shifted.<span id="fnref:swapping"><a class="ref" href="#fn:swapping">6</a></span>
The direct mode connects the 16 bits of the C bus to the lower 16 bits of the address/data pins.
This is the standard mode for transferring data between the 8086's internal circuitry and the data pins.
The crossover mode performs the same connection but swaps the bytes. This is typically used for unaligned memory accesses, where the low memory byte corresponds to
the high register byte, or vice versa.
The shifted mode shifts the 20-bit AD bus value four positions to the right.
In this mode, the 16-bit output from the address adder goes to the 16-bit C bus.
(The shift is necessary to counteract the 4-bit shift applied to the address adder's output.)
Control circuitry selects the right operation for the shift/crossover circuit at the right time.<span id="fnref:shift"><a class="ref" href="#fn:shift">7</a></span></p>
<p>Two of the registers are invisible to the programmer but play an important role in memory accesses.
The <code>IND</code> (Indirect) register specifies the memory address; it holds the 16-bit memory offset in a segment.
The <code>OPR</code> (Operand) register holds the data value.<span id="fnref:prefetch"><a class="ref" href="#fn:prefetch">9</a></span>
The <code>IND</code> and <code>OPR</code> registers are not accessed directly by the programmer; the microcode for a machine instruction moves the appropriate
values to these registers prior to the write.</p>
<h2>Overview of a write cycle</h2>
<p>I hesitate to present a timing diagram, since I may scare off of my readers,
but the 8086's communication is designed around a four-step bus cycle.
The diagram below shows simplified timing for a write cycle, when the 8086 writes to memory or an I/O device.<span id="fnref:timing"><a class="ref" href="#fn:timing">8</a></span>
The external bus activity is organized as four states, each one clock cycle long: T1, T2, T3, T4.
These T states are very important since they control what happens on the bus.
During T1, the 8086 outputs the address on the pins. During the T2, T3, and T4 states, the 8086 outputs the data word on the pins.
The important part for this discussion is that the pins are multiplexed depending on the T-state: the pins provide the address during T1 and data during
T2 through T4.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/write-cycle.jpg"><img alt="A typical write bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16." class="hilite" height="130" src="https://static.righto.com/images/8086-ad-pins/write-cycle-w700.jpg" title="A typical write bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16." width="700" /></a><div class="cite">A typical write bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.</div></p>
<p>There two undocumented T states that are important to the bus cycle.
The physical address is computed in the two clock cycles before T1 so the address will be available in T1.
I give these "invisible" T states the names TS (start) and T0.</p>
<h2>The address adder</h2>
<!--
This computation is a bit tricky because the input buses and the adder are 16 bits, but the physical address is 20 bits.
-->
<p>The operation of the address adder is a bit tricky since the 16-bit adder must generate a 20-bit physical address.
The adder has two 16-bit inputs: the B input is connected to the upper registers via the B bus, while the C input is connected to the C bus.
The segment register value is transferred over the B bus to the adder during the second half
of the TS state (that is, two clock cycles before the bus cycle becomes externally visible during T1).
Meanwhile, the address offset is transferred over the C bus to the adder, but the adder's C input shifts the value four bits to the right,
discarding the four low bits. (As will be explained later, the pin driver circuits latch these bits.)
The adder's output is shifted left four bits and transferred to the AD bus during the second half of T0.
This produces the upper 16 bits of the 20-bit physical memory address.
This value is latched into the address output flip-flops at the start of T1, putting the computed address on the pins.
To summarize, the 20-bit address is generated by storing the 4 low-order bits during T0 and then the 16 high-order sum bits
during T1.</p>
<p>The address adder is not needed for segment arithmetic during T1 and T2.
To improve performance, the 8086 uses the adder during this idle time to increment or decrement memory addresses.
For instance, after popping a word from the stack, the stack pointer needs to be incremented by 2.
The address adder can do this increment "for free" during T1 and T2, leaving the ALU available for other operations.<span id="fnref:pipelining"><a class="ref" href="#fn:pipelining">10</a></span>
Specifically, the adder updates the memory address in <code>IND</code>, incrementing it or decrementing it as appropriate.
First, the <code>IND</code> value is transferred over the B bus to the adder during the second half of T1.
Meanwhile, a constant (-3 to +2) is loaded from the Constant ROM and transferred to the adder's C input.
The output from the adder is transferred to the AD bus during the second half of T2.
As before, the output is shifted four bits to the left. However, the shift/crossover circuit between the AD bus and the C bus
is configured to shift four bits to the right, canceling the adder's shift.
The result is that the C bus gets the 16-bit sum from the adder, and this value is stored in the <code>IND</code> register.<span id="fnref:predecrement"><a class="ref" href="#fn:predecrement">11</a></span>
For more information on the implemenation of the address adder, see my <a href="https://www.righto.com/2020/08/reverse-engineering-adder-inside-intel.html">previous blog post</a>.</p>
<!--
The use of the address pins is closely tied to the 8086's external timing.
The diagram below shows how a typical bus cycle is divided into four "T" states, each one corresponding to one clock cycle.
During T1, the CPU puts the memory address on the bus using the address pins.
During T3 and T4, the CPU writes to memory by putting the data value on the data pins.
Alternatively, the CPU reads from memory by reading the data value during T3 and T4.
State T2 acts a buffer period to ensure that memory and the CPU don't try to write to the bus at the same time.
![A typical bus cycle consists of four T states. Diagram from The 8086 Family Users Manual, figure 4-5.](bus-cycle.jpg "w500")
-->
<h2>The pin driver circuit</h2>
<p>Now I'll dive down to the hardware implementation of an output pin.
When the 8086 chip communicates with the outside world, it needs to provide relatively high currents.
The tiny logic transistors can't provide enough current, so the chip needs to use large output transistors.
To fit the large output transistors on the die, they are constructed of multiple wide transistors in parallel.<span id="fnref:ratio"><a class="ref" href="#fn:ratio">12</a></span>
Moreover, the drivers use a somewhat unusual "superbuffer" circuit with two transistors: one to pull the output high, and one to pull the output low.<span id="fnref:superbuffer"><a class="ref" href="#fn:superbuffer">13</a></span></p>
<p>The diagram below shows the transistor structure for one of the output pins (AD10), consisting of three
parallel transistors between the output and +5V, and five parallel transistors between the output and ground.
The die photo on the
left shows the metal layer on top of the die. This shows the power and ground wiring and the connections to
the transistors.
The photo on the right shows the die with the metal layer removed, showing the underlying silicon and the
polysilicon wiring on top.
A transistor gate is formed where a polysilicon wire crosses the doped silicon region.
Combined, the +5V transistors are equivalent to about 60 typical transistors, while the ground transistors are
equivalent to about 100 typical transistors.
Thus, these transistors provide substantially more current to the output pin.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/output-transistor.jpg"><img alt="Two views of the output transistors for a pin. The first shows the metal layer, while the second shows the polysilicon and silicon." class="hilite" height="402" src="https://static.righto.com/images/8086-ad-pins/output-transistor-w800.jpg" title="Two views of the output transistors for a pin. The first shows the metal layer, while the second shows the polysilicon and silicon." width="800" /></a><div class="cite">Two views of the output transistors for a pin. The first shows the metal layer, while the second shows the polysilicon and silicon.</div></p>
<h3>Tri-state output driver</h3>
<p>The output circuit for an address pin uses a tri-state buffer, which allows the output to be disabled
by putting it into a high-impedance "tri-state" configuration.
In this state, the output is not pulled high or low but is left floating.
This capability allows the pin to be used for data input.
It also allows external devices to device can take control of the bus, for instance, to perform
DMA (direct memory access).</p>
<p>The pin is driven by two large MOSFETs, one to pull the output high and one to pull it low.
(As described earlier, each large MOSFET is physically multiple transistors in parallel, but I'll ignore that for now.)
If both MOSFETs are off, the output floats, neither on nor off.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/output-circuit.jpg"><img alt="Schematic diagram of a "typical" address output pin." class="hilite" height="230" src="https://static.righto.com/images/8086-ad-pins/output-circuit-w400.jpg" title="Schematic diagram of a "typical" address output pin." width="400" /></a><div class="cite">Schematic diagram of a "typical" address output pin.</div></p>
<p>The tri-state output is implemented by driving the MOSFETs with two "superbuffer"<span id="fnref:and"><a class="ref" href="#fn:and">15</a></span> AND gates.
If the <code>enable</code> input is low, both AND gates produce a low output and both output transistors are off.
On the other hand, if <code>enable</code> is high, one AND gate will be on and one will be off.
The desired output value is loaded into a flip-flop to hold it,<span id="fnref:clock"><a class="ref" href="#fn:clock">14</a></span>
and the flip-flop turns one of the output transistors on, driving the output pin high or low as appropriate.
(Conveniently, the flip-flop provides the data output Q and the inverted data output <span style="text-decoration:overline">Q</span>.)
Generally, the address pin outputs are enabled for T1-T4 of a write but only during T1 for a read.<span id="fnref:enable"><a class="ref" href="#fn:enable">16</a></span></p>
<p>In the remainder of the discussion, I'll use the tri-state buffer symbol below, rather than showing the implementation of the buffer.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/output-simplified.jpg"><img alt="The output circuit, expressed with a tri-state buffer symbol." class="hilite" height="137" src="https://static.righto.com/images/8086-ad-pins/output-simplified-w350.jpg" title="The output circuit, expressed with a tri-state buffer symbol." width="350" /></a><div class="cite">The output circuit, expressed with a tri-state buffer symbol.</div></p>
<h3>AD4-AD15</h3>
<p>Pins AD4-AD15 are "typical" pins, avoiding the special behavior of the top and bottom pins, so I'll discuss them first.
The behavior of these pins is that the value on the AD bus is latched by the circuit and then put on the output pin
under the control of the <code>enaable</code> signal.
The circuit has three parts: a multiplexer to select the output value, a flip-flop to hold the output value, and a tri-state driver to
provide the high-current output to the pin.
In more detail, the multiplexer selects either the value on the AD bus or the current output from the flip-flop.
That is, the multiplexer can either load a new value into the flip-flop or hold the existing value.<span id="fnref:implementation"><a class="ref" href="#fn:implementation">17</a></span>
The flip-flop latches the input value on the falling edge of the clock, passing it to the output driver.
If the enable line is high, the output driver puts this value on the corresponding address pin.</p>
<!-- datasheet: output low tested at 2.0mA, output high tested at -400 microamps -->
<p><a href="https://static.righto.com/images/8086-ad-pins/ad415.jpg"><img alt="The output circuit for AD4-AD15 has a latch to hold the desired output value, an address or data bit." class="hilite" height="129" src="https://static.righto.com/images/8086-ad-pins/ad415-w400.jpg" title="The output circuit for AD4-AD15 has a latch to hold the desired output value, an address or data bit." width="400" /></a><div class="cite">The output circuit for AD4-AD15 has a latch to hold the desired output value, an address or data bit.</div></p>
<p>For a write, the circuit latches the address value on the bus during the second half of T0 and puts it on the pins during T1.
During the second half of the T1 state, the data word is transferred from the <code>OPR</code> register over the C bus to the AD bus and loaded
into the AD pin latches.
The word is transferred from the latches to the pins during T2 and held for the remainder of the bus cycle.</p>
<h3>AD0-AD3</h3>
<p>The four low address bits have a more complex circuit because these address bits are latched from the bus before the address adder computes its sum, as described earlier.
The memory offset (before the segment addition) will be on the C bus during the second half of TS and is loaded into the lower
flip-flop. This flip-flop delays these bits for one clock cycle and then they are loaded into the upper flip-flop.
Thus, these four pins pick up the offset prior to the addition, while the other pins get the result of the segment addition.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/ad03.jpg"><img alt="The output circuit for AD0-AD3 has a second latch to hold the low address bits before the address adder computes the sum." class="hilite" height="174" src="https://static.righto.com/images/8086-ad-pins/ad03-w500.jpg" title="The output circuit for AD0-AD3 has a second latch to hold the low address bits before the address adder computes the sum." width="500" /></a><div class="cite">The output circuit for AD0-AD3 has a second latch to hold the low address bits before the address adder computes the sum.</div></p>
<p>For data, the AD0-AD3 pins transfer data directly from the AD bus to the pin latch, bypassing the delay that was used to get the address bits.
That is, the AD0-AD3 pins have two paths: the delayed path used for addresses during T0 and the direct path otherwise used for data.
Thus, the multiplexer has three inputs: two for these two paths and a third loop-back input to hold the flip-flop value.</p>
<!--
ad-latch-load loads the AD0-15 latches.
-->
<!--
If the memory access was an instruction fetch,
the address adder is immediately reused to update the instruction pointer (program counter).
In the second half of T1, one input of the address adder is loaded with the instruction pointer increment from the constant ROM (2 if a word was fetched).
This value is added to the instruction pointer value and
the updated instruction pointer value is written back in the second half of T2.
A similar process is used for other memory accesses that update a pointer, such as stack operations or string operations.
If another bus cycle follows, the T3 and T4 states act like the T0 and T1 states described above, preparing the next memory address.
Thus, address calculation is pipelined in the 8086: the address adder performs the segment computation during the last half of the previous bus cycle, so the physical memory address will be ready at the start of the bus cycle.
For a memory write, the address latches are reloaded during T1, loading them with the `OPR` register???
-->
<h3>A16-A19: status outputs</h3>
<p>The top four pins (A16-A19) are treated specially, since they are not used for data.
Instead, they provide processor status during T2-T4.<span id="fnref:status"><a class="ref" href="#fn:status">18</a></span> The pin latches for these
pins are loaded with the address during T0 like the other pins, but loaded with status instead of data during T1.
The multiplexer at the input to the latch selects the address bit during T0 and the status bit during T1, and
holds the value otherwise.
The schematic below shows how this is implemented for A16, A17, and A19.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/ad1619.jpg"><img alt="The output circuit for AD16, AD17, and AD19 selects either an address output or a status output." class="hilite" height="115" src="https://static.righto.com/images/8086-ad-pins/ad1619-w400.jpg" title="The output circuit for AD16, AD17, and AD19 selects either an address output or a status output." width="400" /></a><div class="cite">The output circuit for AD16, AD17, and AD19 selects either an address output or a status output.</div></p>
<p>Address pin A18 is different because it indicates the current status of the interrupt enable flag bit.
This status is updated every clock cycle, unlike the other pins.
To implement this, the pin has a different circuit that isn't latched,
so the status can be updated continuously.
The clocked transistors act as "pass transistors", passing the signal through when active.
When a pass transistor is turned off, the following logic gate holds the previous value due to the capacitance of the
wiring.
Thus, the pass transistors provide a way of holding the value through the clock cycle.
The flip-flops are implemented with pass transistors internally, so in a sense the circuit below is a flip-flop
that has been "exploded" to provide a second path for the interrupt status.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/ad18.jpg"><img alt="The output circuit for AD18 is different from the rest so the I flag status can be updated every clock cycle." class="hilite" height="162" src="https://static.righto.com/images/8086-ad-pins/ad18-w540.jpg" title="The output circuit for AD18 is different from the rest so the I flag status can be updated every clock cycle." width="540" /></a><div class="cite">The output circuit for AD18 is different from the rest so the I flag status can be updated every clock cycle.</div></p>
<h2>Reads</h2>
<p>A memory or I/O read also uses a 4-state bus cycle, slightly different from the write cycle.
During T1, the address is provided on the pins, the same as for a write.
After that, however, the output circuits are tri-stated so they float, allowing the external memory to put data on the bus.
The read data on the pin is put on the AD bus at the start of the T4 state.
From there, the data passes through the crossover circuit to the C bus. Normally the 16 data bits pass straight through to
the C bus, but the bytes will be swapped if the memory access is unaligned.
From the C bus, the data is written to the <code>OPR</code> register, a byte or a word as appropriate.
(For an instruction prefetch, the word is written to a prefetch queue register instead.)</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/read-cycle.jpg"><img alt="A typical read bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16." class="hilite" height="139" src="https://static.righto.com/images/8086-ad-pins/read-cycle-w600.jpg" title="A typical read bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16." width="600" /></a><div class="cite">A typical read bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.</div></p>
<p>To support data input on the AD0-AD15 pins, they have a circuit to buffer the input data and transfer it to the AD bus.
The incoming data bit is buffered by the two inverters and sampled when the clock is high.
If the enable' signal is low, the data bit is transferred to the AD bus when the clock is low.<span id="fnref:read-enable"><a class="ref" href="#fn:read-enable">19</a></span>
The two MOSFETs act as a "superbuffer", providing enough current for the fairly long AD bus.
I'm not sure what the capacitor accomplishes, maybe avoiding a race condition if the data pin changes just as the clock goes low.<span id="fnref:race"><a class="ref" href="#fn:race">20</a></span></p>
<p><a href="https://static.righto.com/images/8086-ad-pins/read-circuit.jpg"><img alt="Schematic of the input circuit for the data pins." class="hilite" height="134" src="https://static.righto.com/images/8086-ad-pins/read-circuit-w500.jpg" title="Schematic of the input circuit for the data pins." width="500" /></a><div class="cite">Schematic of the input circuit for the data pins.</div></p>
<p>This circuit has a second role, precharging the AD bus high when the clock is low, if there's no data.
Precharging a bus is fairly common in the 8086 (and other NMOS processors) because NMOS transistors are better at pulling a
line low than pulling it high. Thus, it's often faster to precharge a line high before it's needed and then pull it low for a 0.<span id="fnref:adder"><a class="ref" href="#fn:adder">21</a></span></p>
<p>Since pins A16-A19 are not used for data, they operate the same for reads as for writes: providing address bits and then status.</p>
<h2>The pin circuit on the die</h2>
<p>The diagram below shows how the pin circuitry appears on the die. The metal wiring has been removed to show the silicon and polysilicon.
The top half of the image is the input circuitry, reading a data bit from the pin and feeding it to the AD bus.
The lower half of the image is the output circuitry, reading an address or data bit from the AD bus and amplifying it for output
via the pad.
The light gray regions are doped, conductive silicon. The thin tan lines are polysilicon, which forms transistor gates where it crosses doped silicon.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/pin-labeled.jpg"><img alt="The input/output circuitry for an address/data pin. The metal layer has been removed to show the underlying silicon and polysilicon. Some crystals have formed where the bond pad was." class="hilite" height="482" src="https://static.righto.com/images/8086-ad-pins/pin-labeled-w600.jpg" title="The input/output circuitry for an address/data pin. The metal layer has been removed to show the underlying silicon and polysilicon. Some crystals have formed where the bond pad was." width="600" /></a><div class="cite">The input/output circuitry for an address/data pin. The metal layer has been removed to show the underlying silicon and polysilicon. Some crystals have formed where the bond pad was.</div></p>
<h2>A historical look at pins and timing</h2>
<p>The number of pins on Intel chips has grown exponentially, more than a factor of 100 in 50 years.
In the early days, Intel management was convinced that a 16-pin package was large enough for any integrated circuit.
As a result, the Intel 4004 processor (1971) was crammed into a 16-pin package.
Intel chip designer Federico Faggin<span id="fnref:faggin"><a class="ref" href="#fn:faggin">22</a></span> describes 16-pin packages as a completely silly requirement that was throwing away
performance,
but the "God-given 16 pins" was like a religion at Intel.
When Intel was forced to use 18 pins by the 1103 memory chip, it "was like the sky had dropped from heaven"
and he had "never seen so many long faces at Intel."
Although the 8008 processor (1972) was able to use 18 pins, this low pin count still harmed performance by forcing pins to be used for multiple
purposes.</p>
<p>The Intel 8080 (1974) had a larger, 40-pin package that allowed it to have 16 address pins and 8 data pins.
Intel stuck with this size for the 8086, even though competitors used larger packages with more pins.<span id="fnref:ti"><a class="ref" href="#fn:ti">23</a></span>
As processors became more complex, the 40-pin package became infeasible and the pin count rapidly expanded;
The 80286 processor (1982) had a 68-pin package, while the
i386 (1985) had 132 pins; the i386 needed many more pins because it had a 32-bit data bus and a 24- or 32-bit address bus.
The i486 (1989) went to 196 pins while the original Pentium had 273 pins.
Nowadays, a modern <a href="https://www.intel.com/content/www/us/en/products/sku/232167/intel-core-i913900ks-processor-36m-cache-up-to-6-00-ghz/specifications.html">Core I9 processor</a> uses the <a href="https://en.wikipedia.org/wiki/LGA_1700">FCLGA1700</a> socket with a whopping 1700 contacts.</p>
<p>Looking at the history of Intel's bus timing, the 8086's complicated memory timing goes back to the Intel 8008 processor (1972). Instruction execution in the 8008 went through
a specific sequence of timing states; each clock cycle was assigned a particular state number.
Memory accesses took three cycles:
the address was sent to memory during states T1 and T2, half of the address at a time since there were only 8 address pins.
During state T3, a data byte was either transmitted to memory or read from memory.
Instruction execution took place during T4 and T5.
State signals from the 8008 chip indicated which state it was in.</p>
<!-- http://www.bitsavers.org/components/intel/MCS8/Intel_8008_8-Bit_Parallel_Central_Processing_Unit_Rev1_Apr72.pdf -->
<p>The 8080 used an even more complicated timing system.
An instruction consisted of one to five "machine cycles", numbered M1 through M5, where each machine cycle corresponded to
a memory or I/O access. Each machine cycle consisted of three to five states, T1 through T5, similar to the 8008 states.
The 8080 had 10 different types of machine cycle such as instruction fetch, memory read, memory write, stack read or write,
or I/O read or write. The status bits indicated the type of machine cycle.
The 8086 kept the T1 through T4 memory cycle. Because the 8086 decoupled instruction prefetching from execution, it no
longer had explicit M machine cycles. Instead, it used status bits to indicate 8 types of bus activity such as instruction
fetch, read data, or write I/O.</p>
<h2>Conclusions</h2>
<p>Well, address pins is another subject that I thought would be straightforward to explain but turned out to be surprisingly
complicated.
Many of the 8086's design decisions combine in the address pins: segmented addressing, backward compatibility, and the small 40-pin package.
Moreover, because memory accesses are critical to performance, Intel put a lot of effort into this circuitry.
Thus, the pin circuitry is tuned for particular purposes, especially pin A18 which is different from all the rest.</p>
<p>There is a lot more to say about memory accesses and how the 8086's Bus Interface Unit performs them.
The process is very complicated, with interacting state machines for memory operation and instruction prefetches, as well
as handling unaligned memory accesses.
I plan to write more, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>
and Bluesky as <a href="https://staging.bsky.app/profile/righto.com">@righto.com</a> so you can follow me there too.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:ad">
<p>In the discussion, I'll often call all the address pins "AD" pins for simplicity, even though pins 16-19 are not
used for data. <a class="footnote-backref" href="#fnref:ad" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:compatibility">
<p>The 8086's compatibility with the 8080 was somewhat limited since the 8086 had a different instruction set.
However, Intel provided a conversion program called <a href="http://www.bitsavers.org/pdf/intel/ISIS_II/9800642-02_MCS-86_Assembly_Language_Converter_Feb80.pdf">CONV86</a>
that could convert 8080/8085 assembly code into 8086 assembly code that would usually work after minor editing.
The 8086 was designed to make this process straightforward, with a mapping from the 8080's registers onto the
8086's registers, along with a mostly-compatible instruction set. (There were a few 8080 instructions that would
be expanded into multiple 8086 instructions.)
The conversion worked for straightforward code, but didn't work well with tricky, self-modifying code, for instance. <a class="footnote-backref" href="#fnref:compatibility" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:pointers">
<p>To support the 8086's segment architecture, programmers needed to deal with "near" and "far" pointers.
A near pointer consisted of a 16-bit offset and could access 64K in a segment.
A far pointer consisted of a 16-bit offset along with a 16-bit segment address. By modifying the segment register on each access,
the full megabyte of memory could be accessed. The drawbacks were that far pointers were twice as big and were slower. <a class="footnote-backref" href="#fnref:pointers" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:patent">
<p>The <a href="https://patents.google.com/patent/US4449184A">8086 patent</a> provides a detailed architectural diagram of the 8086.
I've extracted part of the diagram below.
In most cases the diagram is accurate, but its description of the C bus doesn't match the real chip.
There are some curious differences between the patent diagram and the actual implementation of the 8086,
suggesting that the data pins were reorganized between the patent and the completion of the 8086.
The diagram
shows the address adder (called the Upper Adder) connected to the C bus, which is connected to the address/data
pins.
In particular, the patent shows the data pins multiplexed with the high address pins, while the low
address pins A3-A0 are multiplexed with three status signals.
The actual implementation of the 8086 is the other way around, with the data pins multiplexed with the low address
pins while the high address pins A19-A16 are multiplexed with the status signals.
Moreover, the patent doesn't show anything corresponding to what I call the AD bus; I made up that name.
The moral is that while patents can be very informative, they can also be misleading.</p>
<p><a href="https://static.righto.com/images/8086-ad-pins/patent-diagram.png"><img alt="A diagram from patent US4449184 showing the connections to the address pins. This diagram does not match the actual chip. The diagram also shows the old segment register names: RC, RD, RS, and RA became CS, DS, SS, and ES." class="hilite" height="530" src="https://static.righto.com/images/8086-ad-pins/patent-diagram-w400.png" title="A diagram from patent US4449184 showing the connections to the address pins. This diagram does not match the actual chip. The diagram also shows the old segment register names: RC, RD, RS, and RA became CS, DS, SS, and ES." width="400" /></a><div class="cite">A diagram from patent US4449184 showing the connections to the address pins. This diagram does not match the actual chip. The diagram also shows the old segment register names: RC, RD, RS, and RA became CS, DS, SS, and ES.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:patent" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:registers">
<p>The C bus is connected to the PC, OPR, and IND registers, as well as the prefetch queue, but is not
connected to the segment registers.
Two other buses (the ALU bus and the B bus) provide access to the segment registers. <a class="footnote-backref" href="#fnref:registers" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:swapping">
<p>Swapping the bytes on the data pins is required in a few cases.
The 8086 has a 16-bit data bus, so transfers are usually a 16-bit word, copied directly between memory and a register.
However, the 8086 also allows 8-bit operations, in which case either the top half or bottom half of the word is
accessed. Loading an 8-bit value from the top half of a memory word into the bottom half of a register uses the crossover
circuit.
Another case is performing a 16-bit access to an "unaligned" address, that is, an odd address so the word crosses
the normal word boundaries.
From the programmer's perspective, an unaligned access is permitted (unlike many RISC processors), but the hardware
converts this access into two 8-bit accesses, so the bus itself never handles an unaligned access.</p>
<p>The 8086 has the ability to access a single memory byte out of a word, either for a byte operation or for
an unaligned word operation.
This behavior has some important consequences on the address pins.
In particular, the low address pin AD0 doesn't behave like the rest of the address pins due to the special handling
of odd addresses.
Instead, this pin indicates which half of the word to transfer.
The AD0 line is low (0) when the lower portion of the bus transfers a byte.
Another pin, <span style="text-decoration: overline">BHE</span> (Bus High Enable) has a similar role for the upper
half of the bus: it is low (0) if a byte is transferred over D15-D8.
(Keep in mind that the 8086 is little-endian, so the low byte of the word is first in memory, at the even address.)</p>
<p><style type="text/css">
table.a0 { border-collapse: collapse}
table.a0 .overline { text-decoration: overline;}
table.a0 th,td {padding: 5px;}
table.a0 th:nth-child(1) {border-right: 1px solid #ccc;}
table.a0 td:nth-child(1) {border-right: 1px solid #ccc;}
</style></p>
<p>The following table summarizes how <span style="text-decoration:overline">BHE</span> and A0 work together to select a byte or word.
When accessing a byte at an odd address, A0 is odd as you might expect.</p>
<p><table class="a0">
<tr style="border-bottom: 1px solid #ccc;"><th>Access type</th><th class="overline">BHE</th><th>A0</th></tr>
<tr><td>Word</td><td>0</td><td>0</td></tr>
<tr><td>Low byte</td><td>1</td><td>0</td></tr>
<tr><td>High byte</td><td>0</td><td>1</td></tr>
</table> <a class="footnote-backref" href="#fnref:swapping" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:shift">
<p>The <code>cbus-adbus-shift</code> signal is activated during <code>T2</code>, when a memory index is being updated, either the instruction pointer or the <code>IND</code> register. The address adder is used to update the register and the shift undoes the 4-bit left shift applied to the adder's output.
The shift is also used for the <code>CORR</code> micro-instruction, which corrects the instruction pointer to account for prefetching.
The <code>CORR</code> micro-instruction generates a "fake" short bus cycle in which the constant ROM and the address adder are used during T0.
I discuss the <code>CORR</code> micro-instruction in more detail in <a href="https://www.righto.com/2023/01/inside-8086-processors-instruction.html">this post</a>. <a class="footnote-backref" href="#fnref:shift" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:timing">
<p>I've made the timing diagram somewhat idealized so actions line up with the clock.
In the real datasheet, all the signals are skewed by various amounts so the timing is more complicated.
Moreover, if the memory device is slow, it can insert "wait" states between
T3 and T4. (Cheap memory was slower and would need wait states.)
Moreover, actions don't exactly line up with the clock.
I'm also omitting various control signals.
The datasheet has pages of timing constraints on exactly when
signals can change. <a class="footnote-backref" href="#fnref:timing" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:prefetch">
<p>Instruction prefetches don't use the <code>IND</code> and <code>OPR</code> registers.
Instead, the address is specified by the Instruction Pointer (or Program Counter), and the data is stored directly into
one of the instruction prefetch registers. <a class="footnote-backref" href="#fnref:prefetch" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:pipelining">
<p>A single memory operation takes six clock cycles: two preparatory cycles to compute the address before the four visible cycles.
However, if multiple memory operations are performed, the operations are overlapped to achieve a degree of pipelining.
Specifically, the address calculation for the next memory operation takes place during the last two clock cycles of the current
memory operation, saving two clock cycles.
That is, for consecutive bus cycles, T3 and T4 overlap with TS and T0 of the next cycle.
In other words, during T3 and T4 of one bus cycle, the memory address gets computed for the next bus cycle.
This pipelining improves performance, compared to taking 6 clock cycles for each bus cycle. <a class="footnote-backref" href="#fnref:pipelining" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:predecrement">
<p>The <code>POP</code> operation is an example of how the address adder updates a memory pointer.
In this case, the stack address is moved from the Stack Pointer to the <code>IND</code> register in order to perform the memory read.
As part of the read operation, the <code>IND</code> register is incremented by 2. The address is then moved from the <code>IND</code> register to
the Stack Pointer.
Thus, the address adder not only performs the segment arithmetic, but also computes the new value for the <code>SP</code> register.</p>
<p>Note that the increment/decrement of the <code>IND</code> register happens after the memory operation.
For stack operations, the SP must be decremented before a <code>PUSH</code> and incremented after a <code>POP</code>.
The adder cannot perform a predecrement, so the <code>PUSH</code> instruction uses the ALU (Arithmetic/Logic Unit) to perform the decrement. <a class="footnote-backref" href="#fnref:predecrement" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:ratio">
<p>The current from an MOS transistor is proportional to the width of the gate divided by the length (the W/L ratio).
Since the minimum gate length is set by the manufacturing process, increasing the width of the gate
(and thus the overall size of the transistor) is how the transistor's current is increased. <a class="footnote-backref" href="#fnref:ratio" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:superbuffer">
<p>Using one transistor to pull the output high and one to pull the output low is normal for CMOS gates, but it is unusual for
NMOS chips like the 8086.
A normal NMOS gate only has active transistor to pull the output low and uses a <a href="https://en.wikipedia.org/wiki/Depletion-load_NMOS_logic">depletion-mode</a> transistor to provide a weak pull-up current, similar to a pull-up resistor.
I discuss superbuffers in more detail <a href="https://www.righto.com/2022/11/the-unusual-bootstrap-drivers-inside.html#:~:text=Basic%20NMOS%20circuits">here</a>. <a class="footnote-backref" href="#fnref:superbuffer" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:clock">
<p>The flip-flop is controlled by the inverted clock signal, so the output will change when the clock goes low.
Meanwhile, the <code>enable</code> signal is dynamically latched by a MOSFET, also controlled by the inverted clock.
(When the clock goes high, the previous value will be retained by the gate capacitance of the inverter.) <a class="footnote-backref" href="#fnref:clock" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
<li id="fn:and">
<p>The superbuffer AND gates are constructed on the same principle as the regular superbuffer, except with two
inputs.
Two transistors in series pull the output high if both inputs are high.
Two transistors in parallel pull the output low if either input is low.
The low-side transistors are driven by inverted signals. I haven't drawn these signals on the schematic to
simplify it.</p>
<p>The superbuffer AND gates use large transistors, but not as large as the output transistors, providing
an intermediate amplification stage between the small internal signals and the large external signals.
Because of the high capacitance of the large output transistors, they need to be driven with larger signals.
There's a lot of theory behind how transistor sizes should be scaled for maximum performance, described in
the book <a href="https://amzn.to/42BHeTz">Logical Effort</a>.
Roughly speaking, for best performance when scaling up a signal, each stage should be about 3 to 4 times as large as the previous
one, so a fairly large number of stages are used (page 21).
The 8086 simplifies this with two stages, presumably giving up a bit of performance in exchange for keeping the drivers smaller
and simpler. <a class="footnote-backref" href="#fnref:and" title="Jump back to footnote 15 in the text">↩</a></p>
</li>
<li id="fn:enable">
<p>The enable circuitry has some complications. For instance, I think the address pins will be enabled if a
cycle was going to be T1 for a prefetch but then got preempted by a memory operation.
The bus control logic is fairly complicated. <a class="footnote-backref" href="#fnref:enable" title="Jump back to footnote 16 in the text">↩</a></p>
</li>
<li id="fn:implementation">
<p>The multiplexer is implemented with pass transistors, rather than gates. One of the pass transistors is turned on to pass
that value through to the multiplexer's output.
The flip-flop is implemented with two pass transistors and two inverters in alternating order.
The first pass transistor is activated by the clock and the second by the complemented clock.
When a pass transistor is off, its output is held by the gate capacitance of the inverter, somewhat like dynamic RAM.
This is one reason that the 8086 has a minimum clock speed: if the clock is too slow, these capacitively-held values
will drain away. <a class="footnote-backref" href="#fnref:implementation" title="Jump back to footnote 17 in the text">↩</a></p>
</li>
<li id="fn:status">
<p>The status outputs on the address pins are defined as follows:
</br>
A16/S3, A17/S4: these two status lines indicate which relocation register is being used for the memory access,
i.e. the stack segment, code segment, data segment, or alternate segment.
Theoretically, a system could use a different memory bank for each segment and increase the total memory capacity to 4 megabytes.
<br/>
A18/S5: indicates the status of the interrupt enable bit. In order to provide the most up-to-date value, this pin has a different
circuit. It is updated
at the beginning of each clock cycle, so it can change during a bus cycle.
The motivation for this is presumably so peripherals can determine immediately if the interrupt enable status changes.
<br/>
A19/S6: the documentation calls this a status output, even though it always outputs a status of 0. <a class="footnote-backref" href="#fnref:status" title="Jump back to footnote 18 in the text">↩</a></p>
</li>
<li id="fn:read-enable">
<p>For a read, the enable signal is activated at the end of T3 and the beginning of T4 to transfer the data value to the AD bus.
The signal is gated by the READY pin, so the read doesn't happen until the external device is ready.
The 8086 will insert Tw wait states in that case. <a class="footnote-backref" href="#fnref:read-enable" title="Jump back to footnote 19 in the text">↩</a></p>
</li>
<li id="fn:race">
<p>The datasheet says that a data value must be held steady for 10 nanoseconds (TCLDX) after the clock goes low at the start of T4. <a class="footnote-backref" href="#fnref:race" title="Jump back to footnote 20 in the text">↩</a></p>
</li>
<li id="fn:adder">
<p>The design of the AD bus is a bit unusual since the adder will put a value on the AD bus when the clock is high, while the data pin will put a value
on the AD bus when the clock is low (while otherwise precharging it when the clock is low).
Usually the bus is precharged during one clock phase and all users of the bus pull it low (for a 0) during the other phase. <a class="footnote-backref" href="#fnref:adder" title="Jump back to footnote 21 in the text">↩</a></p>
</li>
<li id="fn:faggin">
<p>Federico Faggin's oral history is <a href="http://archive.computerhistory.org/resources/text/Oral_History/Faggin_Federico/Faggin_Federico_1_2_3.oral_history.2004.102658025.pdf">here</a>. The relevant part is on pages 55 and 56. <a class="footnote-backref" href="#fnref:faggin" title="Jump back to footnote 22 in the text">↩</a></p>
</li>
<li id="fn:ti">
<p>The Texas Instruments TMS9900 (1976) used a 64-pin package for instance, as did the Motorola 68000 (1979). <a class="footnote-backref" href="#fnref:ti" title="Jump back to footnote 23 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com6tag:blogger.com,1999:blog-6264947694886887540.post-27313075965214368512023-07-01T09:32:00.008-07:002023-09-27T19:05:36.319-07:00The complex history of the Intel i960 RISC processor<style>
.hilite {cursor:zoom-in}
a:link img.hilite, a:visited img.hilite {color: #fff;}
a:hover img.hilite {color: #f66;}
</style>
<style>
pre.microcode {font-family: courier, fixed; padding: 10px; background-color: #f5f5f5; display:inline-block;border:none;}
pre.microcode span {color: green; font-style:italic; font-family: sans-serif; font-size: 90%;}
</style>
<p>The Intel i960 was a remarkable 32-bit processor of the 1990s with a confusing set of versions.
Although it is now mostly forgotten (outside the many people who used it as an embedded processor), it has a complex history.
It had a shot at being Intel's flagship processor until x86 overshadowed it.
Later, it was the world's best-selling RISC processor.
One variant was a <b>33</b>-bit processor with a decidedly non-RISC object-oriented instruction set;
it became a military standard and was used in the F-22 fighter plane.
Another version powered Intel's short-lived Unix servers.
In this blog post, I'll take a look at the history of the i960, explain its different variants, and examine silicon dies.
This chip has a lot of mythology and confusion (especially on <a href="https://en.wikipedia.org/wiki/Intel_i960">Wikipedia</a>), so I'll
try to clear things up.</p>
<h2>Roots: the iAPX 432</h2>
<p><a href="https://static.righto.com/images/960-overview/intel432.jpg"><img alt=""Intel 432": Cover detail from Introduction to the iAPX 432 Architecture." class="hilite" height="215" src="https://static.righto.com/images/960-overview/intel432-w450.jpg" title=""Intel 432": Cover detail from Introduction to the iAPX 432 Architecture." width="450" /></a><div class="cite">"Intel 432": Cover detail from <a href="http://www.bitsavers.org/components/intel/iAPX_432/171821-001_Introduction_to_the_iAPX_432_Architecture_Aug81.pdf">Introduction to the iAPX 432 Architecture</a>.</div></p>
<p>The ancestry of the i960 starts in 1975, when Intel set out to design a "micro-mainframe",
a revolutionary processor that would bring the power of mainframe computers to microprocessors.
This project, eventually called the iAPX 432, was a huge leap in features and complexity.
Intel had just released the popular 8080 processor in 1974, an 8-bit processor that kicked off the hobbyist computer era with computers
such as the Altair and IMSAI.
However, 8-bit microprocessors were toys compared to 16-bit minicomputers like the PDP-11, let alone mainframes like the 32-bit IBM System/370.
Most companies were gradually taking minicomputer and mainframe features and putting them into microprocessors, but
Intel wanted to leapfrog to a mainframe-class 32-bit processor.
The processor would make programmers much more productive by bridging the "semantic gap" between high-level languages and simple processors, implementing many features directly into the processor.</p>
<!-- https://www.google.com/books/edition/Journal_of_Pascal_and_Ada/d0UsAQAAIAAJ?hl=en&gbpv=1&bsq=ada+dominant+programming+language&dq=ada+dominant+programming+language&printsec=frontcover -->
<!-- https://books.google.com/books?id=1mQDa_UuMbYC&pg=PA51&dq=ada+dominant+programming+language&hl=en&sa=X&ved=2ahUKEwjkx535sLX_AhUfFlkFHemmCtoQ6AF6BAgFEAI#v=onepage&q=ada%20dominant%20programming%20language&f=false -->
<!--
The [June 1981](https://www.computer.org/csdl/magazine/co/1981/06) issue of IEEE Computer was devoted to highly positive coverage of the Ada language.
-->
<p>The <a href="https://homes.cs.washington.edu/~levy/capabook/Chapter9.pdf">432 processor</a> included memory management, process management, and interprocess communication.
These features were traditionally part of the operating system, but Intel built them in the processor,
calling this the "<a href="http://www.bitsavers.org/components/intel/iAPX_432/171821-001_Introduction_to_the_iAPX_432_Architecture_Aug81.pdf">Silicon Operating System</a>".
The processor was also one of the first to implement the new IEEE 754 floating-point standard, still in use by most processors.
The 432 also had support for fault tolerance and multi-processor systems.
One of the most unusual features of the 432 was that instructions weren't byte aligned. Instead, instructions were between 6 and 321 bits long,
and you could jump into the middle of a byte.
Another unusual feature was that the 432 was a stack-based machine, pushing and popping values on an in-memory stack, rather than
using general-purpose registers.</p>
<p>The 432 provided hardware support for object-oriented programming, built around
an unforgeable object pointer called an Access Descriptor.
Almost every structure in a 432 program and in the system itself is a separate object.
The processor provided fine-grain security and access control by
checking every object access to ensure that the user had permission and was not exceeding the bounds of the object.
This made buffer overruns and related classes of bugs impossible, unlike modern processors.</p>
<!-- Ada and iAPX 432 developed in parallel https://books.google.com/books?id=DT4EAAAAMBAJ&pg=PA39&dq=iapx+432+ada&hl=en&sa=X&ved=2ahUKEwiV2citlqr_AhXWlWoFHfE9AyUQuwV6BAgKEAY#v=onepage&q=iapx%20432%20ada&f=false -->
<p><a href="https://static.righto.com/images/960-overview/iapx432.jpg"><img alt="This photo from the Intel 1981 annual report shows Intel's 432-based development computer and three of the engineers." class="hilite" height="469" src="https://static.righto.com/images/960-overview/iapx432-w350.jpg" title="This photo from the Intel 1981 annual report shows Intel's 432-based development computer and three of the engineers." width="350" /></a><div class="cite">This photo from the Intel 1981 annual report shows Intel's 432-based development computer and three of the engineers.</div></p>
<p>The new, object-oriented Ada language was the primary programming language for the 432.
The US Department of Defense developed the Ada language in the late 1970s and early 1980s to provide a common language for
embedded systems, using the latest ideas from object-oriented programming.
Proponents expected Ada to become the dominant computer language for the 1980s and beyond.
In 1979, Intel realized that Ada was a good target for the iAPX 432, since they had similar object and task models.
Intel <a href="https://www.computer.org/csdl/magazine/co/1981/06/01667402/13rRUB7a19L">decided to</a>
"establish itself as an early center of Ada technology by using the language as the
primary development and application language for the
new iAPX 432 architecture."
The iAPX 432's operating system (<a href="http://www.bitsavers.org/components/intel/iAPX_432/172103-002_iMAX_432_Reference_Manual_May82.pdf">iMAX 432</a>) and other software were written in Ada, using one of the first Ada compilers.</p>
<p>Unfortunately, iAPX 432 project was way too ambitious for its time.
After a couple of years of slow progress, Intel realized that they needed a stopgap processor to counter competitors such as Zilog and Motorola.
Intel quickly designed a 16-bit processor that they could sell until the 432 was ready.
This processor was the Intel 8086 (1978), which lives on in the x86 architecture used by most computers today.
Critically, the importance of the 8086 was not recognized at the time.
In 1981, IBM selected Intel's 8088 processor (a version of the 8086 with an 8-bit bus) for the IBM PC.
In time, the success of the IBM PC and compatible systems led to Intel's dominance of the microprocessor market, but in 1981
Intel viewed the IBM PC as just another design win.
As Intel VP Bill Davidow later said, "We knew it was an important win. We didn't realize it was the only win."</p>
<!-- Creating the Digital Future, p27 -->
<p><a href="https://static.righto.com/images/960-overview/ibm-pc.jpg"><img alt="Caption: IBM chose Intel's high performance 8088 microprocessor as the central processing unit for the IBM Personal Computer, introduced in 1981. Seven Intel peripheral components are also integrated into the IBM Personal Computer. From Intel's 1981 annual report." class="hilite" height="328" src="https://static.righto.com/images/960-overview/ibm-pc-w350.jpg" title="Caption: IBM chose Intel's high performance 8088 microprocessor as the central processing unit for the IBM Personal Computer, introduced in 1981. Seven Intel peripheral components are also integrated into the IBM Personal Computer. From Intel's 1981 annual report." width="350" /></a><div class="cite">Caption: IBM chose Intel's high performance 8088 microprocessor as the central processing unit for the IBM Personal Computer, introduced in 1981. Seven Intel peripheral components are also integrated into the IBM Personal Computer. From <a href="https://www.intel.com/content/www/us/en/history/history-1981-annual-report.html">Intel's 1981 annual report</a>.</div></p>
<p>Intel finally released the iAPX 432 in 1981.
Intel's <a href="https://www.intel.com/content/www/us/en/history/history-1981-annual-report.html">1981 annual report</a> shows the
importance of the 432 to Intel.
A section titled "The Micromainframe™ Arrives" enthusiastically described the iAPX 432 and how it would "open the door to applications not previously feasible".
To Intel's surprise, the iAPX 432 ended up as "one of the great disaster stories of modern computing" as
the New York Times <a href="https://archive.nytimes.com/www.nytimes.com/library/tech/98/04/biztech/articles/05merced.html">put it</a>.
The processor was so complicated that it was split across two very large chips:<span id="fnref:size"><a class="ref" href="#fn:size">1</a></span> one
to decode instructions and a second to execute them
Delivered years behind schedule, the micro-mainframe's performance was dismal, much worse than competitors and even the stopgap 8086.<span id="fnref:432"><a class="ref" href="#fn:432">2</a></span>
Sales were minimal and the 432 quietly dropped out of sight.</p>
<p><a href="https://static.righto.com/images/960-overview/432-dies.jpg"><img alt="My die photos of the two chips that make up the iAPX 432 General Data Processor. Click for a larger version." class="hilite" height="289" src="https://static.righto.com/images/960-overview/432-dies-w600.jpg" title="My die photos of the two chips that make up the iAPX 432 General Data Processor. Click for a larger version." width="600" /></a><div class="cite">My die photos of the two chips that make up the iAPX 432 General Data Processor. Click for a larger version.</div></p>
<h2>Intel picks a 32-bit architecture (or two, or three)</h2>
<p>In 1982, Intel still didn't realize the importance of the x86 architecture.
The follow-on 186 and 286 processors were released but without much success at first.<span id="fnref:286"><a class="ref" href="#fn:286">3</a></span>
Intel was working on the 386, a 32-bit successor to the 286, but their main customer IBM was very unenthusiastic.<span id="fnref:ibm286"><a class="ref" href="#fn:ibm286">4</a></span>
Support for the 386 was so weak that the 386 team worried that the project might be dead.<span id="fnref:386-oral-history"><a class="ref" href="#fn:386-oral-history">5</a></span>
Meanwhile, the 432 team continued their work.
Intel also had a third processor design in the works, a 32-bit VAX-like processor codenamed P4.<span id="fnref:processor-numbers"><a class="ref" href="#fn:processor-numbers">6</a></span></p>
<p>Intel recognized that developing three unrelated 32-bit processors was impractical and formed a task force to develop a
Single High-End Architecture (SHEA).
The task force didn't achieve a single architecture, but they decided to
merge the 432 and the P4 into a processor codenamed the P7, which would become the i960.
They also decided to continue the 386 project.
(Ironically, in 1986, Intel started yet another 32-bit processor, the unrelated <a href="https://spectrum.ieee.org/intel-i860">i860</a>, bringing
the number of 32-bit architectures back to three.)</p>
<p>At the time, the 386 team felt that they were treated as the
"stepchild" while the P7 project was the focus of Intel's attention.
This would change as the sales of x86-based personal computers climbed and money poured into Intel.
The 386 team would soon transform from stepchild to king.<span id="fnref2:386-oral-history"><a class="ref" href="#fn:386-oral-history">5</a></span></p>
<h2>The first release of the i960 processor</h2>
<p>Meanwhile,
the 1980 paper <a href="https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf">The case for the Reduced Instruction Set Computer</a>
proposed a revolutionary new approach for computer architecture: building Reduced Instruction Set Computers (RISC) instead of
Complex Instruction Set Computers (CISC).
The paper argued that the trend toward increasing complexity was doing more harm than good.
Instead, since "every transistor is precious" on a VLSI chip, the instruction set should be simplified,
only adding features that quantitatively improved performance.</p>
<p>The RISC approach became very popular in the 1980s.
Processors that followed the RISC philosophy generally converged on an approach with
32-bit easy-to-decode instructions,
a load-store architecture (separating computation instructions from instructions that accessed memory),
straightforward instructions that executed in one clock cycle,
and implementing instructions directly rather than through microcode.</p>
<p>The P7 project combined the RISC philosophy and the ideas from the 432 to create Intel's first RISC chip, originally called
the 80960<span id="fnref:80960"><a class="ref" href="#fn:80960">7</a></span> and later the i960.
The chip, announced in 1988, was significant enough for coverage in the <a href="https://www.nytimes.com/1988/04/06/business/company-news-a-new-family-of-intel-chips.html">New York Times</a>.
Analysts said that the chip was marketed as an embedded controller to avoid stealing sales from the 80386.
However, Intel's claimed motivation was the size of the embedded market;
Intel chip designer Steve McGeady <a href="https://archive.org/details/198902ByteMagazineVol1402PersonalWorkstationMacSE/page/n13/mode/2up?q=embedded+80960+29000">said at the time</a>, "I'd rather put an 80960 in every antiskid braking system than in every Sun workstation.”
Nonetheless, Intel also used the i960 as a workstation processor, as will be described in the next section.</p>
<p>The block diagram below shows the microarchitecture of the original i960 processors.
The microarchitecture of the i960 followed most (but not all) of the common RISC design:
a large register set, mostly one-cycle instructions,
a load/store architecture, simple instruction formats, and a pipelined architecture.
The Local Register Cache contains four sets of the 16 local registers. These "<a href="https://en.wikipedia.org/wiki/Register_window">register windows</a>" allow the registers to be switched during
function calls without the delay of saving registers to the stack.
The micro-instruction ROM and sequencer hold microcode for complex instructions; microcode is highly unusual for a RISC processor.
The chip's Floating Point Unit<span id="fnref:fp-80387"><a class="ref" href="#fn:fp-80387">8</a></span> and Memory Management Unit are advanced features for the time.</p>
<p><a href="https://static.righto.com/images/960-overview/arch-xa.jpg"><img alt="The microarchitecture of the i960 XA. FPU is Floating Point Unit. IEU is Instruction Execution Unit. MMU is Memory Management Unit. From the 80960 datasheet." class="hilite" height="288" src="https://static.righto.com/images/960-overview/arch-xa-w600.jpg" title="The microarchitecture of the i960 XA. FPU is Floating Point Unit. IEU is Instruction Execution Unit. MMU is Memory Management Unit. From the 80960 datasheet." width="600" /></a><div class="cite">The microarchitecture of the i960 XA. <i>FPU</i> is Floating Point Unit. <i>IEU</i> is Instruction Execution Unit. <i>MMU</i> is Memory Management Unit. From the <a href="http://www.bitsavers.org/components/intel/i960/271159-001_80960XA_Advance_Information_Oct90.pdf">80960 datasheet</a>.</div></p>
<p>It's interesting to compare the i960 to the 432: the programmer-visible architectures are completely different, while the instruction
sets are almost identical.<span id="fnref:instruction-set-note"><a class="ref" href="#fn:instruction-set-note">9</a></span>
Architecturally, the 432 is a stack-based machine with no registers, while the i960 is a load-store machine with many registers.
Moreover, the 432 had complex variable-length instructions, while the i960 uses simple fixed-length load-store instructions.
At the low level, the instructions are different due to the extreme architectural differences between the processors,
but otherwise, the instructions are remarkably similar, modulo some name changes.</p>
<p>The key to understanding the i960 family is that there are four architectures, ranging from a straightforward RISC processor to
a 33-bit processor implementing the 432's complex instruction set and object model.<span id="fnref:myers"><a class="ref" href="#fn:myers">10</a></span>
Each architecture adds additional functionality to the previous one:</p>
<ul>
<li>
The Core architecture consists of a "RISC-like" core.
<li>
The Numerics architecture extends Core with floating-point.
<li>
The Protected architecture extends Numerics with paged memory management, Supervisor/User protection, string instructions,
process scheduling, interprocess communication for OS, and symmetric multiprocessing.
<li>
The Extended architecture extends Protected with object addressing/protection and interprocess communication for applications.
This architecture used an extra tag bit, so registers, the bus, and memory were 33 bits wide instead of 32.
</ul>
<p>These four versions were sold as the KA (Core), KB (Numerics), MC (Protected), and XA (Extended).
The KA chip cost $174 and the KB version cost $333 while MC was aimed at the military market and cost a whopping $2400.
The most advanced chip (XA) was, at first, kept proprietary for use by BiiN (discussed below), but was later sold to the military.
The military versions weren't secret, but it is very hard to find documentation on them.<span id="fnref:extended"><a class="ref" href="#fn:extended">11</a></span></p>
<!-- https://www.google.com/books/edition/Circuit_Cellar_Ink/NoBVAAAAYAAJ?hl=en&gbpv=1&bsq=%22mc%22 -->
<!-- https://books.google.com/books?id=ADoEAAAAMBAJ&pg=PT43&dq=intel+%2280960%22&hl=en&sa=X&ved=2ahUKEwjIwIzbwab_AhWjSDABHUlEBS8Q6AF6BAgBEAI#v=onepage&q=intel%20%2280960%22&f=false -->
<p>The strangest thing about these four architectures is that the chips were <em>identical</em>, using the same die.
In other words, the simple Core chip included all the circuitry for floating point, memory management, and objects; these features just weren't used.<span id="fnref:features"><a class="ref" href="#fn:features">12</a></span>
The die photo below shows the die, with the main functional
units labeled.
Around the edge of the die are the bond pads that connect the die to the external pins.
Note that the right half of the chip has almost no bond pads. As a result, the packaged IC had many unused pins.<span id="fnref:no-connection"><a class="ref" href="#fn:no-connection">13</a></span></p>
<p><a href="https://static.righto.com/images/960-overview/80960MC-labeled.jpg"><img alt="The i960 KA/KB/MC/XA with the main functional blocks labeled. Click this image (or any other) for a larger version. Die image courtesy of Antoine Bercovici. Floorplan from The 80960 microprocessor architecture." class="hilite" height="603" src="https://static.righto.com/images/960-overview/80960MC-labeled-w600.jpg" title="The i960 KA/KB/MC/XA with the main functional blocks labeled. Click this image (or any other) for a larger version. Die image courtesy of Antoine Bercovici. Floorplan from The 80960 microprocessor architecture." width="600" /></a><div class="cite">The i960 KA/KB/MC/XA with the main functional blocks labeled. Click this image (or any other) for a larger version. Die image courtesy of Antoine Bercovici. Floorplan from <a href="https://archive.org/details/80960microproces0000myer/page/15/mode/1up">The 80960 microprocessor architecture</a>.</div></p>
<p>One advanced feature of the i960 is register scoreboarding, visible in the upper-left corner of the die.
The idea is that loading a register from memory is slow, so to improve performance, the processor executes the following instructions
while the load completes, rather than waiting.
Of course, an instruction can't be executed if it uses a register that is being loaded, since the value isn't there.
The solution is a "scoreboard" that tracks which registers are valid and which are still being loaded, and blocks an instruction
if the register isn't ready.
The i960 could handle up to three outstanding reads, providing a significant performance gain.</p>
<p>The most complex i960 architecture is the Extended architecture, which provides the object-oriented system.
This architecture is designed around an unforgeable pointer called an Access Descriptor that provides protected access to
an object.
What makes the pointer unforgeable is that it is 33 bits long with an extra bit that indicates an Access Descriptor.
You can't set this bit with a regular 32-bit instruction. Instead, an Access Descriptor can only be created with a special
privileged instruction, "Create AD".<span id="fnref:objects-432"><a class="ref" href="#fn:objects-432">14</a></span></p>
<p><a href="https://static.righto.com/images/960-overview/object-diagram.jpg"><img alt="An Access Descriptor is a pointer to an object table. From BiiN Object Computing." class="hilite" height="249" src="https://static.righto.com/images/960-overview/object-diagram-w550.jpg" title="An Access Descriptor is a pointer to an object table. From BiiN Object Computing." width="550" /></a><div class="cite">An Access Descriptor is a pointer to an object table. From <a href="http://www.bitsavers.org/pdf/biin/BiiN_Object_Computing.pdf">BiiN Object Computing</a>.</div></p>
<p>The diagram above shows how objects work.
The 33-bit Access Descriptor (AD) has its tag bit set to 1, indicating that it is a valid Access Descriptor.
The Rights field controls what actions can be performed by this object reference.
The AD's Object Index references the Object Table that holds information about each object.
In particular, the Base Address and Size define the object's location in memory and ensure that an access cannot exceed
the bounds of the object.
The Type Definition defines the various operations that can be performed on the object.
Since this is all implemented by the processor at the instruction level, it provides strict security.</p>
<h2>Gemini and BiiN</h2>
<p>The i960 was heavily influenced by a partnership called Gemini and then BiiN.
In 1983, near the start of the i960 project, Intel formed a partnership with Siemens to build high-performance fault-tolerant servers.
In this partnership, Intel would provide the hardware while Siemens developed the software.
This partnership allowed Intel to move beyond the chip market to the potentially-lucrative systems market, while adding powerful systems to
Siemens' product line.
The Gemini team contained many of the people from the 432 project and wanted to continue the 432's architecture.
Gemini worked closely with the developers of the i960 to ensure the new processor would meet their needs; both teams worked
in the same building at Intel's Jones Farm site in Oregon.</p>
<p><a href="https://static.righto.com/images/960-overview/biin60.jpg"><img alt="The BiiN 60 system. From BiiN 60 Technical Overview." class="hilite" height="440" src="https://static.righto.com/images/960-overview/biin60-w400.jpg" title="The BiiN 60 system. From BiiN 60 Technical Overview." width="400" /></a><div class="cite">The BiiN 60 system. From <a href="http://bitsavers.org/pdf/biin/BiiN_60_Technical_Overview_Oct88.pdf">BiiN 60 Technical Overview</a>.</div></p>
<p>In 1988, shortly after the announcement of the i960 chips, the Intel/Siemens partnership was spun off into a company called BiiN.<span id="fnref:biin"><a class="ref" href="#fn:biin">15</a></span>
BiiN announced two high-performance, fault-tolerant, multiprocessor systems.
These systems used the i960 XA processor<span id="fnref:xa"><a class="ref" href="#fn:xa">16</a></span> and took full advantage of the object-oriented model
and other features provided by its Extended architecture.
The <a href="https://books.google.com/books?id=pDsEAAAAMBAJ&pg=PA35">BiiN 20</a> was designed for departmental computing and cost $43,000 to $80,000. It supported 50 users (connected by terminals) on one 5.5-MIPS i960 processor.
The larger BiiN 60 handled up to 1000 terminals and cost $345,000 to $815,000.
The Unix-compatible BiiN operating system (BiiN/OS) and utilities were written in 2 million lines of Ada code.</p>
<p>BiiN described many <a href="https://books.google.com/books?id=pFcxi38ScwAC&pg=PP10">potential markets</a> for these systems: government, factory automation, financial services, on-line transaction processing,
manufacturing, and health care.
Unfortunately, as <a href="https://www.cs.drexel.edu/~jjohnson/2012-13/fall/cs281/resources/Embedded_Processors_(ExtremeTech).pdf">ExtremeTech put it</a>, "the market for fault-tolerant Unix workstations was approximately nil."
BiiN was shut down in 1989, just 15 months after its creation as profitability kept becoming more distant.
BiiN earned the nickname "Billions invested in Nothing"; the actual investment was 1700 person-years and $430 million.</p>
<!-- https://www.google.com/books/edition/Entrepreneurship_and_Innovation_in_Secon/OgUJJPuDvfQC?hl=en&gbpv=1&bsq=960 -->
<!--
![The BiiN 20 system. From <a href="https://archive.org/details/bub_gb_pDsEAAAAMBAJ_2/page/n33/mode/2up">InfoWorld</a>.](biin20.jpg "w300")
-->
<!-- BiiN is discussed in detail in https://ieeexplore.ieee.org/document/63665 -->
<!--
![The process control block. From <a href="http://www.bitsavers.org/components/intel/i960/271081-001_80960MC_Programmers_Reference_Manual_Jul88.pdf">80960MC Programmer's Reference Manual</a>.](process-control-block.jpg "w400")
-->
<!--
You might argue that these instructions are essentially subroutines in ROM, built up from RISC instructions.
For the most part, that's true.
However, the process scheduling operations are tied into an internal processor timer, so there is some hardware support.
-->
<!--
[PC Magazine](https://books.google.com/books?id=oabZ_SN3dxEC&pg=PA51&hl=en&sa=X&ved=2ahUKEwjE5Zf31cb-AhUMkYkEHZt2C_wQ6AF6BAgBEAI#v=onepage&q&f=false)
states that the 386 team moved to the i960 after the 386 project (so presumably around 1985).
Wikipedia [states](https://en.wikipedia.org/wiki/Intel_i960#End_of_development) that the i960 team was moved to the Pentium Pro
team in 1990.
Glenn Hinton i960CA then Pentium Pro senior architect. https://www.anandtech.com/show/16438/new-intel-ceo-making-waves-rehiring-retired-cpu-architects
Gurbir Singh and Nitin Sarangdhar designed the i960 bus and moved to Pentium Pro team.
-->
<h2>The superscalar i960 CA</h2>
<p>One year after the first i960, Intel released the groundbreaking i960 CA.
This chip was the world's first superscalar microprocessor, able to execute more than one instruction per clock cycle.
The chip had three execution units that could operate in parallel:
an integer execution unit, a multiply/divide unit, and an address generation unit that could also do integer arithmetic.<span id="fnref:ca"><a class="ref" href="#fn:ca">17</a></span>
To keep the execution units busy, the i960 CA's instruction sequencer examined four instructions at once and determined which ones could be issued in parallel without
conflict.
It could issue two instructions and a branch each clock cycle, using branch prediction to speculatively execute branches out of order.</p>
<p><a href="https://static.righto.com/images/960-overview/80960CA-labeled.jpg"><img alt="The i960 CA die, with functional blocks labeled. Photo courtesy of Antoine Bercovici. Functional blocks from the datasheet." class="hilite" height="457" src="https://static.righto.com/images/960-overview/80960CA-labeled-w700.jpg" title="The i960 CA die, with functional blocks labeled. Photo courtesy of Antoine Bercovici. Functional blocks from the datasheet." width="700" /></a><div class="cite">The i960 CA die, with functional blocks labeled. Photo courtesy of Antoine Bercovici. Functional blocks from <a href="http://datasheets.chipdb.org/Intel/80960/PRODBREF/272211_3.PDF">the datasheet</a>.</div></p>
<p>Following the CA, several other superscalar variants were produced:
the CF had more cache, the military MM implemented the Protected architecture (memory management and a floating point unit),
and the military MX implemented the Extended architecture (object-oriented).</p>
<p>The image below shows the 960 MX die with the main functional blocks labeled.
(I think the MM and MX used the same die but I'm not sure.<span id="fnref:mm"><a class="ref" href="#fn:mm">18</a></span>)
Like the i960 CA, this chip has multiple functional units that can be operated in parallel for its superscalar execution.
Note the wide buses between various blocks, allowing high internal bandwidth.
The die was too large for the optical projection of the mask, with the result that the corners of the circuitry needed to be rounded off.</p>
<p><a href="https://static.righto.com/images/960-overview/die-labeled.jpg"><img alt="The i960MX die with the main functional blocks labeled. This is a die photo I took, with labels based on my reverse engineering." class="hilite" height="699" src="https://static.righto.com/images/960-overview/die-labeled-w600.jpg" title="The i960MX die with the main functional blocks labeled. This is a die photo I took, with labels based on my reverse engineering." width="600" /></a><div class="cite">The i960MX die with the main functional blocks labeled. This is a die photo I took, with labels based on my reverse engineering.</div></p>
<p>The block diagram of the i960 MX shows the complexity of the chip and how it is designed for parallelism.
The register file is the heart of the chip. It is multi-ported so up to 6 registers can be accessed at the same time.
Note the multiple, 256-bit wide buses between the register file and the various functional units.
The chip has two buses: a high-bandwidth Backside Bus between the chip and its external cache and private memory;
and a New Local Bus, which runs at half the speed and connects the chip to main memory and I/O.
For highest performance, the chip's software would access its private memory over the high-speed bus, while using the
slower bus for I/O and shared memory accesses.</p>
<p><a href="https://static.righto.com/images/960-overview/960mx-block-diagram.jpg"><img alt="A functional block diagram of the i960 MX. From Intel Military and Special Projects Handbook, 1993." class="hilite" height="405" src="https://static.righto.com/images/960-overview/960mx-block-diagram-w700.jpg" title="A functional block diagram of the i960 MX. From Intel Military and Special Projects Handbook, 1993." width="700" /></a><div class="cite">A functional block diagram of the i960 MX. From Intel Military and Special Projects Handbook, 1993.</div></p>
<h2>Military use and the JIAWG standard</h2>
<p>The i960 had a special role in the US military.
In 1987 the military <a href="https://people.cs.kuleuven.be/~dirk.craeynest/ada-belgium/archive/ase/ase02_01/docs/pol_hist/policy/3405-1.txt">mandated</a> the use of Ada as the single, common computer programming language for Defense computer resources in most cases.<span id="fnref:ada"><a class="ref" href="#fn:ada">19</a></span>
In 1989, the military created the JIAWG standard, which selected two 32-bit instruction set architectures for military avionics.
These architectures were the i960's Extended architecture (implemented by the i960 XA) and the MIPS architecture (based on a RISC project at Stanford).<span id="fnref:standard"><a class="ref" href="#fn:standard">20</a></span>
The superscalar i960 MX processor described earlier soon became a popular JIAWG-compliant processor, since it had higher performance than the XA.</p>
<!-- Ada seems destined to become the dominant programming language of the 1980s. -->
<!-- https://www.computer.org/csdl/magazine/co/1981/06/01667393/13rRUB7a13Y -->
<!--
I should point out that you didn't need to use an Extended architecture chip to run Ada; there were Ada compilers
available for all the i960 processors, although without the hardware enforcement that the Extended architecture processors provided.
-->
<!-- http://intel-vintage-developer.eu5.org/DESIGN/I960/DEVTOOLS/I960.HTM -->
<p>Hughes designed a modular avionics processor that used the i960 XA and later the MX.
A dense module called the HAC-32 contained two i960 MX processors, 2 MB of RAM, and an I/O controller in a 2"×4" multi-chip module, slightly bigger than a credit card.
This module had bare dies bonded to the substrate, maximizing the density.
In the photo below, the two largest dies are the i960 MX while the numerous gray rectangles are memory chips.
This module was used in F-22's Common Integrated Processor, the RAH-66 Comanche helicopter (which was canceled), the F/A-18's Stores Management Processor
(the computer that controls attached weapons), and the AN/ALR-67 radar computer.</p>
<p><a href="https://static.righto.com/images/960-overview/hughes.jpg"><img alt="The Hughes HAC-32. From Avionics Systems Design." class="hilite" height="413" src="https://static.righto.com/images/960-overview/hughes-w600.jpg" title="The Hughes HAC-32. From Avionics Systems Design." width="600" /></a><div class="cite">The Hughes HAC-32. From <a href="https://archive.org/details/avionicsystemsde0000newp/page/40/mode/2up?q=hughes&view=theater">Avionics Systems Design</a>.</div></p>
<p>The military market is difficult due to the long timelines of military projects, unpredictable volumes, and the risk
of cancellations.
In the case of the F-22 fighter plane, the project started in 1985 when the Air Force sent out proposals for a new Advanced Tactical Fighter.
Lockheed built a YF-22 prototype, first flying it in 1990.
The Air Force selected the YF-22 over the competing YF-23 in 1991 and the project moved to full-scale development.
During this time, at least three generations of processors became obsolete.
In particular, the i960MX was out of production by the time the F-22 first flew in <a href="https://www.militaryaerospace.com/computers/article/16710716/f22-avionics-designers-rely-on-obsolescent-electronics-but-plan-for-future-upgrades">1997</a>.
At one point, the military had to pay Intel $22 million to restart the i960 production line.
In 2001, the Air Force started a switch to the PowerPC processor, and finally the plane entered military service in 2005.
The F-22 illustrates how the fast-paced obsolescence of processors is a big problem for decades-long military projects.</p>
<!--
The i960 production line was permanently [shut down](https://www.gao.gov/assets/gao-04-391.pdf) in January 2004 after the Air Force made its last purchase of 820 chips.
-->
<p><a href="https://static.righto.com/images/960-overview/cip.jpg"><img alt="The Common Integrated Processor for the F-22, presumably with i960 MX chips inside. It is the equivalent of two Cray supercomputers and was the world's most advanced, high-speed computer system for a fighter aircraft. Source: NARA/Hughes Aircraft Co./T.W. Goosman." class="hilite" height="325" src="https://static.righto.com/images/960-overview/cip-w500.jpg" title="The Common Integrated Processor for the F-22, presumably with i960 MX chips inside. It is the equivalent of two Cray supercomputers and was the world's most advanced, high-speed computer system for a fighter aircraft. Source: NARA/Hughes Aircraft Co./T.W. Goosman." width="500" /></a><div class="cite">The Common Integrated Processor for the F-22, presumably with i960 MX chips inside. It is the equivalent of two Cray supercomputers and was the world's most advanced, high-speed computer system for a fighter aircraft. Source: <a href="https://nara.getarchive.net/media/a-close-up-view-of-the-common-integrated-processor-which-was-developed-for-704456">NARA/Hughes Aircraft Co./T.W. Goosman</a>.</div></p>
<p>Intel charged thousands of dollars for each i960 MX and each F-22 contained a cluster of 35 i960 MX processors, so the military
market was potentially lucrative. The Air Force originally planned to buy 750 planes, but cut this down to just 187, which must
have been a blow to Intel.
As for the Comanche helicopter, the Army planned to buy 1200 of them, but the program was canceled entirely after building two prototypes.
The point is that the military market is risky and low volume even in the best circumstances.<span id="fnref:tredennick"><a class="ref" href="#fn:tredennick">21</a></span>
In 1998, Intel decided to <a href="https://www.militaryaerospace.com/computers/article/16710194/intel-set-to-quit-military-business">leave</a> the military business entirely, joining AMD and Motorola.</p>
<!--
The i960 MX was also used in the [MLRS rocket system](https://apps.dtic.mil/sti/pdfs/ADA295248.pdf) while its [ATACMS missile](https://archive.org/details/DTIC_ADA298330/page/n272/mode/1up) used
dual i960s.
-->
<p>Foreign militaries also made use of the i960. In 2008 a businessman was sentenced to <a href="https://www.justice.gov/archive/opa/pr/2008/June/08-ag-540.html">35 months in prison</a> for illegally exporting hundreds of i960 chips into India for use in the radar for the Tejas Light Combat Aircraft.</p>
<!-- http://archive.indianexpress.com/news/exhal-staffer-pleads-guilty-in-us-to-illegal-exports-for-indian-missile-programme/284734/0 -->
<!-- https://www.globalsecurity.org/military/world/india/lca-design.htm -->
<h2>i960: the later years</h2>
<p>By 1990, the i960 was selling well, but the landscape at Intel had changed.
The 386 processor was enormously successful, due to the Compaq Deskpro 386 and other systems, leading to Intel's first
billion-dollar quarter.
The 8086 had started as a stopgap processor to fill a temporary marketing need, but
now the x86 was Intel's moneymaking engine.
As part of a reorganization, the i960 project was transferred to Chandler, Arizona.
Much of the i960 team in Oregon moved to the newly-formed Pentium Pro team, while others ended up on the 486 DX2 processor.
This wasn't the end of the i960, but the intensity had reduced.</p>
<p>To reduce system cost, Intel produced versions of the i960 that had a 16-bit bus, although the processor was 32 bits internally.
(This is the same approach that Intel used with the 8088 processor, a version of the 8086 processor with an 8-bit bus instead
of 16.)
The i960 SB had the "Numerics" architecture, that is, with a floating-point unit.
Looking at the die below, we can see that the SB design is rather "lazy", simply the previous die (KA/KB/MC/XA) with a thin layer of circuitry around
the border to implement the 16-bit bus.
Even though the SB didn't support memory management or objects, Intel didn't remove that circuitry.
The process was reportedly moved from 1.5 microns to 1 micron, shrinking the die to 270 mils square.</p>
<!-- "Intel adds low-end, high-end 960 processors" from Microprocessor Reports -->
<p><a href="https://static.righto.com/images/960-overview/960SB.jpg"><img alt="Comparison of the original i960 die and the i960 SB. Photos courtesy of Antoine Bercovici." class="hilite" height="391" src="https://static.righto.com/images/960-overview/960SB-w700.jpg" title="Comparison of the original i960 die and the i960 SB. Photos courtesy of Antoine Bercovici." width="700" /></a><div class="cite">Comparison of the original i960 die and the i960 SB. Photos courtesy of Antoine Bercovici.</div></p>
<p>The next chip, the i960 SA, was the 16-bit-bus "Core" architecture, without floating point.
The SA was based on the SB but Intel finally removed unused functionality from the die, making the die about 24% smaller.
The diagram below shows how the address translation, translation lookaside buffer, and floating point unit were removed,
along with much of the microcode (yellow).
The instruction cache tags (purple), registers (orange), and execution unit (green) were moved to fit into the
available space.
The left half of the chip remained unchanged.
The driver circuitry around the edges of the chip was also tightened up, saving a bit of space.</p>
<p><a href="https://static.righto.com/images/960-overview/SB-SA.jpg"><img alt="This diagram compares the SB and SA chips. Photos courtesy of Antoine Bercovici." class="hilite" height="383" src="https://static.righto.com/images/960-overview/SB-SA-w700.jpg" title="This diagram compares the SB and SA chips. Photos courtesy of Antoine Bercovici." width="700" /></a><div class="cite">This diagram compares the SB and SA chips. Photos courtesy of Antoine Bercovici.</div></p>
<p>Intel introduced the high-performance Hx family around 1994.
This family was superscalar like the CA/CF, but the Hx chips also had a faster clock, had much more cache, and included additional functionality such
as timers and a guarded memory unit.
The Jx family was introduced as the midrange, cost-effective line, faster and better than the original chips but not superscalar like the Hx.
Intel attempted to move the i960 into the I/O controller market with the Rx family and the VH.<span id="fnref:vh"><a class="ref" href="#fn:vh">23</a></span>
This was part of Intel's Intelligent Input/Output specification (I2O), which was a failure overall.</p>
<p>For a while, the i960 was a big success in the marketplace and was used in many products. Laser printers and graphical terminals were key applications, both taking
advantage of the i960's high speed to move pixels.
The i960 was the world's best-selling RISC chip in <a href="https://www.intc.com/filings-reports/all-sec-filings/content/0000050863-95-000004/10-K.txt">1994</a>.
However, without focused development, the performance of the i960 fell behind the competition, and its market share rapidly dropped.</p>
<p><a href="https://static.righto.com/images/960-overview/market-share.jpg"><img alt="Market share of embedded RISC processors. From ExtremeTech." class="hilite" height="211" src="https://static.righto.com/images/960-overview/market-share-w350.jpg" title="Market share of embedded RISC processors. From ExtremeTech." width="350" /></a><div class="cite">Market share of embedded RISC processors. From <a href="https://www.cs.drexel.edu/~jjohnson/2012-13/fall/cs281/resources/Embedded_Processors_(ExtremeTech).pdf">ExtremeTech</a>.</div></p>
<p>By the late 1990s, the i960 was described with terms such as "aging", "venerable", and "medieval".
In 1999, <a href="https://websrv.cecs.uci.edu/~papers/mpr/MPR/19990510/130601.pdf">Microprocessor Report</a> described the situation:
"The i960
survived on cast-off semiconductor processes two to three
generations old; the i960CA is still built in a 1.0-micron process (perhaps by little old ladies with X-Acto knives)."<span id="fnref:rubylith"><a class="ref" href="#fn:rubylith">22</a></span></p>
<p>One of the strongest competitors was DEC's powerful StrongARM processor design, a descendant of the ARM chip.
Even Intel's top-of-the-line i960HT
<a href="https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/100201.pdf">fared pitifully</a>
against the StrongARM, with worse cost, performance, and power consumption.
In 1997, DEC sued Intel, claiming that the Pentium infringed ten of DEC's patents.
As part of the complex but mutually-beneficial <a href="https://www.nytimes.com/1997/10/28/business/intel-and-digital-settle-lawsuit-and-make-deal.html">1997 settlement</a>, Intel obtained rights to the StrongARM chip.
As Intel turned its embedded focus from i960 to StrongARM,
<a href="https://techmonitor.ai/technology/intel_speeds_up_the_i960_chip_by_adding_more_cache">one writer</a> wrote, "Things are looking somewhat bleak for Intel Corp's ten-year-old i960 processor."
The i960 limped on for another decade until
Intel officially <a href="https://www.theregister.com/2006/05/18/intel_cans_386_486_960_cpus/">ended production</a> in 2007.</p>
<!--
Ironically, the i960 outlasted the StrongARM, which Intel [abandoned](https://www.infoworld.com/article/2678798/intel-puts-strongarm-on-death-row.html) in 2004 for the follow-on XScale architecture.
-->
<!--
[Fred Pollack](https://ieeexplore.ieee.org/author/37724570200), manager of the i960 architecture, became architecture
manager of the Pentium Pro in 1990.
Glenn Hinton, who led the superscalar i960 CA design, moved to the Pentium Pro team.
Gurbir Singh and Nitin Sarangdhar, who designed the i960 bus, also moved to Pentium Pro team.
Randy Steck, who (I think) designed the i960 MM, became project leader for the Pentium Pro.
Bob Bentley moved from validation manager of the i960 to Director of pre-silicon validation for the Pentium Pro in 1992.
The 486DX2 team was formed from engineeres from the i960 according to the book [Entrepreneurship and Innovation in Second Tier Regions](https://www.google.com/books/edition/Entrepreneurship_and_Innovation_in_Secon/OgUJJPuDvfQC?hl=en&gbpv=1&pg=PA102&printsec=frontcover).
-->
<!--
The i960 didn't end at that point of course; new engineers were brought in create the Hx and Hx processors, such as
Richard Brunner who joined Intel from DEC in 1992.
-->
<!-- https://www.linkedin.com/in/randy-steck-430821/details/experience/ -->
<h2>RISC or CISC?</h2>
<p>The i960 challenges the definitions of RISC and CISC processors.<span id="fnref:risc"><a class="ref" href="#fn:risc">24</a></span>
It is generally considered a RISC processor, but its architect
says "RISC techniques were used for high performance, CISC techniques for ease of use."<span id="fnref:upr"><a class="ref" href="#fn:upr">25</a></span>
John Mashey of MIPS described it as on the RISC/CISC border<span id="fnref:mashey"><a class="ref" href="#fn:mashey">26</a></span>
while Steve Furber (co-creator of ARM) <a href="https://amzn.to/3JyCoix">wrote</a> that it "includes many RISC ideas, but it is not a simple chip" with "many
complex instructions which make recourse to microcode" and a design that "is more reminiscent of a complex,
mainframe architecture than a simple, pipelined RISC."
And they were talking about the i960 KB with the simple Numerics architecture, not the complicated Extended architecture!</p>
<p>Even the basic Core architecture has many non-RISC-like features.
It has microcoded instructions that take multiple cycles (such as integer multiplication),
numerous addressing modes<span id="fnref:addressing"><a class="ref" href="#fn:addressing">27</a></span>, and unnecessary instructions (e.g. AND NOT as well as NOT AND).
It also has a large variety of datatypes, even more than the 432:
integer (8, 16, 32, or 64 bit), ordinal (8, 16, 32, or 64 bit),
decimal digits, bit fields, triple-word (96 bits), and quad-word (128 bits).
The Numerics architecture adds floating-point reals (32, 64, or 80 bit) while the Protected architecture adds byte strings
with decidedly CISC-like instructions to act on them.<span id="fnref:string"><a class="ref" href="#fn:string">28</a></span></p>
<p>When you get to the Extended architecture with objects, process management, and interprocess communication instructions, the
large instruction set seems obviously CISC.<span id="fnref:instruction-set"><a class="ref" href="#fn:instruction-set">29</a></span> (The instruction set is essentially the same as 432 and the 432 is an extremely CISC processor.)
You could argue that the i960 Core architecture is RISC and the Extended architecture is CISC, but the problem is that they are identical chips.</p>
<p>Of course, it doesn't really matter if the i960 is considered RISC, CISC, or CISC instructions running on a RISC core.
But the i960 shows that RISC and CISC aren't as straightforward as they might seem.</p>
<h2>Summary</h2>
<p>The i960 chips can be confusing since there are four architectures, along with scalar vs. superscalar, and multiple families over time.
I've made the table below to summarize the i960 family and the approximate dates.
The upper entries are the scalar families while the lower entries are superscalar.
The columns indicate the four architectural variants; although the i960 started with four variants, eventually Intel
focused on only the Core.
Note that each "x" family represents multiple chips.</p>
<style type="text/css">
table.i960 {border-collapse: collapse;}
table.i960 th,td {padding: 0 5px;}
table.i960 td {border-top: 1px solid #888;}
table.i960 tr.ul td {border-top: none;}
table.i960 th,td {border-left: 1px solid #888;}
table.i960 th:first-child,td:first-child {border-left: none !important; border-right:none;}
table.i960 tr.brk {border-top: 2px solid #888;}
</style>
<table class="i960">
<tr><th>Core</th><th>Numerics</th><th>Protected</th><th>Extended</th><th></th></tr>
<tr class="brK"><td>KA</td><td>KB</td><td>MC</td><td>XA</td><td>Original (1988)</td></tr>
<tr><td>SA</td><td>SB</td><td> </td><td> </td><td>Entry level, 16-bit data bus (1991)</td></tr>
<tr><td>Jx</td><td> </td><td> </td><td> </td><td>Midrange (1993-1998)</td></tr>
<tr><td>Rx,VH</td><td> </td><td> </td><td> </td><td>I/O interface (1995-2001)</td></tr>
<tr class="brk"><td>CA,CF</td><td> </td><td>MM</td><td>MX</td><td>Superscalar (1989-1992)</td></tr>
<tr><td>Hx</td><td> </td><td> </td><td> </td><td>Superscalar, higher performance (1994)</td></tr>
</table>
<!--
I should point out that the Berkeley RISC paper considers one justification for an instruction to be that it is "unsynthesizable",
that you can't replace it with multiple simpler instructions
This condition is met by the instructions for unforgeable object pointers, since they need to be implemented by the
processor. (That is, the instructions are needed for security and can't be replaced by other instructions, just like
a Supervisor Call instruction.)
Thus, you could argue that these instructions satisfy the RISC philosophy.
The Berkeley RISC paper also "allows" instructions that provide a strong performance benefit such as floating point,
although putting floating point in a RISC chip was controversial with many people.
-->
<p>Although the i960 is now mostly forgotten, it was an innovative processor for the time.
The first generation was Intel's first RISC chip, but pushed the boundary of RISC with many CISC-like features.
The i960 XA literally set the standard for military computing, selected by the JIAWG as the military's architecture.
The i960 CA provided a performance breakthrough with its superscalar architecture.
But Moore's Law means that competitors can rapidly overtake a chip, and the i960 ended up as history.</p>
<p>Thanks to Glen Myers, Kevin Kahn, Steven McGeady, and others from Intel for answering my questions about the i960. Thanks to Prof. Paul Lubeck for obtaining documentation for me.
I plan to write more, so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>
and Bluesky as <a href="https://staging.bsky.app/profile/righto.com">@righto.com</a> so you can follow me there too.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:size">
<p>The 432 used two chips for the processor and a third chip for I/O. At the time, these were said to be
"three of the largest integrated circuits in history."
The first processor chip contained more than 100,000 devices, making it "one of the densest VLSI circuits to have been fabricated so far."
The article also says that the 432 project "was the largest investment in a single program that Intel has ever made."
See <a href="https://worldradiohistory.com/Archive-Electronics/80s/81/Electronics-1981-02-24.pdf">Ada determines architecture of 32-bit microprocessor</a>, Electronics, Feb 24, 1981, pages 119-126, a very detailed article on the 432 written by the lead engineer and the team's manager. <a class="footnote-backref" href="#fnref:size" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:432">
<p>The performance problems of the iAPX 432 were revealed by a student project at Berkeley, <a href="https://dl.acm.org/doi/pdf/10.1145/641542.641545">A performance evaluation of the Intel iAPX 432</a>, which compared its performance with the VAX-11/780, Motorola 68000, and Intel 8086.
Instead of providing mainframe performance, the 432 had a fraction of the performance of the competing systems.
Another interesting paper is <a href="https://dl.acm.org/doi/10.1145/45059.214411">Performance effects of architectural complexity in the Intel 432</a>, which examines in detail what the 432 did wrong.
It concludes that the 432 could have been significantly faster, but would still have been slower than its contemporaries.
An author of the paper was Robert Colwell, who was later hired by Intel and designed the highly-successful Pentium Pro architecture. <a class="footnote-backref" href="#fnref:432" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:286">
<p>You might expect the 8086, 186, and 286 processors to form a nice progression, but it's a bit more complicated.
The 186 and 286 processors were released at the same time.
The 186 essentially took the 8086 and several support chips and integrated them onto a single die.
The 286, on the other hand, extended the 8086 with memory management.
However, its segment-based memory management was a bad design, using ideas from the Zilog MMU, and wasn't popular.
The 286 also had a protected mode, so multiple processes could be isolated from each other.
Unfortunately, protected mode had some serious problems.
Bill Gates famously called the 286 "brain-damaged"
echoing PC Magazine editor <a href="https://archive.org/details/PCMAG/PC-Mag-1989-05-30/page/96/mode/2up">Bill Machrone</a>
and writer <a href="https://books.google.com/books?id=HpsOD9ZeqScC&lpg=PA62&pg=PA62#v=onepage&q&f=false">Jerry Pournelle</a>,
who both wanted credit for originating the phrase.</p>
<p>By 1984, however, the 286 was Intel's star due to growing sales of IBM PCs and compatibles that used the chip.
Intel's <a href="https://www.intel.com/content/www/us/en/history/history-1984-annual-report.html">1984 annual report</a> featured
"The Story of the 286", a glowing 14-page tribute to the 286. <a class="footnote-backref" href="#fnref:286" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:ibm286">
<p>Given IBM's success with IBM PC, Intel was puzzled that IBM wasn't interested in the 386 processor.
It turned out that IBM had a plan to regain control of the PC so they could block out competitors that were manufacturing IBM PC compatibles.
IBM planned to reverse-engineer Intel's 286 processor and build its own version. The computers would run the OS/2
operating system instead of Windows and use the proprietary Micro Channel architecture.
However, the reverse-engineering project failed and IBM eventually moved to the Intel 386 processor.
The IBM PS/2 line of computers, released in 1987, followed the rest of the plan.
However, the PS/2 line was largely unsuccessful; rather than regaining control over the PC, IBM ended up losing control
to companies such as Compaq and Dell.
(For more, see <a href="https://amzn.to/3phNbGW">Creating the Digital Future</a>, page 131.) <a class="footnote-backref" href="#fnref:ibm286" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:386-oral-history">
<p>The 386 team created an <a href="http://archive.computerhistory.org/resources/text/Oral_History/Intel_386_Design_and_Dev/102702019.05.01.acc.pdf">oral history</a>
that describes the development of the 386 in detail. Pages 5, 6, and 19 are most relevant to this post. <a class="footnote-backref" href="#fnref:386-oral-history" title="Jump back to footnote 5 in the text">↩</a><a class="footnote-backref" href="#fnref2:386-oral-history" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:processor-numbers">
<p>You might wonder why the processor was codenamed P4, since logically P4 should indicate the 486.
Confusingly, Intel's processor codenames were not always sequential and they sometimes reused numbers.
The numbers apparently started with P0, the codename for the Optimal Terminal Kit, a processor that didn't get beyond early planning.
P5 was used for the 432, P4 for the planned follow-on, P7 for the i960, P10 for the i960 CA, and P12 for the i960 MX.
(Apparently they thought that x86 wouldn't ever get to P4.)</p>
<p>For the x86 processors, P1 through P6 indicated the 186, 286, 386, 486, 586, Pentium, and Pentium Pro as you'd expect.
(The Pentium used a <a href="https://en.wikipedia.org/wiki/Pentium_(original)#Cores_and_steppings">variety</a> of codes for various
versions, such as P54C, P24T, and P55C; I don't understand the pattern behind these.)
For some reason, the i386SX was the P9 and the i486SX was the <a href="https://www.cpu-world.com/forum/viewtopic.php?t=26657">P23</a> and the i486DX2 was the P24.
The Pentium 4 Willamette was the first new microarchitecture (NetBurst) since P6 so it was going to be P7,
but Itanium took the P7 name codename so Willamette
became P68. After that, processors were named after geographic features, avoiding the issues with numeric codenames.</p>
<p>Other types of chips used different letter prefixes.
The 387 numeric coprocessor was the N3.
The i860 RISC processor was originally the N10, a numeric co-processor.
The follow-on i860 XP was the N11.
Support chips for the 486 included the C4 cache chip and the unreleased I4 interrupt controller. <a class="footnote-backref" href="#fnref:processor-numbers" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:80960">
<p>At the time, Intel had a family of 16-bit embedded microcontrollers called <a href="https://en.wikipedia.org/wiki/Intel_MCS-96">MCS-96</a> featuring the 8096.
The 80960 name was presumably chosen to imply continuity with the 8096 16-bit microcontrollers (MCS-96), even though
the 8096 and the 80960 are completely different. (I haven't been able to confirm this, though.)
Intel started calling the chip the "i960" around 1989.
(Intel's chip branding is inconsistent: from 1987 to 1991, Intel's annual reports called the 386 processor the
80386, the 386, the Intel386, and the i386. I suspect their trademark lawyers were dealing with the problem that numbers
couldn't be trademarked, which was the motivation for the "Pentium" name rather than 586.)</p>
<p>Note that the i860 processor is completely unrelated to the i960 despite the similar numbers.
They are both 32-bit RISC processors, but are architecturally unrelated.
The i860 was targeted at high-performance workstations, while the i960 was targeted at embedded applications.
For details on the i860, see <a href="https://spectrum.ieee.org/intel-i860">The first million-transistor chip</a>. <a class="footnote-backref" href="#fnref:80960" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:fp-80387">
<p>The Intel 80387 floating-point coprocessor chip used the same floating-point unit as the i960. The diagram below shows
the 80387; compare the floating-point unit in the lower right corner with the matching floating-point unit in the i960 KA
or SB die photo.</p>
<p><a href="https://static.righto.com/images/960-overview/80387.jpg"><img alt="The 80837 floating-point coprocessor with the main functional blocks labeled. Die photo courtesy of Antoine Bercovici. 80387 floor plan from The 80387 and its applications." class="hilite" height="533" src="https://static.righto.com/images/960-overview/80387-w500.jpg" title="The 80837 floating-point coprocessor with the main functional blocks labeled. Die photo courtesy of Antoine Bercovici. 80387 floor plan from The 80387 and its applications." width="500" /></a><div class="cite">The 80837 floating-point coprocessor with the main functional blocks labeled. Die photo courtesy of Antoine Bercovici. 80387 floor plan from <a href="https://doi.org/10.1109/MM.1987.304880">The 80387 and its applications</a>.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:fp-80387" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:instruction-set-note">
<p>I compared the instruction sets of the 432 and i960 and
the i960 Extended instruction set seems about
as close as you could get to the 432 while drastically changing the underlying architecture.
If you dig into the details of the object models, there are some differences.
Some instructions also have different names but the same function. <a class="footnote-backref" href="#fnref:instruction-set-note" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:myers">
<p>The first i960 chips were described in detail in the 1988 book <a href="https://archive.org/details/80960microproces0000myer">The 80960 microprocessor architecture</a> by Glenford Myers (who was responsible for the 80960 architecture at Intel) and David Budde (who
managed the VLSI development of the 80960 components).
This book discussed three levels of architecture (Core, Numerics, and Protected).
The book referred to the fourth level, the Extended architecture (XA), calling it "a proprietary higher level of the
architecture developed for use by Intel in system products" and did not discuss it further.
These "system products" were the systems being developed at BiiN. <a class="footnote-backref" href="#fnref:myers" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:extended">
<p>I could find very little documentation on the Extended architecture.
The <a href="http://www.bitsavers.org/components/intel/i960/271159-001_80960XA_Advance_Information_Oct90.pdf">80960XA datasheet</a>
provides a summary of the instruction set.
The i960 MX datasheet provides a similar summary; it is in the Intel Military and Special Products databook, which I
found after much difficulty.
The best description I could find is in the 400-page <a href="http://bitsavers.org/pdf/biin/BiiN_CPU_Architecture_Reference_Man_Jul88.pdf">BiiN CPU architecture reference manual</a>.
Intel has other documents that I haven't been able to find anywhere:
i960 MM/MX Processor Hardware Reference Manual, and Military i960 MM/MX Superscalar Processor.
(If you have one lying around, let me know.)</p>
<p>The <a href="http://datasheets.chipdb.org/Intel/80960/specupdt/27286802.PDF">80960MX Specification Update</a> mentions a few things
about the MX processor. My favorite is that if you take the arctan of a value greater than 32768, the processor may lock
up and require a hardware reset. Oops.
The update also says that the product is sold in die and wafer form only, i.e. without packaging.
Curiously, earlier documentation said the chip was packaged in a 348-pin ceramic PGA package (with 213 signal pins and 122 power/ground pins).
I guess Intel ended up only supporting the bare die, as in the Hughes HAC-32 module. <a class="footnote-backref" href="#fnref:extended" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:features">
<p>According to people who worked on the project, there were not even any bond wire changes or blown fuses to distinguish the chips
for the four different architectures.
It's possible that Intel used <a href="https://en.wikipedia.org/wiki/Product_binning">binning</a>, selling dies as a lower
architecture if, for example, the floating point unit failed testing.
Moreover, the military chips presumably had much more extensive testing, checking the military temperature range for instance. <a class="footnote-backref" href="#fnref:features" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:no-connection">
<p>The original i960 chips (KA/KB/MC/XA) have a large number of pins that are not connected (marked NC on the <a href="http://www.bitsavers.org/components/intel/i960/271159-001_80960XA_Advance_Information_Oct90.pdf">datasheet</a>). This has led to
suspicious theorizing, even on <a href="https://en.wikipedia.org/wiki/Intel_i960#80960MX,_80960MC">Wikipedia</a>, that these pins were left unconnected to control access to various features.
This is false for two reasons. First, checking the datasheets shows that all four chips have the same pinout; there are no
pins connected only in the more advanced versions.
Second, looking at the packaged chip (below) explains why so many pins are unconnected: much of the chip has no bond pads,
so there is nothing to connect the pins to.
In particular, the right half of the die has only four bond pads for power.
This is an unusual chip layout, but presumably
the chip's internal buses made it easier to put all the connections at the left. The downside is that the package is
more expensive due to the wasted pins, but I expect that BiiN wasn't concerned about a few extra dollars for the package.</p>
<p><a href="https://static.righto.com/images/960-overview/Intel i80960MCacav.jpg"><img alt="The i960 MC die, bonded in its package. Photo courtesy of Antoine Bercovici." class="hilite" height="400" src="https://static.righto.com/images/960-overview/Intel i80960MCacav-w400.jpg" title="The i960 MC die, bonded in its package. Photo courtesy of Antoine Bercovici." width="400" /></a><div class="cite">The i960 MC die, bonded in its package. Photo courtesy of Antoine Bercovici.</div></p>
<p>But you might wonder: the simple KA uses 32 bits and the complex XA uses 33 bits, so surely there must be another pin for the 33rd bit.
It turns out that pin F3 is called CACHE on the KA and CACHE/TAG on the XA.
The pin indicates if an access is cacheable, but the XA uses the pin during a different clock cycle to indicate whether
the 32-bit word is data or an access descriptor (unforgeable pointer).</p>
<p>So how does the processor know if it should use the 33-bit object mode or plain 32-bit mode?
There's a processor control word called Processor Controls, that includes a Tag Enable bit. If this bit is set, the processor
uses the 33rd bit (the tag bit) to distinguish Access Descriptors from data.
If the bit is clear, the distinction is disabled and the processor runs in 32-bit mode.
(See <a href="http://bitsavers.org/pdf/biin/BiiN_CPU_Architecture_Reference_Man_Jul88.pdf">BiiN CPU Architecture Reference Manual</a>
section 16.1 for details.) <a class="footnote-backref" href="#fnref:no-connection" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:objects-432">
<p>The 432 and the i960 both had unforgeable object references, the Access Descriptor.
However, the two processors implemented Access Descriptors in completely different ways, which is kind of interesting.
The i960 used a 33rd bit as a Tag bit to distinguish an Access Descriptor from a regular data value.
Since the user didn't have access to the Tag bit, the user couldn't create or modify Access Descriptors.
The 432, on the other hand, used standard 32-bit words.
To protect Access Descriptors, each object was divided into two parts, each protected by a length field.
One part held regular data, while the other part held Access Descriptors.
The 432 had separate instructions to access the two parts of the object, ensuring that regular instructions could
not tamper with Access Descriptors. <a class="footnote-backref" href="#fnref:objects-432" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
<li id="fn:biin">
<p>The name "BiiN"
was developed by Lippincott & Margulies, a top design firm.
The name was designed for a strong logo, as well as referencing binary code (so it was pronounced as "bine").
Despite this pedigree, "BiiN" was called one of the worst-sounding names in the computer industry, see <a href="https://books.google.com/books?id=w3Tf3RomuEsC&lpg=PP63&pg=PP63#v=onepage&q&f=false">Losing the Name Game</a>. <a class="footnote-backref" href="#fnref:biin" title="Jump back to footnote 15 in the text">↩</a></p>
</li>
<li id="fn:xa">
<p>Some sources say that BiiN used the i960 MX, not the XA, but they are confused.
A <a href="https://doi.org/10.1109/CMPCON.1990.63666">paper from BiiN</a> states that BiiN used the 80960 XA.
(Sadly, BiiN was so short-lived that the papers introducing the BiiN system also include its demise.)
Moreover, BiiN shut down in 1989 while the i960 MX was introduced in 1990, so the timeline doesn't work. <a class="footnote-backref" href="#fnref:xa" title="Jump back to footnote 16 in the text">↩</a></p>
</li>
<li id="fn:ca">
<p>The superscalar i960 architecture is described in detail in
<a href="https://doi.org/10.1109/CMPCON.1990.63681">The i960CA SuperScalar implementation of the 80960 architecture</a>
and
<a href="https://doi.org/10.1016/0141-9331(90)90111-8">Inside Intel's i960CA superscalar processor</a>
while the military MM version is described in
<a href="https://doi.org/10.1109/CMPCON.1991.128774">Performance enhancements in the superscalar i960MM embedded microprocessor</a>. <a class="footnote-backref" href="#fnref:ca" title="Jump back to footnote 17 in the text">↩</a></p>
</li>
<li id="fn:mm">
<p>I don't have a die photo of the i960 MM, so I'm not certain of the relationship
between the MM and the MX.
The published MM die size is approximately the same as the MX. The MM block diagram matches the MX, except using 32 bits
instead of 33.
Thus, I think the MM uses the MX die, ignoring the Extended features, but I can't confirm this. <a class="footnote-backref" href="#fnref:mm" title="Jump back to footnote 18 in the text">↩</a></p>
</li>
<li id="fn:ada">
<p>The military's Ada mandate remained in place for a decade until it was <a href="https://people.cs.kuleuven.be/~dirk.craeynest/ada-belgium/archive/ase/ase02_01/docs/pol_hist/oasd497.shtml">eliminated</a> in 1997.
Ada continues to be used by the military and other applications that require high reliability, but by now C++ has mostly replaced it. <a class="footnote-backref" href="#fnref:ada" title="Jump back to footnote 19 in the text">↩</a></p>
</li>
<li id="fn:standard">
<p>The military standard was decided by the Joint Integrated Avionics Working Group, known as JIAWG.
Earlier, in 1980, the military formed a 16-bit computing standard, MIL-STD-1750A. The 1750A standard created a new architecture,
and numerous companies implemented 1750A-compatible processors.
Many systems used 1750A processors and overall it was more successful than the JIAWG standard. <a class="footnote-backref" href="#fnref:standard" title="Jump back to footnote 20 in the text">↩</a></p>
</li>
<li id="fn:tredennick">
<p>Chip designer and curmudgeon Nick Tredennick described the market for Intel's 960MX processor:
"Intel invested considerable money and effort in the design of the 80960MX processor, for which, at the time of implementation,
the only known application was the YF-22 aircraft.
When the only prototype of the YF-22 crashed, the application volume for the 906MX actually went to zero; but even if the
program had been successful, Intel could not have expected to sell more than a few thousand processors for that application." <a class="footnote-backref" href="#fnref:tredennick" title="Jump back to footnote 21 in the text">↩</a></p>
</li>
<li id="fn:rubylith">
<p>In the early 1970s, chip designs were created by cutting large sheets of Rubylith film with X-Acto knives.
Of course, that technology was long gone by the time of the i960.</p>
<p><a href="https://static.righto.com/images/960-overview/rubylith.png"><img alt="Intel photo of two women cutting Rubylith." class="hilite" height="391" src="https://static.righto.com/images/960-overview/rubylith-w500.png" title="Intel photo of two women cutting Rubylith." width="500" /></a><div class="cite">Intel photo of two women cutting Rubylith.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:rubylith" title="Jump back to footnote 22 in the text">↩</a></p>
</li>
<li id="fn:vh">
<p>The Rx I/O processor chips combined a Jx processor core with a PCI bus interface and other hardware.
The RM and RN versions were introduced in 2000 with a hardware XOR engine for RAID disk array parity calculations.
The i960 VH (1998) was similar to Rx, but had only one PCI bus, no APIC bus, and was based on the JT core.
The 80303 (<a href="https://www.edn.com/intel-pushes-i960-to-the-max-with-meaty-80303-device/">2000</a>) was the end of the i960 I/O
processors.
The 80303 was given a numeric name instead of an i960 name because Intel was transitioning from i960 to XScale
at the time.
The numeric name makes it look like a smooth transition from the 80303 (i960) I/O processor to the XScale-based I/O processors such as the 80333.
The 803xx chips were also called IOP3xx (I/O Processor); some were chipsets with a separate XScale processor chip and an I/O companion chip.
<!-- <a href="https://datasheet.octopart.com/GC80303-S-L57T-Intel-datasheet-147318.pdf">80303 datasheet</a> --> <a class="footnote-backref" href="#fnref:vh" title="Jump back to footnote 23 in the text">↩</a></p>
</li>
<li id="fn:risc">
<p>Although the technical side of RISC vs. CISC is interesting, what I find most intriguing is the "social history" of
RISC:
how did a computer architecture issue from the 1980s become a topic that people still vigorously argue over 40 years later?
I see several factors that keep the topic interesting:
<ul>
<li>RISC vs. CISC has a large impact on not only computer architects but also developers and users.
<li>The topic is simple enough that everyone can have an opinion. It's also vague enough that nobody agrees on definitions, so there's lots to argue about.
<li>There are winners and losers, but no resolution.
RISC sort of won in the sense that almost all new instruction set architectures have been RISC.
But CISC has won commercially with the victory of x86 over SPARC, PowerPC, Alpha, and other RISC contenders.
But ARM dominates mobile and is moving into personal computers through Apple's new processors.
If RISC had taken over in the 1980s as expected, there wouldn't be anything to debate. But x86 has prospered despite the efforts of everyone (including Intel) to move beyond it.
<li>RISC vs. CISC takes on a "personal identity" aspect.
For instance, if you're an "Apple" person, you're probably going to be cheering for ARM and RISC.
But nobody cares about branch prediction strategies or caching.
</ul></p>
<p>My personal opinion is that it is a mistake to consider RISC and CISC as objective, binary categories.
(Arguing over whether ARM or the 6502 is really RISC or CISC is like arguing over whether <a href="https://twitter.com/matttomic/status/859117370455060481/photo/1.">a hotdog is a sandwich</a>
RISC is more of a social construct, a design philosophy/ideology that leads to a general kind of instruction set architecture
that leads to various implementation techniques.</p>
<p>Moreover, I view RISC vs. CISC as mostly irrelevant since the 1990s due to convergence between RISC and CISC architectures.
In particular, the Pentium Pro (1995) decoded CISC instructions into "RISC-like" micro-operations that are executed by a superscalar
core, surprising people by achieving RISC-like performance from a CISC processor.
This has been viewed as a victory for CISC, a victory for RISC, nothing to do with RISC, or an indication that RISC and CISC have converged. <a class="footnote-backref" href="#fnref:risc" title="Jump back to footnote 24 in the text">↩</a></p>
</li>
<li id="fn:upr">
<p>The quote is from Microprocessor Report April 1988, "Intel unveils radical new CPU family", reprinted in "Understanding RISC Microprocessors". <a class="footnote-backref" href="#fnref:upr" title="Jump back to footnote 25 in the text">↩</a></p>
</li>
<li id="fn:mashey">
<p>John Mashey of MIPS wrote an interesting article "CISCs are Not RISCs, and Not Converging Either" in the
March 1992 issue of Microprocessor Report, extending a <a href="https://yarchive.net/comp/risc_definition.html">Usenet thread</a>.
It looks at multiple quantitative factors of various processors and finds a sharp line between CISC processors and most RISC processors.
The i960, Intergraph Clipper, and (probably) ARM, however, were "truly on the RISC/CISC border, and, in fact, are often described
that way." <a class="footnote-backref" href="#fnref:mashey" title="Jump back to footnote 26 in the text">↩</a></p>
</li>
<li id="fn:addressing">
<p>The i960 datasheet lists an extensive set of addressing modes, more than typical for a RISC chip:<ul>
<li>12-bit offset
<li>32-bit offset
<li>Register-indirect
<li>Register + 12-bit offset
<li>Register + 32-bit offset
<li>Register + index-register×scale-factor
<li>Register×scale-factor + 32-bit displacement
<li>Register + index-register×scale-factor + 32-bit displacement
</ul>
where the scale-factor is 1, 2, 4, 8, or 16.</p>
<p>See the <a href="https://media.digikey.com/pdf/Data%20Sheets/Intel%20PDFs/80960KA.pdf">80960KA embedded 32-bit microprocessor</a> datasheet for more information. <a class="footnote-backref" href="#fnref:addressing" title="Jump back to footnote 27 in the text">↩</a></p>
</li>
<li id="fn:string">
<p>The i960 MC has string instructions that move, scan, or fill a string of bytes with a specified length.
These are similar to the x86 string operations, but these are very unusual for a RISC processor. <a class="footnote-backref" href="#fnref:string" title="Jump back to footnote 28 in the text">↩</a></p>
</li>
<li id="fn:instruction-set">
<p>The iAPX 432 instruction set is described in detail in chapter 10 of the
<a href="http://www.bitsavers.org/components/intel/iAPX_432/171860-004_iAPX_432_General_Data_Processor_Architecture_Reference_Manual_Feb84.pdf">iAPX 432 General Data Processor Architecture Reference Manual</a>; the instructions are called "operators".
The i960 Protected instruction set is listed in the <a href="http://www.bitsavers.org/components/intel/i960/271081-001_80960MC_Programmers_Reference_Manual_Jul88.pdf">80960MC Programmer's Reference Manual</a>
while the i960 Extended instruction set is described in the
<a href="http://bitsavers.org/pdf/biin/BiiN_CPU_Architecture_Reference_Man_Jul88.pdf">BiiN CPU architecture reference manual</a>.</p>
<p>The table below shows the instruction set for the Extended architecture, the full
set of object-oriented instructions.
The instruction set includes typical RISC instructions (data movement, arithmetic, logical, comparison, etc), floating point instructions (for the Numerics architecture),
process management instructions (for the Protected architecture), and the Extended object instructions (Access Descriptor operations).
The "Mixed" instructions handle 33-bit values that can be either a tag (object pointer) or regular data.
Note that many of these instructions have separate opcodes for different datatypes, so the complete instruction set is larger than
this list, with about 240 opcodes.</p>
<p><a class="footnote-backref" href="#fnref:instruction-set" title="Jump back to footnote 29 in the text">↩</a><a href="https://static.righto.com/images/960-overview/instruction-set.png"><img alt="The Extended instruction set, from the i960 XA datasheet. Click for a larger version." class="hilite" height="778" src="https://static.righto.com/images/960-overview/instruction-set-w600.png" title="The Extended instruction set, from the i960 XA datasheet. Click for a larger version." width="600" /></a><div class="cite">The Extended instruction set, from the i960 XA datasheet. Click for a larger version.</div></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com24tag:blogger.com,1999:blog-6264947694886887540.post-74533590985711627612023-05-13T15:42:00.004-07:002023-05-14T11:38:06.398-07:00The Group Decode ROM: The 8086 processor's first step of instruction decoding<style>
.hilite {cursor:zoom-in}
a:link img.hilite, a:visited img.hilite {color: #fff;}
a:hover img.hilite {color: #f66;}
</style>
<style>
pre.microcode {font-family: courier, fixed; padding: 10px; background-color: #f5f5f5; display:inline-block;border:none;}
pre.microcode span {color: green; font-style:italic; font-family: sans-serif; font-size: 90%;}
</style>
<p>A key component of any processor is instruction decoding: analyzing a numeric opcode and figuring out
what actions need to be taken.
The Intel 8086 processor (1978) has a complex instruction set, making instruction decoding a challenge.
The first step in decoding an 8086 instruction is something called the Group Decode ROM, which categorizes
instructions into about 35 types that control how the instruction is decoded and executed.
For instance, the Group Decode ROM determines if an instruction is executed in hardware or in microcode.
It also indicates how the instruction is structured: if the instruction has a bit specifying a byte or word operation,
if the instruction has a byte that specifies the addressing mode, and so forth.</p>
<p><a href="https://static.righto.com/images/8086-group/die-labeled.jpg"><img alt="The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version." class="hilite" height="633" src="https://static.righto.com/images/8086-group/die-labeled-w600.jpg" title="The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version." width="600" /></a><div class="cite">The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.</div></p>
<p>The diagram above shows the position of the Group Decode ROM on the silicon die, as well as other key functional blocks.
The 8086 chip is partitioned into a Bus Interface Unit that communicates with external components such as memory,
and the Execution Unit that executes instructions.
Machine instructions are fetched from memory by the Bus Interface Unit and stored in the prefetch queue registers,
which hold 6 bytes of instructions.
To execute an instruction, the queue bus transfers an instruction byte from the prefetch queue to the instruction register, under control of a state machine called the Loader.
Next, the Group Decode ROM categorizes the instruction according to its structure.
In most cases, the machine instruction is implemented in low-level microcode. The instruction byte is transferred
to the Microcode Address Register, where the Microcode Address Decoder selects the appropriate microcode routine
that implements the instruction.
The microcode provides the micro-instructions that control the Arithmetic/Logic Unit (ALU), registers, and other
components to execute the instruction.</p>
<p>In this blog post, I will focus on a small part of this process: how the Group Decode ROM decodes instructions.
Be warned that this post gets down into the weeds, so you might want to start with one of my higher-level
posts, such as <a href="https://www.righto.com/2022/11/how-8086-processors-microcode-engine.html">how the 8086's microcode engine works</a>.</p>
<!--
According to patent [4449184](https://patents.google.com/patent/US4449184A),
"A group decode ROM has its
inputs coupled to the instruction register. The group
decode ROM generates a plurality of group decode
signals which are indicative of the genera of the single
byte and multiple byte instructions being received and
decoded by the lower control means."
-->
<h2>Microcode</h2>
<p>Most instructions in the 8086 are implemented in microcode.
Most people think of machine instructions as the basic steps that a computer performs.
However, many processors have another layer of software underneath: microcode.
With microcode, instead of building the CPU's control circuitry from complex logic gates, the control logic is largely replaced with code.
To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.</p>
<p>Microcode is only used if the Group Decode ROM indicates that the instruction is implemented in microcode.
In that case, the microcode address register is loaded with the instruction and the address decoder selects
the appropriate microcode routine.
However, there's a complication. If the second byte of the instruction is a Mod R/M byte, the Group Decode ROM
indicates this and causes a memory addressing micro-subroutine to be called.</p>
<p>Some simple instructions are implemented entirely in hardware and don't use microcode.
These are known as 1-byte logic instructions (1BL) and are also indicated by the Group Decode ROM.</p>
<h2>The Group Decode ROM's structure</h2>
<p>The Group Decode ROM takes an 8-bit instruction as input, along with an interrupt signal.
It produces 15 outputs that control how the instruction is handled.
In this section I'll discuss the physical implementation of the Group Decode ROM; the various outputs
are discussed in a later section.</p>
<p>Although the Group Decode ROM is called a ROM, its implementation is really a PLA (Programmable Logic Array),
two levels of highly-structured logic gates.<span id="fnref:rom"><a class="ref" href="#fn:rom">1</a></span>
The idea of a PLA is to create two levels of NOR gates, each in a grid.
This structure has the advantages that it implements the logic densely and is easy to modify.
Although physically two levels of NOR gates, a PLA can be thought of as an <code>AND</code> layer followed by an <code>OR</code> layer.
The <code>AND</code> layer matches particular bit patterns and then the <code>OR</code> layer combines multiple values from the first
layer to produce arbitrary outputs.</p>
<p><a href="https://static.righto.com/images/8086-group/rom-labeled.jpg"><img alt="The Group Decode ROM. This photo shows the metal layer on top of the die." class="hilite" height="443" src="https://static.righto.com/images/8086-group/rom-labeled-w600.jpg" title="The Group Decode ROM. This photo shows the metal layer on top of the die." width="600" /></a><div class="cite">The Group Decode ROM. This photo shows the metal layer on top of the die.</div></p>
<p>Since the output values are highly structured, a PLA implementation is considerably more efficient than a ROM, since in a sense
it combines multiple entries.
In the case of the Group Decode ROM, using a ROM structure would require 256 columns (one for each 8-bit instruction pattern),
while the PLA implementation requires just 36 columns, about 1/7 the size.</p>
<p>The diagram below shows how one column of the Group Decode ROM is wired in the "AND" plane.
In this die photo, I removed the metal layer with acid to reveal the polysilicon and silicon underneath.
The vertical lines show where the metal line for ground and the column output had been.
The basic idea is that each column implements a NOR gate, with a subset of the input lines selected as inputs to the
gate.
The pull-up resistor at the top pulls the column line high by default. But if any of the selected inputs are high,
the corresponding transistor turns on, connecting the column line to ground and pulling it low.
Thus, this implements a NOR gate.
However, it is more useful to think of it as an AND of the complemented inputs (via <a href="https://en.wikipedia.org/wiki/De_Morgan%27s_laws">De Morgan's Law</a>):
if all the inputs are "correct", the output is high.
In this way, each column matches a particular bit pattern.</p>
<p><a href="https://static.righto.com/images/8086-group/column-labeled.jpg"><img alt="Closeup of a column in the Group Decode ROM." class="hilite" height="575" src="https://static.righto.com/images/8086-group/column-labeled-w280.jpg" title="Closeup of a column in the Group Decode ROM." width="280" /></a><div class="cite">Closeup of a column in the Group Decode ROM.</div></p>
<p>The structure of the ROM is implemented through the silicon doping pattern, which is visible above.
A transistor is formed where a polysilicon wire crosses a doped silicon region: the polysilicon forms the gate, turning the transistor on or off.
At each intersection point, a transistor can be created or not, depending on the doping pattern.
If a particular transistor is created, then the corresponding input must be 0 to produce a high output.</p>
<p>At the top of the diagram above, the column outputs are switched from the metal layer to polysilicon wires and become the inputs to the upper "OR"
plane.
This plane is implemented in a similar fashion as a grid of NOR gates.
The plane is rotated 90 degrees, with the inputs vertical and each row forming an output.</p>
<h2>Intermediate decoding in the Group Decode ROM</h2>
<p>The first plane of the Group Decode ROM categorizes instructions into 36 types based on the instruction bit pattern.<span id="fnref:36"><a class="ref" href="#fn:36">2</a></span>
The table below shows the 256 instruction values, colored according to their categorization.<span id="fnref:octal"><a class="ref" href="#fn:octal">3</a></span>
For instance, the first blue block consists of the 32 ALU instructions
corresponding to the bit pattern <code>00XXX0XX</code>, where <code>X</code> indicates that the bit can be 0 or 1.
These instructions are all decoded and executed in a similar way.
Almost all instructions have a single category, that is, they activate a single column line in the Group Decode ROM. However, a few instructions activate two lines and have two colors below.</p>
<p><a href="https://static.righto.com/images/8086-group/grid.png"><img alt="Grid of 8086 instructions, colored according to the first level of the Group Decode Rom." class="hilite" height="600" src="https://static.righto.com/images/8086-group/grid-w600.png" title="Grid of 8086 instructions, colored according to the first level of the Group Decode Rom." width="600" /></a><div class="cite">Grid of 8086 instructions, colored according to the first level of the Group Decode Rom.</div></p>
<p>Note that the instructions do not have arbitrary numeric opcodes, but are assigned in a way that makes decoding simpler.
Because these blocks correspond to bit patterns, there is little flexibility.
One of the challenges of instruction set design for early microprocessors was to assign numeric values to the opcodes
in a way that made decoding straightforward.
It's a bit like a jigsaw puzzle, fitting the instructions into the 256 available values, while making them easy to decode. </p>
<h2>Outputs from the Group Decode ROM</h2>
<p>The Group Decode ROM has 15 outputs, one for each row of the upper half.
In this section, I'll briefly discuss these outputs and their roles in the 8086.
For an interactive exploration of these signals, see <a href="https://righto.com/8086/groupRom.html">this page</a>,
which shows the outputs that are triggered by each instruction.</p>
<p>Out 0 indicates an <code>IN</code> or <code>OUT</code> instruction.
This signal controls the M/IO (S2) status line, which distinguishes between a memory read/write and an I/O read/write.
Apart from this, memory and I/O accesses are basically the same.</p>
<p>Out 1 indicates (inverted) that the instruction has a Mod R/M byte and performs a read/modify/write on its argument. This signal is used by the Translation ROM when dispatching
an address handler (<a href="https://www.righto.com/2023/02/8086-modrm-addressing.html">details</a>).
(This signal distinguishes between, say, <code>ADD [AX],BX</code> and <code>MOV [AX],BX</code>.
The former both reads and writes <code>[AX]</code>, while the latter only writes to it.)</p>
<p>Out 2 indicates a "group 3/4/5" opcode, an instruction where the second byte specifies the particular instruction,
and thus decoding needs to wait for the second byte.
This controls the loading of the microcode address register.</p>
<p>Out 3 indicates an instruction prefix (segment, <code>LOCK</code>, or <code>REP</code>).
This causes the next byte to be decoded as a new instruction, while blocking interrupt handling.</p>
<p>Out 4 indicates (inverted) a two-byte ROM instruction (2BR), i.e. an instruction is handled by the microcode ROM, but
requires the second byte for decoding.
This is an instruction with a Mod R/M byte.
This signal controls the loader indicating that it needs to fetch the second byte.
This signal is almost the same as output 1 with a few differences.</p>
<p>Out 5 specifies the top bit for an ALU operation. The 8086 uses a 5-bit field to specify an ALU operation.
If not specified explicitly by the microcode, the field uses bits 5 through 3 of the opcode.
(These bits distinguish, say, an <code>ADD</code> instruction from <code>AND</code> or <code>SUB</code>.)
This control line sets the top bit of the ALU field for instructions such as <code>DAA</code>, <code>DAS</code>, <code>AAA</code>, <code>AAS</code>, <code>INC</code>, and <code>DE</code> that fall into a different set from the "regular" ALU instructions.</p>
<p>Out 6 indicates an instruction that sets or clears a condition code directly: <code>CLC</code>, <code>STC</code>, <code>CLI</code>, <code>STI</code>, <code>CLD</code>, or <code>STD</code> (but not <code>CMC</code>). This signal is used by the flag circuitry to update the condition code.</p>
<p>Out 7 indicates an instruction that uses the <code>AL</code> or <code>AX</code> register, depending on the instruction's size bit.
(For instance <code>MOVSB</code> vs <code>MOVSW</code>.)
This signal is used by the register selection circuitry, the <code>M</code> register specifically.</p>
<p>Out 8 indicates a <code>MOV</code> instruction that uses a segment register.
This signal is used by the register selection circuitry, the <code>N</code> register specifically.</p>
<p>Out 9 indicates the instruction has a <code>d</code> bit, where bit 1 of the instruction swaps the source and destination.
This signal is used by the register selection circuitry, swapping the roles of the <code>M</code> and <code>N</code> registers according to the <code>d</code> bit.</p>
<p>Out 10 indicates a one-byte logic (1BL) instruction, a one-byte instruction that is implemented in logic, not microcode. These instructions are the prefixes, <code>HLT</code>, and the condition-code instructions.
This signal controls the loader, causing it to move to the next instruction.</p>
<p>Out 11 indicates instructions where bit 0 is the byte/word indicator. This signal controls the register handling
and the ALU functionality.</p>
<p>Out 12 indicates an instruction that operates only on a byte: <code>DAA</code>, <code>DAS</code>, <code>AAA</code>, <code>AAS</code>, <code>AAM</code>, <code>AAD</code>, and <code>XLAT</code>.
This signal operates in conjunction with the previous output to select a byte versus word.</p>
<p>Out 13 forces the instruction to use a byte argument if instruction bit 1 is set, overriding the regular byte/word pattern. Specifically, it forces the <code>L8</code> (length 8 bits) condition
for the <code>JMP</code> direct-within-segment and the ALU instructions that are immediate with sign extension (<a href="https://www.righto.com/2023/02/how-8086-processor-determines-length-of.html">details</a>).</p>
<p>Out 14 allows a carry update. This prevents the carry from being updated by the <code>INC</code> and <code>DEC</code> operations.
This signal is used by the flag circuitry.</p>
<h3>Columns</h3>
<p>Most of the Group Decode ROM's column signals are used to derive the outputs listed above.
However, some column outputs are also used as control signals directly. These are listed below.</p>
<p>Column 10 indicates an immediate <code>MOV</code> instruction. These instructions use instruction bit 3 (rather than bit 1) to select byte versus word, because the three low bits specify the register.
This signal affects the <code>L8</code> condition described earlier and also causes the <code>M</code> register selection to be converted from a word register to a byte register if necessary.</p>
<p>Column 12 indicates an instruction with bits 5-3 specifying the ALU instruction.
This signal causes the <code>X</code> register to be loaded with
the bits in the instruction that specify the ALU operation. (To be precise, this signal prevents the <code>X</code> register
from being reloaded from the second instruction byte.)</p>
<p>Column 13 indicates the <code>CMC</code> (Complement Carry) instruction. This signal is used by the flags circuitry to complement the carry flag (<a href="https://www.righto.com/2023/02/silicon-reverse-engineering-intel-8086.html">details</a>).</p>
<p>Column 14 indicates the <code>HLT</code> (Halt) instruction. This signal stops instruction processing by blocking the instruction queue.</p>
<p>Column 31 indicates a <code>REP</code> prefix. This signal causes the REPZ/NZ latch to be loaded with instruction bit 0 to
indicate if the prefix is <code>REPNZ</code> or <code>REPZ</code>. It also sets the <code>REP</code> latch.</p>
<p>Column 32 indicates a segment prefix. This signal loads the segment latches with the desired segment type.</p>
<p>Column 33 indicates a <code>LOCK</code> prefix. It sets the <code>LOCK</code> latch, locking the bus.</p>
<p>Column 34 indicates a <code>CLI</code> instruction. This signal immediately blocks interrupt handling to avoid an interrupt between the <code>CLI</code> instruction and when the interrupt flag bit is cleared.</p>
<h2>Timing</h2>
<p>One important aspect of the Group Decode ROM is that its outputs are not instantaneous.
It takes a clock cycle to get the outputs from the Group Decode ROM.
In particular, when instruction decoding starts, the timing signal <code>FC</code> (First Clock) is activated to indicate the first clock
cycle. However, the Group Decode ROM's outputs are not available until the Second Clock <code>SC</code>.</p>
<p>One consequence of this is that even the simplest instruction (such as a flag operation) takes two clock cycles, as does a prefix.
The problem is that even though the instruction could be performed in one clock cycle, it takes two clock cycles
for the Group Decode ROM to determine that the instruction only needs one cycle.
This illustrates how a complex instruction format impacts performance.</p>
<p>The <code>FC</code> and <code>SC</code> timing signals are generated by a state machine called the Loader.
These signals may seem trivial, but there are a few complications.
First, the prefetch queue may run empty, in which case the <code>FC</code> and/or <code>SC</code> signal is delayed until the prefetch queue has a byte available.
Second, to increase performance, the 8086 can start decoding an instruction during the last clock cycle of the previous instruction.
Thus, if the microcode indicates that there is one cycle left, the Loader can proceed with the next instruction.
Likewise, for a one-byte instruction implemented in hardware (one-byte logic or 1BL), the loader proceeds
as soon as possible.</p>
<p>The diagram below shows the timing of an <code>ADD</code> instruction. Each line is half of a clock cycle.
Execution is pipelined: the instruction is fetched during the first clock cycle (First Clock).
During Second Clock, the Group Decode ROM produces its output. The microcode address register also generates
the micro-address for the instruction's microcode.
The microcode ROM supplies a micro-instruction during the third clock cycle and execution of the micro-instruction
takes place during the fourth clock cycle.</p>
<p><a href="https://static.righto.com/images/8086-group/diagram-labeled.jpg"><img alt="This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro"." class="hilite" height="395" src="https://static.righto.com/images/8086-group/diagram-labeled-w750.jpg" title="This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro"." width="750" /></a><div class="cite">This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".</div></p>
<p>The Group Decode ROM's outputs during Second Clock control the decoding.
Most importantly, the <code>ADD imm</code> instruction used microcode; it is not a one-byte logic instruction (<code>1BL</code>).
Moreover, it does not have a Mod R/M byte, so it does not need two bytes for decoding (<code>2BR</code>).
For a <code>1BL</code> instruction, microcode execution would be blocked and the next instruction would be immediately fetched.
On the other hand, for a <code>2BR</code> instruction, the loader would tell the prefetch queue that it was done with the
second byte during the second half of Second Clock.
Microcode execution would be blocked during the third cycle and the fourth cycle would execute a microcode
subroutine to determine the memory address.</p>
<p>For more details, see my article on the <a href="https://www.righto.com/2023/01/the-8086-processors-microcode-pipeline.html">8086 pipeline</a>.</p>
<h2>Interrupts</h2>
<p>The Group Decode ROM takes the 8 bits of the instruction as inputs, but it has an additional input indicating that
an interrupt is being handled.
This signal blocks most of the Group Decode ROM outputs.
This prevents the current instruction's outputs from interfering with interrupt handling.
I wrote about the 8086's interrupt handling in detail <a href="https://www.righto.com/2023/02/8086-interrupt.html">here</a>, so I won't go into more detail in this post.</p>
<h2>Conclusions</h2>
<p>The Group Decode ROM indicates one of the key differences between CISC processors (Complex Instruction Set Computer) such as the 8086 and the RISC processors (Reduced Instruction Set Computer) that became popular a few years later.
A RISC instruction set is designed to make instruction decoding very easy, with a small number of uniform instruction forms.
On the other hand, the 8086's CISC instruction set was designed for compactness and high code density.
As a result, instructions are squeezed into the available opcode space.
Although there is a lot of structure to the 8086 opcodes, this structure is full of special cases and any patterns only apply to a subset of the instructions.
The Group Decode ROM brings some order to this chaotic jumble of instructions, and the number of outputs
from the Group Decode ROM is a measure of the instruction set's complexity.</p>
<p>The 8086's instruction set was extended over the decades to become the x86 instruction set in use today.
During that time, more layers of complexity were added to the instruction set.
Now, an x86 instruction can be up to 15 bytes long with multiple prefixes.
Some prefixes change the register encoding or indicate a completely different instruction set such as <code>VEX</code> (Vector Extensions) or <code>SSE</code> (Streaming SIMD Extensions).
Thus, x86 instruction decoding is very difficult, especially when trying to decode multiple instructions in parallel.
This has an impact in modern systems, where x86 processors typically have 4 complex instruction decoders while Apple's ARM processors have 8 simpler decoders; this is <a href="https://debugger.medium.com/why-is-apples-m1-chip-so-fast-3262b158cba2">said</a> to give Apple a performance benefit.
Thus, architectural decisions from 45 years ago are still impacting the performance of modern processors.</p>
<p>I've written numerous <a href="https://www.righto.com/search/label/8086">posts on the 8086</a> so far and
plan to continue reverse-engineering the 8086 die so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.
Thanks to Arjan Holscher for suggesting this topic.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:rom">
<p>You might wonder what the difference is between a ROM and a PLA.
Both of them produce arbitrary outputs for a set of inputs.
Moreover, you can replace a PLA with a ROM or vice versa.
Typically a ROM has all the input combinations decoded, so it has a separate row for each input value, i.e. 2<sup>N</sup> rows.
So you can think of a ROM as a fully-decoded PLA.</p>
<p>Some ROMs are partially decoded, allowing identical rows to be combined and reducing the size of the ROM.
This technique is used in the 8086 microcode, for instance.
A partially-decoded ROM is fairly similar to a PLA, but the technical distinction is that a ROM has
only one output row active at a time, while a PLA can have multiple output rows active and the results are
OR'd together.
(This definition is from <a href="https://amzn.to/3pzucaS">The Architecture of Microprocessors</a> p117.)</p>
<p>The Group Decode ROM, however, has a few cases where multiple rows are active at the same time
(for instance the segment register <code>POP</code> instructions).
Thus, the Group Decode ROM is technically a PLA and not a ROM.
This distinction isn't particularly important, but you might find it interesting. <a class="footnote-backref" href="#fnref:rom" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:36">
<p>The Group Decode ROM has 38 columns, but two columns (11 and 35) are unused. Presumably, these were provided as spares
in case a bug fix or modification required additional decoding. <a class="footnote-backref" href="#fnref:36" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:octal">
<p>Like the 8008 and 8080, the 8086's instruction set was designed around a 3-bit octal structure.
Thus, the 8086 instruction set makes much more sense if viewed in octal instead of hexadecimal.
The table below shows the instructions with an octal organization. Each 8×8 block uses the two low
octal digits, while the four large blocks are positioned according to the top octal digit (labeled).
As you can see, the instruction set has a lot of structure that is obscured in the usual hexadecimal table.</p>
<p><a href="https://static.righto.com/images/8086-group/grid-octal.jpg"><img alt="The 8086 instruction set, put in a table according to the octal opcode value." class="hilite" height="647" src="https://static.righto.com/images/8086-group/grid-octal-w600.jpg" title="The 8086 instruction set, put in a table according to the octal opcode value." width="600" /></a><div class="cite">The 8086 instruction set, put in a table according to the octal opcode value.</div></p>
<p>For details on the octal structure of the 8086 instruction set, see <a href="https://gist.github.com/seanjensengrey/f971c20d05d4d0efc0781f2f3c0353da">The 80x86 is an Octal Machine</a>. <a class="footnote-backref" href="#fnref:octal" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com3tag:blogger.com,1999:blog-6264947694886887540.post-56939173469665092042023-04-08T08:01:00.000-07:002023-04-08T08:01:52.481-07:00Reverse-engineering the division microcode in the Intel 8086 processor<style>
.hilite {cursor:zoom-in}
a:link img.hilite, a:visited img.hilite {color: #fff;}
a:hover img.hilite {color: #f66;}
</style>
<style>
pre.microcode {font-family: courier, fixed; padding: 10px; background-color: #f5f5f5; display:inline-block;border:none;}
pre.microcode span {color: green; font-style:italic; font-family: sans-serif; font-size: 90%;}
</style>
<p>While programmers today take division for granted, most microprocessors in the 1970s could only add and subtract — division required a slow and tedious loop implemented in assembly code.
One of the nice features of the Intel 8086 processor (1978) was
that it provided machine instructions for integer multiplication and division.
Internally, the 8086 still performed a loop, but the loop was implemented in microcode: faster and transparent to
the programmer.
Even so, division was a slow operation, about 50 times slower than addition.</p>
<p>I recently examined <a href="https://www.righto.com/2023/03/8086-multiplication-microcode.html">multiplication in the 8086</a>, and
now it's time to look at the division microcode.<span id="fnref:microcode"><a class="ref" href="#fn:microcode">1</a></span>
(There's a lot of overlap with the multiplication post so apologies for any deja vu.)
The die photo below shows the chip under a microscope.
I've labeled the key functional blocks; the ones that are important to this post are darker.
At the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations at the heart of division: subtraction and shifts.
Division also uses a few special hardware features: the X register, the F1 flag, and a loop counter.
The microcode ROM at the lower right controls the process.</p>
<p><a href="https://static.righto.com/images/8086-div/die-labeled.jpg"><img alt="The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version." class="hilite" height="592" src="https://static.righto.com/images/8086-div/die-labeled-w600.jpg" title="The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version." width="600" /></a><div class="cite">The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.</div></p>
<h2>Microcode</h2>
<p>Like most instructions, the division routines in the 8086 are implemented in microcode.
Most people think of machine instructions as the basic steps that a computer performs.
However, many processors have another layer of software underneath: microcode.
With microcode, instead of building the CPU's control circuitry from complex logic gates, the control logic is largely replaced with code.
To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.
This is especially useful for a machine instruction such as division, which performs many steps in a loop.</p>
<!--
The 8086 uses a hybrid approach: although it uses microcode, much of the instruction functionality is implemented with gate logic.
This approach removed duplication from the microcode and kept the microcode small enough for 1978 technology.
In a sense, the microcode is parameterized.
For instance, the microcode can specify a generic Arithmetic/Logic Unit (ALU) operation and a generic register.
The gate logic examines the instruction to determine which specific operation to perform and the appropriate register.
-->
<p>Each micro-instruction in the 8086 is encoded into 21 bits as shown below.
Every micro-instruction moves data from a source register to a destination register, each specified with 5 bits.
The meaning of the remaining bits depends on the type field and can be anything from an ALU operation to a memory read or write to
a change of microcode control flow.
Thus, an 8086 micro-instruction typically does two things in parallel: the move and the action.
For more about 8086 microcode, see my <a href="https://www.righto.com/2022/11/how-8086-processors-microcode-engine.html">microcode blog post</a>.</p>
<p><a href="https://static.righto.com/images/8086-div/microcode-format.jpg"><img alt="The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?" class="hilite" height="203" src="https://static.righto.com/images/8086-div/microcode-format-w700.jpg" title="The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?" width="700" /></a><div class="cite">The encoding of a micro-instruction into 21 bits. Based on <a href="https://digitalcommons.law.scu.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1031&context=chtlj">NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?</a></div></p>
<p>A few details of the ALU (Arithmetic/Logic Unit) operations are necessary to understand the division microcode.
The ALU has three temporary registers that are invisible to the programmer: tmpA, tmpB, and tmpC.
An ALU operation takes its first argument from the specified temporary register, while the second argument always comes from tmpB.
An ALU operation requires two micro-instructions.
The first micro-instruction specifies the ALU operation and source register, configuring the ALU. For instance, <code>ADD tmpA</code> to add tmpA to the default tmpB.
In the next micro-instruction (or a later one), the ALU result can be accessed through a register called <code>Σ</code> (Sigma) and moved to another register.</p>
<p>The carry flag plays a key role in division.
The carry flag is one of the programmer-visible status flags that is set by arithmetic operations, but it is also used
by the microcode.
For unsigned addition, the carry flag is set if there is a carry out of the word (or byte).
For subtraction, the carry flag indicates a borrow, and is set if the subtraction requires a borrow.
Since a borrow results if you subtract a larger number from a smaller number, the carry flag also indicates
the "less than" condition.
The carry flag (and other status flags) are only updated if micro-instruction contains the <code>F</code> bit.</p>
<p>The <code>RCL</code> (Rotate through Carry, Left) micro-instruction is heavily used in the division microcode.<span id="fnref:rcl"><a class="ref" href="#fn:rcl">2</a></span>
This operation shifts the bits in a 16-bit word, similar to the <code><<</code> bit-shift operation in high-level languages, but with an additional feature.
Instead of discarding the bit on the end, that bit is moved into the carry flag. Meanwhile, the bit formerly in the carry flag moves into the word.
You can think of this as rotating the bits while treating the carry flag as a 17th bit of the word.
(The <code>RCL</code> operation can also act on a byte.)</p>
<p><a href="https://static.righto.com/images/8086-div/lrcy.jpg"><img alt="The rotate through carry left micro-instruction." class="hilite" height="51" src="https://static.righto.com/images/8086-div/lrcy-w300.jpg" title="The rotate through carry left micro-instruction." width="300" /></a><div class="cite">The rotate through carry left micro-instruction.</div></p>
<p>These shifts perform an important part of the division process since shifting can be viewed as multiplying or dividing by two.
<code>RCL</code> also provides a convenient way to move the most-significant bit to the carry flag, where it can be tested for a conditional jump.
(This is important because the top bit is used as the sign bit.)
Another important property is that performing <code>RCL</code> on a lower word and then <code>RCL</code> on an upper word will perform a 32-bit shift, since
the high bit of the lower word will be moved into the low bit of the upper word via the carry bit.
Finally, the shift moves the quotient bit from the carry into the register.</p>
<h2>Binary division</h2>
<p>The division process in the 8086 is similar to grade-school long division, except in binary instead of decimal.
The diagram below shows the process: dividing 67 (the dividend) by 9 (the divisor) yields the quotient 7 at the top and the remainder 4 at the bottom.
Long division is easier in binary than decimal because you don't need to guess the right quotient digit.
Instead, at each step you either subtract the divisor (appropriately shifted)
or subtract nothing.
Note that although the divisor is 4 bits in this example, the subtractions use 5-bit values.
The need for an "extra" bit in division will be important in the discussion below; 16-bit division needs a 17-bit value.</p>
<style type="text/css">
table.divdia {border-collapse: collapse; font-family: "courier-new", courier, monospace; padding: 5px;}
table.divdia tr.top {border-bottom: 1px solid #ccc;}
table.divdia tr.sum {border-top: 1px solid #ccc;}
table.divdia .dim {color: #aaa;}
table.divdia td.l {border-left: 2px solid #444;}
table.divdia td.t {border-top: 2px solid #444;}
table.divdia td.ul {border-bottom: 1px solid #888;}
</style>
<table class="divdia">
<tr><td colspan=8> </td><td>0</td><td>1</td><td>1</td><td>1</td></tr>
<tr><td>1</td><td>0</td><td>0</td><td>1</td><td class="t l">0</td><td class="t">1</td><td class="t">0</td><td class="t">0</td><td class="t">0</td><td class="t">0</td><td class="t">1</td><td class="t">1</td></tr>
<tr><td colspan=4> </td><td>-</td><td class="ul dim">0</td><td class="ul dim">0</td><td class="ul dim">0</td><td class="ul dim">0</td>
<tr><td colspan=5> </td><td>1</td><td>0</td><td>0</td><td>0</td><td class="u">0</td></tr>
<tr><td colspan=5> </td><td>-</td><td class="ul">1</td><td class="ul">0</td><td class="ul">0</td><td class="ul">1</td>
<tr><td colspan=6> </td><td>0</td><td>1</td><td>1</td><td>1</td><td class="u">1</td></tr>
<tr><td colspan=6> </td><td>-</td><td class="ul">1</td><td class="ul">0</td><td class="ul">0</td><td class="ul">1</td>
<tr><td colspan=7> </td><td>0</td><td>1</td><td>1</td><td>0</td><td class="u">1</td></tr>
<tr><td colspan=7> </td><td>-</td><td class="ul">1</td><td class="ul">0</td><td class="ul">0</td><td class="ul">1</td>
<tr><td colspan=8> </td><td>0</td><td>1</td><td>0</td><td>0</td></tr>
</table>
<p>Instead of shifting the divisor to the right each step, the 8086's algorithm shifts the quotient and the current dividend
to the left each step.
This trick reduces the register space required.
Dividing a 32-bit number (the dividend) by a 16-bit number yields a 16-bit result, so it seems like you'd need four 16-bit registers in
total.
The trick is that after each step, the 32-bit dividend gets one bit smaller, while the result gets one bit larger.
Thus, the dividend and the result can be packed together into 32 bits. At the end, what's left of the dividend is
the 16-bit remainder. The table below illustrates this process for a sample dividend (blue) and quotient (green).<span id="fnref:table"><a class="ref" href="#fn:table">3</a></span>
At the end, the 16-bit blue value is the remainder.</p>
<style type="text/css">
table.div {border-collapse: collapse; border: 1px solid #888; font-family: "courier-new", courier, monospace; padding: 5px;}
table.div th {border-bottom: 1px solid #444;}
table.div th:nth-child(1) {border-left: 1px solid #444;}
table.div th:nth-child(3) {border-left: 1px solid #444;}
table.div th:nth-child(4) {padding: 0 4px;}
table.div tr:nth-child(2) {border-bottom: 1px solid #444;}
table.div tr:nth-child(2) {border-bottom: 1px solid #444;}
table.div td:nth-child(8) {border-right: 1px solid #444;}
table.div td:nth-child(16) {border-right: 2px solid #444;}
table.div td:nth-child(24) {border-right: 1px solid #444;}
table.div td.a {background-color: #ddf;}
table.div td.b {background-color: #cfc;}
</style>
<table class="div">
<tr style="font-family: sans-serif; color: blue"><th colspan=16>dividend</th><th colspan=16 style="font-family:sans-serif; color: green">quotient</th></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">0</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">0</td><td class="b">1</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">0</td><td class="b">1</td><td class="b">1</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">0</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">0</td><td class="b">0</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">0</td><td class="b">0</td><td class="b">1</td></tr>
<tr><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">1</td><td class="b">0</td><td class="b">1</td><td class="b">1</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">1</td></tr>
</table>
<h2>The division microcode</h2>
<p>The 8086 has four division instructions to handle signed and unsigned division of byte and word operands.
I'll start by describing the microcode for the unsigned word division instruction <code>DIV</code>, which divides a 32-bit dividend by a 16-bit divisor.
The dividend is supplied in the AX and DX registers while the divisor is specified by the source operand.
The 16-bit quotient is returned in AX and the 16-bit remainder in DX.
For a divide-by-zero, or if the quotient is larger than 16 bits, a type 0 "divide error" interrupt is generated.</p>
<h3><code>CORD</code>: The core division routine</h3>
<p>The <code>CORD</code> microcode subroutine below implements the long-division algorithm for all division instructions; I think <code>CORD</code> stands for Core Divide.
At entry, the arguments are in the ALU temporary registers:
tmpA/tmpC hold the 32-bit dividend, while tmpB holds the 16-bit divisor.
(I'll explain the configuration for byte division later.)
Each cycle of the loop shifts the values and then potentially subtracts the divisor.
One bit is appended to the quotient to indicate whether the
divisor was subtracted or not.
At the end of the loop, whatever is left of the dividend is the remainder.</p>
<p>Each micro-instruction specifies a register move on the left, and an action on the right.
The moves transfer words between the visible registers and the ALU's temporary registers, while the actions are mostly ALU operations or control flow.
As is usually the case with microcode, the details are tricky.
The first three lines below check if the division will overflow or divide by zero.
The code compares tmpA and tmpB by subtracting tmpB, discarding the result, but setting the status flags (<code>F</code>).
If the upper word of the divisor is greater or equal to the dividend, the division will overflow, so execution jumps to <code>INT0</code> to generate
a divide-by-zero interrupt.<span id="fnref:interrupts"><a class="ref" href="#fn:interrupts">4</a></span> (This handles both the case where the dividend is "too large" and the divide-by-0 case.)
The number of loops in the division algorithm is controlled by a special-purpose loop counter.
The <code>MAXC</code> micro-instruction initializes the counter to 7 or 15, for a byte or word divide instruction respectively.</p>
<pre class="microcode">
move action
SUBT tmpA <span><b>CORD:</b> set up compare</span>
Σ → no dest MAXC F <span>compare, set up counter, update flags</span>
JMP NCY INT0 <span>generate interrupt if overflow</span>
RCL tmpC <span><b>3:</b> main loop: left shift tmpA/tmpC</span>
Σ → tmpC RCL tmpA <span></span>
Σ → tmpA SUBT tmpA <span>set up compare/subtract</span>
JMPS CY 13 <span>jump if top bit of tmpA was set</span>
Σ → no dest F <span>compare, update flags</span>
JMPS NCY 14 <span>jump for subtract</span>
JMPS NCZ 3 <span>test counter, loop back to 3</span>
RCL tmpC <span><b>10:</b> done:</span>
Σ → tmpC <span>shift last bit into tmpC</span>
Σ → no dest RTN <span>done: get top bit, return</span>
RCY <span><b>13:</b> reset carry</span>
Σ → tmpA JMPS NCZ 3 <span><b>14:</b> subtract, loop</span>
JMPS 10 <span>done, goto 10</span>
</pre>
<p>The main loop starts at <em>3</em>.
The tmpC and tmpA registers are shifted left. This has two important side effects. First, the old carry bit (which holds the latest
quotient bit) is shifted into the bottom of tmpC. Second, the top bit of tmpA is shifted into the carry bit;
this provides the necessary "extra" bit for the subtraction below.
Specifically, if the carry (the "extra" tmpA bit) is set, tmpB can be subtracted, which is accomplished by jumping to <em>13</em>.
Otherwise, the code compares tmpA and tmpB by
subtracting tmpB, discarding the result, and updating the flags (<code>F</code>).
If there was no borrow/carry (tmpA ≥ tmpB), execution jumps to <em>14</em> to subtract.
Otherwise, the internal loop counter is decremented and control flow goes back to the top of the loop if not done
(<code>NCZ</code>, Not Counter Zero).
If the loop is done, tmpC is rotated left to pick up the last quotient bit from the carry flag.
Then a second rotate of tmpC is performed but the result is discarded; this puts the top bit of tmpC into the carry flag for
use later in <code>POSTIDIV</code>. Finally, the subroutine returns.</p>
<p>The subtraction path is <em>13</em> and <em>14</em>, which subtract tmpB from tmpA by storing the result (Σ) in tmpA.
This path resets the carry flag for use as the quotient bit.
As in the other path, the loop counter is decremented and tested (<code>NCZ</code>) and execution either continues back at <em>3</em>
or finishes at <em>10</em>.</p>
<p>One complication is that the carry bit is the opposite of the desired quotient bit.
Specifically, if tmpA < tmpB, the comparison generates a borrow so the carry flag is set to 1.
In this case, the desired quotient bit is 0 and no subtraction takes place.
But if tmpA ≥ tmpB, the comparison does not generate a borrow (so the carry flag is set to 0), the code subtracts tmpB,
and the desired quotient bit is 1.
Thus, tmpC ends up holding the <em>complement</em> of the desired result; this is fixed later.</p>
<p>The microcode is carefully designed to pack the divide loop into a small number of micro-instructions.
It uses the registers and the carry flag in tricky ways, using the carry flag to hold the top bit of tmpA,
the comparison result, and the generated quotient bit.
This makes the code impressively dense but tricky to understand.</p>
<h3>The top-level division microcode</h3>
<p>Now I'll pop up a level and take a look at the top-level microcode (below) that implements the <code>DIV</code> and <code>IDIV</code> machine instructions.
The first three instructions load tmpA, tmpC, and tmpB from the specified registers.
(The <code>M</code> register refers to the source specified in the instruction, either a register or a memory location.)
Next, the <code>X0</code> condition tests bit 3 of the instruction, which in this case distinguishes <code>DIV</code> from <code>IDIV</code>.
For signed division (<code>IDIV</code>), the microcode calls <code>PREIDIV</code>, which I'll discuss below.
Next, the <code>CORD</code> micro-subroutine discussed above is called to perform the division.</p>
<pre class="microcode">
DX → tmpA <span><b>iDIV rmw:</b> load tmpA, tmpC, tmpB </span>
AX → tmpC RCL tmpA <span>set up RCL left shift operation</span>
M → tmpB CALL X0 PREIDIV <span>set up integer division if IDIV</span>
CALL CORD <span>call CORD to perform division </span>
COM1 tmpC <span>set up to complement the quotient </span>
DX → tmpB CALL X0 POSTIDIV <span>get original dividend, handle IDIV</span>
Σ → AX NXT <span>store updated quotient</span>
tmpA → DX RNI <span>store remainder, run next instruction</span>
</pre>
<p>As discussed above, the quotient in tmpC needs to be 1's-complemented, which is done with <code>COM1</code>.
For <code>IDIV</code>, the micro-subroutine <code>POSTIDIV</code> sets the signs of the results appropriately.
The results are stored in the <code>AX</code> and <code>DX</code> registers.
The <code>NXT</code> micro-operation indicates the next micro-instruction is the last one, directing the microcode engine to
start the next
machine instruction. Finally, <code>RNI</code> directs the microcode engine to run the next machine instruction.</p>
<h2>8-bit division</h2>
<p>The 8086 has separate opcodes for 8-bit division.
The 8086 supports many instructions with byte and word versions, using 8-bit or 16-bit arguments respectively.
In most cases, the byte and word instructions use the same microcode, with the ALU and register hardware using bytes or words based on the instruction.
In the case of division,
the shift micro-operations act on tmpA and tmpC as 8-bit registers rather than 16-bit registers.
Moreover, the <code>MAXC</code> micro-operation initializes the internal counter to 7 rather than 15.
Thus, the same <code>CORD</code> microcode is used for word and byte division, but the number of loops and the specific
operations are changed by the hardware.</p>
<p>The diagram below shows the tmpA and tmpC registers during each step of dividing 0x2345 by 0x34.
Note that the registers are treated as 8-bit registers.
The divisor (blue) steadily shrinks with the quotient (green) taking its place.
At the end, the remainder is 0x41 (blue) and the quotient is 0xad, complement of the green value.</p>
<table class="div">
<tr style="font-family: sans-serif"><th colspan=16>tmpA</th><th colspan=16 style="font-family:sans-serif">tmpC</th></tr>
<tr><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="b">0</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="b">0</td><td class="b">1</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="b">0</td><td class="b">1</td><td class="b">0</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="b">0</td><td class="b">1</td><td class="b">0</td><td class="b">1</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="b">0</td><td class="b">1</td><td class="b">0</td><td class="b">1</td><td class="b">0</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">1</td><td class="b">0</td><td class="b">1</td><td class="b">0</td><td class="b">1</td><td class="b">0</td><td class="b">0</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">1</td><td class="b">0</td><td class="b">1</td><td class="b">0</td><td class="b">1</td><td class="b">0</td><td class="b">0</td><td class="b">1</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">0</td><td class="a">1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td class="b">0</td><td class="b">1</td><td class="b">0</td><td class="b">1</td><td class="b">0</td><td class="b">0</td><td class="b">1</td><td class="b">0</td></tr>
</table>
<p>Although the <code>CORD</code> routine is shared for byte and word division, the top-level microcode is different.
In particular, the byte and word division instructions use different registers, requiring microcode changes.
The microcode below is the top-level code for byte division. It is almost the same as the microcode above, except it
uses the top and bottom bytes of the accumulator (<code>AH</code> and <code>AL</code>) rather than the <code>AX</code> and <code>DX</code> registers.</p>
<pre class="microcode">
AH → tmpA <span><b>iDIV rmb:</b> get arguments</span>
AL → tmpC RCL tmpA <span>set up RCL left shift operation</span>
M → tmpB CALL X0 PREIDIV <span>handle signed division if IDIV</span>
CALL CORD <span>call CORD to perform division</span>
COM1 tmpC <span>complement the quotient</span>
AH → tmpB CALL X0 POSTIDIV <span>handle signed division if IDIV</span>
Σ → AL NXT <span>store quotient</span>
tmpA → AH RNI <span>store remainder, run next instruction</span>
</pre>
<h2>Signed division</h2>
<p>The 8086 (like most computers) represents signed numbers using a format called two's complement.
While a regular byte holds a number from 0 to 255, a signed byte holds a number from -128 to 127.
A negative number is formed by flipping all the bits (known as the one's complement) and then adding 1, yielding the two's complement value.
For instance, +5 is <code>0x05</code> while -5 is <code>0xfb</code>.
(Note that the top bit of a number is set for a negative number; this is the sign bit.)
The nice thing about two's complement numbers is that the same addition and subtraction operations work on both signed and unsigned values.
Unfortunately, this is not the case for signed multiplication and division.</p>
<p>The 8086 has separate <code>IDIV</code> (Integer Divide) instructions to perform signed division.
The 8086 performs signed division by converting the arguments to positive values, performing unsigned division, and then
negating the results if necessary.
As shown earlier, signed and unsigned division both use the same top-level microcode and the microcode conditionally calls some subroutines for
signed division.
These additional subroutines cause a significant performance penalty, making signed division over 20 cycles slower than unsigned division.
I will discuss those micro-subroutines below.</p>
<h3><code>PREIDIV</code></h3>
<p>The first subroutine for signed division is <code>PREIDIV</code>, performing preliminary operations for integer division.
It converts the two arguments, stored in tmpA/tmpC and tmpB, to positive values.
It keeps track of the signs using an internal flag called <code>F1</code>, toggling this flag for each negative argument.
This conveniently handles the rule that two negatives make a positive since complementing the <code>F1</code> flag twice will clear it.
The point of this is that the main division code (<code>CORD</code>) only needs to handle unsigned division.</p>
<p>The microcode below implements <code>PREIDIV</code>.
First it tests if tmpA is negative, but
the 8086 does not have a microcode condition to directly test the sign of a value.
Instead, the microcode determines if a value is negative by shifting the value left, which moves the top (sign) bit into the carry flag.
The conditional jump (<code>NCY</code>) then tests if the carry is clear, jumping if the value is non-negative.
If tmpA is negative, execution falls through to negate the first argument.
This is tricky because the argument is split between the tmpA and tmpC registers.
The two's complement operation (<code>NEG</code>) is applied to the low word, while either 2's complement or one's complement (<code>COM1</code>) is applied to
the upper word, depending on the carry for mathematical reasons.<span id="fnref:neg"><a class="ref" href="#fn:neg">5</a></span>
The <code>F1</code> flag is complemented to keep track of the sign.
(The multiplication process reuses most of this code, starting at the <code>NEGATE</code> entry point.)</p>
<pre class="microcode">
Σ → no dest <span><b>PREIDIV:</b> shift tmpA left</span>
JMPS NCY 7 <span>jump if non-negative</span>
NEG tmpC <span><b>NEGATE:</b> negate tmpC</span>
Σ → tmpC COM1 tmpA F <span>maybe complement tmpA</span>
JMPS CY 6
NEG tmpA <span>negate tmpA if there's no carry</span>
Σ → tmpA CF1 <span><b>6:</b> toggle F1 (sign)</span>
RCL tmpB <span><b>7:</b> test sign of tmpB</span>
Σ → no dest NEG tmpB <span>maybe negate tmpB</span>
JMPS NCY 11 <span>skip if tmpB positive</span>
Σ → tmpB CF1 RTN <span>else negate tmpB, toggle F1 (sign)</span>
RTN <span><b>11:</b> return</span>
</pre>
<p>The next part of the code, starting at <em>7</em>, negates tmpB (the divisor) if it is negative. Since the divisor is a single
word, this code is simpler.
As before, the <code>F1</code> flag is toggled if tmpB is negative.
At the end, both arguments (tmpA/tmpC and tmpB) are positive, and <code>F1</code> indicates the sign of the result.</p>
<h3><code>POSTIDIV</code></h3>
<p>After computing the result, the <code>POSTIDIV</code> routine is called for signed division.
The routine first checks for a signed overflow and raises a divide-by-zero interrupt if so.
Next, the routine negates the quotient and remainder if necessary.<span id="fnref:signs"><a class="ref" href="#fn:signs">6</a></span></p>
<p>In more detail, the <code>CORD</code> routine left the top bit of tmpC (the complemented quotient) in the carry flag.
Now, that bit is tested. If the carry bit is 0 (<code>NCY</code>), then the top bit of the quotient is 1 so the quotient is too big to fit in a signed value.<span id="fnref:overflow"><a class="ref" href="#fn:overflow">7</a></span>
In this case, the <code>INT0</code> routine is executed to trigger a type 0 interrupt, indicating a divide overflow.
(This is a rather roundabout way of testing the quotient, relying on a carry bit that was set in a previous subroutine.)</p>
<pre class="microcode">
JMP NCY INT0 <span><b>POSTIDIV:</b> if overflow, trigger interrupt</span>
RCL tmpB <span>set up rotate of tmpB</span>
Σ → no dest NEG tmpA <span>get sign of tmpB, set up negate of tmpA</span>
JMPS NCY 5 <span>skip if tmpB non-negative</span>
Σ → tmpA <span>otherwise negate tmpA (remainder)</span>
INC tmpC <span><b>5:</b> set up increment</span>
JMPS F1 8 <span>test sign flag, skip if set</span>
COM1 tmpC <span>otherwise set up complement</span>
CCOF RTN <span><b>8:</b> clear carry and overflow flags, return</span>
</pre>
<p>Next, tmpB (the divisor) is rotated to see if it is negative.
(The caller loaded tmpB with the original divisor, replacing the dividend that was in tmpB previously.)
If the divisor is negative, tmpA (the remainder) is negated.
This implements the 8086 rule that the sign of the remainder matches the sign of the divisor.</p>
<p>The quotient handling is a bit tricky. Recall that tmpC holds the complemented quotient.
the <code>F1</code> flag is set if the result should be negative. In that case, the complemented quotient needs to be incremented
by 1 (<code>INC</code>) to convert from 1's complement to 2's complement.
On the other hand, if the quotient should be positive, 1's-complementing tmpC (<code>COM1</code>) will yield the desired positive
quotient.
In either case, the ALU is configured in <code>POSTIDIV</code>, but the result will be stored back in the main routine.</p>
<p>Finally, the <code>CCOF</code> micro-operation clears the carry and overflow flags.
Curiously, the 8086 documentation declares that the status flags are undefined following <code>IDIV</code>, but the microcode
explicitly clears the carry and overflow flags.
I assume that the flags were cleared in analogy with <code>MUL</code>, but then Intel decided that this wasn't useful so they
didn't document it. (Documenting this feature would obligate them to provide the same functionality in later x86 chips.)</p>
<h2>The hardware for division</h2>
<p>For the most part, the 8086 uses the regular ALU addition and shifts for the division algorithm. Some special hardware
features provide assistance.
In this section, I'll look at this hardware.</p>
<h3>Loop counter</h3>
<p>The 8086 has a 4-bit loop counter for multiplication and division. This counter starts at 7 for byte division and 15 for word division,
based on the low bit of the opcode.
This loop counter allows the microcode to decrement the counter, test for the end, and perform a conditional branch in one micro-operation.
The counter is implemented with four flip-flops, along with logic to compute the value after decrementing by one.
The <code>MAXC</code> (Maximum Count) micro-instruction sets the counter to 7 or 15 for byte or word operations respectively.
The <code>NCZ</code> (Not Counter Zero) micro-instruction has two actions. First, it performs a conditional jump if the counter is nonzero.
Second, it decrements the counter.</p>
<h3>The F1 flag</h3>
<p>Signed multiplication and division use an internal flag called <code>F1</code><span id="fnref:f1"><a class="ref" href="#fn:f1">8</a></span> to keep track of the sign.
The <code>F1</code> flag is toggled by microcode through the <code>CF1</code> (Complement F1) micro-instruction.
The <code>F1</code> flag is implemented with a flip-flop, along with a multiplexer to select the value. It is cleared when a new instruction starts,
set by a <code>REP</code> prefix, and toggled by the <code>CF1</code> micro-instruction.
The diagram below shows how the F1 latch and the loop counter appear on the die. In this image, the metal layer has been removed, showing the
silicon and the polysilicon wiring underneath.</p>
<p><a href="https://static.righto.com/images/8086-div/counter.jpg"><img alt="The counter and F1 latch as they appear on the die. The latch for the REP state is also here." class="hilite" height="308" src="https://static.righto.com/images/8086-div/counter-w600.jpg" title="The counter and F1 latch as they appear on the die. The latch for the REP state is also here." width="600" /></a><div class="cite">The counter and F1 latch as they appear on the die. The latch for the REP state is also here.</div></p>
<h3>X register</h3>
<p>The division microcode uses an internal register called the <code>X</code> register to distinguish between the <code>DIV</code> and <code>IDIV</code> instructions.
The <code>X</code> register is a 3-bit register that holds the ALU opcode, indicated by bits 5–3 of the instruction.<span id="fnref:x-reg"><a class="ref" href="#fn:x-reg">9</a></span>
Since the instruction is held in the Instruction Register, you might wonder why a separate register is required.
The motivation is that some opcodes specify the type of ALU operation in the second byte of the instruction, the ModR/M byte, bits 5–3.<span id="fnref:opcode"><a class="ref" href="#fn:opcode">10</a></span>
Since the ALU operation is sometimes specified in the first byte and sometimes in the second byte, the <code>X</code> register was added to handle
both these cases.</p>
<p>For the most part, the <code>X</code> register indicates which of the eight standard ALU operations is selected (<code>ADD</code>, <code>OR</code>, <code>ADC</code>, <code>SBB</code>, <code>AND</code>, <code>SUB</code>, <code>XOR</code>, <code>CMP</code>).
However, a few instructions use bit 0 of the <code>X</code> register to distinguish between other pairs of instructions.
For instance, it distinguishes between <code>MUL</code> and <code>IMUL</code>, <code>DIV</code> and <code>IDIV</code>, <code>CMPS</code> and <code>SCAS</code>, <code>MOVS</code> and <code>LODS</code>, or <code>AAA</code> and <code>AAS</code>.
While these instruction pairs may appear to have arbitrary opcodes, they have been carefully assigned
so the microcode can distinguish them.</p>
<p>The implementation of the <code>X</code> register is straightforward, consisting of three flip-flops to hold the three bits of the instruction.
The flip-flops are loaded from the prefetch queue bus during First Clock and during Second Clock for appropriate instructions, as the
instruction bytes travel over the bus.
Testing bit 0 of the <code>X</code> register with the <code>X0</code> condition is supported by the microcode condition evaluation circuitry, so it can be used for conditional jumps in the microcode.</p>
<h2>Algorithmic and historical context</h2>
<p>As you can see from the microcode, division is a complicated and relatively slow process.
On the 8086, division takes up to 184 clock cycles to perform all the microcode steps.
(In comparison, two registers can be added in 3 clock cycles.)
Multiplication and division both loop over the bits, performing repeated addition or subtraction respectively.
But division requires a decision (subtract or not?) at each step, making it even slower, about half the speed of multiplication.<span id="fnref:history"><a class="ref" href="#fn:history">11</a></span></p>
<p>Various algorithms have been developed to speed up division.
Rather than performing long division one bit at a time, you can do long division in, say, base 4, producing two quotient
bits in each step.
As with decimal long division, the tricky part is figuring out what digit to select. The <a href="https://en.wikipedia.org/wiki/Division_algorithm#SRT_division">SRT algorithm</a> (1957) uses a small
lookup table to estimate the quotient digit from a few bits of the divisor and dividend.
The clever part is that the selected digit doesn't need to be exactly right at each step; the algorithm will self-correct
after a wrong "guess".
The Pentium processor (1993) famously had a <a href="https://math.mit.edu/~edelman/homepage/papers/pentiumbug.pdf">floating point division bug</a>
due to a few missing values in the SRT table. This bug cost Intel $475 million to replace the faulty processors.</p>
<p>Intel's x86 processors steadily improved divide performance. The 80286 (1982) performed a word divide in 22 clocks, about
6 times as fast as the 8086.
In the <a href="https://www.cubawiki.com.ar/images/b/b3/Orga2_paper_penyn.pdf">Penryn</a> architecture (2007), Intel upgraded from
Radix-4 to Radix-16 division.
Rather than having separate integer and floating-point hardware, integer divides were handled through the floating point divider.
Although modern Intel processors have greatly improved multiplication and division compared to the 8086, division is still a relatively slow operation.
While a Tiger Lake (2020) processor can perform an integer multiplication every clock cycle (with a latency of 3 cycles),
division is much slower and can only be done once every 6-10 clock cycles (<a href="https://agner.org/optimize/">details</a>).</p>
<p>I've written numerous <a href="https://www.righto.com/search/label/8086">posts on the 8086</a> so far and
plan to continue reverse-engineering the 8086 die so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:microcode">
<p>My microcode analysis is based on Andrew Jenner's <a href="https://www.reenigne.org/blog/8086-microcode-disassembled/">8086 microcode disassembly</a>. <a class="footnote-backref" href="#fnref:microcode" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:rcl">
<p>The 8086 patent and Andrew Jenner's microcode use the name <code>LRCY</code> (Left Rotate through Carry) instead of <code>RCL</code>.
I figure that <code>RCL</code> will be more familiar to people because of the corresponding machine instruction. <a class="footnote-backref" href="#fnref:rcl" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:table">
<p>In the dividend/quotient table, the tmpA register is on the left and the tmpC register is on the right.
0x0f00ff00 divided by 0x0ffc yielding the remainder 0x0030 (blue) and quotient 0xf04c (green).
(The green bits are the complement of the quotient due to implementation in the 8086.) <a class="footnote-backref" href="#fnref:table" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:interrupts">
<p>I described the 8086's interrupt circuitry in detail in <a href="https://www.righto.com/2023/02/8086-interrupt.html">this post</a>. <a class="footnote-backref" href="#fnref:interrupts" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:neg">
<p>The negation code is a bit tricky because the result is split across two words.
In most cases, the upper word is bitwise complemented. However, if the lower word is zero, then the upper word is negated (two's complement).
I'll demonstrate with 16-bit values to keep the examples small.
The number 257 (0x0101) is negated to form -257 (0xfeff). Note that the upper byte is the one's complement (0x01 vs 0xfe) while the lower
byte is two's complement (0x01 vs 0xff).
On the other hand, the number 256 (0x0100) is negated to form -256 (0xff00). In this case, the upper byte is the two's complement (0x01 vs 0xff)
and the lower byte is also the two's complement (0x00 vs 0x00).</p>
<p>(Mathematical explanation: the two's complement is formed by taking the one's complement and adding 1. In most cases, there won't be a carry from
the low byte to the upper byte, so the upper byte will remain the one's complement. However, if the low byte is 0, the complement is 0xff and
adding 1 will form a carry. Adding this carry to the upper byte yields the two's complement of that byte.)</p>
<p>To support multi-word negation, the 8086's <code>NEG</code> instruction clears the carry flag if the operand is 0, and otherwise sets the carry flag.
(This is the opposite of the above because subtractions (including <code>NEG</code>) treat the carry flag as a borrow flag, with the opposite meaning.)
The microcode <code>NEG</code> operation has identical behavior to the machine instruction, since it is used to implement the machine instruction.</p>
<p>Thus to perform a two-word negation, the microcode negates the low word (tmpC) and updates the flags (<code>F</code>).
If the carry is set, the one's complement is applied to the upper word (tmpA). But if the carry is cleared, the two's complement is applied to tmpA. <a class="footnote-backref" href="#fnref:neg" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:signs">
<p>There is a bit of ambiguity with the quotient and remainder of negative numbers.
For instance, consider -27 ÷ 7. -27 = 7 × -3 - 6 = 7 * -4 + 1. So you could consider the
result to be a quotient of -3 and remainder of -6, or a quotient of -4 and a remainder of 1.
The 8086 uses the rule that the remainder will have the same sign as the dividend, so the first result would be used.
The advantage of this rule is that you can perform unsigned division and adjust the signs afterward:
<br/>27 ÷ 7 = quotient 3, remainder 6.
<br/>-27 ÷ 7 = quotient -3, remainder -6.
<br/>27 ÷ -7 = quotient -3, remainder 6.
<br/>-27 ÷ -7 = quotient 3, remainder -6.</p>
<p>This rule is known as truncating division, but some languages use different approaches such as
floored division, rounded division, or Euclidean division.
<a href="https://en.wikipedia.org/wiki/Modulo#Variants_of_the_definition">Wikipedia</a> has details. <a class="footnote-backref" href="#fnref:signs" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:overflow">
<p>The signed overflow condition is slightly stricter than necessary.
For a word division, the 16-bit quotient is restricted to the range -32767 to 32767.
However, a 16-bit signed value can take on the values -32768 to 32767.
Thus, a quotient of -32768 fits in a 16-bit signed value even though the 8086 considers it an error.
This is a consequence of the 8086 performing unsigned division and then updating the sign if necessary. <a class="footnote-backref" href="#fnref:overflow" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:f1">
<p>The internal <code>F1</code> flag is also used to keep track of a <code>REP</code> prefix for use with a string operation.
I discussed string operations and the <code>F1</code> flag in <a href="https://www.righto.com/2023/04/8086-microcode-string-operations.html">this post</a>. <a class="footnote-backref" href="#fnref:f1" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:x-reg">
<p>Curiously, the <a href="https://patents.google.com/patent/US4449184A/">8086 patent</a> states that the <code>X</code> register is a 4-bit register holding bits 3–6 of
the byte (col. 9, line 20). But looking at the die, it is a 3-bit register holding bits 3–5 of the byte. <a class="footnote-backref" href="#fnref:x-reg" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:opcode">
<p>Some instructions are specified by bits 5–3 in the ModR/M byte rather than in the first opcode byte.
The motivation is to avoid wasting bits for instructions that use a ModR/M byte but don't need a register specification.
For instance, consider the instruction <code>ADD [BX],0x1234</code>. This instruction uses a ModR/M byte to specify the memory address.
However, because it uses an immediate operand, it does not need the register specification normally provided by bits 5–3 of the ModR/M byte.
This frees up the bits to specify the instruction.
From one perspective, this is an ugly hack, while from another perspective it is a clever optimization. <a class="footnote-backref" href="#fnref:opcode" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:history">
<p>Even the earliest computers such as ENIAC (1945) usually supported multiplication and division.
However, early microprocessors did not provide multiplication and division instructions due to the
complexity of these instructions.
Instead, the programmer would need to write an assembly code loop, which was very slow.
Early microprocessors often had binary-coded decimal instructions that could perform additions and subtractions in decimal.
One motivation for these instructions was that converting between binary and decimal was extremely slow due to the
need for multiplication and division. Instead, it was easier and faster to keep the values as decimal if that was how
they were displayed.</p>
<p>The Texas Instruments TMS9900 (1976) was one of the first microprocessors with multiplication and division instructions.
Multiply and divide instructions remained somewhat controversial on RISC (Reduced Instruction-Set Computer) processors due
to the complexity of these instructions.
The early ARM processors, for instance, did not support multiplication and division.
Multiplication was added to ARMv2 (1986) but most ARM processors still don't have integer division.
The popular open-source RISC-V architecture (2015) doesn't include integer multiply and divide by default, but
provides them as an optional "M" extension.</p>
<p>The 8086's algorithm is designed for simplicity rather than speed.
It is a "restoring" algorithm that checks before subtracting to ensure that the current term is always positive.
This can require two ALU operations (comparison and subtraction) per cycle.
A slightly more complex approach is a "nonrestoring" algorithm that subtracts even if it yields a negative term, and then
adds during a later loop iteration. <a class="footnote-backref" href="#fnref:history" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com3tag:blogger.com,1999:blog-6264947694886887540.post-63701203702448005102023-04-04T09:57:00.010-07:002023-04-04T18:01:19.381-07:00The microcode and hardware in the 8086 processor that perform string operations<style>
.hilite {cursor:zoom-in}
a:link img.hilite, a:visited img.hilite {color: #fff;}
a:hover img.hilite {color: #f66;}
</style>
<style>
pre.microcode {font-family: courier, fixed; padding: 10px; background-color: #f5f5f5; display:inline-block;border:none;}
pre.microcode span {color: green; font-style:italic; font-family: sans-serif; font-size: 90%;}
</style>
<p>Intel introduced the 8086 microprocessor in 1978. This processor ended up being hugely influential, setting the path
for the x86 architecture that is extensively used today.
One interesting feature of the 8086 was instructions that can efficiently
operate on blocks of memory up to 64K bytes long.<span id="fnref:history"><a class="ref" href="#fn:history">1</a></span>
These instructions rapidly copy, compare, or scan data and are known as "string" instructions.<span id="fnref:string"><a class="ref" href="#fn:string">2</a></span></p>
<p>In this blog post, I explain string operations in the 8086, analyze the microcode that it used, and discuss the hardware
circuitry that helped it out.
My analysis is based on reverse-engineering the 8086 from die photos. The photo below shows the chip under a microscope.
I've labeled the key functional blocks; the ones that are important to this post are darker.
Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top
and an Execution Unit (EU) below.
The BIU handles memory accesses, while the Execution Unit (EU) executes instructions.
The microcode ROM at the lower right controls the process.</p>
<!--
This post is part of my [series](https://www.righto.com/search/label/8086) on the internals of the 8086.
-->
<p><a href="https://static.righto.com/images/8086-str/die-labeled.jpg"><img alt="The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version." class="hilite" height="624" src="https://static.righto.com/images/8086-str/die-labeled-w600.jpg" title="The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version." width="600" /></a><div class="cite">The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.</div></p>
<h2>Segments and addressing</h2>
<p>Before I get into the details of the string instructions, I need to give a bit of background on how the 8086 accesses
memory through segments.
Earlier microprocessors such as the Intel 8080 (1974) used 16 bits to specify a memory address, allowing a maximum of 64K of memory.
This memory capacity is absurdly small by modern standards, but at the time when a 4K memory board cost hundreds of dollars, this limit
was not a problem.
However, due to Moore's Law and the exponential growth in memory capacity, the follow-on 8086 processor needed to support more memory.
At the same time, the 8086 needed to use 16-bit registers for backward compatibility with the 8080.</p>
<p>The much-reviled solution was to create a 1-megabyte (20-bit) address space consisting of 64K segments, with a 16-bit address
specifying a position within the segment.
In more detail, the memory address was specified by a 16-bit offset address along with a particular 16-bit segment register selecting a segment.
The segment register's value was shifted by 4 bits to give the segment's 20-bit base address. The 16-bit offset address was added,
yielding a 20-bit memory address.
This gave the processor a 1-megabyte address space, although only 64K could be accessed without changing a segment register.
The 8086 had four segment registers so it could use multiple segments at the same time: the Code Segment, Data Segment, Stack Segment, and Extra Segment.</p>
<p>The 8086 chip is split into
two processing units: the Bus Interface Unit (BIU) that handles segments and memory accesses, and the Execution Unit (EU) that executes instructions.
The Execution Unit is what comes to mind when you think of a processor: it has most of the registers, the arithmetic/logic unit (ALU), and the microcode that implements instructions.
The Bus Interface Unit interacts with memory and other external systems, performing the steps necessary to read and write memory.</p>
<p>Among other things, the Bus Interface Unit has a separate adder for address calculation; this adds the segment register to the base address to
determine the final memory address.
Every memory access uses the address adder at least once to add the segment base and offset.
The address adder is also used to increment the program counter.
Finally, the address adder increments and decrements the index registers used for block operations.
This will be discussed in more detail below.</p>
<h2>Microcode in the 8086</h2>
<p>Most people think of machine instructions as the basic steps that a computer performs.
However, many processors (including the 8086) have another layer of software underneath: microcode.
With microcode, instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code.
To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.
This provides a considerable performance improvement for the block operations, which requires many steps in a loop.
Performing this loop in microcode is considerably faster than writing the loop in assembly code.</p>
<!--
The 8086 uses a hybrid approach: although it uses microcode, much of the instruction functionality is implemented with gate logic.
This approach removed duplication from the microcode and kept the microcode small enough for 1978 technology.
In a sense, the microcode is parameterized.
For instance, the microcode can specify a generic Arithmetic/Logic Unit (ALU) operation and a generic register.
The gate logic examines the instruction to determine which specific operation to perform and the appropriate register.
-->
<p>A micro-instruction in the 8086 is encoded into 21 bits as shown below.
Every micro-instruction specifies a move operation from a source register to a destination register, each specified with 5 bits.
The meaning of the remaining bits depends on the type field and can be anything from an ALU operation to a memory read or write to
a change of microcode control flow.
Thus, an 8086 micro-instruction typically does two things in parallel: the move and the action.
For more about 8086 microcode, see my <a href="https://www.righto.com/2022/11/how-8086-processors-microcode-engine.html">microcode blog post</a>.</p>
<p><a href="https://static.righto.com/images/8086-str/microcode-format.jpg"><img alt="The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?" class="hilite" height="203" src="https://static.righto.com/images/8086-str/microcode-format-w700.jpg" title="The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?" width="700" /></a><div class="cite">The encoding of a micro-instruction into 21 bits. Based on <a href="https://digitalcommons.law.scu.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1031&context=chtlj">NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?</a></div></p>
<p>I'll explain the behavior of an ALU micro-operation since it is important for string operations.
The Arithmetic/Logic Unit (ALU) is the heart of the processor, performing addition, subtraction, and logical operations.
The ALU has three temporary input registers that are invisible to the programmer: tmpA, tmpB, and tmpC.
An ALU operation takes its first argument from any temporary register, while the second argument always comes from tmpB.
Performing an ALU operation requires two micro-instructions.
The first micro-instruction specifies the ALU operation and source register, configuring the ALU. For instance, <code>ADD tmpA</code> configures the ALU to add the tmpA register to the default tmpB register.
In the next micro-instruction (or a later one), the ALU result can be accessed through a special register called <code>Σ</code> (SIGMA) and moved to another register.</p>
<p>I'll also explain the memory read and write micro-operations.
A memory operation uses two internal registers: <code>IND</code> (Indirect) holds the memory address, while <code>OPR</code> (Operand) holds the word that is read or written.
A typical memory micro-instruction for a read is <code>R DS,BL</code>.
This causes the Bus Interface Unit to compute the memory address by adding the Data Segment (<code>DS</code>) to the <code>IND</code> register
and then perform the read.
The Bus Interface Unit determines if the instruction is performing a byte operation or a word operation and reads a byte or
word as appropriate, going through the necessary bus cycles.
The <code>BL</code> option<span id="fnref:bl"><a class="ref" href="#fn:bl">3</a></span> causes the Bus Interface Unit to update the <code>IND</code> register as appropriate,<span id="fnref2:bl"><a class="ref" href="#fn:bl">3</a></span>
incrementing or decrementing it by 1 or 2 depending on the Direction Flag and the
size of the access (byte or word).
All of this complexity happens in the hardware of the Bus Interface Unit and is invisible to the microcode.
The tradeoff is that this simplifies the microcode but makes the chip's hardware considerably more complicated.</p>
<h2>The string move instruction</h2>
<p>The 8086 has five types of string instructions, operating on blocks of memory:
<code>MOVS</code> (Move String), <code>CMPS</code> (Compare Strings), <code>SCAS</code> (Scan String), <code>LODS</code> (Load String), and <code>STOS</code> (Store String).
Each instruction operates on a byte or word, but by using a <code>REP</code> prefix, the operation can be repeated for up to 64k bytes,
controlled by a counter.
Conditional repetitions can terminate the loop on various conditions.
The string instructions provide a flexible way to operate on blocks of memory, much faster than a loop written in assembly code.</p>
<p>The <code>MOVS</code> (Move String) operation copies one memory region to another.
The <code>CMPS</code> (Compare Strings) operation compares two memory blocks and sets the status flags. In particular, this indicates if
one string is greater, less, or equal to the other.
The <code>SCAS</code> (Scan String) operation scans memory, looking for a particular value.
The <code>LODS</code> (Load String) operation moves an element into the accumulator, generally as part of a more complex loop.
Finally, <code>STOS</code> (Store String) stores the accumulator value, either to initialize a block of memory or as part of a more complex loop.<span id="fnref:morse"><a class="ref" href="#fn:morse">4</a></span></p>
<p>Like many 8086 instructions, each string instruction has two opcodes: one that operates on bytes and one that operates on words.
One of the interesting features of the 8086 is that the same microcode implements the byte and word instructions, while the
hardware takes care of the byte- or word-sized operations as needed.
Another interesting feature of the string operations is that they can go forward through memory, incrementing the pointers,
or they can go backward, decrementing the points. A special processor flag, the Direction Flag, indicates the direction: 0 for incrementing
and 1 for decrementing.
Thus, there are four possibilities for stepping through memory, part of the flexibility of the string operations.</p>
<p>The flowchart below shows the complexity of these instructions. I'm not going to explain the flowchart at this point, but
the point is that there is a lot going on. This functionality is implemented by the microcode.</p>
<p><a href="https://static.righto.com/images/8086-str/flowchart.png"><img alt="This flowchart shows the operation of a string instruction. From The 8086 Family Users Manual, fig 2-33." class="hilite" height="803" src="https://static.righto.com/images/8086-str/flowchart-w400.png" title="This flowchart shows the operation of a string instruction. From The 8086 Family Users Manual, fig 2-33." width="400" /></a><div class="cite">This flowchart shows the operation of a string instruction. From <a href="http://bitsavers.org/components/intel/8086/9800722-03_The_8086_Family_Users_Manual_Oct79.pdf">The 8086 Family Users Manual</a>, fig 2-33.</div></p>
<p>I'll start by explaining the <code>MOVS</code> (Move String) instruction, which moves (copies) a block of memory.
Before executing this instruction, the registers should be configured so
the <code>SI</code> Source Index register points to the first block, the <code>DI</code> Destination Index register points to the second block,
and the <code>CX</code> Count register holds the number of bytes or words to move.
The basic action of the <code>MOVS</code> instruction reads a byte (or word) from the <code>SI</code> address and updates <code>SI</code>, writes the value
to the <code>DI</code> address and updates <code>DI</code>, and decrements the <code>CX</code> counter.</p>
<p>The microcode block below is executed for the <code>MOVS</code> (and <code>LODS</code>) instructions.
There's a lot happening in this microcode with a variety of control paths, so it's a bit tricky to
understand, but let's see how it goes.
Each micro-instruction has a register-to-register move on the left and an action on the right, happening in parallel.
The first micro-instruction handles the <code>REP</code> prefix, if any; let's assume for now that there's no prefix so it is skipped.
Next is the read from memory, which requires the memory address to be in the <code>IND</code> register.
Thus, the micro-instruction moves <code>SI</code> to <code>IND</code> and starts the read cycle (<code>R DS,BL</code>).
When the read completes, the updated <code>IND</code> register is moved back to <code>SI</code>, updating that register.
Meanwhile, <code>X0</code> tests the opcode and jumps to "8" for <code>LODS</code>.
The <code>MOVS</code> path falls through, getting the address from the <code>DI</code> register and writing to memory the value that we just read.
The updated <code>IND</code> register is moved to <code>DI</code> while another conditional jump goes to "7" if there's no <code>REP</code> prefix.
Micro-instruction 7 performs an <code>RNI</code> (Run Next Instruction), which ends the microcode and causes the next
machine instruction to be decoded.
As you can see, microcode is very low-level.</p>
<pre class="microcode">
move action
CALL F1 RPTS <span><b>MOVS/LODS:</b> handle REP if active</span>
SI → IND R DS,BL <span><b>1:</b> Read byte/word from SI</span>
IND → SI JMPS X0 8 <span>test instruction bit 3: jump if LODS</span>
DI → IND W DA,BL <span>MOVS path: write to DI</span>
IND → DI JMPS NF1 7 <span><b>4:</b> run next instruction if not REP</span>
Σ → tmpC JMP INT RPTI <span><b>5:</b> handle any interrupt</span>
tmpC → CX JMPS NZ 1 <span>update CX, loop if not zero</span>
RNI <span><b>7:</b> run next instruction</span>
OPR → M JMPS F1 5 <span><b>8:</b> LODS path: store AL/AX, jump back if REP</span>
RNI <span>run next instruction</span>
</pre>
<p>Now let's look at the case with a <code>REP</code> prefix, causing the instruction to loop.
The first step is to test if the count register <code>CX</code> is zero, and bail out of the loop if so.
In more detail, the <code>REP</code> prefix sets an internal flag called <code>F1</code>. The first micro-instruction for <code>MOVS</code> above conditionally calls the <code>RPTS</code> subroutine if <code>F1</code> is set.
The <code>RPTS</code> subroutine below is a bit tricky.
First, it moves the count in <code>CX</code> to the ALU's temporary C register. It also configures the ALU to pass tmpC through unchanged.
The next move discards the ALU result Σ, but as a side effect, sets a flag if the value is zero.
This micro-instruction also configures the ALU to perform <code>DEC tmpC</code>, but the decrement doesn't happen yet.
Next, if the value is nonzero (<code>NZ</code>), the microcode execution jumps to 10 and returns from the microcode subroutine,
continuing execution of the <code>MOVS</code> code described above.
On the other hand, if <code>CX</code> is zero, execution falls through to <code>RNI</code> (Run Next Instruction), which terminates execution of
the <code>MOVS</code> instruction.</p>
<pre class="microcode">
CX → tmpC PASS tmpC <span><b>RPTS:</b> test CX</span>
Σ → no dest DEC tmpC <span>Set up decrement for later</span>
JMPS NZ 10 <span>Jump to 10 if CX not zero</span>
RNI <span>If 0, run next instruction</span>
RTN <span><b>10:</b> return</span>
</pre>
<p>If execution returns to the <code>MOVS</code> microcode, it will execute as described earlier until the <code>NF1</code> test below.
With a <code>REP</code> prefix, the test fails and microcode execution falls through.
The next micro-instruction performs <code>Σ → tmpC</code>, which puts the ALU result into tmpC.
The ALU was configured back in the <code>RPTS</code> subroutine to decrement tmpC, which holds the count from <code>CX</code>, so the
result is that <code>CX</code> is decremented, put into tmpC, and then put back into <code>CX</code> in the next micro-instruction.
It seems like a roundabout way to decrement the counter, but that's microcode.
Finally, if the value is nonzero (<code>NZ</code>), microcode execution jumps back to 1 (near the top of the <code>MOVS</code> code earlier), repeating the whole process.
Otherwise, <code>RNI</code> ends processing of the instruction.
Thus, the <code>MOVS</code> instruction repeats until <code>CX</code> is zero.
In the next section, I'll explain how <code>JMP INT RPTI</code> handles an interrupt.</p>
<pre class="microcode">
IND → DI JMPS NF1 7 <span><b>4:</b> run next instruction if not REP</span>
Σ → tmpC JMP INT RPTI <span><b>5:</b> handle any interrupt</span>
tmpC → CX JMPS NZ 1 <span>update CX, loop if not zero</span>
RNI <span><b>7:</b> run next instruction</span>
</pre>
<p>The <code>NZ</code> (not zero) condition tests a special 16-bit zero flag, not the standard zero status flag.
This allows zero to be tested without messing up the zero status flag.</p>
<h3>Interrupts</h3>
<p>Interrupts pose a problem for the string operations.
The idea behind interrupts is that the computer can be interrupted during processing to handle a high-priority task,
such as an I/O device that needs servicing. The processor stops its current task, executes the interrupt handling code,
and then returns to the original task.
The 8086 processor normally completes the instruction that it is executing before handling the
interrupt, so it can continue from a well-defined state.
However, a string operation can perform up to 64k moves, which could take a large fraction of a second.<span id="fnref:memory"><a class="ref" href="#fn:memory">5</a></span>
If the 8086 waited for the string operation to complete, interrupt handling would be way too slow and could lose network packets or disk data, for instance.</p>
<p>The solution is that a string instruction can be interrupted in the middle of the instruction, unlike most instructions.
The string instructions are designed to use registers in a way that allows the instruction to be restarted.
The idea is that the <code>CX</code> register holds the current count, while the <code>SI</code> and <code>DI</code> registers hold the current memory
pointers, and these registers are updated as the instruction progresses. If the instruction is interrupted it can simply
continue where it left off.
After the interrupt, the 8086 restarts the string operation by backing the program counter up by two bytes
(one byte for the <code>REP</code> prefix and one byte for the string opcode.)
This causes the interrupted string operation to be re-executed, continuing where it left off.</p>
<p>If there is an interrupt, the <code>RPTI</code> microcode routine below is called to update the program counter.
Updating the program counter is harder than you'd expect because the 8086 prefetches instructions.
The idea is that while the memory bus is idle, instructions are read from memory into a prefetch queue.
Then, when an instruction is needed, the processor can (hopefully) get the instruction immediately from the prefetch
queue instead of waiting for a memory access.
As a result, the program counter in the 8086 points to the memory address of the next instruction to <em>fetch</em>, not the
next instruction to <em>execute</em>.
To get the "real" program counter value, prefetching is first suspended (<code>SUSP</code>). Then the <code>PC</code> value is corrected (<code>CORR</code>) by subtracting the
length of the prefetch queue. At this point, the <code>PC</code> points to the next instruction to execute.</p>
<pre class="microcode">
tmpC → CX SUSP <span><b>RPTI:</b> store CX</span>
CORR <span>correct PC</span>
PC → tmpB DEC2 tmpB
Σ → PC FLUSH RNI <span>PC -= 2, end instruction</span>
</pre>
<p>At last, the microcode gets to the purpose of this subroutine: the <code>PC</code> is decremented by 2 (<code>DEC2</code>) using the ALU.
The prefetch queue is flushed and restarted and the <code>RNI</code> micro-operation terminates the microcode and runs the next instruction.
Normally this would execute the instruction from the new program counter value (which now points to the string operation).
However, since there is an interrupt pending, the interrupt will take place instead, and the interrupt handler will
execute.
After the interrupt handler finishes, the interrupted string operation will be re-executed, continuing where it left off.</p>
<p>There's another complication, of course.
An 8086 instruction can have multiple prefixes attached,
for example using a segment register prefix to access a different segment.
The approach of backing up two bytes will only execute the
last prefix, ignoring any others, so if you have two prefixes, the instruction doesn't get restarted correctly.
The 8086 documentation describes this unfortunate behavior.
Apparently a comprehensive solution (e.g. counting the prefixes or providing a buffer to hold prefixes during an interrupt)
was impractical for the 8086. I think this was fixed in the 80286.</p>
<!-- p3-31 of the 8086/88/186/188 manual states this problem. http://www.bitsavers.org/components/intel/80186/210911-001_iAPX86_88_186_188_Programmers_Reference_1983.pdf But the 286 manual doesn't mention it. -->
<h3>The remaining string instructions</h3>
<p>I'll discuss the microcode for the other string operations briefly.
The <code>LODS</code> instruction loads from memory into the accumulator. It uses the same microcode routine as <code>MOVS</code>; the code
below is the same code discussed earlier.
However, the path through the microcode is different for <code>LODS</code> since the <code>JMPS X0 8</code> conditional jump will be taken.
(This tests bit 3 of the opcode, which is set for <code>LODS</code>.)
At step 8, a value has been read from memory and is in the <code>OPR</code> (Operand) register.
This micro-instruction moves the value from <code>OPR</code> to the accumulator (represented by <code>M</code> for complicated reasons<span id="fnref:m"><a class="ref" href="#fn:m">6</a></span>).
If there is a repeat prefix, the microcode jumps back to the previous flow (5). Otherwise, <code>RNI</code> runs the next instruction.
Thus, <code>LODS</code> shares almost all its microcode with <code>MOVS</code>, making the microcode more compact at the cost of slowing it
down slightly due to the conditional jumps.</p>
<pre class="microcode">
move action
CALL F1 RPTS <span><b>MOVS/LODS:</b> handle REP if active</span>
SI → IND R DS,BL <span><b>1:</b> Read byte/word from SI</span>
IND → SI JMPS X0 8 <span>test instruction bit 3: jump if LODS</span>
DI → IND W DA,BL <span>MOVS path: write to DI</span>
IND → DI JMPS NF1 7 <span><b>4:</b> run next instruction if not REP</span>
Σ → tmpC JMP INT RPTI <span><b>5:</b> handle any interrupt</span>
tmpC → CX JMPS NZ 1 <span>update CX, loop if not zero</span>
RNI <span><b>7:</b> run next instruction</span>
OPR → M JMPS F1 5 <span><b>8:</b> LODS path: store AL/AX, jump back if REP</span>
RNI <span>run next instruction</span>
</pre>
<p>The <code>STOS</code> instruction is the opposite of <code>LODS</code>, storing the accumulator value into memory.
The microcode (below) is essentially the second half of the <code>MOVS</code> microcode.
The memory address in <code>DI</code> is moved to the <code>IND</code> register and the value in the accumulator is moved to the <code>OPR</code> register
to set up the write operation. (As with <code>LODS</code>, the <code>M</code> register indicates the accumulator.<span id="fnref2:m"><a class="ref" href="#fn:m">6</a></span>)
The <code>CX</code> register is decremented using the ALU.</p>
<pre class="microcode">
DI → IND CALL F1 RPTS <span><b>STOS:</b> if REP prefix, test if done</span>
M → OPR W DA,BL <span><b>1:</b> write the value to memory</span>
IND → DI JMPS NF1 5 <span>Quit if not F1 (repeat)</span>
Σ → tmpC JMP INT RPTI <span>Jump to RPTI if interrupt</span>
tmpC → CX JMPS NZ 1 <span>Loop back if CX not zero</span>
RNI <span><b>5:</b> run next instruction</span>
</pre>
<p>The <code>CMPS</code> instruction compares strings, while the <code>SCAS</code> instruction looks for a zero or non-zero value, depending on the prefix.
They share the microcode routine below, with the <code>X0</code> condition testing bit 3 of the instruction to select the path.
The difference is that <code>CMPS</code> reads the comparison character
from <code>SI</code>, while <code>SCAS</code> compares against the character in the accumulator.
The comparison itself is done by subtracting the two values and discarding the result. The <code>F</code> bit in the micro-instruction causes the processor's status flags to
be updated with the result of the subtraction, indicating less than, equal, or greater than.</p>
<pre class="microcode">
CALL F1 RPTS <span><b>CMPS/SCAS:</b> if RPT, quit if done</span>
M → tmpA JMPS X0 5 <span><b>1:</b>accum to tmpA, jump if SCAS</span>
SI → IND R DS,BL <span>CMPS path, read from SI to tmpA</span>
IND → SI <span>update SI</span>
OPR → tmpA <span>fallthrough</span>
DI → IND R DA,BL <span><b>5:</b> both: read from DI to tmpB</span>
OPR → tmpB SUBT tmpA <span>subtract to compare</span>
Σ → no dest DEC tmpC F <span>update flags, set up DEC</span>
IND → DI JMPS NF1 12 <span>return if not RPT</span>
Σ → CX JMPS F1ZZ 12 <span>update CX, exit if condition</span>
Σ → tmpC JMP INT RPTI <span>if interrupt, jump to RPTI</span>
JMPS NZ 1 <span>loop if CX ≠ 0</span>
RNI <span><b>12:</b> run next instruction</span>
</pre>
<p>One tricky part about the scan and compare instructions is that you can either repeat until the values are equal or until they are unequal,
with the <code>REPE</code> or <code>REPNE</code> prefixes respectively. Rather than implementing this two-part condition in microcode, the <code>F1ZZ</code> condition above
tests the right condition depending on the prefix.</p>
<h2>Hardware support</h2>
<p>Although the 8086 uses microcode to implement instructions, it also uses a considerable amount of hardware to
simplify the microcode.
This hybrid approach was necessary in order to fit the microcode into the small ROM capacity available in 1978.<span id="fnref:microcode"><a class="ref" href="#fn:microcode">7</a></span>
This section discusses some of the hardware circuitry in the 8086 that supports the string operations.</p>
<h3>Implementing the <code>REP</code> prefixes</h3>
<p>Instruction prefixes, including <code>REPNZ</code> and <code>REPZ</code>, are executed in hardware rather than microcode.
The first step of instruction decoding, before microcode starts, is the Group Decode ROM.
This ROM categorizes instructions into various groups.
For instructions that are categorized as prefixes, the signal from the Group Decode ROM
delays any interrupts (because you don't want an interrupt between the prefix and the instruction)
and starts the next instruction without executing microcode.
The Group Decode ROM also outputs a <code>REP</code> signal specifically for these two prefixes.
This signal causes the <code>F1</code> latch to be loaded with 1, indicating a <code>REP</code> prefix.
(This latch is also used during multiplication to track the sign.)
This signal also causes the <code>F1Z</code> latch to be loaded with bit 0 of the instruction, which is 0 for <code>REPNZ</code> and 1 for <code>REPZ</code>.
The microcode uses these latches to determine the appropriate behavior of the string instruction.</p>
<h3>Updating <code>SI</code> and <code>DI</code>: the Constant ROM</h3>
<p>The <code>SI</code> and <code>DI</code> index registers are updated during each step to point to the next element.
This update is more complicated than you might expect, though, since
the registers are incremented or decremented based on the Direction Flag.
Moreover, the step size, 1 or 2, varies for a byte or word operation.
Another complication is unaligned word accesses, using an odd memory address to access a word.
The 8086's bus can only handle aligned words, so an unaligned word access is split into two byte accesses, incrementing
the address after the first access.
If the operation is proceeding downward, the address then needs to be decremented by 3 (not 2) at the end to
cancel out this increment.
The point is that updating the index registers is not trivial but requires an adjustment anywhere between -3 and +2, depending
on the circumstances.</p>
<p>The Bus Interface Unit performs these updates automatically, without requiring the microcode to implement the addition or subtraction.
The arithmetic is not performed by the regular ALU (Arithmetic/Logic Unit) but by the special adder dedicated to addressing
arithmetic.
The increment or decrement value is supplied by a special ROM called the Constant ROM, located next to the adder.
The Constant ROM (shown below) is implemented as a PLA (programmable logic array), a two-level structured arrangement of gates.
The first level (bottom) selects the desired constant, while the second level (middle) generates the bits of the constant: three bits plus a sign bit.
The constant ROM is also used for correcting the program counter value as described earlier.</p>
<p><a href="https://static.righto.com/images/8086-str/constant-die.jpg"><img alt="The Constant ROM, highlighted on the die. The correction constants are used to correct the PC." class="hilite" height="451" src="https://static.righto.com/images/8086-str/constant-die-w400.jpg" title="The Constant ROM, highlighted on the die. The correction constants are used to correct the PC." width="400" /></a><div class="cite">The Constant ROM, highlighted on the die. The correction constants are used to correct the PC.</div></p>
<h3>Condition testing</h3>
<p>The microcode supports conditional jumps based on 16 conditions. Several of these conditions are designed to support the
string operations.
To test if a <code>REP</code> prefix is active, microcode uses the <code>F1</code> test, which tests the <code>F1</code> latch.
The <code>REPZ</code> and <code>REPNZ</code> prefixes loop while the zero flag is 1 or 0 respectively.
This somewhat complicated test is supported in microcode by the <code>F1ZZ</code> condition, which evaluates the zero flag XOR the <code>F1Z</code> latch. Thus, it tests for zero with REPZ (<code>F1Z=0</code>) and nonzero with REPNZ (<code>F1Z=1</code>).</p>
<p>Looping happens as long as the <code>CX</code> register is nonzero. This is tested in microcode with the <code>NZ</code> (Not Zero) condition.
A bit surprisingly, this test doesn't use the standard zero status flag, but a separate latch that tracks if an ALU result is zero.
(I call this the <code>Z16</code> flag since it tests the 16-bit value, unlike the regular zero flag which tests either a byte or word.)
The <code>Z16</code> flag is only used by the microcode and is invisible to the programmer.
The motivation behind this separate flag is so the string operations can leave the visible zero flag unchanged.<span id="fnref:z16"><a class="ref" href="#fn:z16">8</a></span></p>
<p>Another important conditional jump is <code>X0</code>, which tests bit 3 of the instruction.
This condition distinguishes between the <code>MOVS</code> and <code>LODS</code> instructions, which differ in bit 3, and similarly for
<code>CMPS</code> versus <code>SCAS</code>.
The test uses the <code>X</code> register which stores part of the instruction during decoding.
Note that the opcodes aren't arbitrarily assigned to instructions like <code>MOVS</code> and <code>LODS</code>. Instead, the opcodes
are carefully assigned so the instructions can share microcode but be distinguished by <code>X0</code>.
Finally, the string operation microcode also uses the <code>INT</code> condition, which tests if an interrupt is pending.</p>
<p>The conditions are evaluated by the condition PLA (Programmable Logic Array, a grid of gates), shown below.
The four condition bits from the micro-instruction, along with their complements, are fed into the columns.
The PLA has 16 rows, one for each condition.
Each row is a NOR gate matching one bit combination (i.e. selecting a condition) and the corresponding signal value to
test.
Thus, if a particular condition is specified and is satisfied, that row will be 1.
The 16 row outputs are combined by the 16-input NOR gate at the left.
Thus, if the specified condition is satisfied, this output will be 0, and if the condition is unsatisfied, the
output will be 1.
This signal controls the jump or call micro-instruction:
if the condition is satisfied, the new micro-address is loaded into the microcode address register.
If the condition is not satisfied, the microcode proceeds sequentially.
I discuss the 8086's conditional circuitry in more detail in <a href="https://www.righto.com/2023/01/reverse-engineering-conditional-jump.html">this post</a>.</p>
<p><a href="https://static.righto.com/images/8086-str/condition-pla.jpg"><img alt="The condition PLA evaluates microcode conditionals." class="hilite" height="467" src="https://static.righto.com/images/8086-str/condition-pla-w300.jpg" title="The condition PLA evaluates microcode conditionals." width="300" /></a><div class="cite">The condition PLA evaluates microcode conditionals.</div></p>
<h2>Conclusions</h2>
<!--
These string instructions live on in the modern x86 architecture, mostly the same except they also support operation on double words or quad words (i.e. 32- or 64-bit operands).[^ins]
[^ins]:
The 80186 added block instructions to support block I/O: `INS` and `OUTS`.
-->
<p>Hopefully you have found this close examination of microcode interesting.
Microcode is implemented at an even lower level than assembly code, so it can be hard to understand.
Moreover, the microcode in the 8086 was carefully optimized to make it compact, so it is even more obscure.</p>
<p>One of the big computer architecture debates of the 1980s was "<a href="https://en.wikipedia.org/wiki/Reduced_instruction_set_computer">RISC vs CISC</a>", pitting Reduced Instruction Set Computers against Complex Instruction Set Computers.
Looking at the 8086 in detail has given me more appreciation for the issues in a CISC processor such as the 8086.
The 8086's string instructions are an example of the complex instructions in the 8086 that reduced the "semantic gap"
between assembly code and high-level languages and minimized code size.
While these instructions are powerful, their complexity spreads through the chip, requiring additional hardware features
described above. These instructions also caused a great deal of complications for interrupt handling, including prefix-handling
bugs that weren't fixed until later processors.</p>
<p>I've written multiple <a href="https://www.righto.com/search/label/8086">posts on the 8086</a> so far and
plan to continue reverse-engineering the 8086 die so
follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:history">
<p>Block move instructions didn't originate with the 8086.
The IBM System/360 series of mainframes had an extensive set of block instructions providing moves, compare, logical operations
(AND, OR, Exclusive OR), character translation, formatting, and decimal arithmetic. These operations supported blocks
of up to 256 characters.</p>
<p>The Z80 processor (1976) had block instructions to move and compare blocks of data. The Z80 supported ascending and
descending movements, but used separate instructions instead of a direction flag like the 8086. <a class="footnote-backref" href="#fnref:history" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:string">
<p>The "string" operations process arbitrary memory bytes or words.
Despite the name, these instructions are not specific to zero-terminated strings or any other string format. <a class="footnote-backref" href="#fnref:string" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:bl">
<p>The <code>BL</code> value in the micro-instruction indicates that the <code>IND</code> register should be incremented or decremented by 1 or 2
as appropriate.
I'm not sure what <code>BL</code> stands for in the microcode. The patent says "BL symbolically represents a
two bit code which causes external logic to examine the
byte or word line and the direction flag in PSW register
to generate, according to random logic well known
to the art, the address factor required." So perhaps "Byte Logic"? <a class="footnote-backref" href="#fnref:bl" title="Jump back to footnote 3 in the text">↩</a><a class="footnote-backref" href="#fnref2:bl" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:morse">
<p>The designer of the 8086 instruction set, Steve Morse, discusses the motivation behind the string operations in his book <a href="https://amzn.to/3K2r92L">The 8086/8088 primer</a>.
These instructions were designed to be flexible and support a variety of use cases.
The <code>XLAT</code> (Translate) and <code>JCXZ</code> (Jump if <code>CX</code> Zero) instructions were designed to work well with the string instructions.</p>
<p>The implementation of string instructions is discussed in detail in the <a href="https://patents.google.com/patent/US4449184A">8086 patent</a>, section 13 onward. <a class="footnote-backref" href="#fnref:morse" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:memory">
<p>A string operation could perform 64k moves, each of which consists of a read and a write, yielding 128k memory operations.
I think that if the memory accesses are unaligned, i.e. a word access to an odd address, then each byte of the word
needs to be accessed separately. So I think you could get up to 256k memory accesses.
Each memory operation takes at least 4 clock cycles, more if the memory is slow and has wait states.
So one string instruction could take over a million clock cycles. <a class="footnote-backref" href="#fnref:memory" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:m">
<p>You might wonder why the register <code>M</code> indicates the accumulator, and the explanation is a bit tricky.
The microcode uses 5-bit register specifications to indicate the source and destination for a data move.
Registers can be specified explicitly, such as <code>AX</code> or <code>BX</code>, or a byte register such as <code>AL</code> or an
internal register such as <code>IND</code> or <code>tmpA</code>.
However, the microcode can also specify a generic source or destination register with <code>M</code> or <code>N</code>.
The motivation is that the 8086 has a lot of operations that use an arbitrary source and destination register, for
instance <code>ADD AX, BX</code>. Rather than making the microcode figure out which registers to use for these instructions,
the hardware decodes the register fields from the instruction and substitutes the appropriate registers for <code>M</code> and <code>N</code>.
This makes the microcode much simpler.</p>
<p>But why does the <code>LODS</code> microcode use the <code>M</code> register instead of <code>AX</code> when this instruction only works with the accumulator?
The microcode takes advantage of another clever feature of the <code>M</code> and <code>N</code> registers. The hardware looks at the instruction to
determine if it is a byte or word instruction, and performs an 8-bit or 16-bit transfer accordingly.
If the <code>LODS</code> microcode was hardcoded for the accumulator, the microcode would need separate paths for <code>AX</code> and <code>AL</code>,
the full accumulator and the lower byte of the accumulator.</p>
<p>The final piece of the puzzle is how the hardware knows to use the accumulator for the string instructions when they
don't explicitly specify a register.
The first step of instruction decoding is the Group Decode ROM, which categorizes instructions into various groups.
One group is "instructions that use the accumulator". The string operations are categorized in this group, which causes
the hardware to use the accumulator when the <code>M</code> register is specified.
(Other instructions in this group include the immediate ALU operations, I/O operations, and accumulator moves.)</p>
<p>I discussed the 8086's register codes in more detail <a href="https://www.righto.com/2023/03/8086-register-codes.html">here</a>. <a class="footnote-backref" href="#fnref:m" title="Jump back to footnote 6 in the text">↩</a><a class="footnote-backref" href="#fnref2:m" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:microcode">
<p>The 8086's microcode ROM was small: 512 words of 21 bits. In comparison, the VAX 11/780 minicomputer (1977)
had 5120 words of 96 bits, over 45 times as large. <a class="footnote-backref" href="#fnref:microcode" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:z16">
<p>The internal <code>Z16</code> zero flag is mostly used by the string operations. It is also used by the <code>LOOP</code> iteration-control instructions and the shift
instructions that take a shift count. <a class="footnote-backref" href="#fnref:z16" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com10tag:blogger.com,1999:blog-6264947694886887540.post-88686078413059025492023-03-22T09:45:00.000-07:002023-03-22T09:45:54.962-07:00Reverse-engineering the Globus INK, a Soviet spaceflight navigation computer<style>
.hilite {cursor:zoom-in}
</style>
<p>One of the most interesting navigation instruments onboard Soyuz spacecraft was the Globus INK,<span id="fnref:ink"><a class="ref" href="#fn:ink">1</a></span> which used a rotating globe
to indicate the spacecraft's position above the Earth.
This electromechanical analog computer used an elaborate system of gears, cams, and differentials
to compute the spacecraft's position.
The globe rotates in two dimensions: it spins end-over-end to indicate the spacecraft's orbit, while
the globe's hemispheres rotate according to the Earth's daily rotation around its axis.<span id="fnref:day"><a class="ref" href="#fn:day">2</a></span>
The spacecraft's position above the Earth was represented by the fixed crosshairs on the plastic dome.
The Globus also has latitude and longitude dials next to the globe to show the position numerically, while the light/shadow dial below the globe indicated when the
spacecraft would enter or leave the Earth's shadow.</p>
<p><a href="https://static.righto.com/images/globus-gearing/globus.jpg"><img alt="The INK-2S "Globus" space navigation indicator." class="hilite" height="457" src="https://static.righto.com/images/globus-gearing/globus-w600.jpg" title="The INK-2S "Globus" space navigation indicator." width="600" /></a><div class="cite">The INK-2S "Globus" space navigation indicator.</div></p>
<p>Opening up the Globus reveals that it is packed with complicated gears and mechanisms.
It's amazing that this mechanical technology was used from the 1960s into the 21st century.
But what are all those gears doing? How can orbital functions be implemented with gears?
To answer these questions, I reverse-engineered the Globus and traced out its system of gears.</p>
<p><a href="https://static.righto.com/images/globus-gearing/opened-end.jpg"><img alt="The Globus with the case removed, showing the complex gearing inside." class="hilite" height="496" src="https://static.righto.com/images/globus-gearing/opened-end-w500.jpg" title="The Globus with the case removed, showing the complex gearing inside." width="500" /></a><div class="cite">The Globus with the case removed, showing the complex gearing inside.</div></p>
<p>The diagram below summarizes my analysis.
The Globus is an analog computer that represents values by rotating shafts by particular amounts.
These rotations control the globe
and the indicator dials.
The flow of these rotational signals is shown by the lines on the diagram.
The computation is based around addition, performed by ten differential gear assemblies.
On the diagram, each "⨁" symbol indicates one of these differential gear assemblies.
Other gears connect the components while scaling the signals through various gear ratios.
Complicated functions are implemented with three specially-shaped cams.
In the remainder of this blog post, I will break this diagram down into functional blocks and explain how the Globus operates.</p>
<p><a href="https://static.righto.com/images/globus-gearing/diagram.png"><img alt="This diagram shows the interconnections of the gear network in the Globus." class="hilite" height="370" src="https://static.righto.com/images/globus-gearing/diagram-w700.png" title="This diagram shows the interconnections of the gear network in the Globus." width="700" /></a><div class="cite">This diagram shows the interconnections of the gear network in the Globus.</div></p>
<!--
## The overall network
I've created the diagram below to show how the mechanical components of the Globus are connected.
This diagram focuses on the ten differentials, numbered 1 through 10.
The orbital motion is controlled by differentials 1 through 3. This block is driven by the orbit solenoid, which provides a small
rotation once every second. The cam provides an adjustment to the orbital speed, which varies depending on altitude.
The orbit knob on the front panel allows the position to be adjusted by hand. Differential 4 handles this input and produces the
orbital position signal. For the "landing position" mode, differential 5 allows the landing motor to rotate the position.
The output from differential 5 indicates the orbital position and rotates the globe. It also rotates a potentiometer to provide an output signal
and drives the light/shadow wheel.
The second motion is the Earth's rotation around its axis. This is driven by a second solenoid that pulses onces a second.
There is also a knob to manually rotate the globe around its axis. These signals are combined by differential 6.
Differential 7 adds the signal from the landing motor and differential 8 adds the signal from the orbital rotation;
these cancel out the mechanical effect of the orbital rotation on the earth rotation.
Differentials 9 and 10 compute the longitude, combining the effect of the Earth's rotation with the orbital motion, corrected by a cam.
-->
<p>For all its complexity, though, the functionality of the Globus is pretty limited. It only handles a fixed orbit at a specific angle, and treats the
orbit as circular.
The Globus does not have any navigation input such as an inertial measurement unit (IMU).
Instead, the cosmonauts configured the Globus by turning knobs to set the spacecraft's initial position and orbital period.
From there, the Globus simply projected the current position of
the spacecraft forward, essentially <a href="https://en.wikipedia.org/wiki/Dead_reckoning">dead reckoning</a>.</p>
<p><a href="https://static.righto.com/images/globus-gearing/gears.jpg"><img alt="A closeup of the gears inside the Globus." class="hilite" height="374" src="https://static.righto.com/images/globus-gearing/gears-w500.jpg" title="A closeup of the gears inside the Globus." width="500" /></a><div class="cite">A closeup of the gears inside the Globus.</div></p>
<h2>The globe</h2>
<p>On seeing the Globus, one might wonder how the globe is rotated.
It may seem that the globe must be free-floating so it can rotate in two axes.
Instead, a clever mechanism attaches the globe to the unit.
The key is that the globe's equator is a solid piece of metal that rotates around the horizontal axis of the unit.
A second gear mechanism inside the globe rotates the globe around the North-South axis.
The two rotations are controlled by concentric shafts that are fixed to the unit.
Thus, the globe has two rotational degrees of freedom, even though it is attached at both ends.</p>
<p>The photo below shows the frame that holds and controls the globe.
The dotted axis is fixed horizontally in the unit and rotations are fed through the two gears at the left.
One gear rotates the globe and frame around the dotted axis, while the gear train causes the globe to rotate around the
vertical polar axis (while the equator remains fixed).</p>
<p><a href="https://static.righto.com/images/globus-gearing/axis.jpg"><img alt="The axis of the globe is at 51.8° to support that orbital inclination." class="hilite" height="331" src="https://static.righto.com/images/globus-gearing/axis-w500.jpg" title="The axis of the globe is at 51.8° to support that orbital inclination." width="500" /></a><div class="cite">The axis of the globe is at 51.8° to support that orbital inclination.</div></p>
<p>The angle above is 51.8° which is very important: this is the inclination of the standard Soyuz orbit.
As a result, simply rotating the globe around the dotted line causes the crosshair to trace the orbit.<span id="fnref:orbit"><a class="ref" href="#fn:orbit">3</a></span>
Rotating the two halves of the globe around the poles yields the different paths over the Earth's surface
as the Earth rotates.
An important consequence of this design is that the Globus only supports a circular orbit at a fixed angle.</p>
<h2>Differential gear mechanism</h2>
<p>The primary mathematical element of the Globus is the differential gear mechanism, which can perform addition or subtraction.
A differential gear takes two rotations as inputs and produces the (scaled) sum of the rotations as the output.
The photo below shows one of the differential mechanisms.
In the middle, the spider gear assembly (red box) consists of two bevel gears that can spin freely on a vertical shaft.
The spider gear assembly as a whole is attached to a horizontal shaft, called the spider shaft.
At the right, the spider shaft is attached to a spur gear (a gear with straight-cut teeth).
The spider gear assembly, the spider shaft, and the spider's spur gear rotate together as a unit.</p>
<p><a href="https://static.righto.com/images/globus-gearing/differential-diagram.jpg"><img alt="Diagram showing the components of a differential gear mechanism." class="hilite" height="384" src="https://static.righto.com/images/globus-gearing/differential-diagram-w500.jpg" title="Diagram showing the components of a differential gear mechanism." width="500" /></a><div class="cite">Diagram showing the components of a differential gear mechanism.</div></p>
<p>At the left and right are two end gear assemblies (yellow).
The end gear is a bevel gear with angled teeth to mesh with the spider gears.
Each end gear is locked to a spur gear and these gears spin freely on the horizontal spider shaft.
In total, there are three spur gears: two connected to the end gears and one connected to the spider assembly.
In the diagrams, I'll use the symbol below to represent the differential gear assembly: the end gears are symmetric on the top and bottom, with the
spider shaft on the side.
Any of the three spur gears can be used as an output, with the other two serving as inputs.</p>
<p><a href="https://static.righto.com/images/globus-gearing/differential-symbol.jpg"><img alt="The symbol for the differential gear assembly." class="hilite" height="117" src="https://static.righto.com/images/globus-gearing/differential-symbol-w120.jpg" title="The symbol for the differential gear assembly." width="120" /></a><div class="cite">The symbol for the differential gear assembly.</div></p>
<p>To understand the behavior of the differential, suppose the two end gears are driven in the same direction at the same rate, say upwards.<span id="fnref:directions"><a class="ref" href="#fn:directions">4</a></span>
These gears will push on the spider gears and rotate the spider gear assembly, with the entire differential rotating
as a fixed unit.
On the other hand, suppose the two end gears are driven in opposite directions.
In this case, the spider gears will spin on their shaft, but the spider gear assembly will remain stationary.
In either case, the spider gear assembly motion is the average of the two end gear rotations, that is, the sum of the two rotations divided by 2.
(I'll ignore the factor of 2 since I'm ignoring all the gear ratios.)
If the operation of the differential is still confusing, <a href="https://youtu.be/mQhmmTX5f9Y?t=68">this vintage Navy video</a> has a detailed explanation.</p>
<h2>The controls and displays</h2>
<p>The diagram below shows the controls and displays of the Globus.
The rotating globe is the centerpiece of the unit. Its plastic cover has a crosshair that represents the spacecraft's position above the Earth's surface.
Surrounding the globe itself are dials that show the longitude, latitude, and the time before entering light and shadow.
The cosmonauts manually initialize the globe position with the concentric globe rotation knobs: one rotates the globe along the orbital path
while the other rotates the hemispheres.
The mode switch at the top selects between the landing position mode, the standard Earth orbit mode, and turning off the unit.
The orbit time adjustment configures the orbital time period in minutes while
the orbit counter below it counts the number of orbits.
Finally, the landing point angle sets the distance to the landing point in degrees of orbit.</p>
<p><a href="https://static.righto.com/images/globus-gearing/globus-labeled.jpg"><img alt="The Globus with the controls labeled." class="hilite" height="466" src="https://static.righto.com/images/globus-gearing/globus-labeled-w700.jpg" title="The Globus with the controls labeled." width="700" /></a><div class="cite">The Globus with the controls labeled.</div></p>
<h2>Computing the orbit time</h2>
<p>The primary motion of the Globus is the end-over-end rotation of the globe showing the movement of the spacecraft in orbit.
The orbital motion is powered by a solenoid at the top of the Globus that receives pulses once a second and advances a ratchet wheel (<a href="https://twitter.com/kenshirriff/status/1628118949421363201">video</a>).<span id="fnref:torque"><a class="ref" href="#fn:torque">5</a></span>
This wheel is connected to a complicated cam and differential system to provide the orbital motion.</p>
<p><a href="https://static.righto.com/images/globus-gearing/orbit-solenoid.jpg"><img alt="The orbit solenoid (green) has a ratchet that rotates the gear to the right. The shaft connects it to differential gear assembly 1 at the bottom right." class="hilite" height="404" src="https://static.righto.com/images/globus-gearing/orbit-solenoid-w500.jpg" title="The orbit solenoid (green) has a ratchet that rotates the gear to the right. The shaft connects it to differential gear assembly 1 at the bottom right." width="500" /></a><div class="cite">The orbit solenoid (green) has a ratchet that rotates the gear to the right. The shaft connects it to differential gear assembly 1 at the bottom right.</div></p>
<p>Each orbit takes about 92 minutes, but the orbital time can be adjusted by a few minutes in steps of 0.01 minutes<span id="fnref:time"><a class="ref" href="#fn:time">6</a></span>
to account for changes in altitude. The Globus is surprisingly inflexible and this is the only orbital parameter that can be adjusted.<span id="fnref:parameters"><a class="ref" href="#fn:parameters">7</a></span>
The orbital period is adjusted by the three-position orbit time switch, which points to the minutes, tenths, or hundredths.
Turning the central knob adjusts the indicated period dial.</p>
<p>The problem is how to generate the variable orbital rotation speed from the fixed speed of the solenoid.
The solution is a special cam, shaped like a cone with a spiral cross-section.
Three followers ride on the cam, so as the cam rotates, the follower is pushed outward and rotates on its shaft.
If the follower is near the narrow part of the cam, it moves over a small distance and has a small rotation.
But if the follower is near the wide part of the cam, it moves a larger distance and has a larger rotation.
Thus, by moving the follower to a particular point on the cam, the rotational speed of the follower is selected.
One follower adjusts the speed based on the minutes setting with others for the tenths and hundredths of minutes.</p>
<p><a href="https://static.righto.com/images/globus-gearing/cone-diagram.jpg"><img alt="A diagram showing the orbital speed control mechanism. The cone has three followers, but only two are visible from this angle. The "transmission" gears are moved in and out by the outer knob to select which follower is adjusted by the inner knob." class="hilite" height="533" src="https://static.righto.com/images/globus-gearing/cone-diagram-w600.jpg" title="A diagram showing the orbital speed control mechanism. The cone has three followers, but only two are visible from this angle. The "transmission" gears are moved in and out by the outer knob to select which follower is adjusted by the inner knob." width="600" /></a><div class="cite">A diagram showing the orbital speed control mechanism. The cone has three followers, but only two are visible from this angle. The "transmission" gears are moved in and out by the outer knob to select which follower is adjusted by the inner knob.</div></p>
<p>Of course, the cam can't spiral out forever.
Instead, at the end of one revolution, its cross-section drops back sharply to the starting diameter.
This causes the follower to snap back to its original position.
To prevent this from jerking the globe backward, the follower is connected to the differential gearing via a slip clutch and ratchet.
Thus, when the follower snaps back, the ratchet holds the drive shaft stationary.
The drive shaft then continues its rotation as the follower starts cycling out again.
Each shaft output is accordingly a (mostly) smooth rotation at a speed that depends on the position of the follower.</p>
<p><a href="https://static.righto.com/images/globus-gearing/diagram-orbit.png"><img alt="A cam-based system adjusts the orbital speed using three differential gear assemblies." class="hilite" height="353" src="https://static.righto.com/images/globus-gearing/diagram-orbit-w500.png" title="A cam-based system adjusts the orbital speed using three differential gear assemblies." width="500" /></a><div class="cite">A cam-based system adjusts the orbital speed using three differential gear assemblies.</div></p>
<p>The three adjustment signals are scaled by gear ratios to provide the appropriate contribution to the rotation.
As shown above, the adjustments are added to the solenoid output by three differentials to generate the orbit rotation signal, output from differential 3.<span id="fnref:loop"><a class="ref" href="#fn:loop">8</a></span>
This signal also drives the odometer-like orbit counter on the front of the Globus.
The diagram below shows how the components are arranged, as viewed from the back.</p>
<p><a href="https://static.righto.com/images/globus-gearing/diagram-back2.jpg"><img alt="A back view of the Globus showing the orbit components." class="hilite" height="584" src="https://static.righto.com/images/globus-gearing/diagram-back2-w600.jpg" title="A back view of the Globus showing the orbit components." width="600" /></a><div class="cite">A back view of the Globus showing the orbit components.</div></p>
<h2>Displaying the orbit rotation</h2>
<p>Since the Globus doesn't have any external position input such as inertial guidance, it must be initialized by the cosmonauts.
A knob on the front of the Globus provides manual adjustment of the orbital position.
Differential 4 adds the knob signal to the orbit output discussed above.</p>
<p><a href="https://static.righto.com/images/globus-gearing/diagram-orbit2.png"><img alt="The orbit controls drive the globe's motion." class="hilite" height="334" src="https://static.righto.com/images/globus-gearing/diagram-orbit2-w700.png" title="The orbit controls drive the globe's motion." width="700" /></a><div class="cite">The orbit controls drive the globe's motion.</div></p>
<p>The Globus has a "landing point" mode where the globe is rapidly rotated through a fraction of an orbit to indicate where the spacecraft would land
if the retro-rockets were fired.
Turning the mode switch caused the globe to rotate until the landing position was under the crosshairs
and the cosmonauts could evaluate the suitability of this landing site.
This mode is implemented with a landing position motor that provides the rapid rotation. This motor also rotates the globe back to the orbital position.
The motor is driven through an electronics board with relays and a transistor, controlled by limit switches.
I discussed the electronics in a <a href="https://www.righto.com/2023/03/reverse-engineering-electronics-in.html">previous post</a> so I won't go into more
details here.
The landing position motor feeds into the orbit signal through differential 5, producing the final orbit signal.</p>
<p><a href="https://static.righto.com/images/globus-gearing/landing-motor.jpg"><img alt="The landing position motor and its associated gearing. The motor speed is geared down and then fed through a worm gear (upper center)." class="hilite" height="382" src="https://static.righto.com/images/globus-gearing/landing-motor-w400.jpg" title="The landing position motor and its associated gearing. The motor speed is geared down and then fed through a worm gear (upper center)." width="400" /></a><div class="cite">The landing position motor and its associated gearing. The motor speed is geared down and then fed through a worm gear (upper center).</div></p>
<p>The orbit signal from differential 5 is used in several ways.
Most importantly, the orbit signal provides the end-over-end rotation of the globe to indicate the spacecraft's travel in orbit.
As discussed earlier, this is accomplished by rotating the globe's metal frame around the horizontal axis.
The orbital signal also rotates a potentiometer to provide an electrical indication of the orbital position to other spacecraft systems.</p>
<h2>The light/shadow indicator</h2>
<p>Docking a spacecraft is a tricky endeavor, best performed in daylight, so it is useful to know how much time remains until the spacecraft
enters the Earth's shadow. The light/shadow dial under the globe provides this information.
This display consists of two nested wheels. The outer wheel is white and has two quarters removed.
Through these gaps, the partially-black inner wheel is exposed, which can be adjusted to show 0% to 50% dark.
This display is rotated by the orbital signal, turning half a revolution per orbit.
As the spacecraft orbits, this dial shows the light/shadow transition and the time to the transistion.<span id="fnref:scale"><a class="ref" href="#fn:scale">9</a></span></p>
<p><a href="https://static.righto.com/images/globus-gearing/light-shadow-dials.jpg"><img alt="The light/shadow indicator, viewed from the underside of the Globus. The shadow indicator has been set to 35% shadow. Near the hub, a pin restricts motion of the inner wheel relative to the outer wheel." class="hilite" height="348" src="https://static.righto.com/images/globus-gearing/light-shadow-dials-w500.jpg" title="The light/shadow indicator, viewed from the underside of the Globus. The shadow indicator has been set to 35% shadow. Near the hub, a pin restricts motion of the inner wheel relative to the outer wheel." width="500" /></a><div class="cite">The light/shadow indicator, viewed from the underside of the Globus. The shadow indicator has been set to 35% shadow. Near the hub, a pin restricts motion of the inner wheel relative to the outer wheel.</div></p>
<p>You might expect the orbit to be in the dark 50% of the time, but because the spacecraft is about 200 km above the Earth's surface,
it will sometimes be illuminated when the surface of the Earth underneath is dark.<span id="fnref:iss"><a class="ref" href="#fn:iss">10</a></span> In the ground track below, the dotted
part of the track is where the spacecraft is in the Earth's shadow; this is considerably less than 50%.
Also note that the end of the orbit doesn't match up with the beginning, due to the Earth's rotation during the orbit.</p>
<p><a href="https://static.righto.com/images/globus-gearing/orbitdisplay4.png"><img alt="Ground track of an Apollo-Soyuz Test Project orbit, corresponding to this Globus. Image courtesy of heavens-above.com." class="hilite" height="300" src="https://static.righto.com/images/globus-gearing/orbitdisplay4-w600.png" title="Ground track of an Apollo-Soyuz Test Project orbit, corresponding to this Globus. Image courtesy of heavens-above.com." width="600" /></a><div class="cite">Ground track of an Apollo-Soyuz Test Project orbit, corresponding to this Globus. Image courtesy of <a href="https://www.heavens-above.com/orbit.aspx?satid=08032">heavens-above.com</a>.</div></p>
<h2>The latitude indicator</h2>
<p>The latitude indicator to the left of the globe shows the spacecraft's latitude. The map above shows how the latitude oscillates between
51.8°N and 51.8°S, corresponding to the launch inclination angle.
Even though the path around the globe is a straight (circular) line, the orbit appears roughly sinusoidal when projected onto the map.<span id="fnref:sinusoid"><a class="ref" href="#fn:sinusoid">11</a></span>
The exact latitude is a surprisingly complicated function of the orbital position.<span id="fnref:formulas"><a class="ref" href="#fn:formulas">12</a></span>
This function is implemented by a cam that is attached to the globe. The varying radius of the cam corresponds to the function.
A follower tracks the profile of the cam and rotates the latitude display wheel accordingly, providing the non-linear motion.</p>
<p><a href="https://static.righto.com/images/globus-gearing/latitude-cam.jpg"><img alt="A cam is attached to the globe and rotates with the globe." class="hilite" height="353" src="https://static.righto.com/images/globus-gearing/latitude-cam-w400.jpg" title="A cam is attached to the globe and rotates with the globe." width="400" /></a><div class="cite">A cam is attached to the globe and rotates with the globe.</div></p>
<h2>The Earth's rotation</h2>
<p>The second motion of the globe is the Earth's daily rotation around its axis, which I'll call the Earth rotation.
The Earth rotation is fed into the globe through the outer part of a concentric shaft, while the orbital rotation is provided through the
inner shaft.
The Earth rotation is transferred through three gears to the equatorial frame, where an internal mechanism rotates the hemispheres.
There's a complication, though:
if the globe's orbital shaft turns while the Earth rotation shaft remains stationary, the frame will rotate, causing the
gears to turn and the hemispheres to rotate.
In other words, keeping the hemispheres stationary requires the Earth shaft to rotate with the orbit shaft.</p>
<p><a href="https://static.righto.com/images/globus-gearing/globe-closeup.jpg"><img alt="A closeup of the gear mechanisms that drive the Globus, showing the concentric shafts that control the two rotations." class="hilite" height="464" src="https://static.righto.com/images/globus-gearing/globe-closeup-w400.jpg" title="A closeup of the gear mechanisms that drive the Globus, showing the concentric shafts that control the two rotations." width="400" /></a><div class="cite">A closeup of the gear mechanisms that drive the Globus, showing the concentric shafts that control the two rotations.</div></p>
<p>The Globus solves this problem by adding the orbit rotation to the Earth rotation, as shown in the diagram below, using differentials 7 and 8.
Differential 8 adds the normal orbit rotation, while differential 7 adds the orbit rotation due to the landing motor.<span id="fnref:landing"><a class="ref" href="#fn:landing">14</a></span></p>
<p><a href="https://static.righto.com/images/globus-gearing/diagram-earth.png"><img alt="The mechanism to compute the Earth's rotation around its axis." class="hilite" height="391" src="https://static.righto.com/images/globus-gearing/diagram-earth-w350.png" title="The mechanism to compute the Earth's rotation around its axis." width="350" /></a><div class="cite">The mechanism to compute the Earth's rotation around its axis.</div></p>
<p>The Earth motion is generated by
a second solenoid (below) that is driven with one pulse per second.<span id="fnref:solenoid"><a class="ref" href="#fn:solenoid">13</a></span>
This motion is simpler than the orbit motion because it has a fixed rate.
The "Earth" knob on the front of the Globus permits manual rotation around the Earth's axis. This signal is combined with the solenoid signal by differential 6.
The sum from the three differentials is fed into the globe, rotating the hemispheres around their axis.</p>
<p><a href="https://static.righto.com/images/globus-gearing/earth-solenoid.jpg"><img alt="This solenoid, ratchet, and gear on the underside of the Globus drive the Earth rotation." class="hilite" height="286" src="https://static.righto.com/images/globus-gearing/earth-solenoid-w350.jpg" title="This solenoid, ratchet, and gear on the underside of the Globus drive the Earth rotation." width="350" /></a><div class="cite">This solenoid, ratchet, and gear on the underside of the Globus drive the Earth rotation.</div></p>
<p>The solenoid and differentials are visible from the underside of the Globus. The diagram below labels these components as well as
other important components.</p>
<p><a href="https://static.righto.com/images/globus-gearing/diagram-underside.jpg"><img alt="The underside of the Globus." class="hilite" height="665" src="https://static.righto.com/images/globus-gearing/diagram-underside-w700.jpg" title="The underside of the Globus." width="700" /></a><div class="cite">The underside of the Globus.</div></p>
<h2>The longitude display</h2>
<p><a href="https://static.righto.com/images/globus-gearing/longitude-cam.jpg"><img alt="The longitude cam and the followers that track its radius." class="hilite" height="383" src="https://static.righto.com/images/globus-gearing/longitude-cam-w350.jpg" title="The longitude cam and the followers that track its radius." width="350" /></a><div class="cite">The longitude cam and the followers that track its radius.</div></p>
<p>The longitude display is more complicated than the latitude display because it depends on both the Earth rotation and the orbit rotation.
Unlike the latitude, the longitude doesn't oscillate but increases.
The longitude increases by 360° every orbit according to a complicated formula describing the projection of the orbit onto the globe.
Most of the time, the increase is small, but when crossing near the poles, the longitude changes rapidly.
The Earth's rotation provides a smaller but steady negative change to the longitude.</p>
<p><a href="https://static.righto.com/images/globus-gearing/diagram-earth2.png"><img alt="The computation of the longitude." class="hilite" height="245" src="https://static.righto.com/images/globus-gearing/diagram-earth2-w220.png" title="The computation of the longitude." width="220" /></a><div class="cite">The computation of the longitude.</div></p>
<p>The diagram above shows how the longitude is computed by combining the Earth rotation with the orbit rotation.
Differential 9 adds the linear effect of the orbit on longitude (360° per orbit) and subtracts the effect of the Earth's rotation (360° per day).
The nonlinear effect of the orbit is computed by a cam that is rotated by the orbit signal. The shape of the cam is picked up and fed into differential 10,
computing the longitude that is displayed on the dial. The differentials, cam, and dial are visible from the back of the Globus (below).</p>
<p><a href="https://static.righto.com/images/globus-gearing/diagram-back.jpg"><img alt="A closeup of the differentials from the back of the Globus." class="hilite" height="621" src="https://static.righto.com/images/globus-gearing/diagram-back-w450.jpg" title="A closeup of the differentials from the back of the Globus." width="450" /></a><div class="cite">A closeup of the differentials from the back of the Globus.</div></p>
<p>The time-lapse video below demonstrates the behavior of the rotating displays.
The latitude display on the left oscillates between 51.8°N and 51.8°S.
The longitude display at the top advances at a changing rate. Near the equator, it advances slowly, while it accelerates near the poles.
The light/shadow display at the bottom rotates at a constant speed, completing half a revolution (one light/shadow cycle) per orbit.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/nDHtJy9cpC0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<h2>Conclusions</h2>
<p>The Globus INK is a remarkable piece of machinery, an analog computer that calculates orbits through an intricate
system of gears, cams, and differentials.
It provided astronauts with a high-resolution, full-color display of the spacecraft's position, way beyond what
an electronic space computer could provide in the 1960s.</p>
<p>The drawback of the Globus is that its functionality is limited.
Its parameters must be manually configured: the spacecraft's starting position, the orbital speed, the light/shadow regions, and the landing angle.
It doesn't take any external guidance inputs, such as an IMU (inertial measurement unit), so it's not particularly accurate.
Finally, it only supports a circular orbit at a fixed angle.
While a more modern digital display lacks the physical charm of a rotating globe, the digital solution provides
much more capability.</p>
<p>I recently wrote blog posts providing a <a href="https://www.righto.com/2023/01/inside-globus-ink-mechanical-navigation.html">Globus overview</a>
and <a href="https://www.righto.com/2023/03/reverse-engineering-electronics-in.html">the Globus electronics</a>.
Follow me on Twitter <a href="https://twitter.com/kenshirriff">@kenshirriff</a> or <a href="http://www.righto.com/feeds/posts/default">RSS</a> for updates.
I've also started experimenting with Mastodon recently as <a href="https://oldbytes.space/@kenshirriff">@kenshirriff@oldbytes.space</a>.
Many thanks to Marcel for providing the Globus.
I worked on this with CuriousMarc, so check out his <a href="https://www.youtube.com/@CuriousMarc/videos">Globus videos</a>.</p>
<h2>Notes and references</h2>
<div class="footnote">
<ol>
<li id="fn:ink">
<p>In Russian, the name for the device is "Индикатор Навигационный Космический" abbreviated as ИНК (INK). This translates to "space navigation indicator."
but I'll use the more descriptive nickname "Globus" (i.e. globe).
The Globus has a long history, back to the beginnings of Soviet crewed spaceflight. The first version was simpler and had the Russian acronym ИМП (IMP).
Development of the IMP started in <a href="https://web.mit.edu/slava/space/essays/essay-tiapchenko1.htm">1960</a> for the Vostok (1961)
and Voshod (1964) spaceflights.
The more complex INK model (described in this blog post) was created for the Soyuz flights, starting in 1967.
The landing position feature is the main improvement of the INK model.
The Soyuz-TMA (2002) upgraded to the Neptun-ME system which used digital display screens and abandoned the Globus. <a class="footnote-backref" href="#fnref:ink" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:day">
<p>According to <a href="https://astronaut.ru/bookcase/article/article152.htm?reload_coolmenus">this document</a>,
one revolution of the globe relative to the axis of daily rotation occurs in a time equal to a sidereal day,
taking into account the precession of the orbit relative to the earth's axis, caused by the asymmetry of the Earth's gravitational field.
(A sidereal day is approximately 4 minutes shorter than a regular 24-hour day. The difference is that the sidereal day is
relative to the fixed stars, rather than relative to the Sun.) <a class="footnote-backref" href="#fnref:day" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:orbit">
<p>To see how the angle between the poles and the globe's rotation results in the desired orbital inclination, consider two limit cases.
First, suppose the angle between is 90°. In this case, the globe is "straight" with the equator horizontal.
Rotating the globe along the horizontal axis, flipping the poles end-over-end, will cause the crosshair to
trace a polar orbit, giving the expected inclination of 90°.
On the other hand, suppose the angle is 0°. In this case, the globe is "sideways" with the equator vertical.
Rotating the globe will cause the crosshair to remain over the equator, corresponding to an equatorial orbit
with 0° inclination. <a class="footnote-backref" href="#fnref:orbit" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:directions">
<p>There is a bit of ambiguity when describing the gear motions.
If the end gears are rotating upwards when viewed from the front, the gears are both rotating clockwise when viewed from the right, so
I'm referring to them as rotating in the same direction.
But if you view each gear from its own side, the gear on the left is turning counterclockwise, so from that perspective they are turning
in opposite directions. <a class="footnote-backref" href="#fnref:directions" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:torque">
<p>The solenoids are important since they provide all the energy to drive the globe.
One of the problems with gear-driven analog computers is that each gear and shaft has a bit of friction and loses a bit of torque,
and there is nothing to amplify the signal along the way.
Thus, the 27-volt solenoids need to provide enough force to run the entire system. <a class="footnote-backref" href="#fnref:torque" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:time">
<p>The orbital time can be adjusted between 86.85 minutes and 96.85 minutes according to <a href="https://astronaut.ru/bookcase/article/article152.htm?reload_coolmenus">this detailed page</a> that describes the Globus in Russian. <a class="footnote-backref" href="#fnref:time" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:parameters">
<p>The Globus is manufactured for a particular orbital inclination, in this case 51.8°. The Globus assumes a circular orbit and does not account
for any variations.
The Globus does not account for any maneuvering in orbit. <a class="footnote-backref" href="#fnref:parameters" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:loop">
<p>The outputs from the orbit cam are fed into the overall orbit rotation, which drives the orbit cam.
This may seem like an "infinite loop" since the outputs from the cam turn the cam itself.
However, the outputs from the cam are a small part of the overall orbit rotation, so the feedback dies off. <a class="footnote-backref" href="#fnref:loop" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:scale">
<p>The scales on the light/shadow display are a bit confusing. The inner scale (blue) is measured in percentage of an orbit, up to 100%.
The fixed outer scale (red) measures minutes, indicating how many minutes until the spacecraft enters or leaves shadow.
The spacecraft completes 100% of an orbit in about 90 minutes, so the scales almost, but not quite, line up.
The wheel is driven by the orbit mechanism and turns half a revolution per orbit.</p>
<p><a href="https://static.righto.com/images/globus-gearing/light-dark.jpg"><img alt="The light and shadow indicator is controlled by two knobs." class="hilite" height="175" src="https://static.righto.com/images/globus-gearing/light-dark-w500.jpg" title="The light and shadow indicator is controlled by two knobs." width="500" /></a><div class="cite">The light and shadow indicator is controlled by two knobs.</div></p>
<p><!-- --> <a class="footnote-backref" href="#fnref:scale" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:iss">
<p>The Internation Space Station illustrates how an orbiting spacecraft is illuminated more than 50% of the time due to its height.
You can often see the ISS illuminated in the nighttime sky close to sunset and sunrise
(<a href="https://spotthestation.nasa.gov/">link</a>). <a class="footnote-backref" href="#fnref:iss" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
<li id="fn:sinusoid">
<p>The ground track on the map is roughly, but not exactly, sinusoidal.
As the orbit swings further from the equator, the track deviates more from a pure sinusoid.
The shape will depend, of course, on the rectangular map projection.
For more information, see <a href="https://math.stackexchange.com/questions/1843444/how-do-great-circles-project-on-the-mercator-projection">this StackExcahnge post</a>. <a class="footnote-backref" href="#fnref:sinusoid" title="Jump back to footnote 11 in the text">↩</a></p>
</li>
<li id="fn:formulas">
<p>To get an idea of how the latitude and longitude behave,
consider a polar orbit with 90° angle of inclination, one that goes up a line of longitude, crosses the North Pole,
and goes down the opposite line of latitude.
Now, shift the orbit away from the poles a bit, but keeping a great circle.
The spacecraft will go up, nearly along a constant line of longitude, with the latitude increasing steadily. As the spacecraft reaches the peak of its orbit
near the North Pole, it will fall a bit short of the Pole but will still rapidly cross over to the other side.
During this phase, the spacecraft rapidly crosses many lines of longitude (which are close together near the Pole) until it reaches the opposite
line of longitude.
Meanwhile, the latitude stops increasing short of 90° and then starts dropping.
On the other side, the process repeats, with the longitude nearly constant while the latitude drops relatively constantly.</p>
<p>The latitude and longitude are generated by complicated trigonometric functions.
The latitude is given by arcsin(sin i * sin (2πt/T)), while the longitude is given by λ = arctan (cos i * tan(2πt/T)) + Ωt + λ<sub>0</sub>,
where t is the spaceship's flight time starting at the equator, i is the angle of inclination (51.8°),
T is the orbital period, Ω is the angular velocity of the Earth's rotation, and λ<sub>0</sub> is the longitude of the ascending node. <a class="footnote-backref" href="#fnref:formulas" title="Jump back to footnote 12 in the text">↩</a></p>
</li>
<li id="fn:solenoid">
<p>An important function of the gears is to scale the rotations as needed by using different gear ratios.
For the most part, I'm ignoring the gear ratios, but the Earth rotation gearing is interesting.
The gear driven by the solenoid has 60 teeth, so it rotates exactly once per minute.
This gear drives a shaft with a very small gear on the other end with 15 teeth. This gear meshes with a much larger gear with approximately 75 teeth,
which will thus rotate once every 5 minutes. The other end of that shaft has a gear with approximately 15 teeth, meshed with a large gear
with approximately 90 teeth. This divides the rate by 6, yielding a rotation every 30 minutes.
The sequence of gears and shafts continues, until the rotation is reduced to once per day.
(The tooth counts are approximate because the gears are partially obstructed inside the Globus, making counting difficult.) <a class="footnote-backref" href="#fnref:solenoid" title="Jump back to footnote 13 in the text">↩</a></p>
</li>
<li id="fn:landing">
<p>There's a potential simplification when canceling out the orbital shaft rotation from the Earth rotation.
If the orbit motion was taken from differential 5 instead of differential 4, the landing motor effect would get added automatically,
eliminating the need for differential 7.
I think the landing motor motion was added separately so the mechanism could account for the Earth's rotation during the landing descent. <a class="footnote-backref" href="#fnref:landing" title="Jump back to footnote 14 in the text">↩</a></p>
</li>
</ol>
</div>
Ken Shirriffhttp://www.blogger.com/profile/08097301407311055124noreply@blogger.com2