Improvements to the Xerox Alto Mandelbrot drop runtime from 1 hour to 9 minutes

Last week I wrote a Mandelbrot set program for the Xerox Alto, which took an hour to generate the fractal. The point of this project was to learn how to use the Alto's bitmapped display, not make the fastest Mandelbrot set, so I wasn't concerned that this 1970s computer took so long to run. Even so, readers had detailed suggestions form performance improvements, so I figured I should test out these ideas. The results were much better than I expected, dropping the execution time from 1 hour to 9 minutes.

The Mandelbrot set, generated by a Xerox Alto computer.

The Mandelbrot set, generated by a Xerox Alto computer.

The Alto was a revolutionary computer designed at Xerox PARC in 1973 to investigate personal computing. It introduced high-resolution bitmapped displays, the GUI, Ethernet and laser printers to the world, among other things. For my program I used BCPL, the primary Alto programming language and a precursor to C. In the photo above, the Alto computer is in the lower cabinet. The black box is the 2.5 megabyte disk drive. The Alto's unusual portrait display and an early optical mouse are on top.

Easy optimization: mirror the Mandelbrot set

The Mandelbrot set is a famous fractal, generated by a simple algorithm. Each point on the plane represents a complex number c. You repeatedly iterate the complex function f(z) = z^2 + c. If the value diverges to infinity, the point is outside the set. Otherwise, the point is inside the set and the pixel is set to black.

Since the Mandelbrot set is symmetrical around the X axis, a simple optimization is to draw both halves at the same time, cutting the time in half. (This only helps if you're drawing the whole set; this doesn't help if you're zoomed in.) I implemented this, cutting the time down to about 30 minutes. The image below shows the mirrored appearance midway through computation.

Drawing both halves of the Mandelbrot set simultaneously doubles the performance.

Drawing both halves of the Mandelbrot set simultaneously doubles the performance.

Improving the code

Embedded programmer Julian Skidmore had detailed comments on how to speed up my original code. I tried his changes and the time dropped from 30 to 24 minutes. Some of his changes were straightforward - calculating the pixel address in memory incrementally instead of using multiplication, and simplifying the loop counting. But one of his changes illustrates how primitive the Alto's instruction set is.

Quick background: my Mandelbrot program is implemented in BCPL, a programming language that was a precursor to C. The program is compiled to Nova machine code, the instruction set used by the popular Data General Nova 16-bit minicomputer. The Alto implements the Nova instruction set through microcode.

Since the Xerox Alto doesn't support floating point, I used 16-bit fixed point arithmetic: 4 bits to the left of the decimal point and 12 bits to the right. After multiplying two fixed point numbers with integer multiplication, the 32-bit result must be divided by 2^12 to restore the decimal point location. Usually if you're dividing by a power of 2, it's faster to do a bit shift. That's what I originally did, in the code below. (In BCPL, % is the OR operation, not modulo. ! is array indexing.)

let x2sp = (x2!0 lshift 4) % (x2!1 rshift 12)

Unfortunately this turns out to be inefficient for a couple reasons. Modern processors usually have a barrel shifter, so you can efficiently shift a word by as many bits as you want. The Alto's instruction set, however, only shifts by one bit. So to right shift by 12 bits, the compiled code calls an assembly subroutine (RSH) that does 12 separate shift instructions, much slower than the single instruction I expected. The second problem is the instruction set (surprisingly) doesn't have a bitwise-OR instruction, so the OR operation is implemented in another subroutine (IOR).1 I took Julz's suggestion and used the MulDiv assembly-language function to multiply two numbers and divide by 4096 instead of shifting. It's still not fast (since the Alto doesn't have hardware multiply and divide), but at least it reduces the number of instructions executed.

Shutting off the display

One way to speed up the Alto is to shut off the display.2 I tried this and improved the time from 24 minutes to 9 minutes, a remarkable improvement. Why does turning off the display make such a big difference?

One of the unusual design features of the Alto is that it performed many tasks in software that are usually performed in hardware, giving the Alto more flexibility. (As Alan Kay put it, "Hardware is just software crystallized early.") For instance, the CPU is responsible for copying data between memory and the disk or Ethernet interface. The CPU also periodically scans memory to refresh the dynamic RAM. These tasks are implemented in microcode, and the hardware switches between tasks as necessary, preempting low priority tasks to perform higher-priority tasks. Executing a user program has the lowest priority and runs only when there's nothing more important to be done.

All this task management was done in hardware, not by the operating system. The Xerox Alto doesn't use a microprocessor chip, but instead has a CPU built out of three boards of simple TTL chips. The board below shows one of the CPU boards, the control board that implements the microcode tasks and controls what the CPU is doing. It has PROMs to hold the microcode, 64-bit RAM chips (yes, just 64 bits) to remember what each task is doing, and more chips to determine which task has the highest priority.

Control board for the Xerox Alto. Part of the CPU, this board holds microcode and handles microcode tasks.

Control board for the Xerox Alto. Part of the CPU, this board holds microcode and handles microcode tasks.

The task that affects the Mandelbrot program is the display task: to display pixels on the screen, the CPU must move the pixels for each scan line from RAM to the display board, 30 times a second, over and over. During this time, the CPU can't run any program instructions, so there's a large performance impact just from displaying pixels on the screen. Thus, not using the display causes the program to run much, much faster. I still set the Mandelbrot pixels in memory though, so when the program is done, I update the display pointer causing the set to immediately appear on the display. Thus, the Mandelbrot set still appears on screen; you just don't see it as it gets drawn.

Microcode: the final frontier

The hardest way to optimize performance on the Alto is to write your own microcode. The Alto includes special microcode RAM, letting you extend the instruction set with new instructions. This feature was used by programs that required optimized graphics such as the Bravo text editor and the Draw program. Games such as Missile Command and Pinball also used microcode for better performance. Writing the Mandelbrot set code in microcode would undoubtedly improve performance. However, Alto microcode is pretty crazy, so I'm not going to try a microcode Mandelbrot.

Conclusion

After writing a Mandelbrot program for the Xerox Alto, I received many suggestions for performance improvements. By implementing these suggestions, the time to generate the Mandelbrot set dropped from one hour to 9 minutes, a remarkable speedup. The biggest speedup came from turning off the display during computation; just putting static pixels on the screen uses up a huge amount of the processing power. My improved Mandelbrot code is on github.

My goal with the Mandelbrot was to learn how to use the Alto's high-resolution display from a BCPL program. Using what I learned with the Mandelbrot, I wrote a program to display images; an example is below.3

The Xerox Alto displaying an image of the Xerox Alto displaying...

The Xerox Alto displaying an image of the Xerox Alto displaying...

Notes and references

  1. The Alto has an AND machine instruction but not an OR instruction, so the OR operation is performed by complementing an argument, ANDing, and complement-adding the complement. I.e. ab plus b. 

  2. Strictly speaking, I left the display on; it just wasn't displaying anything. The Alto uses a complex display system with a display list pointing to arbitrarily-sized blocks of pixels. (For instance, when displaying text, each block is a separate text line. Scrolling the screen just involves updating the list pointers, not moving actual data.) Thus, I could set the display list to NULL while rendering the Mandelbrot into memory. Then when the Mandelbrot was done, I simply updated the display list to make the image appear. 

  3. The recursive picture-in-picture effect is known as the Droste effect. After making this picture, I learned that generating the Droste effect on old computers is apparently a thing

15 comments:

colinstu said...

Nice work!

Now.... someone who's really feeling up to it, needs to program this in microcode! Curious to see how fast it would be.

Franc said...

Awesome. Fun fact. The Droste effect is named after the tins of Dutch Droste cacao featuring a nurse carrying a tray with a tin of cacao with a nurse .....
Comes in boxes these days but still good stuff for a cold winter :)

Anthill Beetle said...

If you used symmetry, you can as well use the fact it is continuous and only trace points along the boundary. Needs some math to be done properly, but should improve speed dramatically.

(Although even flood fill was nontrivial problem on old machines.)

Ed said...

There's a boundary-tracing Mandlebrot for the 6502-based BBC Micro here.

Renato Romano said...

Wow! Good job!
Shutting off the display remember me the ZX81 times... :-)

Larry Stewart said...

Actually the Alto do floating point. I know this because I wrote the microcode for it :)
That was for the Mesa language environment. For BCPL, there was a subroutine library for floating point and possibly the microcode was backported.

The canonial way to to this sort of thing as of about 1981 was to farm out the computation to idle Altos on the Ethernet. See the paper by Jon Hupp and John Shoch. Or, if you were lucky enough to have one, run it on a Dorado...

Ken Shirriff said...

Hi Larry! Thanks for your comments. Did you write the FLOAT library? We don't have Mesa running yet - I heard that it needs two drives. As for the Hupp and Shoch worm, do you know if the source code is available anywhere? It would be cool to get that running on a few Altos.

Larry Stewart said...

I didn't write the BCPL float library. Possibly Ed Fiala? I wrote the IEEE single precision microcode and Mesa trap code to handle the hard cases (gradual underflow, etc.) I remember Ed took over the code I wrote and added underflow to 0 because he thought gradual underflow was too slow. Regarding the worm code. I don't know of any, but Paul McJones' archive at CHM might have some, or John Shoch is at Alloy Ventures I think.

Snial said...

Hello Ken,

Thanks for publishing my suggested improvements to the code - it's really exciting to think that I actually have some code that's been run on a real Alto!

I wrote a Mac Plus version in Think C and added a poor-man's Floyd-Steinberg dithering algorithm. It's equivalent to adding this line after the for h loop:

for h = 0 to 37 do // horizontal word count
  [
  let dither=0

and replace the "if n eq 0 then [" sequence with:

dither=dither+(20-n)
if dither ge 20 then [
  dither=dither-20
  v!0=v!0 + b // set the pixel
  v2!=v!0
]

-cheers from Julz

Snial said...

Hi colinstu,

I'd be amazed if someone actually tried to convert the algorithm into microcode, but it's too tempting not to think about it. I don't know how much you understand about the Xerox Alto microarchitecture - Ken's posts are really helpful and the Alto Hardware Manual contains the complete specific details.

It's important to remember of course, that the BCPL compiler has its own overhead, and typically an assembler programmer would be able to improve upon the performance by a factor of 3 (if we trade size for speed). So, this would get us down from 48minutes for a conventional render to 16 minutes.

The microcode version could be much faster if we assume that all the key variables can be held in fast microcode RAM. We can't avoid the cost of multiplication, which is going to be about 10 to 14µs ( = 64c to 84c based on my reading of the existing microcode), but in microcode we can go back to using shifts for fix-point math correction and 16 shifts would take at least 16c*.17=2.72µs (though I think 32c is more likely).

The microcode architecture can't perform register to register calculations in one cycle, because it can only really transfer registers to one ALU input per cycle and it can't read and write to a RAM register in the same cycle. So, a calculation like:

C=A+B

would take 3 cycles: Load A to T_, then B+T to L_ then C_ = L. However, a more complex function such as C=A+B+D could probably be done in just 4 or 5 cycles: A->T, B+T->L, L->T and D+T->L, L->C.

With this and the knowledge that branches are delayed, we can estimate the basic loop time:

A line like this: let x2 =(x ls 0) ? -x,x would be: X->L; L<<1, ALUCY; -X->L,:NONEGX (:NEGX would be the true branch, which would then jump to NONEGX anyway). That's 3 or 4 cycles.

Then a multiplication and shift would be about 80+32. So that's 116c in total.

The y2 multiplication is then another 116c.

"if x2+ y2 ge 16384" is pretty tricky. I would do x2->T; y2+T->L; then shift and test for carry twice. That's about 5 cycles including the delayed branching.

Computing the xor for the sign test is 3cycles (the ALU supports xor directly).
The abs functions will take another 4 cycles each => 8c.
The multiplication and shift is another 112c.
The sign correction is probably another 4 cycles.
The final calculations for x and y are 5 cycles each = 10.
n=n-1 is 2 cycles, because I think decrement is possible directly via the ALU; the microcode can test the result at the same time and perform the jump so that completes it.

This makes the total = 376c per iteration (about 64µs).

My Mac Plus version requires 420259 iterations, which scaled up to the Alto screen size is: 760469 iterations x 64µs gives about 48.6s.

The Alto itself takes 66% of CPU, so that's really: 146s or nearly 2.5 minutes. But I've only optimised the inner loop because that's what will make most of the difference. The screen update in assembler or BCPL would still involve maybe 10 Nova instructions per pixel + maybe another 10 per black pixel and at 7.5µs per instruction that's another 27.36s. So the total is 173s or nearly 3 minutes (without tricks like mirroring).

So, the upshot is that microcode makes a difference, but it's not magic; the program would still be measured in minutes!

-cheers from Julz

Anonymous said...

I am sure you know that x*x-y*y can be computed as (x-y)(x+y) ?

Mark Jeronimus said...

For the large mandelbrot image you can precompute the cardoid and first bulb for another significant speed-up. Here's some Basic code:

A =
B =
ITER = 0
U = 4 * (A * A + B * B): V = U - 2 * A + 1 / 4
IF U + 8 * A + 15 / 4 < 0 THEN GOTO REPEAT
IF V - SQR(V) + 2 * A - 1 / 2 < 0 THEN GOTO REPEAT



REPEAT:

Mark Jeronimus said...

The site ate the meta-code inside angle brackets :( (A, B) is the complex number of the pixel. REPEAT is where you draw the pixel. Before REPEAT is where you calculate the orbit.

Anonymous said...

I'm sort of surprised that they didn't look at the Nova instruction set and try to improve on it a bit, as it seems really poorly suited to a machine that has pushing pixels as one of its major responsibilities.

Alan Kay said...

@anonymous Basically, no one at Parc wanted to use either the Nova emulator or BCPL -- the whole idea and plan was to bootstrap very high level languages "of the future" in the present with the help of the Alto microcode to emulate future HW. Several strategies/paths were adopted.

For example, the Learning Research Group opted to gut it out writing in machine (Nova) code and then did a Smalltalk VM in microcode.

CSL had a similar route for a variety of planned languages, but also opted to do considerable utility work in BCPL, and this included some practical apps such as the BRAVO word processor.

We expected a new *personal supercomputer* would appear every few years but this got delayed, and this led to optimizations and some kludgery instead of cleaner speed. For example, Smalltalk-76 is slow only because it swaps objects from the slow disk instead of having them in RAM.