The anatomy of a micro benchmark

Posted by André on 2014-08-04.

Almost two years ago I demonstrated a 10MHz accelerator card for the Commodore PET on the Classic Computing exhibition. For this presentation I wrote a small demo program that draws a 3-D cube on the PET's screen - first with 1 MHz, then with 10 MHz. As I was concentrating on the 10MHz speedup, I didn't have enough time to look deeply into the demo program itself. So it was questioned why it was so slow in 1MHz compared to other demos. Finally I found the time to have a look at this, so I'll make this an example of a performance optimization job. Read on to learn about how the proper approach for a performance optimization even helps fixing performance problems in an old Commodore BASIC program.

The demo program is a small BASIC program that draws a tilted, rotating 3-D cube on the screen. Due to time constraints I decided to do many parts in BASIC, and only do the line drawing in machine code. The resulting code was this draw2.basic. You can see the demo of that program here:

Note that to make it more interesting to look at, the cube is tilted around the axis that is perpendicular to the screen front, with a fixed angle. In the demo it then rotates around the vertical axis. And looking at the demo you can see that it indeed looks rather slow, if you're used to other demos. One point sure is that most of it actually runs in BASIC. But looking at the code now I in fact immediately found two places where unecessary code paths were taken - missing returns made a routine fall through into another, unecessary routine! But, even though it was tempting, it is important to not directly start hacking away! The first step is to:

Establish your baseline!

The very first step should be to define what you want to improve - and to measure it! If you don't already have any kind of timing measurement framework in your code, now is the time to introduce it! Make sure you have repeatable tests, that show repeatable, stable timing results. If the timing results are not stable, you may have other issues, into which I will not go here though.

In my case of course I didn't have such a timing framework. But luckily the Commodore BASIC has a nifty feature, the "TI" variable (see for example here http://www.zimmers.net/cbmpics/cbm/PETx/basicref.html ). It basically counts the number of interrupts since the computer startup, and those interrupts occur with a frequency of about 60Hz. So I decided to try out TI as time base for now - but this decision had still to be validated.

So the next question is where in the program to measure, and how to make the test repeatable. The program itself first sets up four points of the tilted cube (the others four points can be mirrored on the coordinate center), doing a number of trignometric functions there. Then the demo starts and iterates over the angle around the vertical axis, with 32 steps for a full 360 degrees. The demo then switches between slow (1MHz) and fast (10MHz) operation for effect, and continues endlessly. To get the measurements, I disabled this switch and stayed at 1MHz, and I made it always run and measure just a single full rotation.

Within each of the 32 rotational iterations, first the sinus and cosinus for the rotation angle is computed, and then, using these values, the four known cube points are rotated in absolute (3-D) coordinate space. In true good programming practice, the coordinates are stored in arrays (the closest thing to structures that Commodore BASIC has) and the computation is generic, called separately for each point with a new index in the coordinate array. The computation is thus basically a generic matrix-vector multiplication. After rotating the points, they are z-scaled and projected to 2-D screen coordinates. In this process also the other four cube points are finally computed from the mirrored coordinates of the known four points. This routine is also called four times as subroutine, with indexes pointing to the right coordinates in the coordinate arrays. Then the 2-D coordinates are "poked" into some specific memory locations where the drawing routine would pick them up. So I decided to make timing measurements at:

Each time counts the interrupts for the operation, which occur each 1/60th of a second, I will call them "ticks" here. On each rotational iteration the times were added up by operation, to give a total for the full rotation of the cube. First of course I had the computation of the differences wrong, but that revealed itself because repeated test executions did not return the same (or similar) results, but after fixing this I had the first stable results.

t1=48, t2=166, t3=1029, t4=1247, t5=732, t6=238, sum=3460 ticks or ~57 seconds

What a surprise! What had been assumed e.g. in the comments to the above video to be the culprit were the trigonomic functions - but those were in fact rather fast with 166 ticks only. What the real problem was, was the coordinate transformation! Introducing these measurement points and the baseline had saved me from trying to unecessarily optimize sinus and cosinus functions, at least for now. Instead I was able to focus on those really slow parts...

(Perceived) Low hanging fruit

The first thing I did was of course remove the parts of the code that I had already identified as being unecessarily called. This resulted in a small drop to

t1=43, t2=172, t3=1028, t4=1004, t5=732, t6=133, sum=3112 ticks or ~52 seconds

Note that I only show an example measurement. In the real testing I did at least two measurements each to check if the measurements were stable and what the variance was. Values were typically changing a few (between 0 and 2, sometimes 5) ticks plus or minus, but the sum of all ticks always was very close to each other. The values are also "reasonable" in the sense that each iteration, at least for the slower operations, added 10+ values to it, so the quantization jitter wasn't too bad, and, as the sums showed, cancelled themselves well out. Also no overflow, or underflow was to fear, and TI as measurement was thus accepted.

Another thing I thought would be low-hanging fruit was that I was doing a float division on each of the scaling operations. I replaced the variable containing the division factor with its invers, and replaced the division with a multiplication - but it did not in fact change much - it made t4 even slower by about 40 ticks (but I notice this only now that I didn't actually revert this change, so it's in the next measurements).

This is also something to note: in some cases you may wish for some more measurements, but don't get them in time. Real world isn't ideal, and even if you have general measurements in place, that specific one is sometimes still missing, and may take too long to get. In these cases you may have to improvise based on your experience. Here I have made good experience with using switches, to turn some optimization on or off. If it does not work, you should at least stay at the same level.

BASIC...

For the next optimizations we need to look a bit more into how BASIC works. The Commodore BASIC is in fact a MICROSOFT BASIC (only that Commodore managed to get full rights from MS). So it is a line-based, interpreted language. It has two-character variable names, the memory map has program code, variables, arrays, free space, and string storage at the end of free memory growing downward. There are no specific optimizations you basically imply these days, like hash-map lookup for variable or function names, no compile-time computations, no automatic loop-unrolling, etc. Nope, none of that. Everything is computed as it is interpreted. Even a long comment (a REM line) takes more cycles the longer the comment is, because BASIC has to scan the comment for the beginning of the next line.

So as loop-unrolling is done on the modern machines to not defeat branch prediction algorithms, on the Commodore BASIC it is done to avoid looking for the line number in a GOTO or GOSUB... - so avoiding four GOSUBs for calling the subroutines in t3 and t4 shed about 30 ticks from each of the timers again...

In vain of finding some generic optimization opportunity, I looked back at the trignometric functions in t2. I replaced them with an array lookup - which indeed lowered t2 by about 114 clicks (from about 168 to 54). So while this in itself is a nice speedup, in the light of the still 1000+ clicks for each of the coordinate transformations, it is still not much.

Then looking again at how BASIC works, I thought: what happens with the array memory when a new variable is declared? Yes, BASIC moves that memory out of the way, which would take quite some time. So if in the first iteration there is such a "heavy" operation ... bugger, I had forgotten to handle the "warmup" phase!

Remember: always think about and perform a warmup phase in a performance test to get everything initialized before you take the real measurements (even in such perceived static situations like Commodore BASIC...)

In this case, however, it did not help. Code analysis confirms that the arrays were actually allocated last in the list of variables, so there was no need to move them around, or at least only so few times that it didn't hurt.

Good Bye generic code

So far I had only tried to do generic optimizations. I.e. the coordinate transformations would still work with any other kind of tilt. But now I felt I had to dive into more specific optimizations.

So the next step was to look at the cube point coordinates I fed into the matrix-vector multiplication in an array of coordinates. However, that array of coordinates had only two distinct values: sinus and cosinus of the tilt angle, and zero. Removing the coordinate array and instead using two variables holding the sinus and cosinus value respectively, and removing the zero-terms now fixes the tilt direction to the currently chosen direction. However, this alone shed over 500 ticks from the matrix-vector multiplication time t3, almost halfing it!

Also the rotating angle was so far "generic" in that the rotation matrix was fully set and used. It could have rotated on any direction. But also the rotating angle in the demo is fixed in one direction - so a number of entries in the rotation matrix are zero, and the rest again is only the sinus or cosinus of the rotation angle. Removing the array representing the rotation matrix, removing the zero entries, and just using two variables again remove about 200 ticks from t3!

I.e. going from

  3310 p=0:rem rotate offset 0 in cb to rc
  3311 rc(p)  =cb(p)*ro(0)+cb(p+1)*ro(1)+cb(p+2)*ro(2)
  3312 rc(p+1)=cb(p)*ro(3)+cb(p+1)*ro(4)+cb(p+2)*ro(5)
  3313 rc(p+2)=cb(p)*ro(6)+cb(p+1)*ro(7)+cb(p+2)*ro(8)

to

  3311 rc(0)=a2*b2+r*b1
  3312 rc(1)=a1
  3313 rc(2)=-a2*b1+r*b2

reduces the number of ticks for t3 by 715 from 1040 down to 325 ticks!

In the next step, the scaling and projection, there still is a large number of ticks. In the unrolled loop code from above still some of the indexes in the coordinate arrays are computed. Removing the computation with the absolute values now available due to the unrolling, i.e. going from

  3441 pc=3*p:p1=2*p:p2=8+p1
  3442 sz=rc(pc+2)*sf
  3443 sc(p1+0)=xo+rc(pc+0)*(1+sz):sc(p1+1)=yo+rc(pc+1)*(1+sz)

to

  3442 sz=rc(8)*sf
  3443 sc(4)=xo+rc(6)*(1+sz):sc(5)=yo+rc(7)*(1+sz)

sheds about 390 ticks from t4.

Limits

Going further and replacing the (1+sz) in the above lines by variables does not help. Also replacing the array containing the rotated coordinates that are given from step 3 to step 4 with normal variables on one side makes steps 3 and 4 faster, but it also makes step 5 slower!

Now we have come to a point where it seems the variable lookup in BASIC - which is sequential in the list of BASIC variables - has a large effect. This is shown by the last change:

   2 print"loading...":poke 52,0: poke 53,7*16:clr: rem make place
  -3 gosub 14000
  +3 p=0:gosub 14000
   4 print"done initializing"

This simple change initializes a pointer variable "p" that is heavily used in step 5 as the very first variable. Thus it is quickly found when needed, while all other variable lookups need to check one more variable. So this is the reason for step 5 becoming slower above - replacing the single coordinate array variable with 12 non-array variables makes lookups of the "p" variable so much slower. So while this single change slows t3 and t4 by a small margin (~3 ticks, almost in the measurement error), it makes t5 about 210 ticks faster!

Optimization Result

After all the optimizations, the timing finally gives

Original
  t1=48, t2=166, t3=1029, t4=1247, t5=732, t6=238, sum=3460 ticks or ~57 seconds
Optimized:
  t1=44, t2=62,  t3=314,  t4=659,  t5=582, t6=130, sum=1791 ticks or ~30 seconds

I.e. a speedup of about 48% or a factor of 1.9. The final demo program can be found here: draw12.basic.

And here's a demo video on the faster version.

In summary, this probably about what you can get with the BASIC-based approach. Of course you could still do some micro-optimizations, remove REMs here, and inline code there, but in the end you will end up working against the inner workings of BASIC itself. You need a specific minimum amount of variables and when the variable lookup just takes a small, but noticable amount longer with each new variable, it gets real hard. You could order variable initializations (like the "p" initialization above), but that is rather tedious work and will always weigh one function against the other. So it would probably be best to implement the matrix multiplication and scaling/projecting in machine language, probably using integer values instead of floats (In BASIC ints don't make sense because for any computation they are cast into a float again).

Conclusion

What do we learn from this:

First of all, it is absolutely important to establish a baseline and a means of measuring what you want to optimize. If you can't measure it, how do you know your optimization worked? So you have to establish a repeatable, stable test scenario, that gives consistent test results, that give you the idea where to look for performance optimization potential. In fact I think this is so important, that any good software development project should have such a framework built-in from the start.

Once you have your measurements, drill down to find where the problem is. It could be multiple execution of expensive functions (if a function needs a single millisecond, it still gets expensive if it gets called thousands of times...), redundant executions, or in some cases overly generic code. You often will not have all measurements down to the real point where the problem is, and you will often not get that specific measurement in time for, say, the production release date. But based on your existing measurements you can still drill down with much more focus and use your experience in fixing performance problems.

Do not be surprised if a problem isn't in an area where you thought it would be. Performance problems can be really hard sometimes. But with your measurements as guide, even they can be tackled.