A Fast Scaling Blitter


A recent neuroscience project [*] required drawing first into a small (anywhere from 8x8 to 150x150) pixel map, then blitting a scaled-up copy to the screen. The scaling was always by an integral factor (from 2X to as much as 50X), and resulted in a final size on screen on the order of 300x300. Moreover, it was absolutely critical that we be able to do this at video rates (67 fps), never missing a frame, even when some additional computation (e.g., generation of pseudorandom numbers) is required between frames. Both the source buffer and the screen use 32-bit color, and the stimulus computer is a Macintosh G3.

The CopyBits function of the MacOS toolbox has been extensively optimized, and its performance on straight (unscaled and unmasked) pixel copies is very hard to beat. However, when asked to scale the result, e.g. copying a small pixel map to a large area of the screen, its performance degrades severely. Our initial implementation used CopyBits to do the scaling, but it proved to be too slow in many cases. A faster scaling blitter was needed.

Blitters Evaluated

Four different blitters were evaluated at a variety of source buffer sizes and scaling factors. The blitters were as follows:

  1. CopyBits: the standard CopyBits function, used directly from source buffer to screen.

  2. Direct: a custom blitter which scales the source buffer directly into video RAM, 8 bytes (2 pixels) at a time.

  3. DirectMem: similar to Direct, but this custom blitter scales the source buffer into an offscreen GWorld, then calls on CopyBits to copy (without scaling) the offscreen GWorld to the screen.

  4. RAVE: a custom blitter adapted from John Stiles' SNES9X code, which uses RAVE, QuickDraw 3D, and 3D acceleration hardware to perform the scaling.

Evaluation Conditions

A number of timings were run on each blitter. The source buffer was square, and varied from 8x8 to 256x256 pixels. The final (onscreen) size varied from 120x120 to 512x512 pixels, and was always an integer multiple of the source buffer size. Cases were run with the destination aligned on an 8-byte boundary, or not aligned (i.e., the first pixel on screen was in an even (aligned) or odd (misaligned) column).

Tests involved performing 100 or 1000 scaling copies of the source buffer to the screen, measuring the elapsed time with UpTime, and calculating a frame rate. Note that while in the actual application there would be no point in blitting faster than the display's refresh rate, in these tests the framerate was measured independent of the screen rate to get a feel for absolute speed. A higher framerate means faster blitting, which allows more time for other computations.

All tests were performed on a 1998 Power Macintosh G3, 300 MHz, with 128 MB built-in RAM plus 1 MB of virtual memory, 1 MB backsize L2 cache, and equipped with Rage II (on the motherboard) and a RagePro 3D accelerator card. The system software was MacOS 8.5.1. The display was a Sony Trinitron set to 1152 x 870 32-bit pixels, except for some RAVE tests which required reducing the resolution to 832 x 624.


Results of the benchmarks for the aligned cases (i.e., when the destination buffer began on an 8-byte boundary) are shown at right. "Problem Size" refers to the width of a square destination buffer; for each destination size, a variety of source sizes were measured. See the raw data if more detail is needed.

The frames per second of most blitters increased substantially as the final (onscreen) size decreased. The exception was the RAVE blitter, the framerate of which was fairly constant over a wide range of output sizes. None of the blitters were significantly affected by the source size (except for the unscaled case, i.e., source and output the same size).

The nearly constant performance of the RAVE blitter produces a strange result. At small output sizes, 256x256 or less, the RAVE blitter was the worst performer, yielding about 75 fps. But at larger output sizes, over 400x400 or so, the RAVE blitter's performance (45-60 fps) had degraded less than the others, making it the best of the four.

The other three blitters produced more consistent results: DirectMem always achieved the fastest framerate, followed by Direct, with standard CopyBits coming in last. On average, DirectMem was over twice as fast as CopyBits for scaling blits. The Direct blitter was significantly slower when the destination was not aligned on an 8-byte boundary; the other blitters were not affected by alignment. (Nonaligned results are not shown.)

A closer look at the larger problem sizes (right) shows the improved performance of RAVE when the output is greater than 400x400 pixels. In this regime, it outperforms the other blitters by about 15 fps. However, note that even RAVE performance drops at the largest problem sizes, where output is 512x512 and input is 256x256 (leftmost point of graph). Its performance here is hardly better than Direct and DirectMem. This suggests that the superior performance of RAVE may be restricted to a fairly limited area of the problem space.


For our purposes, the approach benchmarked as "DirectMem" (blue in the graphs above) provides the best scaling blitter over a variety of problem sizes. For any output over 400x400 pixels, it is superior to any other method tested, over twice as fast as CopyBits. (But please see the Epilogue for a very important update.)

This approach does require that the scale factor must be an even multiple of two. This is an acceptable constraint for our intended use. It is not affected by the alignment of the destination buffer.

The performance of RAVE at small blits was disappointing. Probably RAVE was redrawing the entire screen on each refresh, even when only a small area was changing. It may be possible to direct the RAVE library to redraw only part of the screen, and this may result in superior performance throughout the problem space. This avenue was not pursued further for several reasons:

  1. Use of RAVE is likely to cause future compatibility problems, now that Apple has embraced OpenGL as its official 3D API.

  2. The RAVE code was considerably more complex than other approaches.

  3. Use of RAVE imposed additional constraints on the accessible problems; e.g., the source buffer height and width must be either a power of 2, or padded to a power of 2.
However, it is worth noting that RAVE brings some unique advantages as well, such as the possibility of scaling smoothly (i.e. with anti-aliasing). For some developers, these advantages and the possible increased speed may outweigh the disadvantages listed above.


The "DirectMem" blitter has been renamed to ScaleBlit, and is available here:

Usage notes are in the ScaleBlit.h header file. Note that while the source file itself is C++, the interface is C, and it may be called from any language capable of using a standard C interface. This code is public domain. But if you find any bugs or have other suggestions for improvement, please let me know.

* This work was performed in the Chichilnisky Lab at the Salk Institute.


or, Getting There the Hard Way

After the above tests were completed and posted, I made the blitter above a little more general by checking for cases where the scale factor is not divisible by 2. In this case, the code now calls CopyBits to do the scaling -- to the offscreen GWorld, just like my fancy blitter -- and then, as usual, another CopyBits (with no scaling) to the screen.

The result? CopyBits under these conditions is nearly as fast as the fancy custom blitter, even after adding cacheline clearing, loop unrolling, etc.

Revised Conclusion

If you need to do scaling in a hurry, you don't need a fancy blitter -- just use two CopyBits calls:
  1. first, CopyBits with scaling from the source buffer to an offscreen GWorld appropriate for the destination size
  2. then, CopyBits (without scaling) from this offscreen GWorld to the screen
The result is about twice as fast as CopyBits with scaling direct to the screen, basically equivalent to the DirectMem blitter in the graphs above.

Last Updated: 6/21/99 . . . . . . webmaster@strout.net