CacheKiller - array size effects in cache performance

In online flame wars people debate the speed of processors and computer languages often using single-program benchmarks that show e.g. a 30% or even a 2x performance difference. But dfferences of this magnitude and larger can be caused by simply by data structure/cache interaction - which means that simply rearranging the data layout in the program (or perhaps using a different compiler) can change results.

The program cachekiller.c is a dramatic example. This program is in concept reading "pixels" from an "image" of size e.g. 4096x4096, doing a 1d filter, and writing the result to an output "image" of the same size.

On a SGI workstation, changing the image size from 4096 to 4095 resulted in a performance increase of nearly 20x. The crude explanation is that the in reading a pixel, the program causes a particular area of the image to be read into the cache; when it writes the corresponding output pixel, the same locations in the cache are accessed, so the input pixels are flushed and the corresponding region of the output image is cached. Not only is the cache not being used effectively, but it is being refilled twice for each processed pixel.

This particular benchmark does not cripple some alternate cache architectures, such 'set associative'(?)), but other interactions of this sort (but hopefully smaller magnitude) probably are found in many programs.

Here are the original results, showing a 17x hit for the R4400 processor:

          XRES=YRES: 512   724   1023    1024    2048
  SS10/41            0.2   0.4            1.0     3.9
  HP730                          0.8      2.9
  HP735                          0.6      2.7
  DEC3000/400                    1.4      1.4
  SGI R3000@33                            2.0
> SGI R4000@50                   ~0.5     7.4  
> SGI R4400@100                  ~0.5     8.6  
  DEC R3000@25MHz   0.66  1.35   2.71     2.71   10.92
  486DX2@66MHz      0.33  0.71   1.42     1.42    5.75  

This program was posted on the usenet where it generated some discussion and was run on a variety of machines. John Mashey's (lead designer of the MIPS chips) post is quoted below.

Article: 3207 of comp.benchmarks
Newsgroups: comp.benchmarks,comp.arch
From: (John R. Mashey)
Subject: Re: cache.c (was: killer.c, the cache thrasher) [LONG]
Sender: (Net News)
Organization: Silicon Graphics, Inc.
Date: Mon, 21 Jun 1993 02:50:26 GMT

(Commentary on discussion running in comp.benchmarks, started by
people using cache.c (from killer.c) regarding surprises of performance
jumps on SGI CRIMSON when using array sizes that cause cache-clashing;
crossposted to comp.arch since relevant there).


1)  Given 2 systems A, and B, with noticably different memory systems,
it is usually easy to contruct a pair of benchmarks that will prove both
that A is much faster than B, and B much faster than A.  Sometimes,
all it takes is minor differences to let you construct such cases.

2) Suppose A is a high-performance micro-based system (like a
CRIMSON, a SS10/41, HP735, Alpha, RS/6000, etc), i.e. systems built for
high performance across a large range of programs.  IF you are careful
in how you choose your benchmark, you can usually prove these systems are
slower than a 16Mhz 68020 or 386 for example. [and by the way, some points
in the cache.c benchmark are just like this.]

3) There are properties of system design and oeprating system design that:
	a) Give good performance in a wide range of real circumstances.
	b) Give relatively predictable performance.
but that:
	c) Can be made to look far worse than other choices, if you pick
	the benchmark instance carefully, even though those other choices
	are often worse in normal practice.

4) Higher-performance system designs are *much* more likely to be capable
of wild variations of performance, compared to simple single-board-computers
with uncached (or minimal-cached) slower CPUs.  In general, if you engineer
a benchmark to have the maximally worst cache behavior it can get,
you can almost always beat a fast machine with a simple SBC.  AS an
extreme case, consider vector supercomputers, which can be *really* fast,
or not at all fast, depending on your choice of benchmark.

5) This does not say that surprise performance peaks/dips are *good*;
they aren't ... but computers are designed with different mixes of
cost, performance, and performance degradation characteristics.
Direct-mapped caches (which are quite common) inherently have some
weird cases, and they are not at all hard to find.

PROGRAM: The inner loop of the program is:

typdef unsigned char BUF;
static void filterF(in1,out1)
  BUF *in1,*out1;
    int y;
    register int i0,i1,i2;
    register int x;
    register BUF *in,*out;
    in = in1;
    out = out1;
    for( y=0; y < YRES; y++ ) {
	i0 = (int)in[0];
        i1 = (int)in[1];
        /* ignore boundary pixels, over/underflow for this benchmark */
	for( x=1; x < XRES-1; x++ ) {
	    i2 = (int)in[x+1];
	    out[x] = (BUF)( (i0 + (2*i1) + i2) / 4 );
	    i0 = i1; i1 = i2;
        in += XRES;
        out += XRES;

} /*filterF*/

For various sizes of XRES==YRES (usually near powers of 2), 
	a) The 2 input arrays are initiliazed.
	b) The filterF function is called 3 times.
	c) The usual malloc calls will line up the 2 arrays contiguously,
	which of course exciting effects for some systems on power-of-2
The overall effect of this is:
	a) Fill up two arrays, byte by byte
	b) Time dominated by:
		read a byte from one array, compute, store result in
		second array, a byte at a time.
	c) If the 2 arrays consistently line up on top of each other in
	a direct-mapped cache, there is massive cache-thrashing.
	Some postings alluded to testing multiplies (including early-outs)
	and divides.  I suspect most compilers use shifts, so I doubt this
	measures multiplies and divides.  It certain does measure cache


The following is an updated version of tables posted by (John Lewis),  and others
 where # is XRES=YRES, and time in secs:

SYS	511	512	513	1023	1024	1025	2047	2048	2049
CRIM	  0.2	 0.3	 0.2	 0.8	 7.3*	 0.9	 3.7	33.4*	 3.4 D
INDIGO4K  0.2	 0.3	 0.2	 0.8	 9.4*	 0.8	 3.2	37.9*	 3.2 D
IN4K-fix  0.2	 0.2	 0.2	 0.8	 0.8	 0.8	 3.3	 3.2	 3.2 D

DEC-400	  0.4	 0.4	 0.4	 1.4	 1.4	 1.5	 5.8	 5.9	 5.8 D
DEC 400	  0.2	 0.2	 0.2	 3.0	 1.1	 1.2	12.1	 4.7	 4.8 D
DEC 400	  0.4	 0.4	 0.4	 3.3	 3.3	 3.3	13.4	13.4	13.4 D

IBM 320	  0.8	 0.8	 0.6	 2.7	 2.5	 2.5	14.6	13.6	13.1 SA	
IBM 550	  0.3	 0.3	 0.3	 1.1	 1.1	 1.1	 4.3	 4.6	 4.3 SA

SUN SS1+  1.2	 1.2	 1.3	 4.3	 5.9	 4.5	17.7	21.4	18.0 D
SUN IPX	  0.6	 0.6	 0.6	 2.3	 3.9	 2.3	 9.2	15.7	 9.2 D
SS10-41		 0.2			 1.0			 3.9	     D

HP 720	  0.3	 0.7	 0.3	 1.1	 2.7*	 1.0	 4.2	10.8*	 4.2 D
HP 735	  0.1	 0.6*	 0.1	 0.6	 2.7*	 0.6	 2.4	11.1*	 2.6 D
HP 735	  0.1	 0.7*	 0.1	 0.6	 2.7*	 0.6	 2.2	10.8*	 2.2 D

Gwy486-66 0.3	 0.3	 0.3	 1.3	 1.4	 1.3	 5.5	 5.5	 5.5 SA?
Gwy486-66 0.3	 0.3	 0.3	 1.2	 1.2	 1.3	 5.0	 5.0	 5.0 SA?

DEC 3100  0.9	 0.9	 0.9	 3.7	 3.7	 3.7	14.8	14.8	16.1 D

ODDITIES marked with *.
D = direct-mapped (largest cache), SA = set-associative largest cache
1) HP 735 is usually the fastest;  CRIMSON & INDIGO-4K are usually next,
with oddities at 1024 and 2048 (which started the posting).
IN4K-FIX adds 128 byte to the malloc sizes ... and eliminates the odd
peaks (and would do so for the HPs as well, I'd guess.)
NOTE: if people think there is something "wrong" with R4000 and HP 735
cache designs, please note that on *most* instances of this benchmark,
those are the two fastest ....


a) R4000-based systems: use L1: on-chipsplit (I cache+D cache), 8KB,
direct-mapped caches, with write-back data cache, 
plus L2: joint (I+D) 1MB off-chip cache, direct-mapped.
Primary cache is virtual-index, physical tag; L2 is physical-physical.
Line-sizes vary, as do memory latencies and interleaving; typically,
though, using L1 cache linesize = 32 bytes and L2 size = 128 bytes would
be representative.

b) The R4000-based system use an OS-scheme (I think, at least RISC/os did
this, and I think IRIX does) where the OS tries to statistically
map pages in a similar way for the same executable, i.e.,
(over-simplification): if cache has N pages, and virtual page number (VPN)
is X (mod N), then the physical page frame number is X (mod N), usually.
(There are more complex versions where there is an offset that depends
upon the process number, i.e., VPN of X yields PFN of X+Y (mod N).
In general, the idea is that if VPN pages A and B map into the same
place in a cache on run 1, the same thing happens on run 2.
This is done to get better timing consistency (and generally better
average performance) than creating random mappings from VPN -> PFN,
and was put in because of really bizarre cases, like somebody's overnight
ECAD run occasionally taking 2X longer, and *not* running over night.

OF COURSE, for this particular benchmark, it would be better to use
more random mappings from virtual->physical, and I believe this is
done on the SS10-41 and DEC Alpha machines.  FOR THIS BENCHMARK, the
R4000 oddities would generally disappear if the allocation were more

c) What is this benchmark doing on an R4000?

	For XRES=YRES near 256, all is well: *both* arrays fit into
	the 1MB secondary cache, and once they are initialized, they
	stay there for the duration, and all is wonderful.

	For the 1024 or 2048 cases, the most typical inner loop case is:
	(recall, linesize of 32 for L1, 128 for L2, so there are 4
	L1 cache lines per L2 cache line)

	1) Load byte from *in*
		1a) Cache miss in L1 dcache, want to replace
		1b) L1 dcache line found to be dirty [out]
		1c) Write back dirty dcache line (32 bytes) to L2 cache [out]
		1d) Go to L2 cache, cache miss there as well, want to replace
		1e) L2 cache line found to be dirty [out]
		1f) Check each of 3 more L1 cache lines to see if they
		are dirty and also part of the L2 cache line about to be
		written back to memory.  IF so, write each one of them
		back to the L2 cache line.  If belong to L2 cache line,
		but not dirty, invalidate them to avoid stale entries.
		1g) Write the dirty L2 cache line back to memory [out].
		1h) Read 128-byte L2 cache line into L2 cache.
		1i) Read the approrpiate quarter (32 bytes) into L1 dcache.
		1j) complete the read of one byte.
	(you are talking 10s or 100s of cycles here, including memory latency;
	of course, some of this is done out of order and in parallel,
 	but these are the thigns that go on...)
	2) Do the computation (a few cycles).
	3) Do the store of one byte to *out*
		3a) Cache miss in L1 dcache (since *in* is there)
		3d) Go to L2 cache, cache miss there as well
		3f) Check the other 3 entries that might belong to the L2 cache
		line, and if found, invalidate them. 
	 	3h) Read 128-byte L2 cache line into L2 cache.
		3i) Read the appropriate quarter (32 bytes) into L1 dcache.
		3j) Do the store byte [on an SMP machine with cache coherency,
			like an R4400 in a Challenge, there will likely be
			snooping transactions besides.]
	THEN, DO ALL THIS AGAIN FOR THE NEXT BYTE, and the out array entry
	(most of the time) has just replaced the *in* array entry, so
	there is constant thrashing.

	SO: in the worst case, each byte in the input array causes 2
	128-byte reads from memory, and one 128-byte writeback to memory ....

	Within R4000/R4000 systems, if this is going on, you are better with
	the smaller machines (like Indigo4000), where the memory latency for
	a SBC is lower than for the bigger machines, especially an SMP
	Challenge machine.  I.e., memory bandwidth doesn't really help you much,
	but lower total latency does.

	For the off-sizes, there are still cache misses (since the 2 1MB
	arrays don't fit into the 1MB cache together) ... but the caches
	work as intended, i.e., you at least get to read 128 bytes of
	*in* (with one memory read)  and write 128 bytes of out (with one
	128-byte read, and later a 128-byte writeback).
	As can be seen from the numbers:
		512:  .2 seconds, almost no cache misses
		1024: .8 seconds (if do FIX version, which has L2 cache
			misses, but sensible ones, and is 4X more operations,
			i.e., cache-miss processing time is negligble)
		1024:9.4 secs (on my Indigo) with bad case ... so the bad
			behavior costs 8.6 seconds, or 10X more cycles...
	This is part of why the HP735s (which use 1-level caches) are faster:
	in general, it is faster to *miss* in a 1-level cache than a 2-level
	It is also clear why the simpler machines seem more consistent:
	the cycle-count penalties are not so high as they are for the faster
	machines; after all, DRAM is about the same.
	Also, it is clear that set-associative caches help the situation,
	at least for consistency.
	In many wasy, this behavior is related to that found in supercomputers,
	where memory interleave is a power of 2, and people take care to
	avoid array sizes that are powers of 2...


1) R4000 machines: you can make the above worse:
	a) Use 2 stores to different arrays that conflict, this adds
	and extra main memory writeback for every store-byte.
	b) If you work hard at it,  you can manage to get I-cache
	conflicts as well.

2) Alpha machine:  uses L1 cache that is write-thru, not write-back,
	otherwise like R4000.
	a) Use byte-stores a lot.
	b) Use 2 bytes stores to array elements that clash, forcing
	cache-miss-writebacks all of the time.

3) HP PA,DS 3100, Indigo3000, etc.
	a) Pick benchmarks that fit well in larger caches, but don't
	in these (they're smaller than some of the others).
	b) There are some benchmarks that work OK in a joint (I+D) cache,
	that don't work as well in a split I, D cache, where each is half
	the joint size.  For example, big code, small data, or small code,
	big data, if sized right, can make a split cache look bad.
	On the other hand, if you can construct cases where the I & D
	accesses clash well in a joint cache (but, of course, can't clash
	in split one), then you can make split caches look really good.

4) Sun SS10/41
	a) Uses L2 direct-mapped cache (I think), so same tricks work.
	b) Do deep, quick function calls, which don't bother anyone else,
	but trash the register windows.

5) IBM RS/6000 (for example, one with 64KB data cache)
	a) The total cache size is smaller than most, (but 4-set-associative).
	So, write a benchmark that make repeated stores to  8 locations
	16KB apart, causing high miss rate.
	Same technique works for SS10/30's 16KB 4-set cache, but need only
	4KB apart.
	b) Run bigger codes.

6) Any machine with write-thru 1-level caches  [Indigo, SS1]
	a) Generate well-chosen sets of partial-word writes (usually slower
	than fullword writes) and reads to cause page-mode DRAM switching
	on every access...  Although write-thru caches are simpler, it's
	almost always possible to have a program that essentially never
	accesses main memory on a write-back amchine, but accesses it often
	on a write-thru machine.

7) Machines with longer cache lines
	a) Use store-bytes, as above, to make bandwidth useless.
	b) To make a machine with (say) 128-byte linesize look a lot
	worse than one with 64-byte linesize, try:

	store byte into (address)
	store byte into (address+cachesize+64)
	The 128-byte linesize will cause clashes ... the 64-byte one won't.

8) Machines with shorter cache lines
	a) Use benchmarks that read memory, so that bandwidth is relatively
	more important.

1) Direct-mapped caches can have odd behavior around corner-cases;
in fact, caches in gneeral can have odd behavior.  If caches don't
work for a specific problem, then they're worse than useless,
2) Still, a lot of high-performance systems use caches, even direct-mapped
3) There is no one best way to do things across all cases, and if you
know the memory system parameters, you can usually prove all sorts of
things ... none of which mean much, which is why people design
computers for desired statistical patterns of behavior, not by
looking at individual benchmarks alone.

-john mashey    DISCLAIMER: 
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 7U-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311