The program cachekiller.c is a dramatic example. This program is in concept reading "pixels" from an "image" of size e.g. 4096x4096, doing a 1d filter, and writing the result to an output "image" of the same size.
On a SGI workstation, changing the image size from 4096 to 4095 resulted in a performance increase of nearly 20x. The crude explanation is that the in reading a pixel, the program causes a particular area of the image to be read into the cache; when it writes the corresponding output pixel, the same locations in the cache are accessed, so the input pixels are flushed and the corresponding region of the output image is cached. Not only is the cache not being used effectively, but it is being refilled twice for each processed pixel.
This particular benchmark does not cripple some alternate cache architectures, such 'set associative'(?)), but other interactions of this sort (but hopefully smaller magnitude) probably are found in many programs.
Here are the original results, showing a 17x hit for the R4400 processor:
XRES=YRES: 512 724 1023 1024 2048 SS10/41 0.2 0.4 1.0 3.9 HP730 0.8 2.9 HP735 0.6 2.7 DEC3000/400 1.4 1.4 SGI R3000@33 2.0 > SGI R4000@50 ~0.5 7.4 > SGI R4400@100 ~0.5 8.6 DEC R3000@25MHz 0.66 1.35 2.71 2.71 10.92 486DX2@66MHz 0.33 0.71 1.42 1.42 5.75
This program was posted on the usenet where it generated some discussion and was run on a variety of machines. John Mashey's (lead designer of the MIPS chips) post is quoted below.
Article: 3207 of comp.benchmarks Newsgroups: comp.benchmarks,comp.arch From: mash@mash.wpd.sgi.com (John R. Mashey) Subject: Re: cache.c (was: killer.c, the cache thrasher) [LONG] Sender: news@odin.corp.sgi.com (Net News) Organization: Silicon Graphics, Inc. Date: Mon, 21 Jun 1993 02:50:26 GMT (Commentary on discussion running in comp.benchmarks, started by people using cache.c (from killer.c) regarding surprises of performance jumps on SGI CRIMSON when using array sizes that cause cache-clashing; crossposted to comp.arch since relevant there). NOBODY SHOULD BE SURPRISED AT SOME OF THE RESULTS POSTED IN THIS SEQUENCE: HERE ARE A FEW RULES, AND WHY THEY ARE TRUE: 1) Given 2 systems A, and B, with noticably different memory systems, it is usually easy to contruct a pair of benchmarks that will prove both that A is much faster than B, and B much faster than A. Sometimes, all it takes is minor differences to let you construct such cases. 2) Suppose A is a high-performance micro-based system (like a CRIMSON, a SS10/41, HP735, Alpha, RS/6000, etc), i.e. systems built for high performance across a large range of programs. IF you are careful in how you choose your benchmark, you can usually prove these systems are slower than a 16Mhz 68020 or 386 for example. [and by the way, some points in the cache.c benchmark are just like this.] THIS DOES NOT PROVE THAT A 16MHZ 68020 IS FASTER IN GENERAL. 3) There are properties of system design and oeprating system design that: a) Give good performance in a wide range of real circumstances. b) Give relatively predictable performance. but that: c) Can be made to look far worse than other choices, if you pick the benchmark instance carefully, even though those other choices are often worse in normal practice. 4) Higher-performance system designs are *much* more likely to be capable of wild variations of performance, compared to simple single-board-computers with uncached (or minimal-cached) slower CPUs. In general, if you engineer a benchmark to have the maximally worst cache behavior it can get, you can almost always beat a fast machine with a simple SBC. AS an extreme case, consider vector supercomputers, which can be *really* fast, or not at all fast, depending on your choice of benchmark. 5) This does not say that surprise performance peaks/dips are *good*; they aren't ... but computers are designed with different mixes of cost, performance, and performance degradation characteristics. Direct-mapped caches (which are quite common) inherently have some weird cases, and they are not at all hard to find. PROGRAM: The inner loop of the program is: typdef unsigned char BUF; static void filterF(in1,out1) BUF *in1,*out1; { int y; register int i0,i1,i2; register int x; register BUF *in,*out; in = in1; out = out1; for( y=0; y < YRES; y++ ) { i0 = (int)in[0]; i1 = (int)in[1]; /* ignore boundary pixels, over/underflow for this benchmark */ for( x=1; x < XRES-1; x++ ) { i2 = (int)in[x+1]; out[x] = (BUF)( (i0 + (2*i1) + i2) / 4 ); i0 = i1; i1 = i2; } in += XRES; out += XRES; } } /*filterF*/ For various sizes of XRES==YRES (usually near powers of 2), a) The 2 input arrays are initiliazed. b) The filterF function is called 3 times. c) The usual malloc calls will line up the 2 arrays contiguously, which of course exciting effects for some systems on power-of-2 sizes. The overall effect of this is: a) Fill up two arrays, byte by byte b) Time dominated by: read a byte from one array, compute, store result in second array, a byte at a time. c) If the 2 arrays consistently line up on top of each other in a direct-mapped cache, there is massive cache-thrashing. NOTE: Some postings alluded to testing multiplies (including early-outs) and divides. I suspect most compilers use shifts, so I doubt this measures multiplies and divides. It certain does measure cache effects... PERFORMANCE DATA The following is an updated version of tables posted by zilla@ccrl.nj.nec.com (John Lewis), and others where # is XRES=YRES, and time in secs: SYS 511 512 513 1023 1024 1025 2047 2048 2049 CRIM 0.2 0.3 0.2 0.8 7.3* 0.9 3.7 33.4* 3.4 D INDIGO4K 0.2 0.3 0.2 0.8 9.4* 0.8 3.2 37.9* 3.2 D IN4K-fix 0.2 0.2 0.2 0.8 0.8 0.8 3.3 3.2 3.2 D DEC-400 0.4 0.4 0.4 1.4 1.4 1.5 5.8 5.9 5.8 D DEC 400 0.2 0.2 0.2 3.0 1.1 1.2 12.1 4.7 4.8 D DEC 400 0.4 0.4 0.4 3.3 3.3 3.3 13.4 13.4 13.4 D IBM 320 0.8 0.8 0.6 2.7 2.5 2.5 14.6 13.6 13.1 SA IBM 550 0.3 0.3 0.3 1.1 1.1 1.1 4.3 4.6 4.3 SA SUN SS1+ 1.2 1.2 1.3 4.3 5.9 4.5 17.7 21.4 18.0 D SUN IPX 0.6 0.6 0.6 2.3 3.9 2.3 9.2 15.7 9.2 D SS10-41 0.2 1.0 3.9 D HP 720 0.3 0.7 0.3 1.1 2.7* 1.0 4.2 10.8* 4.2 D HP 735 0.1 0.6* 0.1 0.6 2.7* 0.6 2.4 11.1* 2.6 D HP 735 0.1 0.7* 0.1 0.6 2.7* 0.6 2.2 10.8* 2.2 D Gwy486-66 0.3 0.3 0.3 1.3 1.4 1.3 5.5 5.5 5.5 SA? Gwy486-66 0.3 0.3 0.3 1.2 1.2 1.3 5.0 5.0 5.0 SA? DEC 3100 0.9 0.9 0.9 3.7 3.7 3.7 14.8 14.8 16.1 D ODDITIES marked with *. D = direct-mapped (largest cache), SA = set-associative largest cache COMMENTS: 1) HP 735 is usually the fastest; CRIMSON & INDIGO-4K are usually next, with oddities at 1024 and 2048 (which started the posting). IN4K-FIX adds 128 byte to the malloc sizes ... and eliminates the odd peaks (and would do so for the HPs as well, I'd guess.) NOTE: if people think there is something "wrong" with R4000 and HP 735 cache designs, please note that on *most* instances of this benchmark, those are the two fastest .... 2) SO, WHAT'S GOING ON HERE: a) R4000-based systems: use L1: on-chipsplit (I cache+D cache), 8KB, direct-mapped caches, with write-back data cache, plus L2: joint (I+D) 1MB off-chip cache, direct-mapped. Primary cache is virtual-index, physical tag; L2 is physical-physical. Line-sizes vary, as do memory latencies and interleaving; typically, though, using L1 cache linesize = 32 bytes and L2 size = 128 bytes would be representative. b) The R4000-based system use an OS-scheme (I think, at least RISC/os did this, and I think IRIX does) where the OS tries to statistically map pages in a similar way for the same executable, i.e., (over-simplification): if cache has N pages, and virtual page number (VPN) is X (mod N), then the physical page frame number is X (mod N), usually. (There are more complex versions where there is an offset that depends upon the process number, i.e., VPN of X yields PFN of X+Y (mod N). In general, the idea is that if VPN pages A and B map into the same place in a cache on run 1, the same thing happens on run 2. This is done to get better timing consistency (and generally better average performance) than creating random mappings from VPN -> PFN, and was put in because of really bizarre cases, like somebody's overnight ECAD run occasionally taking 2X longer, and *not* running over night. OF COURSE, for this particular benchmark, it would be better to use more random mappings from virtual->physical, and I believe this is done on the SS10-41 and DEC Alpha machines. FOR THIS BENCHMARK, the R4000 oddities would generally disappear if the allocation were more random. c) What is this benchmark doing on an R4000? For XRES=YRES near 256, all is well: *both* arrays fit into the 1MB secondary cache, and once they are initialized, they stay there for the duration, and all is wonderful. For the 1024 or 2048 cases, the most typical inner loop case is: (recall, linesize of 32 for L1, 128 for L2, so there are 4 L1 cache lines per L2 cache line) 1) Load byte from *in* 1a) Cache miss in L1 dcache, want to replace 1b) L1 dcache line found to be dirty [out] 1c) Write back dirty dcache line (32 bytes) to L2 cache [out] 1d) Go to L2 cache, cache miss there as well, want to replace 1e) L2 cache line found to be dirty [out] 1f) Check each of 3 more L1 cache lines to see if they are dirty and also part of the L2 cache line about to be written back to memory. IF so, write each one of them back to the L2 cache line. If belong to L2 cache line, but not dirty, invalidate them to avoid stale entries. 1g) Write the dirty L2 cache line back to memory [out]. 1h) Read 128-byte L2 cache line into L2 cache. 1i) Read the approrpiate quarter (32 bytes) into L1 dcache. 1j) complete the read of one byte. (you are talking 10s or 100s of cycles here, including memory latency; of course, some of this is done out of order and in parallel, but these are the thigns that go on...) 2) Do the computation (a few cycles). 3) Do the store of one byte to *out* 3a) Cache miss in L1 dcache (since *in* is there) 3d) Go to L2 cache, cache miss there as well 3f) Check the other 3 entries that might belong to the L2 cache line, and if found, invalidate them. 3h) Read 128-byte L2 cache line into L2 cache. 3i) Read the appropriate quarter (32 bytes) into L1 dcache. 3j) Do the store byte [on an SMP machine with cache coherency, like an R4400 in a Challenge, there will likely be snooping transactions besides.] THEN, DO ALL THIS AGAIN FOR THE NEXT BYTE, and the out array entry (most of the time) has just replaced the *in* array entry, so there is constant thrashing. SO: in the worst case, each byte in the input array causes 2 128-byte reads from memory, and one 128-byte writeback to memory .... Within R4000/R4000 systems, if this is going on, you are better with the smaller machines (like Indigo4000), where the memory latency for a SBC is lower than for the bigger machines, especially an SMP Challenge machine. I.e., memory bandwidth doesn't really help you much, but lower total latency does. For the off-sizes, there are still cache misses (since the 2 1MB arrays don't fit into the 1MB cache together) ... but the caches work as intended, i.e., you at least get to read 128 bytes of *in* (with one memory read) and write 128 bytes of out (with one 128-byte read, and later a 128-byte writeback). As can be seen from the numbers: 512: .2 seconds, almost no cache misses 1024: .8 seconds (if do FIX version, which has L2 cache misses, but sensible ones, and is 4X more operations, i.e., cache-miss processing time is negligble) 1024:9.4 secs (on my Indigo) with bad case ... so the bad behavior costs 8.6 seconds, or 10X more cycles... This is part of why the HP735s (which use 1-level caches) are faster: in general, it is faster to *miss* in a 1-level cache than a 2-level one. It is also clear why the simpler machines seem more consistent: the cycle-count penalties are not so high as they are for the faster machines; after all, DRAM is about the same. Also, it is clear that set-associative caches help the situation, at least for consistency. In many wasy, this behavior is related to that found in supercomputers, where memory interleave is a power of 2, and people take care to avoid array sizes that are powers of 2... PATHOLOGICAL BENCHMARKS FOR VARIOUS MACHINES: 1) R4000 machines: you can make the above worse: a) Use 2 stores to different arrays that conflict, this adds and extra main memory writeback for every store-byte. b) If you work hard at it, you can manage to get I-cache conflicts as well. 2) Alpha machine: uses L1 cache that is write-thru, not write-back, otherwise like R4000. a) Use byte-stores a lot. b) Use 2 bytes stores to array elements that clash, forcing cache-miss-writebacks all of the time. ( 3) HP PA,DS 3100, Indigo3000, etc. a) Pick benchmarks that fit well in larger caches, but don't in these (they're smaller than some of the others). b) There are some benchmarks that work OK in a joint (I+D) cache, that don't work as well in a split I, D cache, where each is half the joint size. For example, big code, small data, or small code, big data, if sized right, can make a split cache look bad. On the other hand, if you can construct cases where the I & D accesses clash well in a joint cache (but, of course, can't clash in split one), then you can make split caches look really good. 4) Sun SS10/41 a) Uses L2 direct-mapped cache (I think), so same tricks work. b) Do deep, quick function calls, which don't bother anyone else, but trash the register windows. 5) IBM RS/6000 (for example, one with 64KB data cache) a) The total cache size is smaller than most, (but 4-set-associative). So, write a benchmark that make repeated stores to 8 locations 16KB apart, causing high miss rate. Same technique works for SS10/30's 16KB 4-set cache, but need only 4KB apart. b) Run bigger codes. 6) Any machine with write-thru 1-level caches [Indigo, SS1] a) Generate well-chosen sets of partial-word writes (usually slower than fullword writes) and reads to cause page-mode DRAM switching on every access... Although write-thru caches are simpler, it's almost always possible to have a program that essentially never accesses main memory on a write-back amchine, but accesses it often on a write-thru machine. 7) Machines with longer cache lines a) Use store-bytes, as above, to make bandwidth useless. b) To make a machine with (say) 128-byte linesize look a lot worse than one with 64-byte linesize, try: store byte into (address) store byte into (address+cachesize+64) The 128-byte linesize will cause clashes ... the 64-byte one won't. 8) Machines with shorter cache lines a) Use benchmarks that read memory, so that bandwidth is relatively more important. SUMMARY: 1) Direct-mapped caches can have odd behavior around corner-cases; in fact, caches in gneeral can have odd behavior. If caches don't work for a specific problem, then they're worse than useless, AND THIS SHOULD NOT BE A SURPRISE. 2) Still, a lot of high-performance systems use caches, even direct-mapped ones. 3) There is no one best way to do things across all cases, and if you know the memory system parameters, you can usually prove all sorts of things ... none of which mean much, which is why people design computers for desired statistical patterns of behavior, not by looking at individual benchmarks alone. -- -john mashey DISCLAIMER:UUCP: mash@sgi.com DDD: 415-390-3090 FAX: 415-967-8496 USPS: Silicon Graphics 7U-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311