Running the Tests ================= All the tests are executed using the "Run" script in the top-level directory. The simplest way to generate results is with the commmand: ./Run This will run a standard "index" test (see "The BYTE Index" below), and save the report in the "results" directory, with a filename like hostname-2007-09-23-01 An HTML version is also saved. If your system has more than one CPU, the tests will be run twice -- once with a single copy of each test running at once, and once with N copies, where N is the number of CPUs. Since the tests are based on constant time (variable work), a run usually takes about 29 minutes. If both single-processing and multi-processing runs are done, each will take that long, for a total just under an hour. ============================================================================ Detailed Usage ============== The Run script takes a number of options which you can use to customise a test, and you can specify the names of the tests to run. The full usage is: Run [ -q | -v ] [-i ] [-c [-c ...]] [test ...] The option flags are: -q Run in quiet mode. -v Run in verbose mode. -i Run iterations for each test -- slower tests use / 3, but at least 1. Defaults to 10 (3 for slow tests). -c Run copies of each test in parallel. The -c option can be given multiple times; for example: ./Run -c 1 -c 4 will run a single-streamed pass, then a 4-streamed pass. The remaining non-flag arguments are taken to be the names of tests to run. The default is to run "index". See "Tests" below. When running the tests, you may want to go to single-user mode, so that randomly-waking background processes don't mess things up to much. On Linux, go to the first text console (meta-control-F1) and as root do "init 1". Run the tests, then reboot. ============================================================================ Tests ===== The following individual tests are available. Note that not all of these are used in the default "index" run; see "Interpreting the Results" below. dhry2reg Dhrystone 2 using register variables whetstone-double Double-Precision Whetstone syscall System Call Overhead pipe Pipe Throughput context1 Pipe-based Context Switching spawn Process Creation execl Execl Throughput fstime-w File Write 1024 bufsize 2000 maxblocks fstime-r File Read 1024 bufsize 2000 maxblocks fstime File Copy 1024 bufsize 2000 maxblocks fsbuffer-w File Write 256 bufsize 500 maxblocks fsbuffer-r File Read 256 bufsize 500 maxblocks fsbuffer File Copy 256 bufsize 500 maxblocks fsdisk-w File Write 4096 bufsize 8000 maxblocks fsdisk-r File Read 4096 bufsize 8000 maxblocks fsdisk File Copy 4096 bufsize 8000 maxblocks shell1 Shell Scripts (1 concurrent) (runs "looper 60 multi.sh 1") shell8 Shell Scripts (8 concurrent) (runs "looper 60 multi.sh 8") shell16 Shell Scripts (8 concurrent) (runs "looper 60 multi.sh 16") short Arithmetic Test (short) (this is arith.c configured for "short" variables; ditto for the ones below) int Arithmetic Test (int) long Arithmetic Test (long) float Arithmetic Test (float) double Arithmetic Test (double) arithoh Arithoh (huh?) C C Compiler Throughput (runs "looper 60 $cCompiler cctest.c") dc Dc: sqrt(2) to 99 decimal places (runs "looper 30 dc < dc.dat", using your system's copy of "dc") hanoi Recursion Test -- Tower of Hanoi The following pseudo-test names are aliases for combinations of other tests: arithmetic Runs arithoh, short, int, long, float, double, and whetstone-double dhry Alias for dhry2reg dhrystone Alias for dhry2reg whets Alias for whetstone-double whetstone Alias for whetstone-double load Runs shell1, shell8, and shell16 misc Runs C, dc, and hanoi speed Runs the arithmetic and system groups oldsystem Runs execl, fstime, fsbuffer, fsdisk, pipe, context1, spawn, and syscall system Runs oldsystem plus shell1, shell8, and shell16 fs Runs fstime-w, fstime-r, fstime, fsbuffer-w, fsbuffer-r, fsbuffer, fsdisk-w, fsdisk-r, and fsdisk shell Runs shell1, shell8, and shell16 index Runs the tests which constitute the official index: the oldsystem group, plus dhry2reg, whetstone-double, shell1, and shell8 See "The BYTE Index" below for more information. all Runs all tests ============================================================================ The BYTE Index ============== The purpose of this test is to provide a basic indicator of the performance of a Unix-like system; hence, multiple tests are used to test various aspects of the system's performance. These test results are then compared to the scores from a baseline system to produce an index value, which is generally easier to handle than the raw sores. The entire set of index values is then combined to make an overall index for the system. Since 1995, the baseline system has been "George", a SPARCstation 20-61 with 128 MB RAM, a SPARC Storage Array, and Solaris 2.3, whose ratings were set at 10.0. (So a system which scores 520 is 52 times faster than this machine.) Since the numbers are really only useful in a relative sense, there's no particular reason to update the base system, so for the sake of consistency it's probably best to leave it alone. George's scores are in the file "pgms/index.base"; this file is used to calculate the index scores for any particular run. Over the years, various changes have been made to the set of tests in the index. Although there is a desire for a consistent baseline, various tests have been determined to be misleading, and have been removed; and a few alternatives have been added. These changes are detailed in the README, and should be born in mind when looking at old scores. A number of tests are included in the benchmark suite which are not part of the index, for various reasons; these tests can of course be run manually. See "Tests" above. ============================================================================ Multiple CPUs ============= If your system has multiple CPUs, the default behaviour is to run the selected tests twice -- once with one copy of each test program running at a time, and once with N copies, where N is the number of CPUs. (You can override this with the "-c" option; see "Detailed Usage" above.) This is designed to allow you to assess: - the performance of your system when running a single task - the performance of your system when running multiple tasks - the gain from your system's implementation of parallel processing The results, however, need to be handled with care. Here are the results of two runs on a dual-processor system, one in single-processing mode, one dual-processing: Test Single Dual Gain -------------------- ------ ------ ---- Dhrystone 2 562.5 1110.3 97% Double Whetstone 320.0 640.4 100% Execl Throughput 450.4 880.3 95% File Copy 1024 759.4 595.9 -22% File Copy 256 535.8 438.8 -18% File Copy 4096 1261.8 1043.4 -17% Pipe Throughput 481.0 979.3 104% Pipe-based Switching 326.8 1229.0 276% Process Creation 917.2 1714.1 87% Shell Scripts (1) 1064.9 1566.3 47% Shell Scripts (8) 1567.7 1709.9 9% System Call Overhead 944.2 1445.5 53% -------------------- ------ ------ ---- Index Score: 678.2 1026.2 51% As expected, the heavily CPU-dependent tasks -- dhrystone, whetstone, execl, pipe throughput, process creation -- show close to 100% gain when running 2 copies in parallel. The Pipe-based Context Switching test measures context switching overhead by sending messages back and forth between 2 processes. I don't know why it shows such a huge gain with 2 copies (ie. 4 processes total) running, but it seems to be consistent on my system. I think this may be an issue with the SMP implementation. The System Call Overhead shows a lesser gain, presumably because it uses a lot of CPU time in single-threaded kernel code. The shell scripts test with 8 concurrent processes shows no gain -- because the test itself runs 8 scripts in parallel, it's already using both CPUs, even when the benchmark is run in single-stream mode. The same test with one process per copy shows a real gain. The filesystem throughput tests show a loss, instead of a gain, when multi-processing. That there's no gain is to be expected, since the tests are presumably constrained by the throughput of the I/O subsystem and the disk drive itself; the drop in performance is presumably down to the increased contention for resources, and perhaps greater disk head movement. So what tests should you use, how many copies should you run, and how should you interpret the results? Well, that's up to you, since it depends on what it is you're trying to measure. Implementation -------------- The multi-processing mode is implemented at the level of test iterations. During each iteration of a test, N slave processes are started using fork(). Each of these slaves executes the test program using fork() and exec(), reads and stores the entire output, times the run, and prints all the results to a pipe. The Run script reads the pipes for each of the slaves in turn to get the results and times. The scores are added, and the times averaged. The result is that each test program has N copies running at once. They should all finish at around the same time, since they run for constant time. If a test program itself starts off K multiple processes (as with the shell8 test), then the effect will be that there are N * K processes running at once. This is probably not very useful for testing multi-CPU performance. ============================================================================ Interpreting the Results ======================== Interpreting the results of these tests is tricky, and totally depends on what you're trying to measure. For example, are you trying to measure how fast your CPU is? Or how good your compiler is? Because these tests are all recompiled using your host system's compiler, the performance of the compiler will inevitably impact the performance of the tests. Is this a problem? If you're choosing a system, you probably care about its overall speed, which may well depend on how good its compiler is; so including that in the test results may be the right answer. But you may want to ensure that the right compiler is used to build the tests. On the other hand, with the vast majority of Unix systems being x86 / PC compatibles, running Linux and the GNU C compiler, the results will tend to be more dependent on the hardware; but the versions of the compiler and OS can make a big difference. (I measured a 50% gain between SUSE 10.1 and OpenSUSE 10.2 on the same machine.) So you may want to make sure that all your test systems are running the same version of the OS; or at least publish the OS and compuiler versions with your results. Then again, it may be compiler performance that you're interested in. The C test is very dubious -- it tests the speed of compilation. If you're running the exact same compiler on each system, OK; but otherwise, the results should probably be discarded. A slower compilation doesn't say anything about the speed of your system, since the compiler may simply be spending more time to super-optimise the code, which would actually make it faster. This will be particularly true on architectures like IA-64 (Itanium etc.) where the compiler spends huge amounts of effort scheduling instructions to run in parallel, with a resultant significant gain in execution speed. Some tests are even more dubious in terms of host-dependency -- for example, the "dc" test uses the host's version of dc (a calculator program). The version of this which is available can make a huge difference to the score, which is why it's not in the index group. Read through the release notes for more on these kinds of issues. Another age-old issue is that of the benchmarks being too trivial to be meaningful. With compilers getting ever smarter, and performing more wide-ranging flow path analyses, the danger of parts of the benchmarks simply being optimised out of existance is always present. All in all, the "index" tests (see above) are designed to give a reasonable measure of overall system performance; but the results of any test run should always be used with care.