Last time I had started to finally get to grips with the system hanging issues, having found out much of the problem came down to the SMP kernel issues related to the on-board SCSI that are still prevalent within NetBSD releases. I was fortunate that I’d been given a chance to try out a patch that made SMP much more stable (although not perfect). This gave me essentially 4 different configuration options. After thinking about it, I decided it would probably be prudent to make some measurements to hopefully determine what the best way to go is.
I have three Mbus modules (pictured above), a dual CPU SuperSparc @ 50Mhz, a single CPU SuperSparc @60Mhz and a single CPU HyperSparc @ 90Mhz. The clock speeds can be a little misleading as there is a little more to each module. The SuperSparc modules each come with 1Mb per CPU of cache where as the HyperSparc has only 256Kb, and the dual CPU module runs on a slower Mbus @40Mhz whilst the other two run at 50Mhz. Additionally the rough guide to Mbus modules, an essential site for anyone with a sun machine like mine, suggested that the SuperSparc CPUs would actually perform better on a per clock basis. Given all this it’s not really clear which the best performers will be. From here on in I’ll abreviate SuperSparc to SS and HyperSparc to HS
Today we’re going to look at the results of some of the intensive benchmarks I’ve put the modules through, and at the end the best choice of configuration given the hardware I have on hand. All the tests are run with the same OS (NetBSD 7.1) and hardware with the exception of the Mbus modules under test.
The first set of benchmarks are aimed at measuring basic CPU speed. The benchmarks I’ve used are Dhrystone (version 2.1), Whetstone and both the double and single precision versions of the linpack benchmark. These tests are measuring single threaded performance of the modules.
Just looking at these charts it’s obvious that the HS is the fastest of the three modules. Given its higher clock speed that is to be expected, but it also attained higher scores per clock for all the tests except whetstone. The linpack tests show a large difference with the HS running about 12% faster per clock for double precision and about 22% faster per clock for the single. The Dhrystone test showed a much more subdued advantage, only running about 7% faster per clock. The Whetstone test showed the HS was slower, doing floating point arithmetic by about 11% slower per clock cycle.
Both SS modules performed about the same relative to their clock rate, which indicates the Mbus speed wasn’t a large factor in these tests, and that the data size was likely smaller than that of the L2 cache (1Mb). I would have expected the dual 50Mhz module to be slower in single threaded tasks as the Mbus is slowed to 40Mhz (as opposed to 50Mhz the others use).
I’m not sure how I feel about the results here, the data set size for the tests was almost certainly too small to even exceed the capacity of the HSs 256Kb cache. I’m not sure what to make of the linpack results, but the dhrystone and whetstone results seem to indicate the HS core is better at integer and string operations and the SS core is better at floating point.
I selected the next benchmark because it offered speed measurements over a range of data sizes. The Sieve of Eratosthenes is a simple algorithm for finding prime numbers within a finite numerical space. Rather than explain it myself look here on Wikipedia for more details. One of it’s key features is that it is quite hard on a CPU’s memory bandwidth, and it’s use of the cache is quite sub-optimal. I omitted testing the 50Mhz SS module.
The results are quite interesting. The HS enjoys an advantage of about 14% per clock when the data set fits within it’s cache, but suffers quite a performance drop once the data set gets larger. Despite being 30Mhz slower the SS is faster for data sets small enough for its cache but too large to fit in the HSs cache. I suspect this gap would be widest at just below 1Mb data size, but the program didn’t allow control over that. The worst data point shows the HS as 44% slower per clock. This is quite surprising, as the SS is not much faster than the Mbus speed (only 10Mhz faster) I didn’t expect the advantage in that data size to be so large. After 1Mb data size is exceeded, the HS starts to catch up again, but the data points don’t get large enough to know if it ever achieves equal relative performance again. I’d imagine that once the data is large enough both modules would perform close to the same as memory bandwidth becomes the limiting factor.
-
-
The next benchmark is similar in that there is measurement over a range of data sizes, but the algorithm is significantly different. The algorithm used is heapsort, a relatively efficient sorting algorithm used in many places. You can find more details here on Wikipedia. One of it’s characteristics is that it is much more cache friendly. Again I omitted testing the dual 50Mhz SS.
Looking at the graphs this test really requires some points at larger data sizes. I can only really guess, but I’d imagine that the performance would eventually converge given that memory bandwidth would eventually become the dominant factor. The previous test indicates that there may even be a window in which the SS performs better, but without actual data we will never know.

Given that I’ll be using this machine as a desktop workstation I ran a benchmark known as x11perf. It simply tests the maximum speed of components of the X11 protocol. It’s often known just as X for short, and is basically the software that unix systems use to interface to video displays.The chart shows performance relative to the dual 50Mhz SS (the yellow line represents it). A 2 is twice and fast, and 0.5 is half as fast. Each point on the X axis is a test, like line drawing for instance, there are so many tests (over 300) it wasn’t practical to separate and chart them individually. Out of interest I ran the dual 50Mhz SS with a MP kernel to see if it made any appreciable difference.
There are some quite interesting features of this chart. Firstly you’ll notice that both the faster modules have tests that are significantly slower than the dual SS (30-35% slower at worst). This is because those tests are CPU bound, and with a dual CPU module both the X server and client can have a whole CPU to itself. Typically those tests involve little actual drawing to screen, like plotting points.
In general the dual 50Mhz SS is slower than the faster modules. The SS @ 60Mhz is about 1.15 times faster on average and the HS is 1.75 times faster on average. The HS is in general the best on the raw performance numbers, with some odd exceptions. Some tests seem to favour the SS @ 60Mhz, which would be down to cache size.
Relative to their clock speed, the 60Mhz SS does better than the HS, but I’d imagine this would be due to the SBus limiting the maximum through put to the frame buffer. The SBus only runs @ 25Mhz so is almost certainly going to slow down a faster CPU when drawing.

The last and final test is one called Ramspeed. It’s basically designed to measure the memory bandwidth. I opted for the more general integer and floating point tests over the specific reading and writing tests as they are more likely to represent a computational load. There are 4 tests, Copy creates two buffers and copies data from one to the other, Scale creates two buffers and copies data from one to the other, but scales the number by some constant, finally Triad creates 3 buffers and adds two of them together (scaling one by a constant factor) and storing the result in the third buffer. All buffers are the same size. The tests I’ve chosen only test with buffers that are 32Mb in size, so much larger than the caches of either of the modules. You can select the buffer size and some tests available in the program test a range of sizes.
The results are pretty bad for the HS, it achieves slightly better speed only for the copy operations, which shouldn’t be surprising as the Mbus should be a limiting factor. However for the other tests the SS performs quite a bit better, so much in fact I ran the tests many times just to make sure. This would appear to be down to the memory and cache architecture of the modules, not just the cache size, although that is certainly playing an important role in the HS failing to perform. The HS does have significantly smaller L1 cache only having a 8k instruction cache versus a 20k Instruction and 16k data L1 cache in the SS core.
Having now spent a couple of weeks testing these modules I think we’re starting to get a picture of what these chips can do relative to each other. The HS is clearly faster as long as any data isn’t larger than its cache. The SS on the other hand isn’t as fast at it’s peak, largely due to a lower clock speed, but handles larger data sets significantly better. The X11 test showed that it is quite beneficial to have multiple CPUs in a workstation, even if only for basic X11 applications. However it also shows the HS being quite a good choice. I think the tests also show there was some merit to the idea that the SS modules performed better relative to their clock speed, but it also shows this is highly dependent on the work load.
So what am I going with and what would I recommend. With the hardware I have I’ll use the HS @ 90 for running the machine as a workstation as that makes it snappier to use in general. The flip side is that if I were to use the machine for a computational load, such as compiling a number of packages, number cruching, or a basic server the two SS modules would almost certainly perform much better as long as the job could be divided between the CPUs. Even the SS @ 60Mhz has a good chance of doing computation better on it’s own. The HS on it’s own is disadvantaged by not being able to multi-task as well, I have noticed that X is in general less responsive when the machine is under load (compared to both SS modules together), so a second HS module would probably be a nice addition in the future.
If money was no object and I could have any parts at all, both Ross and Sun had decent offerings. The fastest SS is 85-90Mhz, two of these would certainly be quite fast. However I’d imagine they probably wouldn’t be as fast as any pair of HS modules over 125Mhz. So in the end the HS modules would be the way to go if you had access to anything. As it stands, looking around online it’s actually really hard to find faster modules for a reasonable price. Among the SS modules those over 60Mhz are quite expensive and largely not available. The HS parts have a similar problem, but you can get 90Mhz – 133Mhz parts at fairly decent prices, although faster modules still command a high price, and slower modules wouldn’t be worth it. Again with what’s available the HS seems the way to go.
I’ve tried to be as thorough as possible, but if you want to see the raw data and gnumeric spreadsheet with calculations and charts you can find them here.
Recent Comments