A Quantitative Performance Evaluation of SCI Memory Hierarchies

University of Edinburgh, Department of Computer Science

Roberto Andre Hexsel

October 1994

The Scalable Coherent Interface (SCI) is an IEEE standard that defines a hardware platform for scalable shared-memory multiprocessors. SCI consists of three parts. The first is a set of physical interfaces that defines board sizes, wiring and network clock rates. The second is a communication protocol based on unidirectional point to point links. The third defines a cache coherence protocol based on a full directory that is distributed amongst the cache and memory modules. The cache controllers keep track of the copies of a given datum by maintaining them in a doubly linked list. SCI can scale up to 65520 nodes.

This dissertation contains a quantitative performance evaluation of an SCI-connected multiprocessor that assesses both the communication and cache coherence subsystems. The simulator is driven by reference streams generated as a by-product of the execution of "real" programs. The workload consists of three programs from the SPLASH suite and three parallel loops.

The simplest topology supported by SCI is the ring. It was found that, for the hardware and software simulated, the largest efficient ring size is between eight and sixteen nodes and that raw network bandwidth seen by processing elements is limited at about 80Mbytes/s. This is because the network saturates when link traffic reaches 600-700Mbytes/s. These levels of link traffic only occur for two poorly designed programs. The other four programs generate low traffic and their execution speed is not limited by interconnect nor cache coherence protocol. An analytical model of the multiprocessor is used to assess the cost of some frequently occurring cache coherence protocol operations. In order to build large systems, networks more sophisticated than rings must be used. The performance of SCI meshes and cubes is evaluated for systems of up to 64 nodes. As with rings, processor throughput is also limited by link traffic for the same two poorly designed programs. Cubes are 10-15% faster than meshes for programs that generate high levels of network traffic. Otherwise, the differences are negligible. No significant relationship between cache size and network dimensionality was found.

Full text.