Originally Published MEM Fall 2001
EMBEDDED SYSTEMS
Multiprocessing
Architectures for Medical Imaging Systems
The memory and I/O bandwidths of multicomputers make them the necessary choice for the most sophisticated imaging applications.
Iain Goddard
and Mikael Taveniku
Medical imaging systems of increasing sophistication require ever more computing power. Multi-detector-row multislice helical computed tomography (CT) and digital cardiography systems are examples. The fulfillment of Moore's Law has meant that more-powerful computersappearing particularly in the form of symmetric multiprocessor (SMP) boardshave become available at sizes and prices to meet some of the requirements of these systems. But still, the most demanding of today's medical imaging applications require a combination of throughput bandwidth and computational power that exceeds the capabilities of off-the-shelf SMP boards, even those populated with extremely high-speed chips.
The multicomputer is an alternative to the SMP architecture. It can extend processing performance considerably and cost-effectively. This article weighs the relative utility of the SMP and multicomputer processing options for medical imaging systems in light of the requirements of these systems.
Requirements of Medical Imaging Applications
Many medical imaging systems today have astounding image-generation capabilities that depend on the provision of processing systems offering high levels of power and scalability. Consider the requirements of multislice CT and digital cardiology.
Multislice CT. In 1974, computed tomography images were made using an x-ray source that was moved around the patient. These earliest systems could compute one image of 6400 pixels in 57 minutes.
By 1987, the x-ray source came to be mounted on a continuously rotating machine called a gantry, and a fan of attenuated x-ray signals was collected at a rate of 1000 fans per rotation. Moving the table on which the patient lay through the rotating gentry at the same time, the operator could acquire a helix or spiral scanning pattern. This technique provided great freedom in the spacing of output images, which could be computed at a resolution of 0.25 million pixels in 8 seconds.
The first multislice CT systems had been introduced by 2000. These systems were capable of acquiring four helices simultaneously at faster rotation rates. They typically reconstructed images at arbitrary spacing, at a rate of two per second. The image sets produced could exceed 150 MByte per patient. Not long from now, eight-image-per-second systems will be the norm.
Three algorithmic portions of the multislice CT application are particularly interesting: the (tangential) projection filtering, done in the frequency domain; the (longitudinal) so-called Z-filtering, done in the spatial domain; and the backprojection, traditionally done by special hardware.
Projection filtering involves forward and inverse real fast-Fourier transforms (FFTs) of projection vectors and multiplication of the transformed vector by a real filter kernel. The input and output data are typically about 1024 16-bit elements per projection. The projection may be floated and padded to 2048 elements before applying the FFT. The processing must be done at a rate of 8000 projections per rotation, or 16,000 per second. Each projection is 2 Kbyte, so an input/output (I/O) bandwidth of 64 MByte/sec is necessary for this module. Internally, each projection is filtered in about 75 microseconds on a 400-MHz processor and will require memory bandwidth of about 128 MByte/sec.
To produce the 1000 interpolated projections for eight-image-per-second reconstruction, about 32 floating-point operations (FLOPs) per point, or 250 MFLOPS, are needed. The module consumes this input at a rate of about 32 MByte/sec and produces output at about 16 MByte/sec, for an I/O bandwidth of 48 MByte/sec. Internally, there are eight fetches and one store for each output sample, and the memory bandwidth required is about 292 MByte/sec.
In the backprojection module, 1000 projections of 1024 16-bit samples are transformed into one 512 x 512-pixel image (formed from 16-bit pixels). To achieve a reconstruction rate of eight images per second requires I/O bandwidth for 16 MByte/sec of input and 4 MByte/sec of output. Internally, the algorithm requires perhaps 32 operations per pixel per projection, or 8 Giga OP (GOP) per image and thus 64 GOP total. The read-modify-write cycle for each pixel ultimately moves 8000 bytes per pixel between memory and the processor; interpolating the projection samples for each pixel moves an additional 4000 bytes. For eight images per second, of 0.25 million pixels each, memory bandwidth of 24 GByte/sec is required.
Digital Cardiology. Digital cardiology systems are the result of combining aspects of x-ray fluoroscopy and digital subtraction angiography. These systems can capture, enhance, and display x-ray images of the heart and its blood vessels at rates up to 30 frames per second. Any one system will provide a range of image sizes, pixel bit-depths, and frame rates to fit the rated system bandwidth. One system requirement is minimal latency, defined as no more than 50 milliseconds for the processing pipeline. In the immediate future, these systems will provide 2048 x 2048-pixel images formed of 2-byte pixels at rates of 30 frames per second, or 500 MByte/sec of system I/O bandwidth.
For most of the imaging chain, the real-time data stream is delivered to memory, a read-modify-write operation performs subtraction from a mask or bilinear interpolative zoom, and the data stream is forwarded to the next stage of the pipeline by direct memory access. This results in a memory bandwidth of about 11.3 GByte/sec for any particular pipeline stage. Processing load is about 64 operations per pixel for the entire pipeline, with the exception of the edge-enhancement stage.
The edge-enhancement module of the pipeline has a different model. Although the I/O bandwidth, memory bandwidth, and latency requirements are the same as for the other modules, the computation load is much heavier. The processor is a high-pass filter applied in the spatial domain using a convolution kernel that will be 13 x 13 pixels in size. The kernel may not be symmetric in the traditional sense, so each output pixel will be the result of as many as 330 operations, or 40 GOPS for a 2048 x 2048-pixel image size at 30 frames per second.
Table I summarizes application requirements for computed tomography and digital cardiology imaging.
| Modality 2002 | Benchmark | Operations per Point | Memory Bandwidth | I/O Bandwidth |
| Multislice computed tomography | 8 GFLOPS + Backprojector: 64 GOPS |
Z-filter:
32 (floating-point) T-filter: 50 (floating-point) Backprojector: 32,000 |
Z-filter:
292 MByte/sec T-filter: 128 MByte/sec Backprojector: 24 GByte/sec |
Z-filter:
48 MByte/sec T-filter: 64 MByte/sec Backprojector: 20 MByte/sec |
| Digital cardiography | Image processing: 8 GOPS + Convolution: 40 GOPS |
Image processing:
64 Convolution: 330 |
Image processing: 11.3 GByte/sec Convolution: 1 GByte/sec |
System: 500 MByte/sec |
| Table I. Processing requirements for two medical imaging applications. | ||||
The Symmetric Multiprocessor
The term symmetric multiprocessor is generally used to signify digital machines whose processors share a common memory space. The concept designated by the term also involves an assumption about memory architecture, that is, that all memory is equally (symmetrically) accessible from each central processing unit (cpu) in the system. In reality, however, symmetrical memory access is not usually achieved in anything but very small machines with four to eight processors. Figure 1 diagrams such a system. Four processors are connected by a shared cpu-bus attached to a bridge that serves as a memory and system controller. The processors share the memory, and it does not matter which processor accesses what part of the memory space.
![]() |
|
Figure
1. Architecture of an SMP single-board computer.
|
The important assumption about an SMP computer is that the programmer can view the machine as a set of computing and I/O resources all connected to the same memory space. The program can then be constructed such that it does not have to take into account where the data (or text) are in relation to the computing resources.
When suppliers scale up their SMP machines, the processor-bus and memory bandwidths become severe bottlenecks. In the past, SMP suppliers have done several things to boost the capabilities of their computer modules. One is to create multiple memory banks. Having multiple memory buses increases memory bandwidth. Today's memory bus technology provides between 1 and 1.6 GByte/sec of available bus bandwidth.
Another technique has been to create multiple cpu buses. This increases cpu-to-memory controller bandwidth. Current cpu-bus technology (Pentium III and PowerPC 74xx) now supplies roughly 1 GByte/sec of external cpu-bus bandwidth. High-end workstation cpus deliver up to 4 GByte/sec and have complicated bus technology.
Also, the number of memory and bus controllers can be increased. The memory controller becomes a bottleneck when cpu buses and memory buses are multiplied. So instead, multiple memory controllers are connected by an interconnect network.
These techniques allow a larger number of cpus to be connected. However, the machine architecture then no longer matches the programming model of a single shared memory and system bus. The computer supplier must consequently add hardware in order to retain the appearance of an SMP model and to struggle with the circumstance that all memory is not now equally close to the cpu. The distance to memory increases with the machine size. Today, it seems as though this model scales to somewhere between 16 and 32 cpus, at the price of very high and increasing hardware costs.
The Multicomputer
The SMP programming model is also transcended by the cache-coherent nonuniform-memory-access (ccNUMA) architecture. The idea here is to expose to the programmer the fact that memory is not equally accessible from every node (nonuniform memory access). The programming model is still a single shared memory space; however, the programmer can control the locality of the data so that it can be close to the cpu that needs it most.
The good thing about this architecture is that the machine can scale more easily than the pure SMP machine. This is a mixed programming model: the programmer can program the machine as if it were an SMP, but when performance is needed, the machine can be programmed with a distributed-memory programming model.
The multicomputer is a further simplification of the ccNUMA model. In its architecture, the cache-coherency protocol is dropped from the machine. This significantly simplifies the computer hardware, making the machine very inexpensive while yet maximizing network and processor performance.
From the model of a ccNUMA machine, the multicomputer keeps the multiple-memory controller, multiple cpu buses, and the tight coupling between the nodes; that is, it features a shared memory, but has no cache-coherency hardware. As shown in Figure 2, the multicomputer has distributed memory located at the cpu nodes, each node with its own memory controller and interface to a scalable crossbar network. In addition to processing and memory nodes, the network interfaces naturally to I/O devices, standard I/O bridges, bridges to expansion modules (e.g., a peripheral component interconnect (PCI) mezzanine card, or PMC), and the extended network itself.
![]() |
|
Figure
2. Architecture of a multicomputer.
|
The programming model is now based on a distributed-memory model; it preserves the option to use a shared-memory model, but gives it a high-performance impact. This is the so-called distributed shared memory (DSM) multicomputer architecture. The task of programming in this case takes into account both data and program locality; that is, the programmer needs to make sure that the data are local to the cpu that needs it. Since the programmer is in control of the data flow, it is relatively easy to provide high-performance interfaces to special-purpose hardware and I/O devices.
Table II outlines the essential technical characteristics of three platform architectures available for use in medical imaging systems, two types of SMP and a pure DSM multicomputer. The percieved, as well as actual, advantages and disadvantages of the symmetrical multiprocessor architecture and the multicomputer for such use are laid out in Table III.
| Characteristic | Pure DSM | Two-by-Two SMP | Four-Way SMP |
| Memory bandwidth | 4x + 33%2.8 GByte/sec | 2x1.2 GByte/sec | 1x0.6 GByte/sec |
| Computation rate | 8 GFLOPS | 8 GFLOPS | 8 GFLOPS |
| Network | Integrated | PMC | PMC |
| Network bandwidth | 500 MByte/sec | 200 MByte/sec | 200 MByte/sec |
| Standard I/O | External boardFiber I/O | PCI over PMCPCI | PCI over PMC |
|
Table II. Technical characteristics of two SMP architectures and the distributed shared memory (DSM) multicomputer architecture compared. |
|||
| Perception | Advantages | Disadvantages | |
| SMP | Easy to use. Easy to debug. |
Price kept
down by economies of scale. Scales well for small applications. Easy to program with thread replication, if suited to application. Off-the-shelf tools and open-source OS available |
Not well suited
to real-time control. Nonscalable memory bandwidth. Difficult to adapt to other than symmetric-shared-memory application model. Manufacturer must do all the system integration to application builders. |
| Multicomputer | Hard to program. Hard to debug. |
Simple, inexpensive
hardware. Scales well. Maximizes processor and memory performance. Gives programmer control over data locality,data paths, and topology. Comes with field-support teams, system integration services, and customization resources. Better system mean time before failure (MTBF). |
Programmer
needs to take into account location of data. Data must be copied between processor nodes if to be shared. |
| Table III. The two basic computing options for medical imaging systems compared, showing how each tends to be perceived and the benefits and limitations of each.rceived and the benefits and limitations of each. | |||
Making the Choice of SMP or Multicomputer
The medical equipment manufacturer must weigh many different characteristics of the multiprocessor computer platform in choosing what sort of computer to integrate into an imaging system. In addition to considering the business fit between itself and the supplier, the manufacturer must ponder a number of technical questions:
- Are there multiple operating modes? If so, how will the multiple modes be managed?
- What is the required memory bandwidth per cpu? How many processor operations per data point are needed?
- How large will the application grow to be? Will processing power need to be added in the future, and will increases in computing capability that accord with Moore's Law be sufficient and timely?
- How much data movement is involved? Is that liable to change in the future?
- What about technology insertion: can new hardware be added without changing the application?
- What about system integration? Where is the support for solving system problems?
Generally speaking, if these questions have easy answers and the problems to overcome are simple, an SMP platform may be adequate. If some of these questions expose trouble spots, then a multicomputer may be indicated.
Note that cpu-based CT processing uses perhaps 450 MByte/sec of memory bandwidth for 2 GFLOPS of Z- and T-filtering (see Table I). One has to wonder how much memory bandwidth the remaining 6 GFLOPS of processing will use, and where it will come from. Certainly not from a pool of shared memory with a total bandwidth of perhaps 600 MByte/sec.
Two notable requirements in the case of digital cardiology are stringent real-time latency levels and high bandwidths for both memory and I/O. The memory bandwidth required, 11.3 GByte/sec, is unquestionably higher than that supplied by most SMP systems.
In addition, both of these imaging systems have processing stages that are very conducive to special-purpose hardware. This hardware must be integrated into the system along the data path, with more-general vector processing both upstream and downstream. This means that system integration problems will arise to be solved.
Reasons for Selecting an SMP Solution. An SMP machine is a good fit in a highly cost-sensitive application environment with a low memory bandwidth and a small number of processors. This means an application for which two or four SMP processors provide adequate computing power and in which the algorithm has a low ratio of memory access to computing cycles. Furthermore, the system I/O bandwidth and the bandwidth between boards must be low enough for the PCI backplane to sustain it.
Business considerations favor an SMP-board approach if the application builders can do their own system integration. Usually, time-to-market pressure must be low for such an approach to be supportable. In addition, the application builders would have access to an open-source OS, or VxWorks from WindRiver (Alameda, CA), and the corresponding Linux tools or WindView (WindRiver), whichever they may prefer.
In terms of networking to achieve higher performance levels, the builders must be able to accept cabling between systems and extra PMC modules, and must tolerate a less-efficient parallel application programming interface (API). Less-than-ultimate speed must be satisfactory, in other words.
Reasons for Selecting a Multicomputer. Embedded multicomputers have a very high memory bandwidth, that value being the aggregate of the bandwidths of all the distributed memory systems. It can be as high as 2.8 GByte/sec per four-processor board. Libraries of processing functions are highly optimized for this architecture and use the hardware fully.
The multicomputer supplier provides a vertically integrated platform covering everything from high-performance algorithms through network API and real-time development tools and diagnostics. Other things that distinguish the multicomputer supplier from the SMP supplier are field support with on-site technical personnel and customization/integration services.
Networking is integrated into the multicomputer's components, with crossbars mounted directly on the boards. The interface with this embedded network is at each node controller. More than 0.5 GByte/sec of off-board bandwidth provides ultimate networking speed for scalable system performance.
Conclusion
On the basis of filtering demands only, data in Tables I and II suggest that a two-by-two SMP system might be adequate for multislice CT. This does not address the requirements of integrating the backprojection module, which will undoubtedly consist of special-purpose hardware, nor those of real-time control of the system by the OS.
An examination of the tables also suggests that only a DSM-type multicomputer would be adequate to handle a digital cardiology application, owing to the memory bandwidth requirement. As is the case with the CT backprojector, the cardiology edge-enhancement convolution module, which will undoubtedly consist of special-purpose hardware, must be integrated with the rest of the system resources. A multicomputer has the means to maximize bandwidth and minimize latency between the general-purpose and special-purpose processing modules in both cardiology and tomography systems. The stringent real-time latency demands and the high system throughput bandwidth required by digital cardiology also point to a multicomputer as the necessary platform.
The imaging chain of high-performance medical imaging systems often requires a multicomputer. In some cases, an SMP architecture may be appropriate, or even preferable; it depends on the specific application model. A multicomputer may be preferable for technical reasons, such as the memory bandwidth required by an algorithm and the need for real-time control of the data flow. There can also be business reasons for selecting a multicomputer, such as the system integration services furnished by the supplier. The important thing is that the medical system manufacturer asks the questions raised in this article in order to determine what computer architecture is best suited to solve a problem.
Iain Goddard, PhD, is a systems consulting engineering at Mercury Computer Systems (Chelmsford, MA), where he tracks medical imaging technology and develops medical product specifications. Mikael Taveniku, PhD, is a senior systems engineer with Mercury Computer System's product planning group, where he works with future product lines.
Copyright © 2001 Medical Electronics Manufacturing





