Medical Electronics Manufacturing
Magazine
MEM Article Index
Medical Electronics Manufacturing Fall 1998
Digital Signal Processing
System Architecture Determines Performance of Multiprocessor Applications
For multiprocessor implementation, the ability of the system's performance to scale as processors are added is a critical and often overlooked factor that may limit system performance.
Joseph A. Sgro and Paul C. Stanton
Implementing successful, cost-effective real-time digital signal processor (DSP)based medical applications on multiprocessor systems requires that designers carefully specify both latency and throughput requirements, as well as select appropriate processors and system architecture. In many applications, either latency or throughput can be the primary requirement to consider. For diagnostic ultrasound, for example, latency is critical because the technician uses the output image in real time to properly position the transducer. By contrast, for computed axial tomography (CAT) scanners, output latency is less critical, although it is desirable to produce an output set in less than 1 minute.
Scalable subsystems are critical to DSP performance in medical applications.
In a single-processor system, latency determines the peak computational performance needed, whereas throughput determines the aggregate computational performance required. In a multiprocessor system, latency determines how many processors must cooperate on a single data set, and throughput determines the total number of processors required.
The ability of the system's performance to scale as processors are added is a critical factor that may limit system performance in a multiprocessor system. Medical DSP applications are commonly implemented on Texas Instruments' TMS-series processors or Analog Devices' SHARC-series processors. This article will concentrate on the latter.
Currently available DSP cards offer a choice between two classes of architectures: the cluster bus architecture and the local memory architecture. A third architecture, the mesh design, is also available, but it is often poorly suited to demanding medical applications because of limited input/output (I/O) bandwidth and poor data scatter and gather performance.
Figure 1. Cluster bus design.
In the cluster architecture, all processors, global I/O, and global memory are connected to a common cluster bus (see Figure 1). The main drawback to the cluster design is that all processors must contend for access to the common bus, a situation that results in performance degradation during memory- and I/O-intensive operations. In the worst case, where the application is memory limited, a single processor performs the same as or even better than two or more processors.
Figure 2. Dual-ported local memory (DPLM) design.
In a private memory design, each processor has its own memory from which it can read and write without contending for memory being used by other processors. For a dual-ported local memory (DPLM) design, each processor is provided with local dual-ported memory that also connects to a common data bus (see Figure 2). In contrast to the cluster design, the performance of the eight-processor DPLM architecture does not degrade as more processors are added, because the available memory bandwidth per processor remains constant. For memory-limited applications, performance improves with additional processors because the total memory bandwidth scales linearly with the number of processors.
Similar considerations apply to the demands for inputting measurement data and outputting results. Because I/O contends for memory along with the processors in the cluster architecture, it further limits the total bus bandwidth available for data processing. In the private-memory design, the global bus must have bandwidth equal to twice the input data rate plus the output data rate (to accommodate input, forwarding to private memories, and output). Once this requirement is met, adding processors does not change the loading of the global memory bus. Typically, a separate control processor or DMA engine controls the transfer of data from the global bus to private-processor memory.
Clearly, designers must carefully consider I/O and memory bandwidth in selecting an optimal architecture. Some architectures fail because a common memory bus fails to deliver data to computational resources fast enough, critically limiting application performance. Private memory, on the other hand, can obviate this problem. Two real-world applications illustrate the performance obtainable with DPLM architecture. The first is an ultrasound application in which latency is critical; the other is a real-time medical imaging reconstruction application. Both were coded taking advantage of parallel operation wherever possible.
Ultrasound. Diagnostic ultrasound is an example of an application with a tight latency requirement. Satisfactory results are critically dependent on accurate placement of the ultrasound probe by the technician. To facilitate probe placement, the ultrasound system must process the data it collects and render an image without perceptible delay.
A typical ultrasound machine acquires data at a rate of 15 million 12-bit samples per second on four channels, giving an aggregate data rate of 120 Mbyte/sec. To input the data into global memory and read the data into the processor would require 240 Mbyte/sec, which exceeds the 160-Mbyte/sec bandwidth of the SHARC processor. Hence, two isolated buses (that is, two DSP boards) would be required to provide sufficient bandwidth. The performance of fast Fourier transforms (FFTs) is critical for analyzing ultrasound data. A single processor requires 102 microseconds to read the data and write back the FFT result. A 1-K FFT requires about 410 microseconds, which means that as many as four processors can operate on a cluster bus without interference, for a total throughput of 9756 FFTs/sec. Additional processors would not improve performance, and processing would become limited by the bandwidth of the bus.
The SHARC II (ADSP 21160) has a data bus bandwidth of 320 Mbyte/sec; hence, a single global bus can be used. The 120-Mbyte/sec data input rate leaves 200 Mbyte/sec available for processing. However, this processor takes 52 microseconds to read the data and output the results and can perform a 1-K FFT in about 82 microseconds.
Accordingly, two processors would contend for bus bandwidth, and the second processor would only improve performance by a factor of 1.6. If a third processor were added, performance would not improve. With two processors, the throughput is 19,531 FFTs/sec. Although the SHARC II provides five times the peak performance of the original SHARC, it is only able to improve performance by a factor of 2 on a cluster bus.
Figure 3. Number of FFTs/sec obtainable for DPLM and cluster architectures for the SHARC I and the SHARC II.
In a private-memory design, FFT data are read and written in private memory, preventing competition for memory on the bus. Improvement in performance scales linearly with the number of processors. One ADSP 21060 processor provides 2439 FFTs/sec, whereas eight give 19,512 FFTs/sec. Similarly, for the ADSP 21160, one processor provides 12,195 FFTs/sec, and eight give 97,560 FFTs/sec (see Table I and Figure 3). As the table indicates, after the bus becomes saturated, adding processors does not improve the cluster architecture's performance. DPLM architecture, on the other hand, scales the bus bandwidth by adding local buses along with the additional processors.
| CPUs | SHARC I Cluster | SHARC II Cluster | SHARC I DPLM | SHARC II DPLM |
| 1 | 2439 | 12,195 | 2439 | 12,195 |
| 2 | 4878 | 19,230 | 4878 | 24,390 |
| 4 | 9756 | 19,230 | 9756 | 48,780 |
| 8 | 9756 | 19,230 | 19,512 | 97,560 |
Table I. FFTs/sec versus number of processors.
Figure 4. CAT scan data collection.
Computed Axial Tomography. The CAT example uses a reconstruction region of a square array of 1024 x 1024 pixels centered on the rotation axis of the x-ray head. Data are collected from the x-ray head as radial scans, with four channels of 4096 values for each of 360 angles. The data collected occupy 4 x 4096 x 360 x 2, which equals 11,796,480 bytes. The data are moved into memory at an average rate of 5 Mbyte/sec (see Figure 4).
The scan-data processor uses the following five steps to reconstruct the pixel densities:
1. Calculation of the density estimate from the output of the four detectors at the end of each beam path.
2. For back projection, conversion of the fan-shaped polar form of the data to the square array of pixels in the reconstruction region. This is accomplished by summing the density along each ray into each pixel the ray encounters, with linear interpolation being used when the ray passes between two pixels. Four channels of data are used to estimate the actual density along the beam path to compensate for the normal beam hardening that occurs as x-rays pass through an object.
3. Application of a 2-D FFT to the image array.
4. Application of a weighting factor of the distance of each pixel from the axis of rotation. This compensates for the greater number of rays passing through pixels closer to the axis of rotation.
5. Application of an inverse 2-D FFT to the whole array.
Parallelization of the Algorithm. To improve performance, the computation is divided between processors by splitting the reconstruction region into horizontal strips. The first processor gets the first strip, and each processor thereafter gets succeeding strips. The algorithm is inherently linear, allowing a natural division between processors.
Cluster Mode Architecture Analysis
The size of global memory and the precomputed tables make the use of static random access memory (SRAM) prohibitive in terms of cost, physical size, and power requirements. This analysis assumes that global memory is dynamic random access memory (DRAM) accessible in two clocks per cycle (that is, a one-wait state). The reconstructed image, however, is kept in SRAM, accessible in one clock per cycle (no-wait states). This analysis assumes a cluster of six processors, a limitation imposed by the design of the SHARC.
Since a single SHARC can access memory at 160 Mbyte/sec (40 megafloats/sec), six processors compete for global memory, reducing the average access time to SRAM to 6.67 megafloats/sec, and the access time for DRAM to 3.33 megafloats/sec.
In the density estimation step, one processor broadcasts calibration data to all SHARCs in the cluster, and each SHARC accesses a single data sensor. In the back-projection step, data and table values are broadcast to each SHARC by one of the processors in the cluster. Broadcasting these data reduces the bus loading by 30%; however, access processor to the image data still loads the bus significantly. Compared to a single processor, a six-processor cluster, therefore, can improve performance by about 50%.
For the inverse 2-D FFT, each processor reads rows of data from SRAM and performs the FFT calculation in internal memory. This step is limited by both computation time and the time required to twice read and write the data in the image array. Restricted-memory bandwidth limits the improvement in performance to a factor of 2.3. Similar performance considerations apply to the 2-D FFT.
| Step | SHARC I Seconds | SHARC II Seconds |
| 1 | 2.13 | 1.07 |
| 2 | 122.34 | 61.17 |
| 3 | 0.21 | 0.05 |
| 4 | 0.11 | 0.04 |
| 5 | 0.21 | 0.05 |
| Total | 125.00 | 62.38 |
Table II. Cluster mode performance.
The pixel-weighting step is limited by the bandwidth of the bus and is determined by the time required for all processors to read data and table values from that memory. Additional processors, therefore, do not improve performance over that obtainable from a single processor. Table II illustrates the time required to calculate this algorithm on a six-processor cluster mode system, based upon these considerations, for SHARC I (21060) and SHARC II (21160) processors.
DPLM Architecture Analysis
DPLM architecture optimizes memory use and availability. The DPLM architecture is not limited to six processors in a cluster, which enables all processors to operate in parallel without contending with other buses during the back-projection calculation. The reconstructed image is kept in video random access memory (VRAM), reducing the interference between processors during the computation's back-projection phase. When the raw data are passed to the processors, they are broadcast to all the processors by the DMA controller.
When the ray table is broadcast to the processors, it is sent along with the raw data for each angle, reducing the memory requirements at each processor node. DPLM architecture can operate in single instruction multiple data (SIMD), reducing the 2-D FFT time significantly. When estimating the pixel density, the raw data and the coefficient table are sent by the DMA controller into the SHARC array. This step is limited by the transfer time required by the DMA controller. For the back-projection step, the DMA controller broadcasts data and table values to the processor array. Each processor operates on 1/8 of the image in private VRAM. This allows each processor to access memory at 20 megafloats/sec. Although this step is memory-bandwidth limited, execution on DPLM architecture is much faster than on cluster architecture, because eight private buses operate much faster than one common bus.
The 2-D FFT step uses the SIMD version of the FFT. The array is processed in rows. The resulting computed data are passed through the link ports in parallel and then processed in columns. Including the time used to move data to and from VRAM, the entire operation requires 0.083 seconds.
Figure 5. Time to complete rendering of the CAT scan image for the dual-ported local memory (DPLM) and cluster architectures for the SHARC I and the SHARC II.
Pixel weighting by the radial position is limited by the time required to pass the radius table data to VRAM and the time needed to access VRAM in processing the table. The inverse 2-D FFT has the same characteristics as the forward 2-D FFT. More time is required to pass the final image back to global DRAM. Table III and Figure 5 indicate the execution times for the eight-processor SHARC I and SHARC II DPLM architecture.
| Step | SHARC I Total Time (sec) | SHARC II Total Time (sec) |
| 1 | 0.264 | 0.132 |
| 2 | 24.400 | 12.200 |
| 3 | 0.096 | 0.048 |
| 4 | 0.013 | 0.006 |
| 5 | 0.148 | 0.072 |
| Total | 24.684 | 12.458 |
Table III. Dual-ported local memory (DPLM) performance.
Conclusion
These examples illustrate that where memory bandwidth limits performance, the DPLM architecture operates many times faster than the standard cluster mode architecture. This can be attributed to two key differences: the DPLM architecture does not inherently limit the number of processors, and each processor operates on data in private memory isolated from the other processors. If processors are added, the DPLM architecture provides nearly linear improvement in performance. In contrast, even if it were possible to add processors to the cluster design, performance would not improve because the common bus is completely saturated by data accesses.
Joseph A. Sgro is CEO and director of research and development and Paul C. Stanton is president and director of engineering for Alacron (Nashua, NH).



