|Home | Articles | Forum | Glossary | Books|
The processing of information is as old as the human race. The technology associated with it dates back to before the earliest form of writing to the use of counting beads and to the time of cave drawings. However, the inventions of printing (1049), the electronic binary coding of data (1837), and more recently radio communications (1891) and the electronic digital computer (1946) have increased the speed of generation and processing of data to such an extent that information and control technology no longer is concerned just with automatic processes replacing manual ones but provides the opportunity to do entirely new things. The developments during the 50 or so years since the invention of the transistor (1948) have helped designers introduce products that can hear, talk, and even detect the motion of objects. The late 1990s have seen developments related to automatic processes that could replace human sensory and cognitive processes as well as manipulative ones.
Early generations of 4- and 8-bit CISC processors have evolved into 16-, 32-, and 64-bit components with CISC or RISC architectures. Digital signal processors can be considered special cases of RISC architecture or sometimes parallel developments of CISC systems to tackle real-time signal processing needs. Over the past several decades, the field of digital signal processing has grown from a theoretical infancy to a powerful practical tool and matured into an economical yet successful technology.
At its early stages, audio and the many other familiar signals in the same frequency band have appeared as a magnet for DSP development. The 1970s saw the implementation of special signal processing algorithms for filters and fast Fourier transforms by means of digital hardware developed for the purpose.
Early sequential program DSPs are described by Jones and Watson (1990).
In the late 1990s, the market for DSPs is generated mostly by wireless, multimedia, and similar applications. As per industry estimates, by the year 2001, the market for DSPs is expected to grow to about $9.1 billion (Schneiderman, 1996). Currently, communications represent more than half of the applications for DSPs (Schneiderman, 1996). This section provides an essential guide for designers to understand the DSPs and briefly compares microprocessors and DSPs.
2. What Is a DSP?
A digital signal processor accepts one or more discrete-time inputs, xi(n), and produces one or more items of output, yl (n), for n = .... - 1, 0, 1, 2 ..... and I = 1 ..... N, as depicted in FIG. 1(a). The input could represent appropriately sampled (and analog-to-digital converted) values of continuous time signals of interest, which are processed in the discrete-time domain to produce output in discrete time that could then be converted to continuous time, if necessary. The operation of the digital signal processor on the input samples could be linear or nonlinear, time invariant or time varying, depending on the application of interest.
The samples of the signal are quantized to a finite number of bits, and this word length can be either fixed or variable within the processor. Signal processors operate on millions of samples per second, require large memory bandwidth, and are computationally very demanding, often requiring as many as a few hundred operations on each sample processed. These real-time capabilities are beyond the capabilities of conventional microprocessors and mainframe computers. A practical example of voice processing by a DSP is shown in FIG. 1 (b). Signal processors can be either programmable or of a dedicated nature.
Programmable signal processors allow flexibility of implementation of a variety of algorithms that can use the same computational kernel, while dedicated signal processors are hardwired to a specific algorithm or class of algorithms. Dedicated processors often are faster than or dissipate less power than general purpose programmable processors, although this is not always the case.
Digital signal processors traditionally have been optimized to compute the finite impulse response convolutions (sum of products), infinite impulse response recursive filtering, and fast Fourier transform-type (butterfly) operations that typically characterize most signal processing algorithms. They also include interfaces to external data ports for real-time operation. It is interesting to note that one of the earliest digital computers, ENIAC, included characteristics of a DSP (Marven and Ewers, 1994).
3. Comparison Between a Microprocessor and a DSP
Following the preceding section's discussion of microprocessors and microcontrollers, we can compare the microprocessors and DSPs. General architectures for computers and single-chip microcomputers fall into two categories. The architectures for the first significant electromechanical computer had separate memory spaces for the program and the data, so that both could be accessed simultaneously. This is known as a Harvard architecture, having been developed in the late 1930s by Howard Aiken, a physicist at Harvard University. The Harvard Mark 1 computer became operational in 1944.
The first general purpose electronic computer was probably the ENIAC (electronic numerical integrator and calculator) built during 1943-1946 at the University of Pennsylvania. The architecture was similar to that of the Harvard Mark 1 with separate program and data memories. Due to the complexity of two separate memory systems, Harvard architecture has not proven popular in general purpose computer and microcomputer design.
A consultant to the ENIAC project, John von Neumann, a Hungarian-born mathematician, is widely recognized as the creator of a different, very significant architecture, published by Burks, Goldstine, and von Neumann (1946; reprinted in Bell and Newell, 1971). The so-called von Neumann architecture set the standard for developments in computer systems over the next 40 years and more. The idea was very simple, based on two main premises: that there is no intrinsic difference between instructions and data and that instructions can be partitioned into two major fields containing the operation command and the address of the operand (data to be operated on); therefore, a single memory space could contain both instructions and data.
Common general purpose microprocessors, such as the Motorola 68000 family and the Intel i86 family, share the von Neumann architecture. These and other general purpose microprocessors also have other characteristics typical of most computers over the past 40 years. The basic computational blocks are an arithmetic logic unit and a shifter. Operations such as add, move, and subtract are performed easily in a very few clock cycles. Complex instructions such as multiply and divide are built up from a series of simple shift, add, and subtract operations. Devices of this type are known as complex instruction set computers.
CISC devices have multiply instructions, but this will simply execute a series of microcode instructions which are hard coded in on-chip ROM. The micro-coded multiply operation therefore takes many clock cycles.
FIG. 2 compares the basic differences between traditional microprocessor architecture and typical DSP architecture. Real-time digital signal processing applications require many calculations of the form
A = BC + D (eq. 1)
This simple equation involves a multiplication operation and an addition operation. Because of its slow multiplication, a CISC microcomputer is not very efficient at calculating it. We need a machine that can multiply and add in just one clock cycle. For this, we need a different approach to computer architecture.
Many embedded applications are well defined in scope and require only a few calculations to be performed, but they require very fast processing. Examples of such applications are digital compression of images, compact disc players, and digital telephones. In addition to these computation-intensive functions demanding the continuous processing, the processor has to perform comparatively simple functions such as menu control for satellite TV, selection of tracks for CD players, or number processing in a digital PBX, all of which require significantly less processing power.
In such applications, computation-intensive functions such as digital filtering and data compression require continuous signal processing, which requires multiplication, addition, subtraction, and other mathematical functions. While RISC processor architectures could be optimized to handle these situations by incorporating cache memory, direct access internal registers, and the like, DSP systems provide more computation-intensive functions such as fast Fourier transforms, convolutions, and digital filters. Particularly in a DSP-based system, such tasks should be performed on a real-time basis, as per FIG. 3. This indicates that the sample period and computational latency are becoming key parameters.
3.1 The Importance of the Sample Period and Latency in the DSP World
The sample period (the time interval between the arrival of successive samples of the input signal) depends on the technology employed in the processor. The time interval between the arrival of input and the departure of the corresponding output sample is the computational latency of the processor. To ensure the stability of the input ports, the output samples have to depart at the same sample period as the input samples. In signal processing applications, the minimum sample period that can be achieved often is more important than the latency of the circuit. Once the first output sample emerges, successive samples will emerge at the sample period rate, hiding the effects of a large latency of circuit operation. This makes sense because typical signal processing applications deal with a few million samples of data in every second of operation. For details on the relationship between these two parameters, see Madisetti (1995).
Other important measures are the area of the VLSI implementation and its power dissipation. These directly contribute to the cost of a DSP chip. One or more of these measures usually is optimized at the cost of others. These trade-offs again depend on the application. For instance, signal processors for portable communication require low power consumption combined with small size, usually at the cost of an increased sample period and latency.
3.2 The Merging of Microprocessors and DSPs
Diverse, high-volume applications such as cell phones, disk drives, antilocking brakes, modems, and fax machines require both microprocessor and DSP capability. This requirement has led many microprocessor vendors to build in DSP functionality. In some cases, such as in Siemens' Tricore architecture (Levy, 1998a), the functional merging is so complete that it is difficult to determine whether to consider the device a DSP or a microprocessor. At the other extreme, some vendors claim that their microprocessors have high-performance DSP capability, when in fact they have added only a "simple" 16 × 16-bit multiplication instruction.
4. Filtering Applications and the Evolution of DSP Architecture
Digital signal processing techniques are based on mathematical concepts familiar to most engineers. From these basic ideas spring the myriad applications of DSP, including fast Fourier transform, linear prediction, nonlinear filtering, and decimation and interpolation (see FIG. 4). One of the most common signal processing functions is linear filtering. High-pass, low-pass, and bandpass filters, which traditionally are analog designs, can be constructed with DSP techniques.
To build a linear filter using digital methods, a continuous-time input signal, Xc(t), is sampled to produce a sequence of numbers, x(n) = Xc(nT). This sequence is transformed by a discrete-time system m that is, a computational algorithm--into an output sequence of numbers, y(n). Finally, a continuous-time output signal, yc(t), is reconstructed from the sequence y(n). The essentials of filtering and sampling as applied to the world of DSP were discussed in section 3.
4.1 Digital Filters
Digital filters for many years have been the most common application of digital signal processors. Digital design, of any kind, ensures repeatability. Two other significant advantages accrue with respect to filters. First, it is possible to reprogram the DSP and drastically alter the filter's gain or phase response. For example, we can reprogram a system from low pass to high pass without throwing away the existing hardware. Second, we can update the filter coefficients while the program is running; that is, build "adaptive" filters. The two basic forms of digital filter, the finite impulse response (FIR) filter and the infinite impulse response (IIR) filter, are explained next. The initial descriptions are based on a low-pass filter. It is very easy to change low-pass filters to other types: high pass, bandpass, and so forth.
4.1.1 Finite Impulse Response Filter (FIR)
The mechanics of the basic FIR filter algorithm are straightforward. The blocks labeled z -1 in FIG. 5 are unit delay operators; their output is a copy of the input sample delayed by one sample period. A series of storage elements (usually memory locations) are used to simulate series of these delay elements (called a delay line). The FIR filter is constructed from a series of taps. Each tap includes a multiplication operation and an accumulation operation. At any given time, n 1 of the most recent input samples resides in the delay line, where n is the number of taps in the filter. Input samples are designated xk; the first input sample is x1 the next is x2, and so on. Each time a new input sample arrives, the previously stored samples are shifted one place to the right along the delay line and a new output sample is computed by multiplying the newly arrived sample and each of the previously stored input samples by the corresponding coefficient.
In the figure, coefficients are represented as Cn where n is the coefficient number.
The results of'each multiplication are summed together to form the new output sample, yn. Later we discuss how DSPs are designed to help implement these.
4.1.2 Infinite Impulse Response Filter (IIR)
The other basic form of digital filter is the infinite impulse response filter.
A simple form of this is shown in FIG. 6. Using the same notations as for the FIR, we can see that:
Take the math for granted m it is just relatively simple substitution.
Therefore, the transfer function is given by
From equation (2) we can see that each output, y(n), is dependent on the input value, x(n), and two previous outputs, y(n- 1) and y(n- 2). Taking this one step at a time, let us assume that there were no previous input samples before n = 0, then ...
We already can see that any output depends on all the previous inputs and we could go on, but the equation just gets longer. An alternative way of expressing this is to say that each output depends on an infinite number of inputs. This is why this filter type is called an infinite impulse response.
If we look again at FIG. 6, the filter actually is a series of feedback loops, and as with any such design, we know that, under certain conditions, it may become unstable. Although instability is possible with an IIR design, it has the advantage that, for the same roll-off rate, it requires fewer taps than FIR filters. This means that, if we are limited in the processor resources available to perform a desired function, we may have to use an IIR. We just have to be careful to design a stable filter. More advanced forms of these filters are discussed with simple explanation in Marven and Ewers (1994).
4.2 Filter Implementation in DSPs
To explain the filter implementation, let us take the case of a first-order recursive filter. A signal flow graph or signal flow diagram is a convenient representation of a signal processing algorithm. Consider the first-order recursive filter shown in FIG. 7(a). The sequential computations involved are not clearly evident in the signal flow graph, since it appears as if all the operations can be evaluated at the same time. However, operations have to follow a certain precedence to preserve correct operation. It is also not clear where the data operands and coefficients are stored prior to their utilization in the computation. A more convenient mode of description would be the one in FIG. 7(b), which shows the storage locations for each operand and the sequence of computations in terms of micro-operations at the register-transfer level (RTL) ordered in time from left to right. We assume that the state variable u(n - 1) is stored in the data memory (DM) at location D1, while the coefficient C1 stored in a coefficient memory (CM) at location C1. Both these operands are fetched and multiplied and the result is added to the input sample, x(n), and the sum is stored in a temporary location, T1. Then, another multiplication is performed using coefficient C2 and the product is added to the contents of T1. The final result is the output y(n). The new variable v(n) is stored in memory location D1. One may wonder why temporary location T1 has been used. Temporary locations such as T1 often provide a longer word length (or precision) than the word length of the memory.
Repeated sums of products, as required in this example, quickly can exceed the dynamic range provided by the word length. Temporary locations provide the additional bits required to offset the deleterious effects of overflow. One also can observe that, in this example, the multiplier and adder operate in tandem and the second coefficient multiplication can utilize the same multiplier when the input sample is being added. Thus, only one multiplier and one adder are required as arithmetic units. One data memory location, two coefficient memory locations, and one temporary storage register are required for correct operation of the filter.
The specification of the sequence of micro-operations required to perform the computation is called programming in assembler.
From the preceding discussion, any candidate signal processor architecture for the IIR filter needs a coefficient memory, a data memory, temporary registers for storage, a multiplier, an adder, and interconnection. In addition, address must be calculated for the memories as well as interpretation (or decoding) of the instruction (obtained from the program memory). The coefficient memory and the program memory can be combined into one memory (the program memory). Nothing can be written into this read-only memory (ROM). Data can be written and read from the random-access data memory (RAM). The architecture shown in FIG. 8 is a suitable candidate architecture for this application. The program counter and the index registers are used in computing the addresses of the next instruction and the coefficients. The instruction is decoded by the instruction register (IR), where the address of the data is calculated using the adder and the base index register provided with the data memory. The program bus and the data bus are separate from each other, as are the program and data memories. This separation of data and program memories and buses characterizes the Harvard architecture for digital signal processors. The shifter is provided to allow incorporation of multiple word lengths within the data path (the multiplier and the adder) and the data and program buses. The T1 register is configured as a higher-precision accumulator. Input samples are read in from the input buffer and written into the output buffer. The DSP can interact with a host computer via the external interface. In FIG. 8, the integers represent the number of bits carried on each bus. For a detailed account of digital filters, see Jones and Watson (1990, Chapter 7). The inherent advantages of digital filters are these:
1. They can be made to have no insertion loss.
2. Linear phase characteristics are possible.
3. Filter coefficients easily are changed to enable adaptive performance.
4. Frequency response characteristics can be made to approximate closely to the ideal.
5. They do not drift.
6. Performance accuracy can be controlled by the designer.
7. They can handle very low-frequency signals.
4.3 DSP Architecture
The simplest processor memory structure is a single bank of memory, which the processor accesses through a single set of address and data lines, as shown in FIG. 9. This structure, which is common among non-DSP processors, is often considered a von Neumann architecture. Both program instructions and data are stored in the single memory. In the simplest (and most common) case, the processor can make one access (either a read or a write) to memory during each instruction cycle.
If we consider programming a simple von Neumann architecture machine to implement the example FIR filter algorithm, the shortcomings of the architecture become immediately apparent. Even if the processor's data path is capable of completing a multiply-accumulate operation in one instruction cycle, it will take four instruction cycles for the processor to actually perform the multiply-accumulate operation, since the four memory accesses outlined previously must proceed sequentially, with each memory access taking one instruction cycle.
This is one reason why conventional processors often do not perform well on DSP-intensive applications and why designers of DSP processors have developed a wide range of alternatives to the von Neumann architecture, which we explore next.
The previous discussions indicate that parallel memories are preferred in DSP applications. In most DSPs, Harvard architecture coexists with data pipelines and instruction processors in a very efficient manner. The systems with specific addressing modes for signal processing applications could be best described as special instruction set computers (SISC). SISC architecture is characterized by a memory-oriented special purpose instruction set.
4.3.1 Basic Harvard Architecture
Harvard architecture refers to a memory structure in which the processor is connected to two independent memory banks via two independent sets of buses.
In the original Harvard architecture, one memory bank holds program instructions and the other holds data. Commonly, this concept is extended slightly to allow one bank to hold program instructions and data, while the other bank holds data only. This "modified" Harvard architecture is shown in FIG. 10. The key advantage of the Harvard architecture is that two memory accesses can be made during any one instruction cycle. Thus, the four memory accesses required for the example FIR filter can be completed in two instruction cycles. This type of memory architecture is used in many DSP families including the Analog Devices ADSP21xx.
4.3.2 SISC Architecture
While microprocessors are based on register-oriented architecture, signal processors have memory-oriented architectures. Multiple memories for both program and data have been present even in the first-generation DSPs such as TMS320C10. Modern DSPs have as many as six parallel memories for the use of the instruction or the data processors. External memory is as easily accessible as internal memory. In addition, a rich set of addressing modes tailored for signal processing applications also are provided. We describe the architecture representative of SISC computers and expect that future generations of SISC computers will have communication primitives as part of the standard instruction set. The basic instruction cycle is a unit of time measurement in the context of signal processing architectures, in some sense, the average time required to execute an ALU instruction. The basic instruction cycle is further divided into subcycles (usually two to four). The memory cycle time is that required to access one operand from the memory. The high-memory bandwidth requirement in SISC computers can be met by either providing for memories with very low-memory cycle times or multiple memories with relatively slower cycle times. Typically, an instruction cycle is twice as long as a memory cycle for on-chip memory (and equal to the memory cycle for external memory). Clearly, this facilitates the use of operand fetch and execution pipelines of two-operand instructions with on-chip data memories. If parallel data memories are provided, then the total number of memory cycles per instruction cycle is increased. The total number of memory cycles possible within a single basic instruction cycle is defined as the demand ratio (Kogge, 1981) for a SISC machine. Higher demand ratios lead to a higher throughput of instructions (eqn. 5):
Demand ratio = [(Basic cycle time) x (Number of memories)] / Memory cycle time
4.3.3 Multiple Access Memory-Based Architecture
As discussed, Harvard architecture achieves multiple memory accesses per instruction cycle by using multiple, independent memory banks connected to the processor data path via independent buses. While a number of DSP processors use this approach, there are other ways to achieve multiple memory accesses per instruction cycle. These include using fast memories that support multiple, sequential accesses per instruction cycle over a single set of buses and using "multiported" memories that allow multiple concurrent memory accesses over two or more independent sets of buses.
Achieving increased memory access capacity by use of multiported memory is becoming popular with the development of memory technology. A multiported memory has multiple independent sets of address and data connections, allowing multiple independent memory access to proceed in parallel. The most common type of multiported memory is the dual-ported variety, which provides two simultaneous accesses. However, triple- and even quadruple-ported varieties sometimes are used. Multiported memories dispense with the need to arrange data among multiple, independent memory banks to achieve maximum performance. The key disadvantage of multiported memories is that they are much more costly (in terms of chip area) to implement than standard, single-ported memories. Some DSP processors combine a modified Harvard architecture with the use of multiported memories. The memory architecture shown in FIG. 11, for example, includes a single-ported program memory with a dual-ported data memory. This arrangement provides one program memory access and two data memory accesses per instruction word and is used in the Motorola DSP561xx processors. For a more-detailed discussion of these techniques, see Lapsley et al. (1997).
4.4 Modifications to Harvard Architecture
The basic Harvard Architecture can be modified into six different types.
This discussion is beyond the scope of the section and for details, see Lee (1988, 1989).
5. Special Addressing Modes
In addition to general addressing modes used in microprocessor systems, several special addressing modes are used in DSPs, including circular addressing and bit reversed addressing. For a comprehensive discussion on addressing modes, see Lapsley et al. (1997), as only circular addressing and bit reversed addressing are discussed here.
5.1 Circular Addressing
Many DSP applications need to manage data buffers. A data buffer is a section of memory used to store data that arrive from an off-chip source or a previous computation until the processor is ready to process the data. In realtime systems, where dynamic memory allocation is prohibitively expensive, the programmer usually must determine the maximum amount of data that a given buffer must hold and set aside a portion of memory for that buffer. The buffers generally use a first-in-first-out (FIFO) protocol, meaning that data values are read out of the buffer in the order in which they arrived.
In managing the movement of data into and out of the buffer, the programmer maintains two pointers, which are stored in registers or in memory: a read pointer and a write pointer. The read pointer points to (that is, contains the address of)
the memory location containing the next data value to arrive, as illustrated in FIG. 12. Each time a read or write operation is performed, the read or write pointer is advanced and the programmer must check to see whether the pointer has reached the last location in the buffer. When the pointer reaches the end of the buffer, it is reset to point to the first location in the buffer. Checking whether the pointer has reached the end of the buffer after each buffer operation and resetting it if it has is time consuming. For systems that use buffers extensively, this linear addressing can cause a significant performance bottleneck.
To address this bottleneck, many DSPs have a special addressing capability that allows them, after each buffer address calculation, to automatically check whether the pointer has reached the end of the buffer and reset it at the buffer start location if necessary. This capability is called modulo addressing or circular addressing.
The term modulo refers to modulo arithmetic, where numbers are limited to a specific range. This is similar to the arithmetic used in a clock, which is based on a 12-hour cycle. When the result of a calculation exceeds the maximum value, it is adjusted by repeatedly subtracting from it the maximum representable value until the result lies within the specified range. For example, four hours after 10 o' clock is 2 o'clock (14 modulo 12). When modulo address arithmetic is in effect, read and write pointers (address registers) are updated using pre- or post-increment register-indirect addressing (Lapsley et al., 1997). The processor's address generation unit performs modulo arithmetic when new address values are computed, creating the appearance of a circular memory layout, as illustrated in FIG. 11(b). Modulo address arithmetic eliminates the need for the programmer to check the read and write pointers to see whether they have reached the end of the buffer and reset them once they have reached the end. This results in much faster buffer operations and makes modulo addressing a valuable capability for many applications.
In most real-time signal processing applications, such as those found in filtering, the input is an infinite stream of data samples. These samples are placed in "windows" and used in filtering applications. For instance, a sliding window of N data samples is used by an FIR filter with N taps. The data samples simulate a tapped-delay line and the oldest sample is written over by the most recent sample. The filter coefficients and the data samples are written into two circular buffers. Then, they are multiplied and accumulated to form the output sample result, which is stored. The address pointer for the data buffer is updated and the samples appear shifted by one sample period, the oldest data being written out and the most recent data is written into that location.
5.2 Bit-Reversed Addressing
Perhaps the most unusual of addressing modes, bit-reversed addressing is used only in very specialized circumstances. Some DSP applications make heavy use of the fast Fourier transform (FFT) algorithm. The FFT is a fast algorithm for transforming a time-domain signal into its frequency-domain representation and vice versa (Oppenheim and Schafer, 1988; Kularatna, 1996, Chapter 9). However, the FFT has the disadvantage that it either takes its input or leaves its output in a scrambled order. This dictates that the data be rearranged to or from natural order at some point.
The scrambling required depends on the particular variation of the FFT. The radix-2 implementation of an FFT, a very common form, requires reordering of a particularly simple nature, bit-reversed ordering. The term bit reversed refers to the observation that, if the output values from a binary counter are written in reverse order (that is, least significant bit first), the resulting sequence of counter output values will match the scrambled sequence of the FFT output data. This phenomenon is illustrated in FIG. 13.
Because the FFT is an important algorithm in many DSP applications, many DSP processors include special hardware in their address generation units to facilitate generating bit-reversed address sequences for unscrambling FFT results.
For example, the Analog Devices ADSP-210xx provides a bit-reverse mode, which is enabled by setting a bit in a control register. When the processor is in the bit-reverse mode, the output of one of its address registers is bit reversed before being applied to the memory address bus.
An alternative approach to implementing bit-reversed addressing is the use of reverse-carry arithmetic. With reverse-carry arithmetic, the address generation unit reverses the direction in which carry bits propagate when an increment is added to the value in an address register. If reverse-carry arithmetic is enabled in the AGU and the programmer supplies the base address and increment value in bit-reversed order, then the resulting addresses will be in bit-reversed order.
Reverse-carry arithmetic is provided in the AT&T DSP32xx, for example.