|

What is the 3DLABS Media Processor?
The 3DLABS Media Processor is a highly parallel, fully programmable
floating point array combined with dual ARM® processor core and
integrated peripheral support to provide a fully balanced solution for
accelerating both application and media intensive tasks with maximum
MIPS per mW.
What is special about it?
The Media Processor has an unusually flexible, programmable media-processing
engine. All processing is done with this engine and the ARM® CPUs;
there are no special units dedicated to specific functions. This architecture
allows us to squeeze the most performance from the lowest cost and the
lowest power consumption.
How does it do that?
A fixed function unit can only do one thing. A 3D graphics unit can
only do 3D graphics. If you don’t need 3D graphics that part of
the chip sits idle. You are paying for it even though you don’t
need it. If you don’t want 3D graphics from the Media Processor
you simply don’t load the program. You can use all the chip for
video decode if that’s what you want.
When a chip is designed with fixed function units someone has to decide
how much of the chip to devote to it, and that limits its performance.
For example, a chip designed for both video decode and 3D graphics has
to assign some of the die to each function. The video decode performance
will depend on how much area is given to it, so for a given cost, adding
3D graphics reduces the video performance. We don’t have that
problem with our Media Processor; when you run video decode you get
the whole chip, then when you run 3D graphics you get the whole chip.
What if I want to run video and 3D at the same time?
The engine is multi-tasking, just as the CPUs are, so video decode and
3D graphics (and audio, 2D graphics, physics engines and anything else
you can think of) can be time-sliced on the engine. In that regard,
we’ve designed the engine to be more like a CPU than a traditional
graphics chip or DSP.
So is the engine just a very fast CPU?
No, it’s very different. We all know the power consumption of
general purpose CPUs doing media tasks. Our engine is built from clusters
of parallel units. The Media Processor has three clusters; future chips
will have different numbers. Each cluster can work on a different task,
and each cluster can multi-task. This gives us great flexibility in
how work is distributed and lets us tackle very different types of problems.
Multi-tasking is a great way to distribute a resource but can cause
problems in real time systems. In those circumstances we can lock a
job to a cluster while the other jobs multi-task on the remaining clusters.
Isn’t that a challenge to program?
No, we provide different ways to use the engine. At its simplest the
programmer just writes a normal program to run on an ARM® CPU. The
program calls libraries that we provide that control processing on the
media engine. The programmer doesn’t even know the engine is there.
Or we allow a programmer to take full control and hand code highly optimised
routines for the media engine and decide which clusters run which program.
In between these extremes, programmers can mix their code with ours,
queue tasks to be executed and have the workload automatically distributed.
What is in each cluster?
A cluster is group of single instruction multiple data (SIMD) processing
elements. SIMD is a well-known method for packing a lot of processing
into a small space. The trick is to make that processing useful. There
have been far too many SIMD designs that could do one thing very well
but failed to deliver across a range of applications. Our design takes
the efficiency of SIMD and combines it with new ideas to produce something
uniquely flexible.
That’s a big claim, can you back it
up?
Well, with the same engine we can decode high definition H.264 video,
then process an FFT, then run Bayer interpolation on data from an image
sensor and do it all with world-class efficiency. If you understand
those algorithms you’ll know that they have very different compute
loads and data access patterns.
What performance does the Media Processor
have?
A summary of performance is given >
here
Why do you support floating point?
Many engineers used to traditional DSPs think of floating point as something
for bloated applications on power hungry CPUs. In fact, floating point
can be quicker than integer because you don’t have to worry about
overflow and renormalising. You write less code, it’s easier to
understand, and quicker to get working. But that’s only true if
the hardware is done well and we have a long history of floating point
design. For our media processor we put a lot of effort into designing
a really tight, efficient unit that runs floating point as quickly as
integer. There’s no drop in performance for floating point so
it can be used wherever it’s needed. And we use it all over the
place. Not only in obvious places like 3D graphics but also image processing,
physics simulations, audio processing, and general signal processing.
We support IEEE single precision floating point, and a special 16-bit
format that’s great for high dynamic range image processing.
Surely something like a colour processing
pipeline for an image sensor is best done as a fixed function?
Show me two image processing engineers who agree on a pipeline. Everyone
has different needs and if we hardcode the algorithms we please no one.
Take Bayer interpolation; there is continuous research into the best
way to extract data from a sensor. Even if the research stopped today,
different algorithms are best for different circumstances. The algorithm
you want to use when displaying a live preview of the scene should be
most concerned with using minimum power. The algorithm you use when
capturing an image to save should focus on quality. We just run a different
program and get the best of both worlds.
Why do you have two ARM® 9 CPUs? Why
not one ARM® 11?
One obvious benefit is that two ARM® 9 processors give more MIPs
for fewer milliWatts than a single ARM® 11, but this is not the
only benefit.
The twin CPUs allow separation of interrupt driven events and applications.
In real time systems there is a conflict between the application that
you want to run and all the events such as network traffic and disk
activity that have to be handled on an interrupt basis. The randomness
of the background tasks makes it difficult to guarantee the speed of
the primary application so some headroom has to be left in the system.
Because we have two CPUs we can use one to handle the interrupt related
tasks and one to run the real time application. When our customers ship
a system and say it will run video at a particular rate, they can be
sure it will keep doing that even if another event arrives part way
through the movie. The system is more stable, tuning is easier, our
customers have shorter time to market, and we don’t have to run
the CPU faster than necessary so power consumption is lower.
What techniques have you used to reduce power
consumption?
Energy efficiency starts with the fundamental architecture. How does
data move through the system? How can it be kept on chip? How do you
control power when the system is busy and when it’s idle? Our
architecture has been shaped by the way we tackled these questions,
and we didn’t only look at what is happening today but what we
need to do for future silicon processes as their characteristics change.
One of the most important things we have done is to provide multiple
voltage domains that allow us to set different voltages as demand for
performance changes. We can do this dynamically without rebooting. We
can also completely power off parts of the chip while the rest keeps
running.
|