Esta sacado de un articulo de watch impress en japones, esta es una traducción al ingles de parte del articulo:
...The structure in which a processor core has shared registers accessible from other cores is seen in network processors and the like. For what it's worth, the AGEIA co-founder Manju Hedge was an ex-CTO of a network processor manufacturer.
Memory Architecture Without Cache Hierarchy
One of the biggest characteristics of the PPU is its memory architecture. It has 128bit memory interface for external memory, but has no internal cache memory.
"We don't have cache memory hierarchy of any kind. This is very important because traditional cache is not suitable for physics," says Hedge.
PPU has no structure such as CPU cache that is synchronized with external memory by the set-associative method and updates automatically. It's because in physics simulation it has little data locality. They say memory cache hierarchy is more trouble than it's worth.
"In CPU and GPU, data has locality. But in physics not, as it has to do random access to many objects. Data structures are totally different" says Nadeem Mohammad, who moved from a GPU vendor to AGEIA.
Still PPU has large internal memory in itself. It has various internal memories instead of cache, and has the organization that does explicit and programmable transfer between internal and external memories.
The patent explains memories such as dual-bank Inter-Engine Memory (IEM) connected to VPU, multi-purpose Scratch Pad Memory (SPM), DME Instruction Memory (DIM) which does instruction queuing, and so on. Hedge suggested that those memories in the patent are in the actual implementation by saying "they are probably included" in PPU.
Among those memories IEM is used in the way that looks like traditional data cache. According to the patent, DME loads a data set required for operation of processing units into IEM explicitly. Unlike cache memory, low-latency access is possible in IEM and apparently it could implement a large number of I/O ports. As the result, it could achieve huge internal memory bandwidth.
"One of the important factors in a physics architecture is it requires huge on-chip memory bandwidth. Our PPU has 2Tb(Tera-bit)/sec on-chip memory bandwidth," says Hedge.
In short, removing complicated cache control made it possible that PPU has L2-cache size internal memory with L1-cache latency and huge bandwidth, and it's suitable for physics algorithms according to them.
Cell-like Global Structure of PPU
By the abstract of the PPU architecture, you'll immediately notice the commonality with Cell. Both of them are parallel processors with huge floating point processing units, have no cache hierarchy, and manage inter-memory data transfer by software programs.
If you replace PPU Control Engine (PCE), the RISC core in PPU, with PPE(Power Processor Element), the PowerPC core in Cell, and Vector Processing Engine (VPE), PPU's data processing engine, with SPE(Synergistic Processor Element), Cell's data processor, they almost correspond with each other.
In both the architectures one RISC core does global control and many vector data processors does data processing in parallel. As for the affinity in the architectures Hedge said:
"If you look at the very high level they are very alike. Both of them are huge parallel engines, have floating point processing units, and control each internal memory. But the difference is also big. For example, Cell does internal data transfer by a ring bus (so it has limited bandwidth). On the other hand, our architecture has far higher (internal data) bandwidth.
But it's also true that Cell is relatively suitable architecture for physics processing. In PS3, the GPU is "GeForce 7900+" architecture, but it has Cell. So in PS3 it can do physics in a PPU-like architecture (Cell), not in GPU."
Looking at the PPU architecture like this, you can imagine AGEIA has a relatively good affinity in PS3 library development.
The current transistor count of PPU is 125M and manufactured at the 0.13um process of TSMC. For the chip size it's GeForce FX 5800(NV30) class and the process is the same, the die size is about 182mm and it's a bit smaller than NV30. It won't be far from the reality if you assume you have NV30 for physics.
By comparing it with GPU you can imagine the configuration of VPU too. In the 120M class GPU it has 6-12 programmable shaders. Unlike GPU, PPU doesn't do texture handling etc, it should have simpler processing units. Then it's estimated that the current AGEIA PPU have 16 VPUs at most.