How to go fast without speculating… Maybe. Perhaps.

Kunle Olukotun didn’t like systems that wasted their time stalled on loads and branches. He and his team at Afara Websystems therefore designed a SPARC processor that did work without waits. It became the Sun T1.

Speed without speculating

The basic idea is to have more decoders than ALUs, so you can have lots of threads competing for an ALU.  If, for example, thread 0 comes to a load, it will stall, so on the next instruction thread 1 gets the ALU, and runs… until it stalls and thread 2 get the ALU.  Ditto for thread 3, and control goes back to thread 0, which has completed a multi-cycle fetch from cache and is ready to proceed once more.

That is the basic idea of the Sun T-series processors, and it works without speculation.

The strength is that the ALUs are never waiting for work. The weakness is that individual threads still have to wait for data to come from cache.

You can improve on that

Now imagine it isn’t entire ALUs that are the available resources, its individual ALU component, like adders.  Now the scenario becomes

  • thread 0 stalls
  • thread 1 get an adder
  • thread 2 gets a compare (really a subtracter)
  • thread 3 gets a branch unit, and will probably need to wait in the next cycle
  • thread 4 gets an adder
  • thread 5 gets an FPU

… and so on. Each cycle, the hardware assigns as many ALU components as it has available to threads, all of which can run. Only the stalled threads are waiting, and they don’t need ALU bits to do that.

Now more threads can run at the same time, the ALU components are (probabilistically) all busy, and we have increased capacity. But individual threads are still waiting for cache.

Do I feel lucky?

In principle, we could allocate two adders to thread 5, one doing the current instruction and another doing a subsequent, non-dependent instruction. It’s not speculative, but it is out-of-order. That makes some threads twice as fast when doing non-interacting calculations. Allocate it three adders and it’s three times as fast.

If we’re prepared to have more ALU components than decoders, decode deeply and we have enough of each to be likely to be able to find lots of non-dependent instructions, then we can be executing multiple instructions at once in multiple streams, and probabilistically get startlingly better performance.

I can see a new kind of optimizing compiler, too: one which tries to group non-dependant instructions together,  to allow more out-of-order operations.

 

2 thoughts on “How to go fast without speculating… Maybe. Perhaps.

  1. I’ve always thought that the Niagara design was onto something. The Xerox PARC Alto did something like this too, I think. And the Terra design, if I remember correctly.

    I don’t think that splitting an adder from a subtracter makes sense since they share so much logic. Its almost as cheap to make two adder/subracters as a adder and a subtracter.

    The internal paths for data (let’s call them buses, but that’s not quite correct) are apparently often more expensive than the logic units.

    It’s all non-intuitive to a programmer.

    I’ve thought for more than 40 years that something like functional programming yields a program that is more amenable to “instruction level parallelism”. But I’ve never done anything about that. Optimizing compilers have gone that way with SSA.

    Like

Leave a comment