2 – Limits on parallel execution

We learn programming by being introduced to the simplest concept of an algorithm: an ordered set of actions that, when executed in sequential order, solves a problem by doing the calculations related to each action. The key point here is that some of us fix that sequential order idea and never forget it. However, not all action in an algorithm have to executed in the given order to achieve the same result.

Some operations are not related to each other (reading and writing different variables), which make it possible to execute them in any relative order, or even at the same time. Every program (and related problem) has some inherent parallelism. But how further can we go in exploiting it?

2.1 – Amdahl’s law

Gene Amdahl [1] formulated a now classic result over how much performance one can gain by exploiting all the parallelism existent in a given algorithm.  We start by accounting for all the instruction of a program in two groups: P the possibly-parallel ones; and S the sequential part. Even if we had an unlimited number of processors, there is a limit on how much speedup (how fast the parallel version is compared to the sequential one) we can achieve. This limit is expressed by the following formulae, with N being the number of available cores:

speedup = (S + P)/(S + P/N)

Notice that even with an infinite number of cores, the maximum speedup of a program that is half sequential will be at most 2. So, how would behave the same program in a recent dual or quad-core machine? Given that the programmer insert all the multithreading spice to it (more on this later), the maximum speedup would be in a quad-core (remember that in the example S = P):

speedup = (S + P)/(S + P/4) -> (S + S)/(S + S/4) -> 2*S/( (5*S)/4 ) -> 8/5

See that even in a quad-core machine, as expected, the speedup will be less than 2. Also remember that we are not considering any overhead or other issues in parallelizing the P instructions. This limit is sometimes depressing but it shows that parallelism is not easy, even if you do know what to do.

2.2 – Gustafson’s law

But the previous results do not say it is impossible to do more work with more cores. Amdahl’s law just accounts for the speedup for a specific program implementation. We still can do a completely different program, or solve different problems. This is what John Gustafson exposed in his 1988 paper [2] and was also explained by Sutter [3], where he proposes other aproaches such as:

  • Use more data in the parallel parts, solving a bigger problem (Gustafson’s approach);
  • Do different work that is parallel such as solving problems that were not possible before (aka real-time ray tracing).

What is being said is that the available power of the extra cores can (and should) be used. A key thing is that this power will not come for free, and not all programs can be refactored to use it. In the next section we’ll be looking at how the hardware has been engineered to implicitly exploit (or expose) paralellism.

3 – Practical issues

The recent trend in multicore systems is not based on new parallel models. It is instead a convergence of decades of research in the field where several different approaches were experimented.From a programmers point of view, two concerns are important: data and functions, so one can classify the available parallelism expressed in programs based on these two aspects [4]:

  • Functional parallelism – expressed by the divide and conquer approach of most programming paradigms. It is usually small, sparce and irregular;
  • Data parallelism – comes from data-structures that allow for the parallel manipulation on them. It is usually regular and massive.

However, not all available parallelism can be exploited in practice, so one must account for the utilized parallelism at runtime. Part of the available functional parallelism can be exploited implicitelly by techniques that try to squeeze the most of each piece of (expressed) sequential code. Most of these techniques can be though of as ILP (instruction level parallelism). ILP accounts fine grained, often machine level, instructions that do not have a mandatory sequential order to execute. The good news is that (good) compilers and also hardware are capable of tackling most of these cases succesfuly. The bad news is that this form of parallelism does not allow for really noticeable speedups.

The bread and butter really comes from coarse grained functional (programm or thread level) and data parallelism. Since this forms or parallelism are explicit (the programmer need to express it directly) one will have to learn about memory models, plataforms and libraries to describe data structures and algorithms in parallel fashion.This is where our jorney into parallel architectures really begins. We start out by classifying types of machines based on how they expose data and instructions to the programmer using a (somehow) well accepted taxonomy given by Flynn in 1966 (taken from a parallel programming instructional tutorial [5]):

  • SISD – single instruction with single data is the basic model for sequential machines;
  • SIMD – single instruction and multiple data. Processors usually execute the same instructions in lock-step. This the model for vector machines and processor arrays. Until recently, graphic processors (GPU) could be classified in this category;
  • MISD – multiple instruction, single data. This model is of little use and few machines were ever built around it. A possible application would be to apply different criptoanalysis algorithms over the same message (data)
  • MIMD – multiple instruction, multiple data. This model acconts for almost all parallel machines nowadays, such as SMP (simetric multiprocessors), multicores (kind of SMP), massive parallel machines (most clusters), GRIDs, and modern GPUs;

Again, from the programmers point of view, there is another layer on top of this one that have to be taken into consideration: the memory model. Given that we have more than one processor and some ammount of available memory, the memory model accounts for how much of it is directly accessible from each processor. Shared memory architectures allow for all processors to access all memory addresses directly. Still, this access can be uniform (UMA – uniform memory access), where there is no latency or bandwidth difference for each processor. In this category are SMP and multicore machines. Non uniform access (NUMA) provide a direct addressing for the whole memory but latency and bandwidth vary depending on data locality, such as CPU accessing its own RAM and VideoRAM from the GPU. Figure 1 shows an ilustration of this pattern.

Shared memory (taken from [5])

Figure 1: Shared memory (taken from [5])

Distributed memory models are all those architectures where there is no direct addressing for all available memory. Most clusters and GRIDs are into this category. Figure 2 ilustrates this model.

Distributed memory (taken from [5])

Figure 2: Distributed memory (taken from [5])

Some real world scenarios, however, are not strictly into only one of these models. Most clusters now use SMP or multicore machines, ending up as a hybrid architecture. Also the IBM Cell Processor (Sony Playstation 3) has one main processor that addresses all available memory in NUMA pattern, while each synergistic processor accesses only its own memory (distributed model). Figure 3 shows this practical scenario.

Hybrid memory access (taken from [5])

Figure 3: Hybrid memory (taken from [5])

It is also important to mention that while some software platform (as we will see in the next part) hide these architecture details from the programmer (such as allowing for global addressing in distributed architecture, or message passing in shared memory), knowing the instrisics of the hardware platform is still mandatory if one wants to squeeze the most performance out of it.

In the next part we will be discussing programming models and languages to exploit exposed parallelism. Some subjective propositions will be put on team and individual instruction approaches to learning parallelism.

References

[1] Gene Amdahl. “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities” (AFIPS Conference Proceedings, Vol. 30, AFIPS Press, April 1967).

[2] John Gustafson. “Reevaluating Amdahl’s Law” (Communications of the ACM, 31(5), 1988).

[3] Herb Sutter. “Break Amdahl’s law! – here is a law you should break early and often” (Dr. Dobb’s Journal, 2008).

[4] Eugene Vinod Rebello. “Parallel architecture lecture notes” (IC-UFF, 2008).

[5] Blaise Barney. “Introduction to parallel computing” (HPC – Lawrence Livermore National Laboratory, 2007).

It’s been a while since my last post. I hope this July will be plenty of new ones. I’ll start by talking about a subject that’s been in every game developer around (at least the programmers): parallel programming!

This text will be span across several posts. Since its contents are also part of a college homework, there will be a continuation for sure. Today I’m going to talk about the motivations behind the need for parallel programming, both in software and hardware.

1 – Introduction

The mainstream computer hardware industry has recently moved to a race towards parallelism. The reasons behind this move are mainly related to the inability to dissipate the increasing heat generated inside processor cores with growing clock rates. At the same time, there is the synchronization problem between processor components. At very high speeds, it is difficult to guarantee that data sent through paths of approximately the same size arrive together at the destination.

This “new” computing trend is being part of a market hype, and the media is full of nice stories about the benefits of multicore CPUs and virtualization techniques. What is not being told is that the life of programmers and systems designers has already changed. From enterprise developers to game programmers, everybody will have to face new paradigms to still develop software that takes advantage of new hardware. Herb Sutter [1] has called this the concurrent revolution and said that “the free lunch is over”, meaning that no software will get better performance by just upgrading the hardware anymore.

You could think that I will now start a Java tutorial on how to use multithreading to exploit multicore parallelism with JMonkeyEngine, or how a specific game engine used a fancy technique to gain more performance. It is not that simple. Parallelism is an old research subject and even today its bads and goods are not fully understanded. So, I’ll instead start a voyage through the concepts and techniques used to exploit the maximum performance out of the problem being solved. That is it: instead of the software, sometimes is better to re-think the whole problem and its solution.

The next part will explain some concepts and rules about the parallelism that is achievable in software. Part III will explain how hardware (and also compilers) have been developed to squeeze the maximum performance out of (quasi) sequential software and how much room has been left to the developer (this is the new trend). Part IV will analyse current mainstream programming paradigm and platforms (named object oriented programming) against this new trend and instigate the reader to explore other languages and paradigms. I hope this text will be a good reading for those interested in the subject.

References

[1] – Sutter, Herb. A Fundamental Turn Toward Concurrency in Software. In Dr. Dobbs Journal. March 1, 2005.

Hello new sertao3D!

June 15, 2008

This is my new blog… Just moved from blogger. In a few days I’ll be publishing a post about game music and another one about tile based flash games.