4 – Programming Models

Given that current architecture trends already converged to MIMD machines with shared (SMP and multicores) or hybrid (GRIDs and clusters) memory models, how does this affect the way programmers do their work? Until the beggining of this century, all this concepts and concerns were of little interest for the mainstream enterprise programmer (code monkey). This lack of interest is mostly related to the fact that if the software was slow, in the next couple of years new technology should arise to enable it to run smoothly.

There is a common misconception about the well known Moore’s law, who envisioned that every 18 months, the transistor count of a given core would double. This “rule” is still valid, but the issue is that these new transistors are (more and more) necessary to deal with the increasing complexity of the processors. And also, higher clock rates increase the power consumption and, consequently, the generated heat. These issues have forced the mainstream CPU industry to expose coarse grained parallelism (current multicore systems). Until recently, these mainstream CPUs only explored ILP (instruction level) in their superscalar design. In this part we explain the programming models that are available (today) to deal with this coarse grained parallelism.

We will consider three parallel programming models, namely:

  • Threads
  • Message passing
  • Data-parallel

These models are abstractions over CPU and memory architectures discussed so far and are not tied to any of them. However, it is important to remember basic assumptions in order to be able to get the most out of it.

4.1 – Threads

Threads are (almost) independent execution paths inside a program. It is common to relate them with subroutines inside a given algorithm. Given two subroutines that are not dependent to one another (operate over different data), we can safely run them in parallel. Threads share the same address space, so one have to be extra careful when assuming that they are independent. This is also one of the strengths of the model, since this shared view of memory makes the communication between threads easier (global variables).

Even in the case where two threads operate (partially) over the same data, it is possible to benefit from parallel execution in those parts that are not accessing this shared state. It is known that race conditions, situations in which two or more threads are operating over the same data (communication), can lead to unpredictible results. To avoid them, several hardware and software constructs (TSL instructions, semaphores, monitors, locks) have been developed over the last decades. Even though, the fact that these constructs are not as easy to grasp as sequential algorithms (also the lack of a standard model) contributes to little adoption of these models besides very small groups (several software development teams make good use of threads, but not mainstream code-monkeys).

4.2 – Message passing

The message passing model provides for a single and well accepted construct for communication between distributed units of work. These units (processes) can be spread over a cluster, GRID or even one single node. To communicate, processes use two primitives: send and receive. Each send dispatched by one process instance must have a matching receive in order to complete.

To the programmer, message passing is exposed by a set of library functions for the most usual programming languages (such as Fortran, C or Java). In recent years, the most common of these libraries has become MPI (message passing interface). MPI (and to some extent message passing) is well accepted by the high performance computing (HPC) community but is relatively unknown outside these research groups.

4.3 – Data parallel

Data parallel accounts for domains where data (information) is provided as regular sets with considerable size. Tasks operate (depending on the hardware architecture) concurrently on the smae data set, but over different partitions. Notice that on shared memory architectures, the whole data set is accessible to all task, while in distributed memory the partitions can be spread over the nodes (see that the programming model is suitable for different architectures).

In practice, not all problems (domains) map well to this model, but tho one that do are very importante, such as weather simulation, financial analysis, computer graphics, among others. It is also interesting to notice that in recent years the computer games market lead to the development of specialized SIMD (and later MIMD) hardware (GPUs) whose programming model is data parallel. This hardware is now suitable for general purpose computing.

Is interesting to notice that these models can be used at the same time in hybrid fashion. For instance, one can use distributed MPI processes, each one with multiple threads manipulating data-parallel GPUs to solve one single (and huge) problem. However, all these performance-wise wonders seem to be distance from the mainstream programmer. Lets now take a look at how we can solve problems with parallel thinking.

5 – Parallel problem solving

Ideally, we would create sequential algorithms and tools would identify oportunities for parallelism and reorganize the code acordingly, making it benefit from the new architectures. This is the approach of hardware that provides ILP (instruction level parallelism) and some (good) compilers. However, there are very important caveats:

  • Depending on the case, performance could even degrade;
  • Limited to a small subset of code (fine grained – close instruction reordering and/or loop unrolling);
  • Much less flexible than manual parallelization.

There is no substitute for a good understanding of the domain yet, specially for problems that can benefit from coarse grained parallelism such as data-parallel or independent chunks of code/data communicatin with message passing. The first step is always to examinate the problem and the current algorithm used to solve it, paying spetial attention to possible data and functional decomposition.

Certain problems allow data decomposition, where the information can be split between discrete “data-chunks” that are independent and suitable for parallel processing in SIMD fashion. Figure 4 shows a skematic representation of this kind of decomposition.

Domain decomposition

Figure 4: Domain (data) decomposition (taken from [5])

Functional decomposition accounts for partitioning the tasks that need to be acomplished to solve the problem. The gains come from subtasks that are independent (operate over different data) and those that depend on data-streams from other subtasks, both relying on MIMD architecture to acomplish parallelism.

The later case is related to the producer-consumer classic syncronization problem, and is parallelizable in two ways: a) while the first task is operating over the second data-element of the stream, the second task is already working over the first element of the stream. It is also important to see that if each different task operates over independent chunks of the stream, they can be replicated (SIMD – several copies of the same task over multiple chunks of data), creating a hybrid approach. Figure 5 shows a pipeline example (not hybrid).

Pipelined tasks over data-stream

Figure 5: Pipelined tasks over data-stream

These decomposition approaches are imagined and developed at high abstraction layers but their implementation still relies on the programming models and related libraries such as threads (POSIX Threads, Java Threads), message passing (PVM, MPI) or data-parallel models (High Performance Fortran, GPGPU techniques).

One last note on this topic about programming models to discuss about a recent trend with GPGPU programming. Graphics hardware company NVidia has recently released a library (<a href=”http://cuda.nvidia.com”>CUDA</a>) to ease the task of developing parallel programs to run on its hardware (AMD/ATI also released similar library). The interesting thing about the CUDA library is that, although the hardware architecture and memory model are better suitable to a data-parallel approach, NVidia chose to base the library constructs and terminology upon the concept of threads. The reasons behind this may be related to the fact that the data-parallel model is even more obscure (for the mainstream programmer) than threads.

6 – Final thoughts

It’s been already 6 years since the last “big” leap in raw clock increase in the mainstream CPU market and most software developers are still not aware of this. It is also interesting to notice how the average customer did not account for the fact that his brand new PC is not faster than the old one (sometimes it is, just based on the fact that the old one was infested with malware – and as the new one will also become, its multicores will take care of the load).

But the market still pushes for more complex computing, specially in the enterprise and scientific research fields. The new generation of programmers have a (now even more important) choice: to think parallel or not to think parallel. Several enterprise software categories are already benefiting from multicore processors, such as databases, web servers and virtualized operating systems. But they were already production-grade multithreaded software before this recent trend (mostly because of isolation and SMP architectures).

Those who develop for the web or run virtualized data-centers may be saying that for them there is no need to learn new paradigms, but maybe they are just missing the point. Multicore systems are enabling them to do the same work as they did before with less power consumption per server but this is just more of the same. Interesting questions that came to my mind:

  • What can I do now that I could not do before?
  • Who else (competition) is thinking about this?
  • Am I risking to loose customers to these (parallel) people?

As a last remark, I’ll leave other interesting questions (and further readings) for the reader. These are related to language and tools to ease (with some overhead, but who said garbage collection is/was rubish?) the burden on parallel programming:

  • Will functional languages with immutable objects (evidences here for *Lisp and Erlang) replace current procedural and object oriented ones? (operations over immutable objects are inherently parallelizable)
  • What about specialized tools (with accompaining restrictive design patterns) such as Google’s MapReuce framework (or its opensource incarnation Apache Hadoop)?

I hope you enjoyed the (kind of long) reading and that it was able to fire at least some curiosity light on the subject.

2 – Limits on parallel execution

We learn programming by being introduced to the simplest concept of an algorithm: an ordered set of actions that, when executed in sequential order, solves a problem by doing the calculations related to each action. The key point here is that some of us fix that sequential order idea and never forget it. However, not all action in an algorithm have to executed in the given order to achieve the same result.

Some operations are not related to each other (reading and writing different variables), which make it possible to execute them in any relative order, or even at the same time. Every program (and related problem) has some inherent parallelism. But how further can we go in exploiting it?

2.1 – Amdahl’s law

Gene Amdahl [1] formulated a now classic result over how much performance one can gain by exploiting all the parallelism existent in a given algorithm.  We start by accounting for all the instruction of a program in two groups: P the possibly-parallel ones; and S the sequential part. Even if we had an unlimited number of processors, there is a limit on how much speedup (how fast the parallel version is compared to the sequential one) we can achieve. This limit is expressed by the following formulae, with N being the number of available cores:

speedup = (S + P)/(S + P/N)

Notice that even with an infinite number of cores, the maximum speedup of a program that is half sequential will be at most 2. So, how would behave the same program in a recent dual or quad-core machine? Given that the programmer insert all the multithreading spice to it (more on this later), the maximum speedup would be in a quad-core (remember that in the example S = P):

speedup = (S + P)/(S + P/4) -> (S + S)/(S + S/4) -> 2*S/( (5*S)/4 ) -> 8/5

See that even in a quad-core machine, as expected, the speedup will be less than 2. Also remember that we are not considering any overhead or other issues in parallelizing the P instructions. This limit is sometimes depressing but it shows that parallelism is not easy, even if you do know what to do.

2.2 – Gustafson’s law

But the previous results do not say it is impossible to do more work with more cores. Amdahl’s law just accounts for the speedup for a specific program implementation. We still can do a completely different program, or solve different problems. This is what John Gustafson exposed in his 1988 paper [2] and was also explained by Sutter [3], where he proposes other aproaches such as:

  • Use more data in the parallel parts, solving a bigger problem (Gustafson’s approach);
  • Do different work that is parallel such as solving problems that were not possible before (aka real-time ray tracing).

What is being said is that the available power of the extra cores can (and should) be used. A key thing is that this power will not come for free, and not all programs can be refactored to use it. In the next section we’ll be looking at how the hardware has been engineered to implicitly exploit (or expose) paralellism.

3 – Practical issues

The recent trend in multicore systems is not based on new parallel models. It is instead a convergence of decades of research in the field where several different approaches were experimented.From a programmers point of view, two concerns are important: data and functions, so one can classify the available parallelism expressed in programs based on these two aspects [4]:

  • Functional parallelism – expressed by the divide and conquer approach of most programming paradigms. It is usually small, sparce and irregular;
  • Data parallelism – comes from data-structures that allow for the parallel manipulation on them. It is usually regular and massive.

However, not all available parallelism can be exploited in practice, so one must account for the utilized parallelism at runtime. Part of the available functional parallelism can be exploited implicitelly by techniques that try to squeeze the most of each piece of (expressed) sequential code. Most of these techniques can be though of as ILP (instruction level parallelism). ILP accounts fine grained, often machine level, instructions that do not have a mandatory sequential order to execute. The good news is that (good) compilers and also hardware are capable of tackling most of these cases succesfuly. The bad news is that this form of parallelism does not allow for really noticeable speedups.

The bread and butter really comes from coarse grained functional (programm or thread level) and data parallelism. Since this forms or parallelism are explicit (the programmer need to express it directly) one will have to learn about memory models, plataforms and libraries to describe data structures and algorithms in parallel fashion.This is where our jorney into parallel architectures really begins. We start out by classifying types of machines based on how they expose data and instructions to the programmer using a (somehow) well accepted taxonomy given by Flynn in 1966 (taken from a parallel programming instructional tutorial [5]):

  • SISD – single instruction with single data is the basic model for sequential machines;
  • SIMD – single instruction and multiple data. Processors usually execute the same instructions in lock-step. This the model for vector machines and processor arrays. Until recently, graphic processors (GPU) could be classified in this category;
  • MISD – multiple instruction, single data. This model is of little use and few machines were ever built around it. A possible application would be to apply different criptoanalysis algorithms over the same message (data)
  • MIMD – multiple instruction, multiple data. This model acconts for almost all parallel machines nowadays, such as SMP (simetric multiprocessors), multicores (kind of SMP), massive parallel machines (most clusters), GRIDs, and modern GPUs;

Again, from the programmers point of view, there is another layer on top of this one that have to be taken into consideration: the memory model. Given that we have more than one processor and some ammount of available memory, the memory model accounts for how much of it is directly accessible from each processor. Shared memory architectures allow for all processors to access all memory addresses directly. Still, this access can be uniform (UMA – uniform memory access), where there is no latency or bandwidth difference for each processor. In this category are SMP and multicore machines. Non uniform access (NUMA) provide a direct addressing for the whole memory but latency and bandwidth vary depending on data locality, such as CPU accessing its own RAM and VideoRAM from the GPU. Figure 1 shows an ilustration of this pattern.

Shared memory (taken from [5])

Figure 1: Shared memory (taken from [5])

Distributed memory models are all those architectures where there is no direct addressing for all available memory. Most clusters and GRIDs are into this category. Figure 2 ilustrates this model.

Distributed memory (taken from [5])

Figure 2: Distributed memory (taken from [5])

Some real world scenarios, however, are not strictly into only one of these models. Most clusters now use SMP or multicore machines, ending up as a hybrid architecture. Also the IBM Cell Processor (Sony Playstation 3) has one main processor that addresses all available memory in NUMA pattern, while each synergistic processor accesses only its own memory (distributed model). Figure 3 shows this practical scenario.

Hybrid memory access (taken from [5])

Figure 3: Hybrid memory (taken from [5])

It is also important to mention that while some software platform (as we will see in the next part) hide these architecture details from the programmer (such as allowing for global addressing in distributed architecture, or message passing in shared memory), knowing the instrisics of the hardware platform is still mandatory if one wants to squeeze the most performance out of it.

In the next part we will be discussing programming models and languages to exploit exposed parallelism. Some subjective propositions will be put on team and individual instruction approaches to learning parallelism.

References

[1] Gene Amdahl. “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities” (AFIPS Conference Proceedings, Vol. 30, AFIPS Press, April 1967).

[2] John Gustafson. “Reevaluating Amdahl’s Law” (Communications of the ACM, 31(5), 1988).

[3] Herb Sutter. “Break Amdahl’s law! – here is a law you should break early and often” (Dr. Dobb’s Journal, 2008).

[4] Eugene Vinod Rebello. “Parallel architecture lecture notes” (IC-UFF, 2008).

[5] Blaise Barney. “Introduction to parallel computing” (HPC – Lawrence Livermore National Laboratory, 2007).

It’s been a while since my last post. I hope this July will be plenty of new ones. I’ll start by talking about a subject that’s been in every game developer around (at least the programmers): parallel programming!

This text will be span across several posts. Since its contents are also part of a college homework, there will be a continuation for sure. Today I’m going to talk about the motivations behind the need for parallel programming, both in software and hardware.

1 – Introduction

The mainstream computer hardware industry has recently moved to a race towards parallelism. The reasons behind this move are mainly related to the inability to dissipate the increasing heat generated inside processor cores with growing clock rates. At the same time, there is the synchronization problem between processor components. At very high speeds, it is difficult to guarantee that data sent through paths of approximately the same size arrive together at the destination.

This “new” computing trend is being part of a market hype, and the media is full of nice stories about the benefits of multicore CPUs and virtualization techniques. What is not being told is that the life of programmers and systems designers has already changed. From enterprise developers to game programmers, everybody will have to face new paradigms to still develop software that takes advantage of new hardware. Herb Sutter [1] has called this the concurrent revolution and said that “the free lunch is over”, meaning that no software will get better performance by just upgrading the hardware anymore.

You could think that I will now start a Java tutorial on how to use multithreading to exploit multicore parallelism with JMonkeyEngine, or how a specific game engine used a fancy technique to gain more performance. It is not that simple. Parallelism is an old research subject and even today its bads and goods are not fully understanded. So, I’ll instead start a voyage through the concepts and techniques used to exploit the maximum performance out of the problem being solved. That is it: instead of the software, sometimes is better to re-think the whole problem and its solution.

The next part will explain some concepts and rules about the parallelism that is achievable in software. Part III will explain how hardware (and also compilers) have been developed to squeeze the maximum performance out of (quasi) sequential software and how much room has been left to the developer (this is the new trend). Part IV will analyse current mainstream programming paradigm and platforms (named object oriented programming) against this new trend and instigate the reader to explore other languages and paradigms. I hope this text will be a good reading for those interested in the subject.

References

[1] – Sutter, Herb. A Fundamental Turn Toward Concurrency in Software. In Dr. Dobbs Journal. March 1, 2005.

Games and OOP

Only a few domains map to a programming paradigm so directly as computer games and object orientation. A Game class matches the concept of a virtual world where several different instances of a GameObject class reside. The simulation usually consists of a loop inside the Game class with three main goals: collect and interpret user or network input; update each GameObject instance based on input and/or a simulation step; draw visible game objects on the output device.

Inheritance vs. Composition in game objects (Deja-vu):

To represent different types of simulated objects, the programmer usually create subclasses of GameObject. There are problems with the inheritance approach since objects from different hierarquies sometimes have common functionality and characteristics. This is a common issue in object oriented software design and replacing inheritance by composition is strongly recomended [check previous posts]. The GameObject class then consists only of an unique identifier, some common attributes all game objects have such as position and orientation and, most important, a collection of components. Each class implementing the GameComponent interface represents a different aspect/behavior of a GameObject such as visuals, physics, AI or health, depending on game being implemented. These components are highly flexible and easier to maintain while also maximize code reuse with many of them being useful in several different game genres.

What is the problem with inter-dependent components?

Composition also has its drawbacks, mostly when there’s some coupling between components. For example, an AI component commonly depends on the existance of a health component to decide actions to take and also on a physics component to apply movements to. These dependencies are normally solved directly by the programmer, as shown in the following Java code, part of an AIComponent class, adapted from C++ example Cris Stoy put in his excelent article in Game Programmin Gems 6:

1 public void update() {

2 GameObject owner = getOwner();

3 HealthComponent health = owner.getComponent(HealthComponent.class);

4 if (health != null) {

5 // take appropriate actions

6 }

7 }

Aparently there’s nothing wrong with this code, but a closer inspection shows that:

a) lines 2-4 have nothing to do with the expected game logic of an AI update, being just boilerplate code;

b) if no instance of HealthComponent is initialized for this particular GameObject, no proper AI action (line 5) will ever be taken, making it hard to debug.

How to circunvent that?

What I think is a good idea is the use of dependency injection [check Martin Fowler's article] to automatically take care of the coupling between game components and safe initialization. The programmer will not need to manually check for dependencies or their nullability. It is easier to concentrate on the game logic to implement by replacing the above code with this:

1 @Inject(nullable=false)
2 private HealthComponent health;
3 public void update() {
4 // take appropriate actions
5 }

The above code is smaller, considerably cleaner and also safer since our framework will stop initialization and show an error log if there is no HealthComponent initialized for the GameObject that owns this AI component. With this approach we believe one can write better code and achieve faster prototyping which is always desired in game production pipelines.

In the next post more details about how GCore uses dependency injection to safely initialize game objects and its components…