New Demo Version - 0.1.9

September 3, 2008 by erickpassos

New demo version, adding the following features:

  • Life system (finish 3rd or up to continue, no more lives -> game over);
  • Much improved GUI with nice buttons and messages;
  • Improved graphics and displacement of obstacles;
  • Powerups now ON;
  • Difficulty Levels (need balance, but already there);

The link is the same, from the graphic designr website:

Sperm Racer - V0.1.9

Here, some screenshots:

New main menu - difficulty levels

New main menu - difficulty levelsPowerup pickers

Powerup pickers

Powerup pickers

Finally the demo…

September 1, 2008 by erickpassos

Here is the link for the first playable demo of Sperm Racer:

http://www.pedrothiago.com/spermracer.html

Please wait a bit before clicking on the start button, the fisrt level takes some time to load (I’ll fix this on the next release).

Plus, we have some new artwork of the powerups and current screenshots:

Sperm Racer Powerups

Sperm Racer Powerups

Main Menu background

Main Menu background

Development screenshot form today

Development screenshot form today

Sperm Racer

August 31, 2008 by erickpassos

No demo yet, only on monday,

But here is a new screenshot:

New GUI

New GUI

Sperm Racer - Screenshot and development update

August 27, 2008 by erickpassos

Good news everyone!

Sperm racer development is going well and has nearly reached beta stage. We will publish a playable version of the first race “track” very soon. Here is a updated in-game screenshot:

As noticed from the above image, we have a lot of new features implemented, such as:

  • Sperm animations;
  • HUD with stamina, health, and a RADAR;
  • Trail bubles - particles;
  • Positions;
  • Track wayspoints;
  • Other sperms AI;
  • Power-ups;
  • The first racing “track”;
  • Start count-down;
  • Several other scripts…

In the next week we will be finishing some issues and release the first beta for testing. We also promise to give more details on development stuff.

Sperm Racer

August 3, 2008 by erickpassos

Someday one has to grow up. After spending 2+ years learning about game engine development, now it is time to try a different thing. I joined forces with my friend Pedro who is a graphics designer to participate in the Unity3D Game Contest.

For those who don’t know it yet, Unity3D is a fantastic game engine which brings together a visual approach to game development and the best data-driven object oriented programming practices I’ve seen in any such tool. It’s IDE/Editor run on Max OSX only, but the target for the deployed game can be OSX executable, dashborad widget, Windows executable, web based game or event for the Wii console (with a special license of course). The price tag for this tool is very affordable for indie developers and so for now on I’ll be sharing my spare time between Unity3D and JMonkeyEngine.

But now, lets talk about the game. It a racing game (my favorite genre), but a very unusual one. It’s a game about the race for life, the sperm race. It features unique game mechanics and a fast pace action that hopefully will please a lot of casual gamers out there. Follows a brief explanation of the game mechanics and some early concept artwork.

Mechanics…

You’ll race a single sperm by quickly waving its tail sideways to provide the impulse needed to swing your way to victory. You do this by alternating the LEFT and RIGHT keys or by using an analog or digital gamepad. To drive your way through the body you’ll have to learn how to turn and slide just by waving the tail.

Early Sperm Sketches - by Pedro Thiago

Apart from the basic swimm mechanics, it will be also possible to quickly jump sideways to avoid an obstacle; use the tail to beat your adversaries; and perform small loops to get back on the track (in case you miss a powerup).

You’ll race against several Ai controlled sperms and the courses will be full of static and dynamic obstacle such as wounds, tumors, viruses and bacterias. Each unwanted encounter will affect your sperm differently, maybe slowing you down, maybe losing power-ups.

The racing sperm is not an indestructible half-cell, it has finite levels of life and stamina which directly affect its speed. To make thing more interesting, you’ll be given the chance to chose between male (Y) and female (X) sperms, each one with characteristic levels of stamina and speed: X sperms are know for being impetuously fast but get tired very quickly while X sperms are not as hurry but don’t give up soon.

The power-ups planed are all going to act for just some moments and are grouped in the following types:

  • Shields (against viruses, bacteria or other sperm fight moves);
  • Extra speed;
  • Extra stamina;
  • Alternative fight move/gun;

The race courses…

The sperm race will take place inside the female reproductive aparatus in the expected order. The first “track” goal will be to reach the uterus, and only the very first sperms will be granted on the next one. The Uterus course will be the second and the “Falloppius Tube” will be a fast paced spring race to the end, that reserve some surprises to the player. All courses will have different scenarios, chalenges and pace. In the next posts I’ll be updating the development status of the game and publishing (as possible) playable demos of the first course, which you can have a very early preview here:

The first course

The first course

Parallel Musings Part III - Languages and Tools

July 1, 2008 by erickpassos

4 - Programming Models

Given that current architecture trends already converged to MIMD machines with shared (SMP and multicores) or hybrid (GRIDs and clusters) memory models, how does this affect the way programmers do their work? Until the beggining of this century, all this concepts and concerns were of little interest for the mainstream enterprise programmer (code monkey). This lack of interest is mostly related to the fact that if the software was slow, in the next couple of years new technology should arise to enable it to run smoothly.

There is a common misconception about the well known Moore’s law, who envisioned that every 18 months, the transistor count of a given core would double. This “rule” is still valid, but the issue is that these new transistors are (more and more) necessary to deal with the increasing complexity of the processors. And also, higher clock rates increase the power consumption and, consequently, the generated heat. These issues have forced the mainstream CPU industry to expose coarse grained parallelism (current multicore systems). Until recently, these mainstream CPUs only explored ILP (instruction level) in their superscalar design. In this part we explain the programming models that are available (today) to deal with this coarse grained parallelism.

We will consider three parallel programming models, namely:

  • Threads
  • Message passing
  • Data-parallel

These models are abstractions over CPU and memory architectures discussed so far and are not tied to any of them. However, it is important to remember basic assumptions in order to be able to get the most out of it.

4.1 - Threads

Threads are (almost) independent execution paths inside a program. It is common to relate them with subroutines inside a given algorithm. Given two subroutines that are not dependent to one another (operate over different data), we can safely run them in parallel. Threads share the same address space, so one have to be extra careful when assuming that they are independent. This is also one of the strengths of the model, since this shared view of memory makes the communication between threads easier (global variables).

Even in the case where two threads operate (partially) over the same data, it is possible to benefit from parallel execution in those parts that are not accessing this shared state. It is known that race conditions, situations in which two or more threads are operating over the same data (communication), can lead to unpredictible results. To avoid them, several hardware and software constructs (TSL instructions, semaphores, monitors, locks) have been developed over the last decades. Even though, the fact that these constructs are not as easy to grasp as sequential algorithms (also the lack of a standard model) contributes to little adoption of these models besides very small groups (several software development teams make good use of threads, but not mainstream code-monkeys).

4.2 - Message passing

The message passing model provides for a single and well accepted construct for communication between distributed units of work. These units (processes) can be spread over a cluster, GRID or even one single node. To communicate, processes use two primitives: send and receive. Each send dispatched by one process instance must have a matching receive in order to complete.

To the programmer, message passing is exposed by a set of library functions for the most usual programming languages (such as Fortran, C or Java). In recent years, the most common of these libraries has become MPI (message passing interface). MPI (and to some extent message passing) is well accepted by the high performance computing (HPC) community but is relatively unknown outside these research groups.

4.3 - Data parallel

Data parallel accounts for domains where data (information) is provided as regular sets with considerable size. Tasks operate (depending on the hardware architecture) concurrently on the smae data set, but over different partitions. Notice that on shared memory architectures, the whole data set is accessible to all task, while in distributed memory the partitions can be spread over the nodes (see that the programming model is suitable for different architectures).

In practice, not all problems (domains) map well to this model, but tho one that do are very importante, such as weather simulation, financial analysis, computer graphics, among others. It is also interesting to notice that in recent years the computer games market lead to the development of specialized SIMD (and later MIMD) hardware (GPUs) whose programming model is data parallel. This hardware is now suitable for general purpose computing.

Is interesting to notice that these models can be used at the same time in hybrid fashion. For instance, one can use distributed MPI processes, each one with multiple threads manipulating data-parallel GPUs to solve one single (and huge) problem. However, all these performance-wise wonders seem to be distance from the mainstream programmer. Lets now take a look at how we can solve problems with parallel thinking.

5 - Parallel problem solving

Ideally, we would create sequential algorithms and tools would identify oportunities for parallelism and reorganize the code acordingly, making it benefit from the new architectures. This is the approach of hardware that provides ILP (instruction level parallelism) and some (good) compilers. However, there are very important caveats:

  • Depending on the case, performance could even degrade;
  • Limited to a small subset of code (fine grained - close instruction reordering and/or loop unrolling);
  • Much less flexible than manual parallelization.

There is no substitute for a good understanding of the domain yet, specially for problems that can benefit from coarse grained parallelism such as data-parallel or independent chunks of code/data communicatin with message passing. The first step is always to examinate the problem and the current algorithm used to solve it, paying spetial attention to possible data and functional decomposition.

Certain problems allow data decomposition, where the information can be split between discrete “data-chunks” that are independent and suitable for parallel processing in SIMD fashion. Figure 4 shows a skematic representation of this kind of decomposition.

Domain decomposition

Figure 4: Domain (data) decomposition (taken from [5])

Functional decomposition accounts for partitioning the tasks that need to be acomplished to solve the problem. The gains come from subtasks that are independent (operate over different data) and those that depend on data-streams from other subtasks, both relying on MIMD architecture to acomplish parallelism.

The later case is related to the producer-consumer classic syncronization problem, and is parallelizable in two ways: a) while the first task is operating over the second data-element of the stream, the second task is already working over the first element of the stream. It is also important to see that if each different task operates over independent chunks of the stream, they can be replicated (SIMD - several copies of the same task over multiple chunks of data), creating a hybrid approach. Figure 5 shows a pipeline example (not hybrid).

Pipelined tasks over data-stream

Figure 5: Pipelined tasks over data-stream

These decomposition approaches are imagined and developed at high abstraction layers but their implementation still relies on the programming models and related libraries such as threads (POSIX Threads, Java Threads), message passing (PVM, MPI) or data-parallel models (High Performance Fortran, GPGPU techniques).

One last note on this topic about programming models to discuss about a recent trend with GPGPU programming. Graphics hardware company NVidia has recently released a library (<a href=”http://cuda.nvidia.com”>CUDA</a>) to ease the task of developing parallel programs to run on its hardware (AMD/ATI also released similar library). The interesting thing about the CUDA library is that, although the hardware architecture and memory model are better suitable to a data-parallel approach, NVidia chose to base the library constructs and terminology upon the concept of threads. The reasons behind this may be related to the fact that the data-parallel model is even more obscure (for the mainstream programmer) than threads.

6 - Final thoughts

It’s been already 6 years since the last “big” leap in raw clock increase in the mainstream CPU market and most software developers are still not aware of this. It is also interesting to notice how the average customer did not account for the fact that his brand new PC is not faster than the old one (sometimes it is, just based on the fact that the old one was infested with malware - and as the new one will also become, its multicores will take care of the load).

But the market still pushes for more complex computing, specially in the enterprise and scientific research fields. The new generation of programmers have a (now even more important) choice: to think parallel or not to think parallel. Several enterprise software categories are already benefiting from multicore processors, such as databases, web servers and virtualized operating systems. But they were already production-grade multithreaded software before this recent trend (mostly because of isolation and SMP architectures).

Those who develop for the web or run virtualized data-centers may be saying that for them there is no need to learn new paradigms, but maybe they are just missing the point. Multicore systems are enabling them to do the same work as they did before with less power consumption per server but this is just more of the same. Interesting questions that came to my mind:

  • What can I do now that I could not do before?
  • Who else (competition) is thinking about this?
  • Am I risking to loose customers to these (parallel) people?

As a last remark, I’ll leave other interesting questions (and further readings) for the reader. These are related to language and tools to ease (with some overhead, but who said garbage collection is/was rubish?) the burden on parallel programming:

  • Will functional languages with immutable objects (evidences here for *Lisp and Erlang) replace current procedural and object oriented ones? (operations over immutable objects are inherently parallelizable)
  • What about specialized tools (with accompaining restrictive design patterns) such as Google’s MapReuce framework (or its opensource incarnation Apache Hadoop)?

I hope you enjoyed the (kind of long) reading and that it was able to fire at least some curiosity light on the subject.

Parallel Musings Part II - Foundation Theory

June 30, 2008 by erickpassos

2 - Limits on parallel execution

We learn programming by being introduced to the simplest concept of an algorithm: an ordered set of actions that, when executed in sequential order, solves a problem by doing the calculations related to each action. The key point here is that some of us fix that sequential order idea and never forget it. However, not all action in an algorithm have to executed in the given order to achieve the same result.

Some operations are not related to each other (reading and writing different variables), which make it possible to execute them in any relative order, or even at the same time. Every program (and related problem) has some inherent parallelism. But how further can we go in exploiting it?

2.1 - Amdahl’s law

Gene Amdahl [1] formulated a now classic result over how much performance one can gain by exploiting all the parallelism existent in a given algorithm.  We start by accounting for all the instruction of a program in two groups: P the possibly-parallel ones; and S the sequential part. Even if we had an unlimited number of processors, there is a limit on how much speedup (how fast the parallel version is compared to the sequential one) we can achieve. This limit is expressed by the following formulae, with N being the number of available cores:

speedup = (S + P)/(S + P/N)

Notice that even with an infinite number of cores, the maximum speedup of a program that is half sequential will be at most 2. So, how would behave the same program in a recent dual or quad-core machine? Given that the programmer insert all the multithreading spice to it (more on this later), the maximum speedup would be in a quad-core (remember that in the example S = P):

speedup = (S + P)/(S + P/4) -> (S + S)/(S + S/4) -> 2*S/( (5*S)/4 ) -> 8/5

See that even in a quad-core machine, as expected, the speedup will be less than 2. Also remember that we are not considering any overhead or other issues in parallelizing the P instructions. This limit is sometimes depressing but it shows that parallelism is not easy, even if you do know what to do.

2.2 - Gustafson’s law

But the previous results do not say it is impossible to do more work with more cores. Amdahl’s law just accounts for the speedup for a specific program implementation. We still can do a completely different program, or solve different problems. This is what John Gustafson exposed in his 1988 paper [2] and was also explained by Sutter [3], where he proposes other aproaches such as:

  • Use more data in the parallel parts, solving a bigger problem (Gustafson’s approach);
  • Do different work that is parallel such as solving problems that were not possible before (aka real-time ray tracing).

What is being said is that the available power of the extra cores can (and should) be used. A key thing is that this power will not come for free, and not all programs can be refactored to use it. In the next section we’ll be looking at how the hardware has been engineered to implicitly exploit (or expose) paralellism.

3 - Practical issues

The recent trend in multicore systems is not based on new parallel models. It is instead a convergence of decades of research in the field where several different approaches were experimented.From a programmers point of view, two concerns are important: data and functions, so one can classify the available parallelism expressed in programs based on these two aspects [4]:

  • Functional parallelism - expressed by the divide and conquer approach of most programming paradigms. It is usually small, sparce and irregular;
  • Data parallelism - comes from data-structures that allow for the parallel manipulation on them. It is usually regular and massive.

However, not all available parallelism can be exploited in practice, so one must account for the utilized parallelism at runtime. Part of the available functional parallelism can be exploited implicitelly by techniques that try to squeeze the most of each piece of (expressed) sequential code. Most of these techniques can be though of as ILP (instruction level parallelism). ILP accounts fine grained, often machine level, instructions that do not have a mandatory sequential order to execute. The good news is that (good) compilers and also hardware are capable of tackling most of these cases succesfuly. The bad news is that this form of parallelism does not allow for really noticeable speedups.

The bread and butter really comes from coarse grained functional (programm or thread level) and data parallelism. Since this forms or parallelism are explicit (the programmer need to express it directly) one will have to learn about memory models, plataforms and libraries to describe data structures and algorithms in parallel fashion.This is where our jorney into parallel architectures really begins. We start out by classifying types of machines based on how they expose data and instructions to the programmer using a (somehow) well accepted taxonomy given by Flynn in 1966 (taken from a parallel programming instructional tutorial [5]):

  • SISD - single instruction with single data is the basic model for sequential machines;
  • SIMD - single instruction and multiple data. Processors usually execute the same instructions in lock-step. This the model for vector machines and processor arrays. Until recently, graphic processors (GPU) could be classified in this category;
  • MISD - multiple instruction, single data. This model is of little use and few machines were ever built around it. A possible application would be to apply different criptoanalysis algorithms over the same message (data)
  • MIMD - multiple instruction, multiple data. This model acconts for almost all parallel machines nowadays, such as SMP (simetric multiprocessors), multicores (kind of SMP), massive parallel machines (most clusters), GRIDs, and modern GPUs;

Again, from the programmers point of view, there is another layer on top of this one that have to be taken into consideration: the memory model. Given that we have more than one processor and some ammount of available memory, the memory model accounts for how much of it is directly accessible from each processor. Shared memory architectures allow for all processors to access all memory addresses directly. Still, this access can be uniform (UMA - uniform memory access), where there is no latency or bandwidth difference for each processor. In this category are SMP and multicore machines. Non uniform access (NUMA) provide a direct addressing for the whole memory but latency and bandwidth vary depending on data locality, such as CPU accessing its own RAM and VideoRAM from the GPU. Figure 1 shows an ilustration of this pattern.

Shared memory (taken from [5])

Figure 1: Shared memory (taken from [5])

Distributed memory models are all those architectures where there is no direct addressing for all available memory. Most clusters and GRIDs are into this category. Figure 2 ilustrates this model.

Distributed memory (taken from [5])

Figure 2: Distributed memory (taken from [5])

Some real world scenarios, however, are not strictly into only one of these models. Most clusters now use SMP or multicore machines, ending up as a hybrid architecture. Also the IBM Cell Processor (Sony Playstation 3) has one main processor that addresses all available memory in NUMA pattern, while each synergistic processor accesses only its own memory (distributed model). Figure 3 shows this practical scenario.

Hybrid memory access (taken from [5])

Figure 3: Hybrid memory (taken from [5])

It is also important to mention that while some software platform (as we will see in the next part) hide these architecture details from the programmer (such as allowing for global addressing in distributed architecture, or message passing in shared memory), knowing the instrisics of the hardware platform is still mandatory if one wants to squeeze the most performance out of it.

In the next part we will be discussing programming models and languages to exploit exposed parallelism. Some subjective propositions will be put on team and individual instruction approaches to learning parallelism.

References

[1] Gene Amdahl. “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities” (AFIPS Conference Proceedings, Vol. 30, AFIPS Press, April 1967).

[2] John Gustafson. “Reevaluating Amdahl’s Law” (Communications of the ACM, 31(5), 1988).

[3] Herb Sutter. “Break Amdahl’s law! - here is a law you should break early and often” (Dr. Dobb’s Journal, 2008).

[4] Eugene Vinod Rebello. “Parallel architecture lecture notes” (IC-UFF, 2008).

[5] Blaise Barney. “Introduction to parallel computing” (HPC - Lawrence Livermore National Laboratory, 2007).

Parallel Musings Part I - Motivations

June 30, 2008 by erickpassos

It’s been a while since my last post. I hope this July will be plenty of new ones. I’ll start by talking about a subject that’s been in every game developer around (at least the programmers): parallel programming!

This text will be span across several posts. Since its contents are also part of a college homework, there will be a continuation for sure. Today I’m going to talk about the motivations behind the need for parallel programming, both in software and hardware.

1 - Introduction

The mainstream computer hardware industry has recently moved to a race towards parallelism. The reasons behind this move are mainly related to the inability to dissipate the increasing heat generated inside processor cores with growing clock rates. At the same time, there is the synchronization problem between processor components. At very high speeds, it is difficult to guarantee that data sent through paths of approximately the same size arrive together at the destination.

This “new” computing trend is being part of a market hype, and the media is full of nice stories about the benefits of multicore CPUs and virtualization techniques. What is not being told is that the life of programmers and systems designers has already changed. From enterprise developers to game programmers, everybody will have to face new paradigms to still develop software that takes advantage of new hardware. Herb Sutter [1] has called this the concurrent revolution and said that “the free lunch is over”, meaning that no software will get better performance by just upgrading the hardware anymore.

You could think that I will now start a Java tutorial on how to use multithreading to exploit multicore parallelism with JMonkeyEngine, or how a specific game engine used a fancy technique to gain more performance. It is not that simple. Parallelism is an old research subject and even today its bads and goods are not fully understanded. So, I’ll instead start a voyage through the concepts and techniques used to exploit the maximum performance out of the problem being solved. That is it: instead of the software, sometimes is better to re-think the whole problem and its solution.

The next part will explain some concepts and rules about the parallelism that is achievable in software. Part III will explain how hardware (and also compilers) have been developed to squeeze the maximum performance out of (quasi) sequential software and how much room has been left to the developer (this is the new trend). Part IV will analyse current mainstream programming paradigm and platforms (named object oriented programming) against this new trend and instigate the reader to explore other languages and paradigms. I hope this text will be a good reading for those interested in the subject.

References

[1] - Sutter, Herb. A Fundamental Turn Toward Concurrency in Software. In Dr. Dobbs Journal. March 1, 2005.

Hello new sertao3D!

June 15, 2008 by erickpassos

This is my new blog… Just moved from blogger. In a few days I’ll be publishing a post about game music and another one about tile based flash games.

Game Design - Slinger

April 1, 2008 by erickpassos

I’ve decided to evolve a game design, called Slinger and use this blog as an experimental repository for the ideas, as did Danc in his excelent Lost Garden. The design for slinger will be posted as a series and downloadable prototypes will be accessible from here as well.

The series will consist of incremental and iteractive essays about the game:

1 - Introduction and basic game mecanics (april 01, 2008)
2 - Reward system and difficulty management (april 02, 2008)
3 - Setting ideas (april, 03, 2008)
4..N - Iteractive reviews of the design (during april 2008)
N+1 - Post Morten (when finished…)

Introduction

Slinger will be an action game where the player takes the role of a boy equiped with a sling. Armed with the sling and rocks, he has to defend his farm’s animals againts all sorts of threats.

Player mecanics
Slinger is a 3rd person view 3D game where the player uses the keys to move the character and the mouse to turn it. Moving around the area there will be animals, that tend to run away from him, and other scene elements, like fences, barns, trees, etc.

To throw rocks the player uses the mouse. By holding the left-button, the sling stay in armed position. The player stretches the sling’s rubber by pulling back the mouse. The rock is thrown when the players releases the button. While the sling is hold armed, the player is free to move with the keys and “aim” by moving the mouse from side to side, remembering that the sling’s power is controlled by the disntance he moves the mouse back.

Animals behavior
Animals in Slinger will behave as in nature, moving as groups, and being predators and preys. The preys will try to avoid the predators and the player, while predators will try to attack the preys (when hungry) and avoid the player (except when they decide that he’s also a prey).

Final thoughts
I’ve decided to start with the basic mechanics because it’s the first thing I must implement and test. If the basic mechanics is not fun, nothing else will be. In the next post I’ll write about some reward system ideas to make the game enjoyable in the long run (not so long, since the game is intended for casual players).

Fell free to comment on the idea…