Apple's CHUD Tools, INTEL and Noble Ape

( home ) - [ simulation ] - ( podcast | monthly updates ) - [ documents | music ] - ( contact/bio )

APPLE's CHUD TOOLS, INTEL and NOBLE APE

Mac Developer topics covered - Threading, Altivec and SSE

BACKGROUND, SOURCE and THANKS

In early 2003 I was contacted by two (now former) Apple engineers, Nathan Slingerland and Sanjay Patel, about demonstrating Noble Ape at Apple's World Wide Developer Conference (WWDC). Since that time, Apple has distributed a version of the Noble Ape in the CHUD Tools with every new Mac Apple sells and also available for download in the CHUD Tools on Apple's site. Noble Ape is demonstrated by Apple at their Developer Kitchens and a number of other public and private displays of Apple's hardware technology. It was also demonstrated at Apple's WWDC 2006 by INTEL engineers.

Apple and INTEL's display of Noble Ape has been a huge boost to the development. It provides a great deal of street credibility to the Simulation. I get a number of positive emails from Mac developers who have used aspects of the Noble Ape in their own development. It is great that the metaphoric monkey project can give back so much to the Mac development community.

Nathan Slingerland and Sanjay Patel have been outstanding in their contributions to the Noble Ape development.

More recently, Noble Ape has received the thankful assistance of Rick Altherr, Eric Miller and Bryan Follis at Apple, Justin L. Landon, Michael Lewis, Michael W. Yi, Pallavi Mehrotra and Phil Kerly at INTEL, with the management oversight of Myke Smith at Apple.

Many thanks also needs to go to Dr. Ernest Prabhakar, Apple's Open Source Project Manager, who contributed his 15% Friends and Family discount to assist with the purchase of an INTEL Mac mini for Noble Ape development.

Finally, I would like to thank George Warner at Apple who invited me to Cupertino in 1998 to display some Noble Ape tech and continues to provide occasional assistance with implementation/conversion issues.

Hats off to all these great folk!

This document has been written to bridge a delta between the version of the Simulation Apple distributes and the current Noble Ape development source. It identifies how to get the current development to compile as the Noble Ape version Apple distributes with a localized quantity of new code.

At the time of writing this document, the version of the Simulation Apple distributes is still based on the 2003 source. This document outlines how you can get the features seen in the Apple distributed source with the current source code of the Simulation.

The latest released version of the Noble Ape Simulation source code is available from the Simulation page.

THREADING

The initial Apple contribution to the Noble Ape development came in two parts - threading and Altivec vector code. Apple's contribution took a very broad view of threading and divided the apes down the middle putting the two groups in separate threads on separate processors (where applicable). This optimization was a "no-brainer" and I quickly redeveloped this idea for the general release of the Noble Ape Simulation on the Mac.

With the introduction of scripting in the Simulation and the lack of thread-safe predictability through the scripting, I decided to move the threading code very tightly around the brain simulation component reducing the threading from eating, moving, sleeping etc. This fitted in well with the vector contribution from Apple which was solely focused around the brain simulation component too.

Threading is on by-default in the Simulation when compiled for Mac. It will automatically be use to divide the brain simulation between two processors (if available) on Mac OS X. The CHUD version of the Simulation allows threading to be explicitly selected through menu items.

ALTIVEC and SSE

The largest code contribution from Apple to the Simulation came with the Altivec and SSE developed code. This code allows a number of brain simulations to be calculated at the same time by stacking the vector mathematics in 8 or 16 calculation pipelines. The original (and CHUD tool released) version of this code had two separate vector optimization levels. They were divided to show the progressive levels of optimization that a developer may go through when converting their code to vector mathematics.

Questions from Tom Barbalet and Answers from Sanjay Patel of Apple
2-3 July 2003 (link)
Could you please give an overview of the Vector code?
The vector code is composed of the following steps: unpack the brain data into 16-bit fields, compute POS_BRAIN_LH_OPT and POS_BRAIN_UH_OPT, compute both BRFUNC_AWAKE and BRFUNC_ASLEEP, select either awake or asleep based on the awake variable, store that value into br_tmp, update change_clate if needed, and update the change array if needed.
Could you please expand on the even/odd combination method? Does this produce bottle-necks? Is it better to maintain the same-number-of-bits mathematics throughout?
The even/odd multiplies are necessary because a vector is a fixed (128-bit) quantity, but we want to multiply 8 16-bit elements. We can't create a 256-bit vector, so we must operate on half of the vector at a time. Obviously, this cuts down on the throughput of the code. You might ask, "if we're operating on 16 8-bit elements in parallel with Altivec, then how come this function didn't get 16x faster?" The reason is that your algorithm requires more than 16-bits of precision in the intermediate stages of calculation. We could almost double the performance if the algorithm only needed to keep 16-bits of precision because Altivec offers 16-bit multiplies for high and low halves of a multiply. I experimented with this for over a day, but I couldn't find any way to preserve all of the significant bits that Noble Ape requires.
What are the major improvements in the Vector Optimised code?
2 big improvements: (1) we vectorized the entire function, not just the middle loop. This is, of course, a relatively simple thing to do since all of the loops are identical apart from bounds checks that define what element of the brain to load. (2) I made the computation of the awake and asleep sides of the equation more efficient. I cut the number of multiplies down from 20 to 12 by conditionalizing the constants used in the math rather than conditionalizing the equations themselves.
If you had more time, what would you look to test/improve in the Vector Optimised code?
I'd look at rescheduling the brain data loads, using cache hints, and compiler flags. Maybe you could unroll the loop and get some more speedup? I don't see any way to remove any more of the computation (unless you can change the algorithm a bit to need only 16-bits of precision in the intermediate calcs).

One of the shortfalls of the vector code was that it was more assembly-language-like in its format and thus substantially longer in line-count and less readable to the novice user. With the addition of SSE vector code for Apple's progression to INTEL processors, the amount of code and the general level of confusion made me look at options for simplifying the code in the released version of the Simulation.

My solution to the volume of new vector code was to remove the partially optimized code which seemed less necessary in the Altivec/SSE contrast code example. In addition, I created a series of porting macros and typedefs. The benefits of using macros not only meant a reduction in the volume of code between the two vector types (Altivec and SSE), it also reduced the general code and the need for duplicate and extended comments. The macros assisted the vector code's one-to-one with the scalar code. It was easy to follow down the scalar code with the vector code and get a sense of what the vector functions were doing. This implicit commenting property was very useful.

sim/core/brain.c - source example of the vector macros

#ifdef APPLE_CHUD

#ifdef __ALTIVEC__

typedef  vector unsigned char  n_vbyte;
typedef  vector signed short   n_vshort;
typedef  vector unsigned int   n_vbyte4;
typedef  vector signed int     n_vint;

#define  SET_VSHORT(num)      (vector signed short)(num)
#define  SET_VBYTE4(num)      (vector unsigned int)(num)
#define  SET_ZEROED           (vector unsigned char)(0)

#define  UNPACK_HI16(v, n)    (vector signed short) vec_mergeh((n), (v))
#define  UNPACK_LO16(v, n)    (vector signed short) vec_mergel((n), (v))

#define  VECT_ADD16(a, b)      vec_add((a), (b))

#define  VECT_SRA32(a, b)      vec_sra((a), (b))
#define  VECT_SUB32(a, b)      vec_sub((a), (b))
#define  VECT_ADD32(a, b)      vec_add((a), (b))

#define  VECT_PACK(a, b)      (vector unsigned char) vec_pack((a), (b))

#define  VECT_BRAIN_MULTIPLY(a,b,c,d, e,f,g,h)               \
                              (e) = vec_mule((a), (b));      \
                              (f) = vec_mulo((a), (b));      \
                              (g) = vec_mule((c), (d));      \
                              (h) = vec_mulo((c), (d))
#endif

#ifdef __SSE2__

typedef  __m128i               n_vbyte;
typedef  __m128i               n_vshort;
typedef  __m128i               n_vbyte4;
typedef  __m128i               n_vint;

#define  SET_VSHORT(num)       _mm_set1_epi16((signed short)num)
#define  SET_VBYTE4(num)       _mm_set_epi32(((num)>>24)&255, ((num)>>16)&255, ((num)>>8)&255, (num)&255)
#define  SET_ZEROED            _mm_setzero_si128()

#define  UNPACK_HI16(v,n)      _mm_unpackhi_epi16((v), (n))
#define  UNPACK_LO16(v,n)      _mm_unpacklo_epi16((v), (n))

#define  VECT_ADD16(a,b)       _mm_add_epi16((a), (b))

#define  VECT_SRA32(a,b)       _mm_sra_epi32((a), (b))
#define  VECT_SUB32(a,b)       _mm_sub_epi32((a), (b))
#define  VECT_ADD32(a,b)       _mm_add_epi32((a), (b))

#define  VECT_PACK(a,b)        _mm_packus_epi16((b), (a))

#define  VECT_BRAIN_MULTIPLY(a,b,c,d, e,f,g,h)                             \
                               tmp1 = _mm_mulhi_epi16((a), (b));           \
                               tmp2 = _mm_mullo_epi16((a), (b));           \
                               tmp3 = _mm_mulhi_epi16((c), (d));           \
                               tmp4 = _mm_mullo_epi16((c), (d));           \
                              (e) = _mm_unpackhi_epi16 (tmp2, tmp1);       \
                              (f) = _mm_unpacklo_epi16 (tmp2, tmp1);       \
                              (g) = _mm_unpackhi_epi16 (tmp4, tmp3);       \
                              (h) = _mm_unpacklo_epi16 (tmp4, tmp3)
#endif

The brain simulation for SSE versus scalar was tested and confirmed 1:1 in March 2007. The folks at INTEL have taken on a leadership role for this project, special thanks again to Justin L. Landon, Michael W. Yi and Pallavi Mehrotra at INTEL. Michael in particular provided exact 1:1 testing.

DEFINES and COMPILATION

There are a number of defines used through the Simulation to add and remove features in the compiled version of the Simulation. The three most relevant to this document are APPLE_CHUD, THREADED and BRAIN_HASH.

APPLE_CHUD - default undefined
sim/core/core.h

THREADED - default defined
sim/gui/gui.h

BRAIN_HASH - default undefined
sim/gui/gui.h

To show the optimization advantage of threading and vector processing, the Apple version of the Simulation requires additional menu items to switch on and off threading and vector processing. Defining APPLE_CHUD puts these menus into effect and activates the vector specific code which isn't currently used in the released version of the Simulation.

Threading can be turned off and on in the Mac compiled version of the Simulation through defining or undefining THREADED.

BRAIN HASH

The brain simulation part of the Simulation is notoriously slippery to track changes. The brain simulation features a large number of mathematical calculations designed to propagate information through the three dimensional brain array. The solution to tracking changes in this fluid mathematics was to apply a hash function to resolve all the values of the brain at a particular time to a single number.

The brain hash is currently used to test and verify changes to the scalar or either vector versions of the Simulation as well as the variety of platforms the Simulation is run on. In the future, it will be used to track deltas between the Apple SSE, the Linux SSE and the Windows SSE implementations as well as the normal scalar versions.

To ease testing, when BRAIN_HASH is defined the Simulation shows a small string of numbers and letters to represent the hash brain value at a particular time. The delta between brain hash changes is sufficiently long that the user should be able to write these numbers and letters on a piece of paper. This was the high-tech original method of verifying the brain hash values matched one-to-one through changes.

Another critical point in getting one-to-one brain hash values is maintaining the same conditions in the Simulation. This can be achieved by saving a Simulation variables file. I would recommend hacking this file down to four or five apes at most. The Simulation can be paused, the variable file loaded and the Simulation can then resume running. Note down the letters and numbers the Simulation displays in the Brain window.

Run a different version of the Simulation with the same BRAIN_HASH defined. You may consider running the Altivec version versus the scalar version. Perhaps the SSE version versus the Altivec version or Linux versus Windows. With the same variable file, the same brain hash results should be produced. If they aren't there is some difference in the brain simulation.

Tom Barbalet, 21 October 2005.
Revised 24 March 2006, 25 March 2007, 18 June 2008.

( home ) - [ simulation ] - ( podcast | monthly updates ) - [ documents | music ] - ( contact/bio )