32-bit vs 64-bit Performance Under Mac OS X
One of the assertions I keep hearing is that with the transition from 32-bit to 64-bit computing applications will receive an increase in performance “for free” since processors will be working with 64 bits, instead of 32 bits, at a time (provided, of course, applications are re-compiled for 64-bit platforms).
Now that Geekbench is available for both 32- and 64-bit processors on Mac OS X, I thought I’d see if that assertion is correct. I’ve compared performance on the three 64-bit processors currently available for Mac OS X; the PowerPC G5, the Intel Xeon, and the Intel Core 2 Duo.
Setup
Here are the configurations of the test machines:
Mac Pro
- Intel Xeon 5150 @ 2.66GHz
- 2048MB RAM
- Mac OS X 10.4.7 (Build 8K1124)
- Geekbench 2006 (Build 208)
iMac (Late 2006)
- Intel Core 2 Duo @ 2.0GHz
- 1024MB RAM
- Mac OS X 10.4.7 (Build 8K1106)
- Geekbench 2006 (Build 208)
Power Mac G5
- PowerPC G5 @ 2.0GHz (two processors)
- 1024MB RAM
- Mac OS X 10.4.7 (Build 8J135)
- Geekbench 2006 (Build 208)
I’ve only included the scores for single-threaded tests (since I think they’re the most relevant when comparing 32- and 64-bit performance on the same machine). I’m using the baseline score (where a score of 100 is equivalent to the performance of a Power Mac G5 at 1.6GHz), where higher is better. I’ve also computed the 64-bit score for each machine as a percentage of the machine’s 32-bit score.
Mac Pro Performance
Overall Score
| Mac Pro (32-bit) |
Mac Pro (64-bit) |
|
|---|---|---|
| Overall Score | 344.8 | 365.2 (105.9%) |
Integer Performance
| Benchmark | Mac Pro (32-bit) |
Mac Pro (64-bit) |
|---|---|---|
| Emulate 6502 single-threaded scalar |
162.8 | 218.0 (133.9%) |
| Blowfish single-threaded scalar |
232.8 | 205.2 (88.1%) |
| bzip2 Compress single-threaded scalar |
223.5 | 277.4 (124.1%) |
| bzip2 Decompress single-threaded scalar |
251.9 | 300.4 (119.3%) |
Floating Point Performance
| Benchmark | Mac Pro (32-bit) |
Mac Pro (64-bit) |
|---|---|---|
| Mandelbrot single-threaded scalar |
179.9 | 180.0 (100.1%) |
| Dot Product single-threaded scalar |
362.0 | 364.1 (100.6%) |
| Dot Product single-threaded vector |
153.5 | 127.2 (82.9%) |
| JPEG Compress single-threaded scalar |
161.0 | 195.7 (121.6%) |
| JPEG Decompress single-threaded scalar |
154.9 | 199.9 (129.1%) |
Memory Performance
| Benchmark | Mac Pro (32-bit) |
Mac Pro (64-bit) |
|---|---|---|
| Read Sequential single-threaded scalar |
354.3 | 356.7 (100.7%) |
| Write Sequential single-threaded scalar |
631.3 | 423.1 (67.0%) |
| Stdlib Allocate single-threaded scalar |
279.0 | 357.1 (128.0%) |
| Stdlib Write single-threaded scalar |
124.3 | 116.1 (93.4%) |
| Stdlib Copy single-threaded scalar |
234.9 | 252.1 (107.3%) |
Stream Performance
| Benchmark | Mac Pro (32-bit) |
Mac Pro (64-bit) |
|---|---|---|
| Stream Copy single-threaded scalar |
199.4 | 204.2 (102.4%) |
| Stream Copy single-threaded vector |
197.1 | 198.2 (100.6%) |
| Stream Scale single-threaded scalar |
217.6 | 208.7 (95.9%) |
| Stream Scale single-threaded vector |
196.4 | 195.6 (99.6%) |
| Stream Add single-threaded scalar |
188.4 | 210.7 (111.8%) |
| Stream Add single-threaded vector |
184.7 | 218.2 (118.1%) |
| Stream Triad single-threaded scalar |
147.0 | 210.0 (142.9%) |
| Stream Triad single-threaded vector |
183.7 | 173.9 (94.7%) |
Mac Pro Summary
Overall performance in 64-bit mode is 5% higher than overall performance in 32-bit mode. However, a number of benchmarks that were slower in 64-bit mode than in 32-bit mode (like the Blowfish and Write Sequential benchmarks).
iMac Performance
Overall Score
| iMac (32-bit) |
iMac (64-bit) |
|
|---|---|---|
| Overall Score | 205.2 | 221.5 (107.9%) |
Integer Performance
| Benchmark | iMac (32-bit) |
iMac (64-bit) |
|---|---|---|
| Emulate 6502 single-threaded scalar |
122.3 | 164.0 (134.1%) |
| Blowfish single-threaded scalar |
175.0 | 153.5 (87.7%) |
| bzip2 Compress single-threaded scalar |
168.5 | 209.7 (124.5%) |
| bzip2 Decompress single-threaded scalar |
212.9 | 227.7 (107.0%) |
Floating Point Performance
| Benchmark | iMac (32-bit) |
iMac (64-bit) |
|---|---|---|
| Mandelbrot single-threaded scalar |
135.1 | 135.1 (100.0%) |
| Dot Product single-threaded scalar |
271.5 | 273.2 (100.6%) |
| Dot Product single-threaded vector |
113.2 | 115.0 (101.6%) |
| JPEG Compress single-threaded scalar |
120.7 | 147.2 (122.0%) |
| JPEG Decompress single-threaded scalar |
116.1 | 154.8 (133.3%) |
Memory Performance
| Benchmark | iMac (32-bit) |
iMac (64-bit) |
|---|---|---|
| Read Sequential single-threaded scalar |
308.4 | 307.5 (99.7%) |
| Write Sequential single-threaded scalar |
416.7 | 439.5 (105.5%) |
| Stdlib Allocate single-threaded scalar |
208.6 | 273.4 (131.1%) |
| Stdlib Write single-threaded scalar |
104.3 | 104.7 (100.4%) |
| Stdlib Copy single-threaded scalar |
218.9 | 221.1 (101.0%) |
Stream Performance
| Benchmark | iMac (32-bit) |
iMac (64-bit) |
|---|---|---|
| Stream Copy single-threaded scalar |
170.7 | 178.1 (104.3%) |
| Stream Copy single-threaded vector |
161.0 | 159.1 (98.8%) |
| Stream Scale single-threaded scalar |
183.2 | 176.0 (96.1%) |
| Stream Scale single-threaded vector |
160.8 | 160.4 (99.8%) |
| Stream Add single-threaded scalar |
159.0 | 192.5 (121.1%) |
| Stream Add single-threaded vector |
176.2 | 179.0 (101.6%) |
| Stream Triad single-threaded scalar |
159.7 | 187.8 (117.6%) |
| Stream Triad single-threaded vector |
141.7 | 148.9 (105.1%) |
iMac Summary
Despite the fact that the Core 2 Duo and the Xeon share the same underlying architecture, the Core 2 Duo’s 64-bit performance is better than the Xeon’s 64-bit performance; overall performance for the Core 2 Duo is up 7% (compared to 5% for the Xeon). Plus, the only benchmark that was significantly slower in 64-bit mode was the Blowfish benchmark.
Power Mac Performace
Overall Score
| Power Mac G5 (32-bit) |
Power Mac G5 (64-bit) |
|
|---|---|---|
| Overall Score | 154.9 | 140.1 (90.4%) |
Integer Performance
| Benchmark | Power Mac G5 (32-bit) |
Power Mac G5 (64-bit) |
|---|---|---|
| Emulate 6502 single-threaded scalar |
125.1 | 100.0 (79.9%) |
| Blowfish single-threaded scalar |
124.7 | 89.0 (71.4%) |
| bzip2 Compress single-threaded scalar |
156.5 | 110.8 (70.8%) |
| bzip2 Decompress single-threaded scalar |
108.7 | 106.0 (97.5%) |
Floating Point Performance
| Benchmark | Power Mac G5 (32-bit) |
Power Mac G5 (64-bit) |
|---|---|---|
| Mandelbrot single-threaded scalar |
125.2 | 129.8 (103.7%) |
| Dot Product single-threaded scalar |
112.3 | 112.8 (100.4%) |
| Dot Product single-threaded vector |
125.5 | 42.0 (33.5%) |
| JPEG Compress single-threaded scalar |
122.0 | 105.9 (86.8%) |
| JPEG Decompress single-threaded scalar |
129.6 | 107.5 (82.9%) |
Memory Performance
| Benchmark | Power Mac G5 (32-bit) |
Power Mac G5 (64-bit) |
|---|---|---|
| Read Sequential single-threaded scalar |
133.9 | 130.1 (97.2%) |
| Write Sequential single-threaded scalar |
145.9 | 161.2 (110.5%) |
| Stdlib Allocate single-threaded scalar |
101.9 | 93.2 (91.5%) |
| Stdlib Write single-threaded scalar |
129.7 | 131.1 (101.1%) |
| Stdlib Copy single-threaded scalar |
134.2 | 124.7 (92.9%) |
Stream Performance
| Benchmark | Power Mac G5 (32-bit) |
Power Mac G5 (64-bit) |
|---|---|---|
| Stream Copy single-threaded scalar |
132.9 | 127.9 (96.2%) |
| Stream Copy single-threaded vector |
129.2 | 122.8 (95.0%) |
| Stream Scale single-threaded scalar |
129.9 | 127.5 (98.2%) |
| Stream Scale single-threaded vector |
129.6 | 131.1 (101.2%) |
| Stream Add single-threaded scalar |
127.4 | 129.3 (101.5%) |
| Stream Add single-threaded vector |
130.8 | 140.7 (107.6%) |
| Stream Triad single-threaded scalar |
134.5 | 129.7 (96.4%) |
| Stream Triad single-threaded vector |
137.1 | 139.5 (101.8%) |
Power Mac Summary
Overall performance is down 10% in 64-bit mode. Hardly any tests are appreciably faster in 64-bit mode, and several are noticeably slower (such as most of the integer tests, as well as the dot product test).
Conclusion
It turns out the assertion that software runs faster in 64-bit mode than 32-bit mode is both correct and incorrect; Geekbench runs faster in 64-bit mode on Intel-based Macs, but slower on PowerPC-based Macs. I find this incredibly surprising.
On Intel-based Macs, most of the benchmarks that are slower in 64-bit mode are benchmarks that perform bit operations on 32-bit integers, where the compiler has to emit extra instructions to preserve the semantics of 32-bit arithmetic while using 64-bit registers.
However, extra instructions don’t explain the surprising performance hit PowerPC-based Macs experience in 64-bit mode. I haven’t had a chance to investigate it, but compiler quality could be a factor; the 64-bit PowerPC is a somewhat exotic platform, and GCC might not be generating great code for it.
I don’t think the performance hit in 64-bit mode on PowerPC-based Macs is really something to be concerned about; I think that when 64-bit applications become mainstream, most users will have switched to Intel-based Macs (where 64-bit performance isn’t a concern).
Update
There’s an interest comment over on MacSlash suggesting why 64-bit performace (compared to 32-bit performace) is better on x86 than PPC:
As someone who used to work at AMD which designed the x86-64 architecture: – 16 integer pipe registers versus 8 in 32 bit mode (of which 6 get used) – Carefully designed CISC so that 64-bit mode takes only 10% more space than 32 bit mode. This is important because the main bottleneck in modern systems is memory speed (hence the constant increase in cache sizes) PowerPC: – no increase in registers – much larger code size increase, although I can’t find exact figures.
Trackbacks & Pingbacks
- Mac OS X 32-bit and 64-bit Performance pingbacked Posted September 27, 2006, 9:42 pm
- I-R-Coops Blog » 32bit OS X vs 64bit OS X pingbacked Posted September 28, 2006, 7:28 am
- EveryDigg » Blog Archive » Mac OS X 32-bit and 64-bit Performance pingbacked Posted October 4, 2006, 7:21 am
- Anyone SORRY they installed Leopard? - MacNN Forums pingbacked Posted February 21, 2008, 9:31 am

The Intel processor’s in 64 bit mode allow Apple to use more registers and thus function parameters are passed via processor registers than over the memory stack. A significant performance boost alone for latency bound tightly looped integer programs.
However 64 bit pointers take up twice the processor cache space over 32 bit pointers. Data should not be declared 64 bit unless double precision is required, as it would also take up more cache space. So the large caches are helpful and programmers need to be mindful of their data declarations.
Of course double precision calculations are significantly faster.
The results are exactly what I would expect. In general, going to 64 bits slows things down, since you have twice as much data to deal with and most of the time that data is unused (most integers used in your average calculation are less than 4,294,967,296). There is an exception for the Intel architecture, though, since going to 64-bits buys you some extra registers because of the different ISA, so you get an effect that is not strictly due to the jump to 64-bit wide registers.
I don’t know how these test really matter. OS X isn’t 64-bit yet so we don’t know how it has been optimized for such processors. Most software isn’t optimized for such a thing either. While I understand the cache issues and such, 64-bit will make your computer more powerful IF the software can take advantage of it. Benchmarking is extremely general and doesn’t give real world numbers; only predictions.
Onyx,
You are mistaken. OS X has been a 64-bit OS since the G5 shipped.
-jcr
John C. Randolph, you are mistaken. Mac OS X is not a 64bit OS.
Yes, it is able to run 64bit binarys without GUI using the libSystem.dylib.
But the kernel and the drivers are not 64 bit (they are universal binary “ppc, i386″.) So it is not a 64bit OS.
The only file in the whole system being 64bit ist the /usr/lib/libSystem.B.dylib (this file is universal binary “ppc, ppc64, i386, x86_64″).
On the PowerPC it makes no difference if the OS is 64 or 32 bit.
Quote from the IBM PPC970 documentation:
But how about Intel?
Quote from the Intel documentation:
Since the device drivers in Mac OS X Tiger on the Intel Macs with EM64T are unchanged from the ones used on the 32bit Intel Macs (all are still i386) and the kernel is executed in the same memory space as the drivers (it is also still i386) it is kind of a riddle why Mac OS X is even executing x8664 code.
Maybe Apple will tell sometimes how they manage to make x8664 binary run on the MacPro or iMac Core 2 Duo running Tiger even so the OS is not 64bit.
OS X is 64-bit at the kernel level, but wasn’t so at the frameworks level. That is changing with 10.5. Everything will be 64-bit enabled.
John C. Randolph is a former Apple software engineer, FYI. Although the majority of Apple-supplied GUI libraries and frameworks remain 32-bit prior to Leopard 10.5, he is technically correct in his statement in that G5 can run in 64-bit mode on Tiger. It’s also likely that the kernel will still run in 32-bit mode on Leopard, because the memory management semantics are no different than Tiger’s, for supporting a 64-bit process address space.
coolfactor
Mac OS X is not 64-bit at the kernel level.
If it was 64bit, it wouldn’t be running on the Yonah (the Core Processor in the first Intel Macs). The Yonah only uses the i386 instruction set. A 64bit kernel would use a different instruction set, the x8664. The Yonah is not able to execute 64bit (x8664) instuctions the kernel would be using if it was 64bit.
Apart from that the kernel doesn’t even contain any x86_86 code.
Although not knowing exactly the details, I could make the hypotesis that both the PowerPC and the x86_64 processors are able to run code of a single process in a 64-bit mode, removing the need for the kernel to be 64-bit itself.
OS X is a 64-bit OS in that it is 64-bit-aware and capable at the libc layer; being aware and able to put the processor in a mode in which it can run 64-bit code does not necessarily need a fully 64-bit kernel. (This I can say for sure as OS X does it, which means it is possible.)
Peter is absolutely right. This ability to switch between 32-bit and 64-bit mode at the CPU level, is one of the advantages of the PowerPC 970 architecture.
And yes, xnu has been 32-bit since it’s inception. I too would like to know what the final is on the EM64-T Mactels though. I figured for sure, Apple have to be offering a 64-bit kernel on those machines, as I would think that’s required to support 64-bit in userland. Peter seems to be saying though, that they’re running a 32-bit xnu? Can someone who owns one of these machines verify this? I’ve often wondered if this limitation in IA, was one of the reasons Apple decided to delay the move to 64-bit, even though they had 64-bit hardware back in 2003, with introduction the G5.
XNU absolutely has to support EM64T. If this weren’t the case, how could the kernel service sytem calls from 64-bit processes? Assuming XNU works either with a SYSENTER or an “int x” instruction, how could it access the dataspace of the calling process? How would write() work? How would brk() work? What would happen when a 64-bit process had a page fault at address 00001000_00000000, and data needed to be swapped in?
The OS X kernel must be 64-bit. I have 6.5 GB of RAM in my PowerMac G5 (first revision). A 32-bit OS cannot handle more than 2 GB of RAM. The kernel addresses the RAM, therefore the kernel is 64-bit.
Uhm there are many methods to address >2gig of ram with a 32bit kernel… NT does it, novell does it….. the method name escapes me.. pae mainly?
AWE .. address window extensions
PAE physical address extensions
Max
Obviously there is something very wrong with the dot product test under ppc64. As well, profiling GeekBench with a tool like Shark will probably let you properly optimize the app for 64-bit systems. But that’s not really necessary, at least on the PPC.
Right, the PPC970 can natively handle 32 and 64 bit. It’s the bridge between the POWER 3/4 lines and the PowerPC 7410.
Wrong, the PPC is not exotic. It’s in every damned new car on the road, minus 2 or 3. It’s a cockroach!
In a good way, though.
The thing is you can’t compare apples to oranges, sic. You never could. The Intel designs are not the IBM/Freescale mantra, as it were.
Please, they don’t sell the Calgon near me, so if you would then geek out better.
The next thing I don’t want to read is how MySQL is better than PostgreSQL or vice versa.
The premise of “purpose built” really needs to become the chant. The tools at hand are powerful and elegant even if not in that order at all times.
I mean really, we are talking about how software performs on hardware.
Regardless, the conversation is good. I had to look up a few things. I like that.
ahhh so this is where the real geeks hang
It is my understanding that the OS X 10.4 kernel is 64-bit but the Aqua user interface is all 32-bit. This means that you can compile and link programs in 64-bit mode only if you are not linking to any of the user interface. Somewhere I saw a discussion of a way to fake a 64-bit program with a GUI by having two separate programs that work together, one 32-bit to do the GUI and one 64-bit for whatever.
I have seen all kinds of contradictory information on 64 bit mode with Intel processors. I have seen it stated that 10.4.7 added support for 64 bit mode in new Mac Pro and iMac systems. But, I have also seen it stated that pointers are still 4 bytes on the new iMacs. Obviously, these systems are 64 bit capable. But, unless the kernel runs in 64 bit mode, it’s not gonna run 64 bit code.
Anyone have a new Mac Pro / iMac / or MacBook Pro? What does sizeof() say for long integers and pointers?
Also, I was surprised to see that the new Core 2 Duo MacBook Pros had absolutely NO mention of 64 bit support in any of the Apple information. I wonder why that is..
I believe the issue has to do witih CISC vs. RISC. In a CISC instruction set with variable length instructions, you can (as a processor vendor) add a new instruction to load 64 bit values into memory locations and registers with one instruction, using probably one clock cycle. In RISC, the instruction length is fixed. For PPC, instructions are 32 bits long, and only 16 bits are available for an immediate value.
To load a 32 bit immediate value into a register, you have to use two instructions to load the 32 bits, 16 bits at at time.
To load a 64 bit immediate value into a register, you have to use two instructions to load the lower 32 bits of the register (16 bits at a time), then another instruction to load those bits into the higher 32 bits, then two more instructions to load the lower 32 bits again. That’s 5 instructions to make up a 64 bit value in a register compared to 2 instructions for the 32 bit case.
It looks to me a case of making 64 bit possible, but optimizing for 32 bit. Perfectly reasonable for a time when the PC world was still trying to get everyone on board with 32 bit software. Now that the industry is shifting to 64 bit, though, it’s a design philosophy that’s come back to haunt them.
Source: http://www-128.ibm.com/developerworks/linux/library/l-ppc/