Nifty features of the ARM architecture
While sorting out GHC’s support for architectures with weakly-ordered memory subsystems I peeked at GCC’s lowering of the __sync_bool_compare_and_swap
operation [^1] on ARM. This reminded me of a few features of the architecture that I quite like and felt compelled to write about.
My test involved looking at this program,
$ cat hi.c
#include <stdint.h>
int hello(uint32_t *x) {
return __sync_bool_compare_and_swap(x, 0, 1);
}
$ arm-linux-gnueabihf-gcc -c hi.c
This produced the following assembler:
Function prelude:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
6: 6078 str r0, [r7, #4]
8: 687b ldr r3, [r7, #4]
a: 2201 movs r2, #1
Issue full barrier:
c: f3bf 8f5b dmb ish (1)
Load value and claim exclusive access:
10: e853 1f00 ldrex r1, [r3] (2)
Compare, loop if necessary:
14: 2900 cmp r1, #0
16: d103 bne.n 20 <hello+0x20>
Store value and possibly cede exclusive access:
18: e843 2000 strex r0, r2, [r3] (2)
Ensure that store succeeded, loop if necessary:
1c: 2800 cmp r0, #0
1e: d1f7 bne.n 10 <hello+0x10>
Issue full barrier:
20: f3bf 8f5b dmb ish (1)
Function epilogue:
24: bf0c ite eq (3)
26: 2301 moveq r3, #1
28: 2300 movne r3, #0
2a: 4618 mov r0, r3
2c: 370c adds r7, #12
2e: 46bd mov sp, r7
30: f85d 7b04 ldr.w r7, [sp], #4
34: 4770 bx lr
There are a few things here that I found interesting, parenthesized above and discussed in detail below.
1. Partitioned memory barriers
dmb
is ARM’s Data Memory Barrier instruction. What is interesting is the options accepted by this instruction:
SY
: Full system barrierST
: Store barrierISH
: Full barrier within inner shareable domainISHST
: Store barrier within inner shareable domainNSH
: Full barrier out to point of unificationNSHST
: Store barrier out to point of unificationOSH
: Full barrier out to outer shareable domainOSHST
: Store barrier out to outer shareable domain
The amount of granularity embodied by these options is quite forward-thinking, especially given how much freedom it leaves with the microarchitecture designer:
inner/outer shareable domain: This allows a large system to be broken into multiple independently-coherent domains. For instance, a large machine may have several paritions, each running its own supervisor (e.g. hypervisor or kernel). The
DMB ISH
instruction provides an efficient way for a supervisor to issue a barrier within the “system” on which it runs.point of unification: This is defined to be the point where the processor’s instruction and data caches are guaranteed to be coherent with its TLB walk logic. This depends upon the system design but is often the L2 cache.
This is handy for implementing JIT compilers and other self-modifying code as it provides a flush mechanism which ensures that code modified by the local core can be executed.
2. Exclusive load/store
For someone used to IA-32’s rather boring atomic primitives, ARM is a breath of fresh air. IA-32 provides a handful of instructions implementing a selection of atomic operations (e.g. atomic add, subtract, AND, OR, etc.) via the LOCK
prefix. By constrast, ARM provides a far more flexible mechanism consisting of two instructions:
ldrex
: This is similar to the typicalldr
instruction, loading an address (given in a register) from memory into a destination register. However, it also lays a claim of “exclusive” access to the loaded region (typically a cacheline, I believe).strex
: This is stores a value from a register to a memory location previously claimed withldrex
. If the value is has changed since the location was claimed then the store is aborted and a failure result is returned.
These collectively provide the necessary pieces for a compare-and-swap operation, but allowing significantly more flexibility over a dedicated CAS instruction. For instance, one might do non-trivial computation on the loaded value before storing an updated value.
3. If/then/else instruction
If/then/else flow control is incredibly common in typical programs. ARM has a nifty it
instruction which predicates the execution of up to three instructions according to a given condition.
For instance, consider the program:
// Assume:
// uint64_t x, y, z;
// uint64_t *p;
if (x == 0) {
p[0] = x;
p[1] = y;
} else {
p[0] = z
}
this can be lowered as the following svelte instruction sequence:
cmp r0, #0 # x == 0?
ittte eq
str r0, [r7] # p[0] = x;
str r1, [r7, #8] # p[1] = y;
str r3, [r7] # p[0] = z;
...
Here the it
instruction conditional on the eq
condition code and is followed by three condition switches (tte
; where t
should be read as “then” (the condition holds) and e
as “else” (the condition does not hold)). This is a remarkably efficient encoding for this common idiom.
whether the operation implies a full memory barrier. Unfortunately it seems to be a pattern that memory ordering operations are woefully underdocumented despite their subtlety.