Nifty features of the ARM architecture

2019-06-06 | tagged with

memory barriers
ARM
compilers

While sorting out GHC’s support for architectures with weakly-ordered memory subsystems I peeked at GCC’s lowering of the __sync_bool_compare_and_swap operation [^1] on ARM. This reminded me of a few features of the architecture that I quite like and felt compelled to write about.

My test involved looking at this program,

$ cat hi.c
#include <stdint.h>
int hello(uint32_t *x) {
	return __sync_bool_compare_and_swap(x, 0, 1);
}
$ arm-linux-gnueabihf-gcc -c hi.c

This produced the following assembler:

Function prelude:
   0:	b480      	push	{r7}
   2:	b083      	sub	sp, #12
   4:	af00      	add	r7, sp, #0
   6:	6078      	str	r0, [r7, #4]
   8:	687b      	ldr	r3, [r7, #4]
   a:	2201      	movs	r2, #1
Issue full barrier:
   c:	f3bf 8f5b 	dmb	ish                  (1)
Load value and claim exclusive access:
  10:	e853 1f00 	ldrex	r1, [r3]           (2)
Compare, loop if necessary:
  14:	2900      	cmp	r1, #0
  16:	d103      	bne.n	20 <hello+0x20>
Store value and possibly cede exclusive access:
  18:	e843 2000 	strex	r0, r2, [r3]       (2)
Ensure that store succeeded, loop if necessary:
  1c:	2800      	cmp	r0, #0
  1e:	d1f7      	bne.n	10 <hello+0x10>
Issue full barrier:
  20:	f3bf 8f5b 	dmb	ish                  (1)
Function epilogue:
  24:	bf0c      	ite	eq                   (3)
  26:	2301      	moveq	r3, #1
  28:	2300      	movne	r3, #0
  2a:	4618      	mov	r0, r3
  2c:	370c      	adds	r7, #12
  2e:	46bd      	mov	sp, r7
  30:	f85d 7b04 	ldr.w	r7, [sp], #4
  34:	4770      	bx	lr

There are a few things here that I found interesting, parenthesized above and discussed in detail below.

1. Partitioned memory barriers

dmb is ARM’s Data Memory Barrier instruction. What is interesting is the options accepted by this instruction:

SY: Full system barrier
ST: Store barrier
ISH: Full barrier within inner shareable domain
ISHST: Store barrier within inner shareable domain
NSH: Full barrier out to point of unification
NSHST: Store barrier out to point of unification
OSH: Full barrier out to outer shareable domain
OSHST: Store barrier out to outer shareable domain

The amount of granularity embodied by these options is quite forward-thinking, especially given how much freedom it leaves with the microarchitecture designer:

inner/outer shareable domain: This allows a large system to be broken into multiple independently-coherent domains. For instance, a large machine may have several paritions, each running its own supervisor (e.g. hypervisor or kernel). The DMB ISH instruction provides an efficient way for a supervisor to issue a barrier within the “system” on which it runs.
point of unification: This is defined to be the point where the processor’s instruction and data caches are guaranteed to be coherent with its TLB walk logic. This depends upon the system design but is often the L2 cache.

This is handy for implementing JIT compilers and other self-modifying code as it provides a flush mechanism which ensures that code modified by the local core can be executed.

2. Exclusive load/store

For someone used to IA-32’s rather boring atomic primitives, ARM is a breath of fresh air. IA-32 provides a handful of instructions implementing a selection of atomic operations (e.g. atomic add, subtract, AND, OR, etc.) via the LOCK prefix. By constrast, ARM provides a far more flexible mechanism consisting of two instructions:

ldrex: This is similar to the typical ldr instruction, loading an address (given in a register) from memory into a destination register. However, it also lays a claim of “exclusive” access to the loaded region (typically a cacheline, I believe).
strex: This is stores a value from a register to a memory location previously claimed with ldrex. If the value is has changed since the location was claimed then the store is aborted and a failure result is returned.

These collectively provide the necessary pieces for a compare-and-swap operation, but allowing significantly more flexibility over a dedicated CAS instruction. For instance, one might do non-trivial computation on the loaded value before storing an updated value.

3. If/then/else instruction

If/then/else flow control is incredibly common in typical programs. ARM has a nifty it instruction which predicates the execution of up to three instructions according to a given condition.

For instance, consider the program:

// Assume:
//   uint64_t x, y, z;
//   uint64_t *p;

if (x == 0) {
  p[0] = x;
  p[1] = y;
} else {
  p[0] = z
}

this can be lowered as the following svelte instruction sequence:

cmp r0, #0         # x == 0?
ittte eq
str r0, [r7]       # p[0] = x;
str r1, [r7, #8]   # p[1] = y;
str r3, [r7]       # p[0] = z;
...

Here the it instruction conditional on the eq condition code and is followed by three condition switches (tte; where t should be read as “then” (the condition holds) and e as “else” (the condition does not hold)). This is a remarkably efficient encoding for this common idiom.

whether the operation implies a full memory barrier. Unfortunately it seems to be a pattern that memory ordering operations are woefully underdocumented despite their subtlety.