Temporal Fuzzing II: Barriers, CAS and Locks

11 Jul 2021

Part of a Series:

Temporal Fuzzing I: Memory Models
Temporal Fuzzing II: Barriers, CAS and Locks

In Process Memory Models

In the previous post, we looked at a simple shared memory concurrency problem. Then we simulated it using Temper, my Clojure library for temporal fuzzing.

In this post, we’re going to explore shared memory concurrency more. We’ll cover Intel vs ARM, and introduce locks, CAS and memory barriers.

ARM vs Intel, and the Sequential Consistency ideal

Simple rules describe which memory accesses can be executed out of order on each architecture. This is often called a memory model, although the term is overloaded.¹

Here is a chart that describes common memory models.

Sequential consistency is included for reference. It disallows reorders entirely, although different threads executing concurrently will interleave.

Intel allows reordering loads before stores, which allows the processor to fetch data ahead of its use as an optimization.

ARM’s allows reordering more or less everything with everything. Its memory model is so weak it breaks causality - more on this later.

                             ARM             Intel              SC
Loads before stores?          x                x
Loads before loads?           x                
Stores before stores?         x
Stores before loads?          x

Note that two memory operations to the same address are never reordered.

This is encoded in Temper as an order-valid? function. It defines whether the instruction b blocks a from being reordered before it.

(defn order-valid? [model a b]
  (cond

    ; On Intel, reads can be reordered before writes
    (= :intel model)
    (and (= :read a)
         (= :write b))

    ; On ARM, anything can be reordered with anything,
    ; because your life wasn't complicated enough
    (= :arm model)
    (and
      (#{:read :write} a)
      (#{:read :write} b))

    ; Default for sequential consistency - no rearranging
    :default
    false))

A reordering being permissable does not guarantee it takes place. Programs that run correctly on a weaker memory model will always work correctly on a stronger memory model that provides additional guarantees.

Porting a program from a stronger to a weaker memory model is much more challenging. Apple’s introduction of the M1 processor included emulation of Intel programs. This was accomplished by just in time translation of x64 code into ARM assembly. To prevent the memory model difference from causing issues, the processor reportedly included a switch to swap to the Intel memory model.²

Compare and Swap

Compare and Swap, or CAS, is an atomic memory operation that can be used for synchronizing threads.

It operates as a standard write, with an extra guard. The memory location and its expected contents are both given. Before the write occurs, the location is first read and checked against the expected value. If they differ, the operation does not succeed. A success flag is returned. Crucially, this happens atomically.³

def compare_and_swap( addr, expected, new ):
  if memory[ addr ] == expected:
    memory[ addr ] = new
    return True
  else
    return False

This operation can be composed to build other primitives. For example, atomic increment, looping until the CAS operation succeeds:

def atomic_increment( addr ):
  while True:
    old = memory[ addr ]
    if compare_and_swap( addr, old, old + 1 ):
      return

We’re going to use this as motivating example, simulating it with Temper.

Atomic Increment

A standard increment has a race condition. If you run two threads side by side:

x = x + 1

x = x + 1

Updates to x will regularly be lost due to a race condition:

Thread A fetches 0
Thread B fetches 0
Thread A writes (0 + 1)
Thread B writes (0 + 1)

This can be ported to Temper, and viewed in its visualisation mode. Note the final contents in memory when cycling through orderings:

; Run twice side by side
(defn increment-x []
  (mem/write-val :x (inc @(mem/read-val :x))))

Atomic Increment via CAS

Porting atomic increment to Temper is relatively straight forward. With the exception of our memory operations mem/read-val and mem/cas , the logic is pure Clojure.

The @a operator in Clojure is shorthand for (deref a) . All memory operations return a promise, which allows the engine to execute them out of order. A deref blocks until the promise has been fulfilled.

(defn atomic-increment [loc]
  (loop []
    (let [v (mem/read-val loc)]
      (when-not @(mem/cas loc @v (inc @v))
        (recur)))))

Setting up two simple threads that call this function, we can view the execution path with Temper’s visualisation mode. Sometimes the CAS fails, because the value of :x has been updated by another thread. When this happens, a comment is generated, and the transaction retried.

Note the value of :x in memory. When both threads are finished, the value is the same. Regardless of thread interleaving and retries, updates are never lost.

;; Run twice side by side
(atomic-increment :x)

Memory Barriers

Memory barriers are used to prevent implicit reordering of memory operations by a CPU. They are also known as memory fences (Intel), or data synchronization barriers (ARM). We’ll use memfence() to represent the operation generically.

If you’re running a single thread, the reordering of instructions will never change the result of a program. But as we saw in our example last week, adjacent threads will appear to perform operations in random order.

This led to a surprising result, where both threads print 0 on both Intel and ARM:

y = 1
print( x )

x = 1
print( y )

Simulated in Temper, this looks like:

(defn thread-a []
  (mem/write-val :x 1)
  (mem/read-val :y))

(defn thread-b []
  (mem/write-val :y 1)
  (mem/read-val :x))

A memory fence prevents memory operations from being reordered across it (where otherwise allowed):

a = 1
b = 2
memfence()
c = 3
d = 4

Assume our memory model allows writes to be reordered with writes, such as on ARM.

In this example, the stores to a and b can swap places. The stores to c and d can swap places. But a store to c or d can never occur before both stores to b or a are complete.

Fixing the mystery result of our example problem is as simple as adding memfence:

y = 1
memfence()
print( x )

x = 1
memfence()
print( y )

On both ARM and Intel, we’re now back to the land of sequential consistency:

[0 1] ; Thread A executed first
[1 0] ; Thread B executed first
[1 1] ; Interleaved execution

This can be simulated in Temper. Note that after the introduction of memfence() , the order for each thread becomes fixed. Different results are caused only by thread interleavings.

(defn thread-a []
  (mem/write-val :x 1)
  (mem/memfence)
  (mem/read-val :y))

(defn thread-b []
  (mem/write-val :y 1)
  (mem/memfence)
  (mem/read-val :x))

Locks

Edge cases that depend on exact execution orders are a potent source of bugs. They’re also very hard to reproduce. Sometimes the act of trying to reproduce them prevents them from happening, commonly known as a Heisenbug.

Take the interleaved execution case, where both threads print 1 . Even if it doesn’t break anything today, it is good form to remove it. Otherwise it will persist as a rare path of execution, infrequently tested and never thought about.⁴

To do this, we’re going to introduce locks to our library.

In Temper, locks are stored in the global state. They are implemented by overriding can-execute? , which determines whether a pending instruction can be executed next:

(cond
  (= m :lock)
  (not (get-in store [:locks k]))

  :default true)

The logic is simple. If you try to take a lock on an address, and there’s already a lock, the instruction will remain in the pending queue.

The execution logic of a lock delivers a success message to the instruction’s promise, and updates the locks table. Unlock reverses this action.

(do
  (deliver p :locked)
  (assoc-in store [:locks k] true))

Note that this implementation does not support re-entrant locks. This is where the same thread will be able to take a lock on the same resource twice.

Our modified Temper program with locks is as follows - note we’ve removed the memfence. Let’s run it using Intel’s memory model. It gives us the expected result:

(defn lock-thread-a []
  (mem/lock :lck)
  (let [va (mem/write-val :x 1)
        vb (mem/read-val :y)]
    (mem/unlock :lck)
    vb))

(defn lock-thread-b []
  (mem/lock :lck)
  (let [va (mem/write-val :y 1)
        vb (mem/read-val :x)]
    (mem/unlock :lck)
    vb))

There is some subtlety here.

Removing the memfence means the processor is free to execute the loads and stores in any order it wants. Were execution interleaved, the memory fence is no longer present to prevent the [0 0] result. But we’re preventing the interleaving by blocking execution in the first place.

In an out of order environment, we need to be careful about what we mean by blocking execution. If loads or stores are reordered around locks, they may provide little to no protection.

On Intel, lock operations implicitly give you a memory barrier. On ARM, there’s no equivalent instruction. It’s up to the programmer to define a lock in software. If you neglect to include a memory barrier when building the lock, it will not have one.

In Temper, this can be simulated via the :lock-memfence property. Disable it, and the execution of this example code proceeds as though you don’t have locks at all.

Note that the locks themselves are functioning properly, and only one thread enters the critical section at a time. The error comes entirely from instruction reordering within each thread.⁵

; Same program executed with {:lock-memfence false}

(defn lock-thread-a []
  (mem/lock :lck)
  (let [va (mem/write-val :x 1)
        vb (mem/read-val :y)]
    (mem/unlock :lck)
    vb))

(defn lock-thread-b []
  (mem/lock :lck)
  (let [va (mem/write-val :y 1)
        vb (mem/read-val :x)]
    (mem/unlock :lck)
    vb))

The results can be summed up as:

[0 1] ; Thread A executed first
[1 0] ; Thread B executed first
[1 1] ; Writes reordered before locks
[0 0] ; Reads reordered before both locks and writes

Did I mention that reasoning about concurrent execution is hard?

Up Next

In the next section, we’re going to build some simple lockless queues. We’ll also write code that works on Intel but not ARM, and see how ARM’s memory model breaks causality. Finally, we’ll implement a lock using CAS, and attempt to replicate the most bizarre warning in ARM’s memory model document.

After that, we’re going to try to use Temper to do something completely different - simulate transaction isolation in SQL databases. Stay tuned!

Wikipedia’s article on memory ordering is very good. If I hadn’t written Temper, these blog posts would probably be just a link to that. ↩︎
A Twitter thread here has some interesting details. It’s just some person on the internet without references, but so am I. ↩︎
Again, Wikipedia’s article on CAS is very good. It points out I should probably have called my variant Compare and Set. ↩︎
Code Coverage for Concurrency argues that coverage metrics need to be expanded to measure concurrent execution. I couldn’t agree more. ↩︎
If this seems silly, see this article by Preshing, replicating a broken mutex on a real iPhone. I haven’t left the warm comfort of my REPL. ↩︎