Temporal Fuzzing I: Memory Models

Part of a Series:

Concurrency

Reasoning about concurrent execution is hard.

Let’s start with some pseudocode. Assume all memory is initialized to 0, and each of these two threads is spawned very close to the other in random order and execute in parallel.

What will each print?

x = 1
print( y )
y = 1
print( x )

Please do take some time to think this through.

The most natural answer is that either Thread A or Thread B starts first and runs to completion, with its twin executing in turn. This yields two possible outputs:

[0 1] ; Thread A executed first
[1 0] ; Thread B executed first

Where [A B] is the output for each thread, with one possibility per line.

If you’ve had your morning coffee, you’ve probably spotted a third possibility. The execution of both threads overlaps, and the first line from both Thread A and Thread B execute first, setting both x = 1 and y = 1 .

By the time the printing occurs, both will output 1 . So there are three possible outcomes:

[0 1] ; Thread A executed first
[1 0] ; Thread B executed first
[1 1] ; Interleaved execution

If you haven’t seen a problem like this before and figured that out, that’s genuinely impressive.

Before you go patting yourself on the back too much though, there’s just one problem.

This is not the correct answer.

It relies on an implicit assumption of sequential consistency, where each thread executes its lines in order, and any modifications made are immediately and globally visible.

The correct answer depends on a question - what is your memory model?

Take the pseudocode above, translate it to the language of your choice that implements mutable shared memory. Then run it on whatever device you’re reading this on. You will occasionally get a fourth result:

[0 1] ; Thread A executed first
[1 0] ; Thread B executed first
[1 1] ; Interleaved execution
[0 0] ; ?!

Memory Models

Modern processors do not execute instructions in order, and will aggressively load values from memory before they’re needed.

There is a strong contract that this will never break single threaded execution. With shared memory concurrency, all bets are off.

In the case where both threads print 0 , each thread has aggressively loaded the value it intends to print before executing the store, and the execution of both threads have also overlapped. The programs have been implicitly transformed into:

print( y )
x = 1
print( x )
y = 1

Intel provides a strong memory model with good guarantees, with a single caveat - “Loads may be reordered with older stores to different locations but not with older stores to the same location”.1 That is exactly what is happening here.

ARM’s memory model offers much less protection. Their equivalent document may alarm. 2

Temporal Fuzzing

I hope I’ve convinced you that analyzing a four line program that touches shared memory is not trivial.

So what hope do we have building something larger? What happens when want to write programs that interact with more complicated concurrency models, such as databases, disks that can fail, or networks that can go down or partition?

What happens when you want to write a distributed database that doesn’t lose data, even when a disk fails during a network partition while rebalancing shards?

This is the motivation behind a new Clojure library I’m developing, tentatively called Temper.3 It simulates multiple threads of execution, reordering and interleaving operations in any valid way as defined by the pluggable concurrency model.

To circle back to our motivating example, we can implement two threads and execute it both under ARM and Intel memory models, as well under sequential consistency, which never reorders operations.

(defn thread-a []
  (mem/write-val :x 1)
  (mem/read-val :y))

(defn thread-b []
  (mem/write-val :y 1)
  (mem/read-val :x))

(st/multi-execute (Memory. {:model :intel}) {} [thread-a thread-b])
(st/multi-execute (Memory. {:model :arm}) {} [thread-a thread-b])
(st/multi-execute (Memory. {:model :sc}) {} [thread-a thread-b])

Each test is run 1000 times, and the frequency of each outcome is listed.

; Intel result
[0 1] ; 235 of 1000
[1 1] ; 268 of 1000
[1 0] ; 254 of 1000
[0 0] ; 243 of 1000
; ARM result
[0 0] ; 243 of 1000
[1 0] ; 230 of 1000
[1 1] ; 270 of 1000
[0 1] ; 257 of 1000
; Sequential Consistency result
[1 0] ; 242 of 1000
[1 1] ; 513 of 1000
[0 1] ; 245 of 1000

Implementation

Threads are standard Clojure code. Each read or write to the central data store immediately returns a promise. It also generates an event, which is added to a pending queue:

[0 :write :x 1]
[0 :read :y]
[1 :write :y 1]
[1 :read :x]

When all threads are parked and waiting for the result of a promise, the execution engine uses the active memory model to determine which events are legal to execute next. For instance, two writes on the same thread cannot have their order shuffled on Intel, but they can on ARM.

Executing an event modifies the global store, and returns the result of the operation to the promise. The thread blocking on that promise can now proceed.

This execution model flattens the probability distribution, and makes edge cases trigger often. In a real world test, [0 0] is a result that may trigger a few times every million runs on an Intel chip. In the above simulation, its probability is 25%.

A new model can be implemented using this high level interface:

(defprotocol Model
  ; Updates store with the result of executing msg
  (apply-message [_ msg store])
  ; When fuzzing orders, can fst be reordered before lst?
  (order-valid? [_ fst lst])
  ; Can msg be executed given the state of store,
  ; used for locks and transactions
  (can-execute? [_ msg store]))

While so far we’ve focused on in process shared memory concurrency, the same technique can be used to simulate multiple processes accessing a disk, programs communicating over a network, database transactions at different isolation levels and more.

Interactive Visualisation

After execution, a Temper simulation can be dumped and visualised offline. The interleaved execution of each thread is shown, as well as the state of memory at any point in time. Each line of memory represents the state after the execution of the instruction to its left.

It’s important to note the reordering of instructions within each thread. Local reordering together with thread interleaving determine the program output.

Press the Shuffle button to fetch the next execution trace.

(defn thread-a []
  (mem/write-val :x 1)
  (mem/read-val :y))

(defn thread-b []
  (mem/write-val :y 1)
  (mem/read-val :x))

Up Next

In the next section, we’re going to use this system to dive deeper into in process concurrency, including modeling locks and memory fences.

After that we can move on to the good stuff!


  1. Intel’s memory ordering is here ↩︎

  2. ARM’s equivalent document is here ↩︎

  3. Named in the sense of temper your expectations. This stuff is hard. ↩︎