Concurrent Programming

Lock-free programming often results in slower programs at low CPU counts. Lock-free algorithms offer better scalability (w.r.t. the number of cores). But you need to pay something to achieve that, and what you pay is single-thread performance / performance on machines with only a dozen CPUs or so (depending on the problem and amount of contention).

Lock-free programming can be very, very difficult.

Lectures

Synchronization without locks and Concurrent Data Structures — CS 4435 - CS 9624, UWaterloo. This one has similar content to Charles Leiserson’s lecture below.

Videos

CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced. What do they really do?” - YouTube
17. Synchronization Without Locks - YouTube, by Charles Leiserson under MIT 6.172 Performance Engineering of Software Systems on OCW, 2018.

Udemy

Modern C++ Concurrency in Depth ( C++17/20) | Udemy; purchased Sept 19, 2024.

Concepts

A set of operations can be considered atomic when two conditions are met:

Until the entire set of operations completes, no other process can know about the changes being made (invisibility); and
If any of the operations fail then the entire set of operations fails, and the state of the system is restored to the state it was in before any of the operations began.

C++ and Beyond 2012: Herb Sutter - atomic Weapons 1 of 2 - YouTube; the C++ memory model and modern hardware.

Memory Consistency Models

memory consistency model: how memory operations behave in the parallel computer system. This is a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.

Causal consistency is weaker than sequential consistency, which requires that all nodes see all writes in the same order

Sequential consistency was defined by Leslie Lamport (1979) for concurrent programming, as follows: the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

The sequence of instructions as defined by a processor’s program are interleaved with the corresponding sequences defined by the other processors’s programs to produce a global linear order of all instructions.

Mutual Exclusion principal

Mutual exclusion (often abbreviated to mutex) algorithms are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable, by pieces of code called critical sections.

A critical section is a piece of code where a process or thread accesses a common resource.

The synchronization of access to those resources is an acute problem because a thread can be stopped or started at any time.

Mutexes are enough to handle >90% of all cases. You should resort to lock-free algos only if you know that you need them to improve scalability (or because you need the lock-freedom for some other reason, e.g., to keep worst-case time bounded in real time programs). It also depends on your system and application. If you’re targeting a quad-socket NUMA system, looking at lock-free algos makes a lot more sense than when you’re targeting a laptop. via

Dekker’s algorithm guarantees mutual exclusion, freedom from deadlock, and freedom from starvation.

Peterson’s algorithm is another mutual exclusion mechanism that allows two processes to share a single-use resource without conflict, using only shared memory for communication.

While Peterson’s original formulation worked with only two processes, the algorithm can be generalized for more than two, which makes it more powerful than Dekker’s algorithm.

atomic read-modify-write instruction

test-and-set

The test-and-set instruction is an instruction used to write to a memory location and return its old value as a single atomic (i.e. non-interruptible) operation.

If multiple processes may access the same memory, and if a process is currently performing a test-and-set, no other process may begin another test-and-set until the first process is done.

Compare and Swap ( CAS )

is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response (this variant is often called compare-and-set), or by returning the value read from the memory location (not the value written to it). Compare and Swap - YouTub by Rob Edwards at SDSU.

there a subset of lock-free algorithms, called wait-free algorithms, the lock-free algorithms do not give real-time guarantees.

Concurrent data structures

Concurrent data structures are designed to handle simultaneous access and modification by multiple threads or processes in a concurrent computing environment. In such environments, where multiple tasks or threads are executed in parallel, traditional data structures might encounter issues like race conditions, deadlocks, and data corruption if not appropriately managed. Concurrent data structures aim to address these challenges and allow multiple threads to work with the data structure concurrently with correctness.

Mark Moir and Nir Shavit - Concurrent Data Structures ( 32 page PDF ) is a good text book material.

See ACM review article (Shavit, 2011)

Concurrent linked lists — Lock-free or non-blocking implementations ensure that insertions, deletions, and traversals can be performed concurrently without causing conflicts.
Concurrent hash tables — Techniques like fine-grained locking or lock-free approaches are used to manage access to individual buckets or entries.
Concurrent Queues and Stacks — Lock-free queues, such as Michael-Scott queues, are commonly used for scenarios where multiple threads need to enqueue and dequeue items.
Concurrent Trees — Concurrent versions of trees, such as binary search trees and AVL trees, are designed to enable concurrent insertion, deletion, and searching operations without causing conflicts or inconsistencies.
Concurrent sets and maps — They use partitioning, sharding, or lock-free strategies to handle concurrent operations, among other techniques.

Michael Scott Queues

(Michael & Scott, 1996) — paper ( PDF)

Video lecture Michael Scott — Nonblocking data structures. Part 1. - YouTube, part 2

Michael, M. M., & Scott, M. L. (1996). Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, 267–275.

Shavit, N. (2011). Data Structures in the Multicore Age. Communications of The Acm, 54(3), 76–84. https://doi.org/10.1145/1897852.1897873

btbytes.com