I’m learning about CPU virtualization and the Limited Direct Execution model. Part of CPU virtualization is allowing multiple processes to share a CPU and run concurrently. A context switch occurs when the operating system pauses one process and resumes another, requiring the CPU to save and restore process state. This overhead can significantly impact system performance, so measuring it helps us understand the cost of multitasking.
After the syscall time homework, I also got a task to measure how much a context switch costs1. The book mentions that LMbench uses the following approach to measure it:
- run two processes on a single CPU
- set up two UNIX pipes between them
- first process writes to the first pipe and waits for a read on the second
- operating system blocks first process as it is waiting on IO
- second process waits for a read on the first pipe, then emits a write on the second pipe
- repeat
Unfortunately, I haven’t found a reliable way to enforce the “run two processes on a single CPU” requirement on macOS. Setting thread affinity failed and I couldn’t find a surefire way to ensure the two processes stayed in the same CPU core. This limitation could affect measurement accuracy since cross-core context switches have different overhead than intra-core switches.
To address this, I also tested on a FreeBSD VM configured with a single core, which naturally satisfies the single-CPU requirement. Here is the code in Zig that follows the procedure above:
pub fn main() !void {
const pipe1 = try pipe();
const pipe2 = try pipe();
const pid = try fork();
if (pid == 0) {
try read_write(pipe1, pipe2);
std.posix.exit(0);
}
const duration_ns = try write_read(pipe1, pipe2);
const avg_us = @as(f64, @floatFromInt(duration_ns)) / (iterations * 2 * 1000.0);
_ = std.posix.waitpid(pid, 0);
print("Context switch estimate: {d:.2}μs (includes pipe overhead)\n", .{avg_us});
}
fn read_write(pipe1: [2]fd_t, pipe2: [2]fd_t) !void {
var buffer = [1]u8{'x'};
defer close(pipe1[0]);
close(pipe1[1]);
close(pipe2[0]);
defer close(pipe2[1]);
for (0..iterations) |_| {
_ = try read(pipe1[0], &buffer);
_ = try write(pipe2[1], &buffer);
}
}
fn write_read(pipe1: [2]fd_t, pipe2: [2]fd_t) !i128 {
var buffer = [1]u8{'x'};
close(pipe1[0]);
defer close(pipe1[1]);
defer close(pipe2[0]);
close(pipe2[1]);
const start_ns = std.time.nanoTimestamp();
for (0..iterations) |_| {
_ = try write(pipe1[1], &buffer);
_ = try read(pipe2[0], &buffer);
}
const end_ns = std.time.nanoTimestamp();
return end_ns - start_ns;
}
const std = @import("std");
const print = std.debug.print;
const pipe = std.posix.pipe;
const read = std.posix.read;
const write = std.posix.write;
const close = std.posix.close;
const fork = std.posix.fork;
const fd_t = std.posix.fd_t;
const iterations = 100_000;
The highlighted lines show the core of the measurement: line 7 starts the child process that will read from pipe1 and write to pipe2, while line 11 runs the parent process that writes to pipe1 and reads from pipe2, with timing measurement wrapped around the entire operation.
The results are somewhat interesting. On my macOS (Apple M4 Max) machine:
- macOS: 2.18μs
- FreeBSD2: 0.66μs
The C version of the program produced similar results on macOS and FreeBSD.
Environment for these runs:
- Host: macOS on Apple M4 Max; STDIN was a TTY
- Guest: FreeBSD 14.3-RELEASE under QEMU with HVF2
How to run
zig run measure_context_switch.zig
Note that I’ve also run with optimizations, but the numbers did not change. Compiled with zig 0.16.0-dev.253+4314c9653.
Operating Systems: Three Easy Pieces. Available at: https://pages.cs.wisc.edu/~remzi/OSTEP/ (accessed January 22, 2025) ↩︎
For the FreeBSD Guest on Apple Silicon, you can see the command here ↩︎ ↩︎