Page faults cannot happen when the user-space scheduler is running,
otherwise we may hit deadlock conditions: a kthread may need to run to
resolve the page fault, but the user-space scheduler is waiting on the
page fault to be resolved => deadlock.
We solved this problem (mostly) in commit 9708a80 ("scx_userland: use a
custom memory allocator to prevent page faults"), introducing a custom
allocator for the user-space scheduler that operates on a pre-allocated
mlocked memory buffer, but there is an exception that can still trigger
page faults: kcompactd.
When memory compaction is enabled, specifically with
vm.compact_unevictable_allowed=1 (which is often the default in many
distributions), kcompactd regularly attempts to compact all memory
zones, such that free memory is available in contiguous blocks where
feasible, including unevictable memory as well.
In the event that kcompactd remaps pages within the user-space
scheduler's address space, it can lead to page faults, resulting in a
potential deadlock.
To prevent this from happening automatically set
vm.compact_unevictable_allowed=0 when the scheduler is loaded and
restore the previous value when the scheduler in unloaded. In this way
we can prevent kcompactd from touching the unevictable memory associated
to the user-space scheduler.
Keep in mind that this is not a full bullet proof solution: something
else in the system may still set vm.compact_unevictable_allowed=1 while
the scheduler is running, re-enabling the risk of deadlock.
Ideally we would need a way to mark the user-space scheduler memory as
"really unevictable", or a proper kernel ABI to instruct kcompactd to
exclude certain tasks (or better, cgroups) from its proactive memory
compaction actions, but since then, this seems to be the best way to
mitigate this issue.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
This directory contains schedulers with user space rust components.
The README in each scheduler directory provides some background and describes
the types of workloads or scenarios they're designed to accommodate. For more
details on any of these schedulers, please see the header comment in their
main.rs or *.bpf.c files.