Set CONFIG_PROVE_LOCKING=y in the defalut virtme-ng config, to enable
lockdep and additional lock debugging features, which can help catch
lock-related kernel issues in advance.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
add retries to kernel clone step
odds are that like, when we most care about cloning a fresh upstream
(i.e. some fix was just pushed), things are going to be flakiest (i.e.
no caches anywhere), so retry 10 times ~3 minutes total between tries
when trying to clone kernel source.
Enabling SCHED_MC in the kernel used for testing allows us to
potentially run more complext tests, simulating different CPU topologies
and have access to such topology data through the in-kernel scheduler's
information.
This can be useful as we add more topology awareness logic to the
sched_ext core (e.g., in the built-in idle CPU selection policy).
Therefore add this option to the default .config (and also fix a missing
newline at the end of the file).
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Disable topology currently defaults to `false` (topology enabled...). Change
this so that topology is enabled by default on hardware that may benefit from
it (multiple NUMA nodes or LLCs) and disabled on hardware that does not benefit
from it.
This is a slightly noisy change as we have to move ownership of the newly
mutable layer specs into the `Scheduler` object (previously they were a
borrow). We don't have a `Topology` object to make the default decision from
until `Scheduler::init`, and I think this is because of the possibility of hot
plugs. We therefore have to clone the `Vec<LayerSpec>` each time as it is
potentially mutable.
Test plan:
- CI. Updated to be explicit about topology in both cases.
Single NUMA multi-LLC machine:
```
$ scx_layered --run-example
...
13:34:01 [INFO] Topology awareness not specified, selecting enabled based on
hardware
...
$ scx_layered --run-example --disable-topology=true
...
13:33:41 [INFO] Disabling topology awareness
...
$ scx_layered --run-example -t
...
13:33:15 [INFO] Disabling topology awareness
...
$ scx_layered --run-example --disable-topology=false
# none of the above messages present
```
Single NUMA single LLC machine:
```
$ scx_layered --run-example
15:33:10 [INFO] Topology awareness not specified, selecting disabled based on
hardware
```
Enabling the `caching-build` workflow on `merge_group` actions. This is required to enable merge queues which should make merging lots of PRs while getting full CI testing possible. Currently `lint` is the only required check but we will add more in future.
* Fix a couple of misc errors in build scripts.
* Tweak scripts/kconfigs to make bpftrace work.
* Update how CI caching works to make builds faster (6 minute turnaround
time)
* Update CI config to generate per-scheduler debug archives w/ guest
dmesg/scheduler stdout, guest stdout, bpftrace script output,
veristat output.
* Update build scripts to accept the following:
** VNG RW -- write to host filesystem (better caching, logging).
* For stress tests in particular (via ini config):
** QEMU Opts -- to facilitate reproducing bugs (i.e. high core count).
** bpftrace scripts -- specify bpftrace scripts to run during stress
tests.
* make ci nicer
Replace build scheds and merged with caching build, and rename
caching build to build-and-test.
This should make the CI reports on PRs be nice and specific
(i.e. at a glance, know what passes and what fails).
It also keeps PR CI jobs up to date (as folks edit things) and
has them all use one config/24.04 etc.
* prevent untar permission errors from causing cache misses
I think I see PRs being harder to write because all parts of a CI job
are cancelled when one fails.
I think I am also starting to see that we have enough largely disjoint
moving pieces that there will often be one that is failing stress
tests at any time.
Make CI run all stress tests always to address this.
Use `cargo fmt` with a specific nightly branch in the CI to enforce formatting. Globally format these files while the diff is still small so we can stay on top of it.
Test plan:
- CI lint check passes.
split build and test jobs to reduce ci turnaround time
and make it clear what is failing when something fails.
also add virtiofsd to deps to make test compilation faster
(most test time is compliation) and remove all force 9ps.
The right options to set the amount of CPUs in vng is `-p`, not `-c`.
Change the command to use the long format `--cpu` and `--memory` for
better clarity.
This fixes issue #625.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
This commit fixes rust tests and configures ci to
run them on commit. It also sets up CI to run those
in a timely manner by caching dependencies and splitting jobs.
Now that we have a sane build environment with meson we don't need to
force sequential builds anymore (--jobs=1) and we can just run regular
builds without having to worry about excessive parallelization.
See commit 43950c6 ("build: Use workspace to group rust sub-projects").
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The veristat diff action isn't working properly so remove it for now.
Instead, run veristat (non diff mode) as part of the build-scheds
workflow. This should still provide some useful data for BPF program
related stats.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Add a github action to run
[`veristat`](https://github.com/libbpf/veristat) during PRs and merges.
This uses a [action cache](https://github.com/actions/cache) to store
the veristat values when a PR is merged.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
The kernel from kernel.org lacks the necessary .config chunk required
by the build via virtme-ng.
To fix this, include the kernel .config chunk directly in the scx
repository and use it from here.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
This change adds stress-ng as a load test for schedulers when running in
CI. It will run stress-ng while schedulers are being tested with a
reasonable amount of work. At the end of the run the stress-ng metrics
are collected for later analysis. However, since these results may be
running in a VM they may not be super robust.
We're failing in CI with:
/usr/include/string.h:26:10: fatal error: 'bits/libc-header-start.h'
file not found
26 | #include <bits/libc-header-start.h>
This header is apparently shipped with the gcc-multilib Ubuntu package,
according to a Stack Overflow post. Let's see if this fixes CI.
Signed-off-by: David Vernet <void@manifault.com>
During the build meson attempts to distribute the workload of multiple
sub-projects across all available CPUs and parallelize each build within
those projects, resulting in an NxN task generation.
This process could potentially overload the CI systems, leading to
potential failures (see for example issue #202).
To mitigate this, always use --jobs=1 during the CI run, which
serializes the build of sub-projects and restricts the level of
parallelization to N.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Instead clone the libbpf repo at a specific hash during setup.
This is to fix an issue whereby submodules are not included
in the tarball and therefore won't be updated/fetched during
setup after unzipping the tarball.
We're basically always runnin two CI jobs: one for a remote push, and
another for when a PR is opened. These are essentially measuring the
same thing, so let's save CI bandwidth and just do a PR run. This will
hopefully make things a bit less noisy as well.
Signed-off-by: David Vernet <void@manifault.com>
Instead of downloading a precompiled sched-ext enabled kernel from the
Ubuntu ppa, fetch the latest kernel directly from the sched-ext git
repository and recompile it on-the-fly using virtme-ng.
This allows to get rid of the Ubuntu ppa dependency, take out from the
equation potential Ubuntu-specific patches, and ensures testing all the
schedulers with the most up-to-date sched-ext kernel (that should also
help to detect potential kernel-related issues in advance).
The downside is that the CI runs will take a bit longer now, because we
are recompiling the kernel from scratch. However, the kernel built with
virtme-ng is relatively quick to compile and includes all the sched-ext
features required for testing.
It's worth noting that this method aligns with the current sched-ext
kernel CI, where we test only the in-kernel schedulers (as intended).
This change allows to extend the test coverage, using the same kernel to
test also the schedulers included in this repository.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Temporarily switch to the unstable sched-ext ppa, so that we can resume
testing with the new kernel API.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Improve build portability by including asm-generic/errno.h, instead of
linux/errno.h.
The difference between these two headers can be summarized as following:
- asm-generic/errno.h contains generic error code definitions that are
intended to be common across different architectures,
- linux/errno.h includes architecture-specific error codes and
provides additional (or overrides) error code definitions based on
the specific architecture where the code is compiled.
Considering the architecture-independent nature of scx, the advantages
of being able to use architecture-specific error codes are marginal or
negligible (and we should probably discourage using them).
Moving towards asm-generic/errno.h, however, allows the removal of
cross-compilation dependencies (such as the gcc-multilib package in
Debian/Ubuntu) and improves the code portability across various
architectures and distributions.
This also allows to remove a symlink hack from the github workflow.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use virtme-ng to run the schedulers after they're built; virtme-ng
allows to pick an arbitrary sched-ext enabled kernel and run it
virtualizing the entire user-space root filesystem, so we can basically
exceute the recompiled schedulers inside such kernel.
This should allow to catch potential run-time issue in advance (both in
the kernel and the schedulers).
The sched-ext kernel is taken from the Ubuntu ppa (ppa:arighi/sched-ext)
at the moment, since it is the easiest / fastest way to get a
precompiled sched-ext kernel to run inside the Ubuntu 22.04 testing
environment.
The schedulers are tested using the new meson target "test_sched", the
specific actions are defined in meson-scripts/test_sched.
By default each test has a timeout of 30 sec, after the virtme-ng
completes the boot (that should be enough to initialize the scheduler
and run the scheduler for some seconds), while the total lifetime of the
virtme-ng guest is set to 60 sec, after this time the guest will be
killed (this allows to catch potential kernel crashes / hangs).
If a single scheduler fails the test, the entire "test_sched" action
will be interrupted and the overall test result will be considered a
failure.
At the moment scx_layered is excluded from the tests, because it
requires a special configuration (we should probably pre-generate a
default config in the workflow actions and change the scheduler to use
the default config if it's executed without any argument).
Moreover, scx_flatcg is also temporarily excluded from the tests,
because of these known issues:
- https://github.com/sched-ext/scx/issues/49
- https://github.com/sched-ext/sched_ext/pull/101
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Andrea pointed out that we can and should be using Ubuntu 22.04.
Unfortunately it still doesn't ship some of the deps we need like
clang-17, but it does at least ship virtme-ng, so it's good for us to
use this so that we can actually test running the schedulers in a
virtme-ng VM when it supports being run in docker.
Also, update the job to run on pushes, and not just when a PR is opened
Suggested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: David Vernet <void@manifault.com>
When Ubuntu ships with sched_ext, we can also maybe test loading the
schedulers (not sure if the runners can run as root though). For now, we
should at least have a CI job that lets us verify that the schedulers
can _build_. To that end, this patch adds a basic CI action that builds
the schedulers.
As is, this is a bit brittle in that we're having to manually download
and install a few dependencies. I don't see a better way for now without
hosting our own runners with our own containers, but that's a bigger
investment. For now, hopefully this will get us _some_ coverage.
Signed-off-by: David Vernet <void@manifault.com>