Commit Graph

123 Commits

Author SHA1 Message Date
Tejun Heo
76934f3aab scx_rusty: Convert to scx_stats
This allows scx_rusty to avoid generating excessive logs for statistics
while still allowing detailed monitoring on demand.
2024-08-22 19:44:12 -10:00
Tejun Heo
16c07a5cd9 scx_rusty: Don't reset bpf_stats, remember prev states and calculate delta
This will ease transition to scx_stats.
2024-08-22 13:02:23 -10:00
Tejun Heo
13fa48a871 scx_rusty: Separate out stats generation and formatting
to prepare for scx_stats conversion.
2024-08-22 10:03:10 -10:00
Tejun Heo
b4564520e5 scx_rusty: Simplify Stats structs and take id out of the structs
to prepare for scx_stats conversion. While at it, make some cosmetic
changes.
2024-08-22 08:45:33 -10:00
Tejun Heo
4834dec684 scx_rusty: Move stats structs to stats.rs and rename for consistency 2024-08-21 22:04:38 -10:00
Tejun Heo
5cf4212330 Revert "rusty: Integrate stats with the metrics framework"
This reverts commit 83373b1f4e in prepration
for converting to scx_stats.
2024-08-20 21:59:25 -10:00
Tejun Heo
516a7590db scx_rusty: Revert log_recorder conversion
scx_rusty will be converted to scx_stats in a similar fashin with
scx_layered. Undo log_recorder conversion in preparation.
2024-08-20 21:59:20 -10:00
Tejun Heo
1da249f063 scx_utils::topology: Always use NR_CPU_IDS and NR_CPUS_POSSIBLE
Always use the LazyLock versions and drop the counterparts from Topology.
2024-08-20 21:57:56 -10:00
Tejun Heo
f7c193e528 scx_utils, scx_rusty: Minor updates to version handling
- Update scx_utils/build.rs so that 12 char SHA1 is generated instead of
  full one.

- Add --version to scx_rusty. Use custom one as we don't want to use the
  default cargo version one.
2024-08-20 21:03:05 -10:00
Tejun Heo
8f786be08f scx_rusty: cargo fmt 2024-08-20 21:03:05 -10:00
Tejun Heo
4440567949 scx_rusty: Update Cargo.lock 2024-08-20 21:03:05 -10:00
I Hsin Cheng
5d85937842 scx_rusty: Fix typo
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-16 22:03:59 +08:00
Tejun Heo
c16b48d7b2 scheds/rust: Include Cargo.lock in the repo
Binary packages are expected to include Cargo.lock in the repo so that the
produced binaries match across different builds.
2024-08-15 23:08:35 -10:00
I Hsin Cheng
15b40de408 scx_rusty: Fix logical error when filtering tasks
The logic of tasks filtering were moved from find_first_candidate() into
a vector filter operation in commit 1c3b563. However, it was forgotten
to transfer the logic with "NOT" since now .filter() will populate the
tasks we want, rather than .skip_while() which was throwing unwanted
tasks out.

That's why the logic here should be reverse so we won't take kworker or
migrated tasks into considerations.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-08-10 22:56:20 +08:00
Tejun Heo
63c4a0191f
Merge branch 'main' into topic/inlined-skeleton-members 2024-08-08 14:23:37 -10:00
Tejun Heo
cd6a4d72c7 Bump versions for 1.0.2 release 2024-08-08 14:10:16 -10:00
Tejun Heo
7c3ffe96e1 Unify crate dependency versions
Different sub-projects are using different versions for the same crates.
Synchronize them to the latest.
2024-08-08 13:26:47 -10:00
David Vernet
5401876430
Revert "rusty: Rework deadline as a signed sum" 2024-07-25 14:50:45 -05:00
David Vernet
09536aa15d
Merge pull request #309 from sched-ext/rusty_improved_dl
rusty: Rework deadline as a signed sum
2024-07-25 13:44:54 -05:00
David Vernet
c1ad602ce5
rusty: Transfer latency priority between CPU-intensive and interactive tasks
In some scenarios, a CPU-intensive task may be on the critical path for
interactive workloads. For example, you may have a game with CPU-intensive
tasks that are crunching the logic for the game, and that's required for the
game to proceed without being choppy.

To support such workflows, this change adds logic to allow a non-interactive
task to inherit the lower (i.e. stronger) latency priority of another task if
it wakes or is woken by that task.

Signed-off-by: David Vernet <void@manifault.com>
2024-07-25 11:55:40 -05:00
David Vernet
933ea9baa1
rusty: Rework deadline as a signed sum
Currently, a task's deadline is computed as its vtime + a scaled function of
its average runtime (with its deadline being scaled down if it's more
interactive). This makes sense intuitively, as we do want an interactive task
to have an earlier deadline, but it also has some flaws.

For one thing, we're currently ignoring duty cycle when determining a task's
deadline. This has a few implications. Firstly, because we reward tasks with
higher waker and blocked frequencies due to considering them to be part of a
work chain, we implicitly penalize tasks that rarely ever use the CPU because
those frequencies are low. While those tasks are likely not part of a work
chain, they also should get an interactivity boost just by pure virtue of not
using the CPU very often. This should in theory be addressed by vruntime, but
because we cap the amount of vtime that a task can accumulate to one slice, it
may not be adequately reflected after a task runs for the first time.

Another problem is that we're minimizing a task's deadline if it's interactive,
but we're also not really penalizing a task that's a super CPU hog by
increasing its deadline. We sort of do a bit by applying a higher niceness
which gives it a higher deadline for a lower weight, but its somewhat minimal
considering that we're using niceness, and that the best an interactive task
can do is minimize its deadline to near zero relative to its vtime.

What we really want to do is "negatively" scale an interactive task's deadline
with the same magnitude as we "positively" scale a CPU-hogging task's deadline.
To do this, we make two major changes to how we compute deadline:

1. Instead of using niceness, we now instead use our own straightforward
   scaling factor. This was chosen arbitrarily to be a scaling by 1000, but we
   can and should improve this in the future.

2. We now create a _signed_ linear latency priority factor as a sum of the
   three following inputs:
   - Work-chain factor (log_2 of product of blocked freq and waker freq)
   - Inverse duty cycle factor (log_2 of the inverse of a task's duty cycle --
     higher duty cycle means lower factor)
   - Average runtime factor (Higher avg runtime means higher average runtime
     factor)

We then compute the latency priority as:

	lat_prio := Average runtime factor - (work-chain factor + duty cycle factor)

This gives us a signed value that can be negative. With this, we can compute a
non-negative weight value by calculating a weight from the absolute value of
lat_prio, and use this to scale slice_ns. If lat_prio is negative we calculate
a task's deadline as its vtime MINUS its scaled slice_ns, and if it's positive,
it's the task's vtime PLUS scaled slice_ns.

This ends up working well because you get a higher weight both for highly
interactive tasks, and highly CPU-hogging / non-interactive tasks, which lets
you scale a task's deadline "more negatively" for interactive tasks, and "more
positively" for the CPU hogs.

With this change, we get a significant improvement in FPS. On a 7950X, if I run
the following workload:

	$ stress-ng -c $((8 * $(nproc)))

1. I get 60 FPS when playing Stellaris (while time is progressing at max
   speed), whereas EEVDF gets 6-7 FPS.

2. I get ~15-40 FPS while playing Civ6, whereas EEVDF seems to get < 1 FPS. The
   Civ6 benchmark doesn't even start after over 4 minutes in the initial frame
   with EEVDF, but gets us 13s / turn with rusty.

3. It seems that EEVDF has improved with Terraria in v6.9. It was able to
   maintain ~30-55 FPS, as opposed to the ~5-10FPS we've seen in the past.
   rusty is still able to maintain a solid 60-62FPS consistently with no
   problem, however.
2024-07-25 11:55:03 -05:00
Daniel Müller
98af514972 scx_rusty: Simplify LoadBalancer::populate_tasks_by_load()
Simplify LoadBalancer::populate_tasks_by_load() by cutting out the
heap allocation bits, by moving mutable accesses in front of immutable
ones. Because multiple immutable accesses (between bss and rodata) do
not conflict, we don't need the intermediate PID storage.

Signed-off-by: Daniel Müller <deso@posteo.net>
2024-07-23 13:59:26 -07:00
I Hsin Cheng
2525b94af4 scx_rusty: Remove unused variable
Remove unused variable "has_preferred_dom".

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-17 20:30:17 +08:00
I Hsin Cheng
bf2f0fbf35 scx_rusty: Remove skip_while in find_first_candidate
Followed commit 1c3b563, move the checking of task.migrated.get() into
the vector filter. In this way, we can remove the skip_while() call in
find_first_candidate().

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-17 20:27:12 +08:00
Daniel Müller
565aec3662 rust: Update libbpf-rs & libbpf-cargo to 0.24
Update libbpf-rs & libbpf-cargo to 0.24. Among other things, generated
skeletons now contain directly accessible map and program objects, no
longer necessitating the use of accessor methods. As a result, the risk
for mutability conflicts is reduced greatly.

Signed-off-by: Daniel Müller <deso@posteo.net>
2024-07-16 11:48:52 -07:00
Daniel Hodges
27122a8a00 scx_rusty: refactor mempolicy handling bpf code and load balancing
This change refactors some of the helper methods for getting the
preferred node for tasks using mempolicy. The load balancing logic in
try_find_move_task is updated to allow for a filter, which is used to
filter for tasks with a preferred mempolicy.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 09:40:00 -07:00
Daniel Hodges
43a263aa75 scx_rusty: Use preferred node mask with balancer
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 08:11:19 -07:00
Daniel Hodges
bab6e9523c scx_rusty: Add mempolicy checks to rusty
This change makes scx_rusty mempolicy aware. When a process uses
set_mempolicy it can change NUMA memory preferences and cause
performance issues when tasks are scheduled on remote NUMA nodes. This
change modifies task_pick_domain to use the new helper method that
returns the preferred node id.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
2024-07-16 08:11:19 -07:00
I Hsin Cheng
1c3b563caf scx_rusty: Pre-check task domain mask with pull domain mask
Instead of performing domain mask checking inside
"find_first_candidate()" every time, check whether the tasks within push
domain are abled to run on pull domain by performing the mask check at
vector generation stage.

This way can also avoid repeated computation generated by the same
(task, pull_dom) pair as they'll try to check whether the pull domain is
in the task domain mask.

Also since whether a task is a kworker won't change in time, we can
perform the check earlier and put it in the filter, too.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-07-16 21:48:06 +08:00
Tejun Heo
51334b5c4d Bump versions for 1.0.1 release 2024-07-15 13:21:52 -10:00
Tejun Heo
761ec142ce Bump most versions to 1.0.0
sched_ext is about to be merged upstream. There are some compatibility
breaking changes and we're making the current sched_ext/for-6.11
1edab907b57d ("sched_ext/scx_qmap: Pick idle CPU for direct dispatch on
!wakeup enqueues") the baseline.

Tag everything except scx_mitosis as 1.0.0. As scx_mitosis is still in early
development and is currently temporarily disabled, only the patchlevel is
bumped.
2024-07-12 11:34:14 -10:00
Andrea Righi
d76551bbd3 scx_rusty: fix stats map initialization
The stats map in scx_rusty is a BPF_MAP_TYPE_PERCPU_ARRAY, with its size
determined by num_possible_cpus(). Initializing it with nr_cpu_ids() can
result in errors such as:

 Error: Failed to zero stat

 Caused by:
     number of values 6 != number of cpus 8

Fix by using num_possible_cpus() to initialize it.

Fixes: 263e02f6 ("rusty: Use nr_cpu_ids instead of nr_cpus_possible")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-30 17:37:14 +02:00
Andrea Righi
cf4883fbf8 meson: introduce serialize build option
With commit 5d20f89a ("scheds-rust: build rust schedulers in sequence"),
schedulers are now built serially one after the other to prevent meson
and cargo from forking NxN parallel tasks.

However, this change has made building a single scheduler much more
cumbersome, due to the chain of dependencies.

For example, building scx_rusty using the specific meson target would
still result in all schedulers being built, because they all depend on
each other.

To address this issue, introduce the new meson build option
`serialize=true|false` (default is false).

This option allows to disable the schedulers' build chain, restoring the
old behavior.

With this option enabled, it is now possible to build just a single
scheduler, parallelizing the cargo build properly, without triggering
the build of the others. Example:

  $ meson setup build -Dbuildtype=release -Dserialize=false
  $ meson compile -C build scx_rusty

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
2024-06-28 10:17:37 +02:00
David Vernet
fe3ce64a9b
Revert "scx_rusty: Refactor ridx assignment in populate_tasks_by_load" 2024-06-26 17:35:22 -04:00
David Vernet
d42bae4fcf
rusty: Print build ID when rusty is loaded
When someone is testing schedulers, we often have to ask what version
the scheduler is running as. Now that we can access the build ID from
rust schedulers, let's update scx_rusty to print the build ID when rusty
first starts running.

This results in output such as the following:

```
[void@maniforge scx]$ rusty
19:04:26 [INFO] Running scx_rusty (build ID: 0.8.1-g2043d2537f37c8d75753bb65eb75bca965067564 x86_64-unknown-linux-gnu/debug)
19:04:26 [INFO] NUMA[00] mask= 0b11111111111111111111111111111111
19:04:26 [INFO]   DOM[00] mask= 0b00000000111111110000000011111111
19:04:26 [INFO]   DOM[01] mask= 0b11111111000000001111111100000000
19:04:26 [INFO] Rusty scheduler started!
```

Signed-off-by: David Vernet <void@manifault.com>
2024-06-25 11:44:46 -05:00
David Vernet
9d9ece11aa
Merge pull request #384 from jfernandez/log-recorder
scx_utils: Add log_recorder module for metrics-rs
2024-06-25 11:43:37 -05:00
Jose Fernandez
e5984ed016
scx_utils: Add log_recorder module for metrics-rs
This change adds a new module to the scx_utils crate that provides a
log recorder for metrics-rs. The log recorder will log all metrics to
the console at a configurable interval in an easy to read format. Each
metric type will be displayed in a separate section. Indentation will
be used to show the hierarchy of the metrics. This results in a more
verbose output, but it is easier to read and understand.

scx_rusty was updated to use the log recorder and all explicit metric
logging was removed.

Counters will show the total count and the rate of change per second.
Counters with an additional label, like `type` in
`dispatched_tasks_total` in rusty, will show the count, rate, and
percentage of the total count.

Counters:
  dispatched_tasks_total: 65559 [1344.8/s]
    prev_idle: 44963 (68.6%) [966.5/s]
    wsync_prev_idle: 15696 (23.9%) [317.3/s]
    direct_dispatch: 2833 (4.3%) [35.3/s]
    dsq: 1804 (2.8%) [21.3/s]
    wsync: 262 (0.4%) [4.3/s]
    direct_greedy: 1 (0.0%) [0.0/s]
    pinned: 0 (0.0%) [0.0/s]
    greedy_idle: 0 (0.0%) [0.0/s]
    greedy_xnuma: 0 (0.0%) [0.0/s]
    direct_greedy_far: 0 (0.0%) [0.0/s]
    greedy_local: 0 (0.0%) [0.0/s]
  dl_clamped_total: 1290 [20.3/s]
  dl_preset_total: 514 [1.0/s]
  kick_greedy_total: 6 [0.3/s]
  lb_data_errors_total: 0 [0.0/s]
  load_balance_total: 0 [0.0/s]
  repatriate_total: 0 [0.0/s]
  task_errors_total: 0 [0.0/s]

Gauges will show the last set value:

Gauges:
  slice_length_us: 20000.00

Histograms will show the average, min, and max. The histogram will be
reset after each log interval to avoid memory leaks, since the data
structure that holds the samples is unbounded.

Histograms:
  cpu_busy_pct: avg=1.66 min=1.16 max=2.16
  load_avg node=0: avg=0.31 min=0.23 max=0.39
  load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39
  processing_duration_us: avg=297.50 min=296.00 max=299.00

Signed-off-by: Jose Fernandez <josef@netflix.com>
2024-06-24 18:45:02 -06:00
David Vernet
8059acb634
Merge pull request #381 from vax-r/rusty_dom_load_status_check
scx_rusty: Pull domain status check
2024-06-24 17:54:54 -05:00
I Hsin Cheng
eab234a74f scx_rusty: Refactor ridx assignment in populate_tasks_by_load
Origin assignment of the variable ridx is equivalent to comparing
between "ridx" and "wids - MAX_PIDS". Using u64 max library helper
function to perform the comparison and provide better readability.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-23 21:58:51 +08:00
I Hsin Cheng
84b9ac4dce scx_rusty: Pull domain status check
Check whether the BalanceState of pull_dom.load inside function
try_find_move_task is actually the variant NeedsPull. It'll perform task
migration in abit more conservative manner when the system is under high
loading situation.

Experiments are performed when the system is compiling linux kernel and
undergoing a large amount of I/O operation at the same time using fio.

The result showns that before the modification, there're 12,6617 times
of task migrations system wide. After the modification, there're 11,5419
times of task migrations system wide.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-23 21:38:23 +08:00
David Vernet
5038f54701
Merge pull request #377 from jfernandez/metrics-rs
rusty: Integrate stats with the metrics framework
2024-06-21 15:23:20 -05:00
David Vernet
263e02f644
rusty: Use nr_cpu_ids instead of nr_cpus_possible
In scx_rusty, we're currently using topo.nr_cpus_possible() to determine
how many possible CPU IDs we could have on the system. scx_rusty already
accounts for offlined CPUs, so to properly support systems whose
disabled CPUs may be in the middle of the range of possible CPU IDs,
let's instead use topo.nr_cpu_ids().

Signed-off-by: David Vernet <void@manifault.com>
2024-06-21 12:57:19 -05:00
Jose Fernandez
83373b1f4e
rusty: Integrate stats with the metrics framework
We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.

This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.

The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.

A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.

Future changes will migrate additional stats to this framework and add
support for other backends.

[1] https://metrics.rs/

Signed-off-by: Jose Fernandez <josef@netflix.com>
2024-06-21 10:18:44 -06:00
I Hsin Cheng
94e3616c02 scx_rusty: Refactor lookup operation for new_domc in task_set_domain
Modify the execution sequence before lookup operation for new_domc. If
new_dom_id == NO_DOM_FOUND, lookup operation for new_domc is definitely
going to fail so we don't have to wait until we found that new_domc is
NULL, clearing of cpumask and return operation should be done directly
in that case.

Plus we should avoid using try_lookup_dom_ctx outside the context of
lookup_dom_ctx, as it can keep the interface's consistency.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
2024-06-18 12:58:17 +08:00
David Vernet
0184444285
Merge pull request #366 from sched-ext/task_set_domain_global
rusty: Make dom_xfer_task() a global prog
2024-06-17 14:43:45 -05:00
David Vernet
dfe0ffb312
Merge pull request #347 from sched-ext/rusty_cleanup
rusty: Clean up some logic in rusty
2024-06-17 14:26:53 -05:00
David Vernet
7985ee556e
rusty: Clean up dispatch logic
The rusty dispatch logic is a bit unnecessarily convoluted. Let's clean it up
so that we're just comparing dom ids rather than iterating over arrays nested
inside of pcpu context.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-17 14:24:30 -05:00
David Vernet
87aa86845d
rusty: Refactor + slightly improve wake_sync
Right now, the SCX_WAKE_SYNC logic in rusty is very primitive. We only check to
see if the waker CPU's runqueue is empty, and then migrate the wakee there if
so. We'll want to expand this to be more thorough, such as:

- Checking to see if prev_cpu and waker_cpu share the same LLC when determining
  where to migrate
- Check for whether SCX_WAKE_SYNC migration helps load imbalance between cores
- ...

Right now all of that code is just a big blob in the middle of
rusty_select_cpu(). Let's pull it into its own function to improve readability,
and also add some logic to stay on prev_cpu if it shares an LLC with the waker.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-17 14:24:29 -05:00
David Vernet
fed66fa571
rusty: Make dom_xfer_task() a global prog
It seems that task_set_domain() is nearly at the point where it can
cause the verifier to get confused and think that it's exceeding the
number of available instructions per program. I've seen this a number of
times when making small changes to task_set_domain(), and it's once
again happened @vax-r (I-Hsin Cheng) made a small cleanup change to
rusty in https://github.com/sched-ext/scx/pull/362.

To avoid this, let's just make dom_xfer_task() a separate global program
so that the verifier doens't have to worry about branch pruning, etc
depending on what the caller does. This should hopefully make
task_set_domain() (and its callers) much less brittle.

Signed-off-by: David Vernet <void@manifault.com>
2024-06-17 14:22:26 -05:00
Tejun Heo
7c9aedaefe compat: Drop __COMPAT_scx_bpf_switch_all()
In preparation of upstreaming, let's set the min version requirement at the
released v6.9 kernels. Drop __COMPAT_scx_bpf_switch_call(). The open helper
macros now check the existence of SCX_OPS_SWITCH_PARTIAL and abort if not.
2024-06-15 20:03:37 -10:00