mirror of
https://github.com/sched-ext/scx.git
synced 2024-12-12 11:37:18 +00:00
0048f8dd38
Add some info on `perf` to the developer guide and link from the main readme. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
160 lines
12 KiB
Markdown
160 lines
12 KiB
Markdown
# Developer Guide
|
|
## eBPF
|
|
The scheduling logic for sched_ext schedulers is written in eBPF (BPF). For
|
|
high level documentation the kernel docs should be referenced.
|
|
|
|
- [kernel documentation](https://docs.kernel.org/bpf/index.html)
|
|
- [eBPF docs](https://ebpf-docs.dylanreimerink.nl/)
|
|
|
|
When working on schedulers the following documentation is rather useful as
|
|
schedulers will use a combination of BPF cpumasks, helper functions, kfuncs and
|
|
maps for scheduling logic.
|
|
|
|
- [BPF maps](https://docs.kernel.org/bpf/maps.html)
|
|
- [bpf helper functions](https://man7.org/linux/man-pages/man7/bpf-helpers.7.html)
|
|
- [kfuncs](https://docs.kernel.org/bpf/kfuncs.html)
|
|
- [BPF cpumasks](https://docs.kernel.org/bpf/cpumasks.html)
|
|
|
|
The [kernel BPF tests](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf)
|
|
are also a useful source of examples of BPF functionality.
|
|
|
|
## Scheduling
|
|
The [kernel scheduling docs](https://docs.kernel.org/scheduler/index.html)
|
|
provide a high level overview of the existing scheduler subsystem. The kernel
|
|
docs cover various topics such as deadline scheduling, realtime scheduling and
|
|
the interaction of schedulers with other system resources.
|
|
|
|
When schedulers are written to scale beyond more than a single core eventually
|
|
the scheduler needs to implement a load balancing algorithm. Calculating the
|
|
load between scheduling domains becomes a difficult problem. sched_ext has a
|
|
common crate for calculating weights between scheduling domains. See the
|
|
`infeasible` crate in `rust/scx_utils/src` for the implementation.
|
|
|
|
## Useful Tools
|
|
### perf
|
|
|
|
The linux `perf` tool has a subcommand for profiling scheduling `perf sched`.
|
|
The interface is text driven, but is able to provide various timeline views and
|
|
aggregations of scheduler events. The following is an example of using `perf
|
|
sched` to get a timeline histogram with additional scheduling metrics.
|
|
|
|
```
|
|
$ perf sched record
|
|
$ perf sched timehist -Vw --state
|
|
time cpu 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0 task name wait time sch delay run time state
|
|
[tid/pid] (msec) (msec) (msec)
|
|
--------------- ------ --------------------------------------------------------------------------------- ------------------------------ --------- --------- --------- -----
|
|
960264.500659 [0000] perf[1635250] awakened: migration/0[19]
|
|
960264.500680 [0000] s perf[1635250] 0.000 0.000 0.000 D
|
|
960264.500683 [0000] migration/0[19] awakened: perf[1635250]
|
|
960264.500809 [0001] perf[1635250] awakened: migration/1[24]
|
|
960264.500814 [0001] s perf[1635250] 0.000 0.000 0.000 D
|
|
960264.500816 [0001] migration/1[24] awakened: perf[1635250]
|
|
960264.500824 [0001] s migration/1[24] 0.000 0.005 0.009 S
|
|
960264.502403 [0001] i <idle> 0.000 0.000 1.579 I
|
|
960264.502418 [0001] s HTTPSrvExec39[3403538/3403436] 0.000 0.000 0.014 S
|
|
960264.506002 [0001] i <idle> 0.014 0.000 3.583 I
|
|
960264.506045 [0001] s CfgrIO0[13302/13094] 0.000 0.000 0.043 S
|
|
960264.506763 [0001] swapper awakened: chef-client[1629157]
|
|
960264.506767 [0001] i <idle> 0.043 0.000 0.721 I
|
|
960264.506784 [0001] s chef-client[1629157] 0.000 0.003 0.017 S
|
|
960264.507622 [0001] i <idle> 0.017 0.000 0.837 I
|
|
960264.507806 [0001] mcrcfg-fci[1635235/1635080] awakened: GlobalCPUThread[1635186/1635080
|
|
960264.507937 [0001] mcrcfg-fci[1635235/1635080] awakened: FalconClientThr[1635187/1635080
|
|
960264.507996 [0001] mcrcfg-fci[1635235/1635080] awakened: CfgrIO0[1635185/1635080]
|
|
960264.508007 [0001] s mcrcfg-fci[1635235/1635080] 0.000 0.000 0.384 S
|
|
960264.508079 [0001] i <idle> 0.384 0.000 0.071 I
|
|
960264.508100 [0001] ThriftSrv.N2104[1635036/2683498 awakened: IOThreadPool0[2685229/2683498]
|
|
960264.508108 [0001] s ThriftSrv.N2104[1635036/2683498 0.000 0.000 0.029 S
|
|
960264.508638 [0001] i <idle> 0.029 0.000 0.529 I
|
|
960264.508655 [0001] ThriftSrv.N2104[1635036/2683498 awakened: ThriftIO70[2683693/2683498]
|
|
|
|
```
|
|
|
|
### `bpftool`
|
|
[`bpftool`](https://github.com/libbpf/bpftool) contains many utilities for
|
|
interacting with the BPF subsystem and BPF programs. If you need to know
|
|
what BPF programs, maps, iterators are loaded on a system `bpftool` will
|
|
provide all this information.
|
|
|
|
Listing BPF maps:
|
|
```
|
|
$ sudo bpftool map list
|
|
11: hash_of_maps name cgroup_hash flags 0x0
|
|
key 8B value 4B max_entries 2048 memlock 172992B
|
|
pids systemd(1)
|
|
```
|
|
Listing `struct_ops`:
|
|
```
|
|
$ sudo bpftool struct_ops list
|
|
21381: layered sched_ext_ops
|
|
```
|
|
|
|
### `retsnoop`
|
|
[`retsnoop`](https://github.com/anakryiko/retsnoop) is a BPF tool for tracing
|
|
linux. It is very useful if you are trying to understand the flow of kernel
|
|
functions. This can be useful when BPF verification issues are encountered. The
|
|
following example shows how the verifier `do_check_common` function can be
|
|
traced.
|
|
|
|
```
|
|
$ sudo retsnoop -e 'do_check*' -a ':kernel/bpf/*.c' -T
|
|
07:55:28.049718 -> 07:55:28.049797 TID/PID 270611/270611 (bpftool/bpftool):
|
|
|
|
FUNCTION CALL TRACE RESULT DURATION
|
|
--------------------------------- --------- --------
|
|
→ do_check_common
|
|
→ init_func_state
|
|
↔ tnum_const [0] 2.084us
|
|
← init_func_state [void] 6.648us
|
|
↔ tnum_const [0] 2.662us
|
|
→ do_check
|
|
↔ mark_reg_unknown [void] 2.251us
|
|
↔ tnum_const [0] 2.421us
|
|
↔ reg_bounds_sanity_check [0] 2.049us
|
|
↔ check_reference_leak [0] 2.014us
|
|
→ check_return_code
|
|
↔ mark_reg_read [0] 2.212us
|
|
← check_return_code [0] 6.531us
|
|
↔ pop_stack [-ENOENT] 2.099us
|
|
← do_check [0] 34.822us
|
|
↔ pop_stack [-ENOENT] 2.167us
|
|
← do_check_common [0] 76.413us
|
|
|
|
entry_SYSCALL_64_after_hwframe+0x4b (entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:130:0)
|
|
do_syscall_64+0x6a (arch/x86/entry/common.c:0:0)
|
|
__x64_sys_bpf+0x18 (kernel/bpf/syscall.c:5792:1)
|
|
. __se_sys_bpf (kernel/bpf/syscall.c:5792:1)
|
|
. __do_sys_bpf (kernel/bpf/syscall.c:5794:9)
|
|
__sys_bpf+0x27e (kernel/bpf/syscall.c:0:9)
|
|
bpf_prog_load+0x593 (kernel/bpf/syscall.c:2908:6)
|
|
bpf_check+0x1066 (kernel/bpf/verifier.c:21608:8)
|
|
. do_check_main (kernel/bpf/verifier.c:20938:8)
|
|
76us [0] do_check_common+0x552 (kernel/bpf/verifier.c:20856:9)
|
|
! 2us [-ENOENT] pop_stack
|
|
```
|
|
|
|
### `bpftrace`
|
|
[`bpftrace`](https://github.com/bpftrace/bpftrace) is a high level tracing
|
|
language for BPF. When working with sched_ext `bpftrace` programs can be used
|
|
for understanding scheduler run queue latency as other scheduler internals. See
|
|
the `scripts` dir for examples.
|
|
|
|
### `stress-ng`
|
|
For generating synthetic load on a system
|
|
[`stress-ng`](https://github.com/ColinIanKing/stress-ng) can be used.
|
|
`stress-ng` can generate different types of load on the system including cpu
|
|
bound, fork heavy, NUMA, cache heavy and more.
|
|
|
|
### `veristat`
|
|
[`veristat`](https://github.com/libbpf/veristat) is a tool to provide statics
|
|
from the BPF verifier for BPF programs. It can also be used to compare
|
|
verification stats across runs. This is useful when trying to optimize BPF
|
|
programs for their instruction count.
|
|
|
|
### `turbostat`
|
|
[`turbostat`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/power/x86/turbostat)
|
|
is a tool for inspecting CPU frequency as well as power utilization. When
|
|
optimizing schedulers for energy performance `turbostat` can be used to
|
|
understand the energy required per operation.
|