# Developer Guide ## eBPF The scheduling logic for sched_ext schedulers is written in eBPF (BPF). For high level documentation the kernel docs should be referenced. - [kernel documentation](https://docs.kernel.org/bpf/index.html) - [eBPF docs](https://ebpf-docs.dylanreimerink.nl/) - [rustdocs and sched_ext for-next docs](https://sched-ext.github.io/scx/) When working on schedulers the following documentation is rather useful as schedulers will use a combination of BPF cpumasks, helper functions, kfuncs and maps for scheduling logic. - [BPF maps](https://docs.kernel.org/bpf/maps.html) - [bpf helper functions](https://man7.org/linux/man-pages/man7/bpf-helpers.7.html) - [kfuncs](https://docs.kernel.org/bpf/kfuncs.html) - [BPF cpumasks](https://docs.kernel.org/bpf/cpumasks.html) The [kernel BPF tests](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf) are also a useful source of examples of BPF functionality. ## Scheduling The [kernel scheduling docs](https://docs.kernel.org/scheduler/index.html) provide a high level overview of the existing scheduler subsystem. The kernel docs cover various topics such as deadline scheduling, realtime scheduling and the interaction of schedulers with other system resources. When schedulers are written to scale beyond more than a single core eventually the scheduler needs to implement a load balancing algorithm. Calculating the load between scheduling domains becomes a difficult problem. sched_ext has a common crate for calculating weights between scheduling domains. See the `infeasible` crate in `rust/scx_utils/src` for the implementation. ## Rust We use `cargo fmt` to ensure consistency in our Rust code. This runs on PRs in the CI and will fail with a patch if your code doesn't match. We currently need a nightly version of Rust to format so have pinned this for consistency. To run locally (with rustup) run: $ rustup install nightly-2024-09-10 $ cargo +nightly-2024-09-10 fmt ## Useful Tools ### [Perfetto](https://perfetto.dev/) [Perfetto](https://perfetto.dev/) is a profiling and trace visualization platform. It can be used to view scheduling data, which is useful for understanding scheduling decisions. The [`sched_ftrace.py`](scripts/sched_ftrace.py) script can be used to generate a ftrace compatible with Perfetto. ``` $ sudo ./scripts/sched_ftrace.py > sched.ftrace ``` The output of the script can then be loaded into the perfetto UI: ![perfetto](https://github.com/user-attachments/assets/23e18bd4-8016-40e7-8b49-d2be8ef62f1b) ### perf The linux `perf` tool has a subcommand for profiling scheduling `perf sched`. The interface is text driven, but is able to provide various timeline views and aggregations of scheduler events. The following is an example of using `perf sched` to get a timeline histogram with additional scheduling metrics. ``` $ perf sched record $ perf sched timehist -Vw --state time cpu 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0 task name wait time sch delay run time state [tid/pid] (msec) (msec) (msec) --------------- ------ --------------------------------------------------------------------------------- ------------------------------ --------- --------- --------- ----- 960264.500659 [0000] perf[1635250] awakened: migration/0[19] 960264.500680 [0000] s perf[1635250] 0.000 0.000 0.000 D 960264.500683 [0000] migration/0[19] awakened: perf[1635250] 960264.500809 [0001] perf[1635250] awakened: migration/1[24] 960264.500814 [0001] s perf[1635250] 0.000 0.000 0.000 D 960264.500816 [0001] migration/1[24] awakened: perf[1635250] 960264.500824 [0001] s migration/1[24] 0.000 0.005 0.009 S 960264.502403 [0001] i 0.000 0.000 1.579 I 960264.502418 [0001] s HTTPSrvExec39[3403538/3403436] 0.000 0.000 0.014 S 960264.506002 [0001] i 0.014 0.000 3.583 I 960264.506045 [0001] s CfgrIO0[13302/13094] 0.000 0.000 0.043 S 960264.506763 [0001] swapper awakened: chef-client[1629157] 960264.506767 [0001] i 0.043 0.000 0.721 I 960264.506784 [0001] s chef-client[1629157] 0.000 0.003 0.017 S 960264.507622 [0001] i 0.017 0.000 0.837 I 960264.507806 [0001] mcrcfg-fci[1635235/1635080] awakened: GlobalCPUThread[1635186/1635080 960264.507937 [0001] mcrcfg-fci[1635235/1635080] awakened: FalconClientThr[1635187/1635080 960264.507996 [0001] mcrcfg-fci[1635235/1635080] awakened: CfgrIO0[1635185/1635080] 960264.508007 [0001] s mcrcfg-fci[1635235/1635080] 0.000 0.000 0.384 S 960264.508079 [0001] i 0.384 0.000 0.071 I 960264.508100 [0001] ThriftSrv.N2104[1635036/2683498 awakened: IOThreadPool0[2685229/2683498] 960264.508108 [0001] s ThriftSrv.N2104[1635036/2683498 0.000 0.000 0.029 S 960264.508638 [0001] i 0.029 0.000 0.529 I 960264.508655 [0001] ThriftSrv.N2104[1635036/2683498 awakened: ThriftIO70[2683693/2683498] ``` ### `bpftool` [`bpftool`](https://github.com/libbpf/bpftool) contains many utilities for interacting with the BPF subsystem and BPF programs. If you need to know what BPF programs, maps, iterators are loaded on a system `bpftool` will provide all this information. Listing BPF maps: ``` $ sudo bpftool map list 11: hash_of_maps name cgroup_hash flags 0x0 key 8B value 4B max_entries 2048 memlock 172992B pids systemd(1) ``` Listing `struct_ops`: ``` $ sudo bpftool struct_ops list 21381: layered sched_ext_ops ``` ### `retsnoop` [`retsnoop`](https://github.com/anakryiko/retsnoop) is a BPF tool for tracing linux. It is very useful if you are trying to understand the flow of kernel functions. This can be useful when BPF verification issues are encountered. The following example shows how the verifier `do_check_common` function can be traced. ``` $ sudo retsnoop -e 'do_check*' -a ':kernel/bpf/*.c' -T 07:55:28.049718 -> 07:55:28.049797 TID/PID 270611/270611 (bpftool/bpftool): FUNCTION CALL TRACE RESULT DURATION --------------------------------- --------- -------- → do_check_common → init_func_state ↔ tnum_const [0] 2.084us ← init_func_state [void] 6.648us ↔ tnum_const [0] 2.662us → do_check ↔ mark_reg_unknown [void] 2.251us ↔ tnum_const [0] 2.421us ↔ reg_bounds_sanity_check [0] 2.049us ↔ check_reference_leak [0] 2.014us → check_return_code ↔ mark_reg_read [0] 2.212us ← check_return_code [0] 6.531us ↔ pop_stack [-ENOENT] 2.099us ← do_check [0] 34.822us ↔ pop_stack [-ENOENT] 2.167us ← do_check_common [0] 76.413us entry_SYSCALL_64_after_hwframe+0x4b (entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:130:0) do_syscall_64+0x6a (arch/x86/entry/common.c:0:0) __x64_sys_bpf+0x18 (kernel/bpf/syscall.c:5792:1) . __se_sys_bpf (kernel/bpf/syscall.c:5792:1) . __do_sys_bpf (kernel/bpf/syscall.c:5794:9) __sys_bpf+0x27e (kernel/bpf/syscall.c:0:9) bpf_prog_load+0x593 (kernel/bpf/syscall.c:2908:6) bpf_check+0x1066 (kernel/bpf/verifier.c:21608:8) . do_check_main (kernel/bpf/verifier.c:20938:8) 76us [0] do_check_common+0x552 (kernel/bpf/verifier.c:20856:9) ! 2us [-ENOENT] pop_stack ``` ### `bpftrace` [`bpftrace`](https://github.com/bpftrace/bpftrace) is a high level tracing language for BPF. When working with sched_ext `bpftrace` programs can be used for understanding scheduler run queue latency as other scheduler internals. See the `scripts` dir for examples. ### `stress-ng` For generating synthetic load on a system [`stress-ng`](https://github.com/ColinIanKing/stress-ng) can be used. `stress-ng` can generate different types of load on the system including cpu bound, fork heavy, NUMA, cache heavy and more. ### `veristat` [`veristat`](https://github.com/libbpf/veristat) is a tool to provide statics from the BPF verifier for BPF programs. It can also be used to compare verification stats across runs. This is useful when trying to optimize BPF programs for their instruction count. ### `turbostat` [`turbostat`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/power/x86/turbostat) is a tool for inspecting CPU frequency as well as power utilization. When optimizing schedulers for energy performance `turbostat` can be used to understand the energy required per operation.