scx/DEVELOPER_GUIDE.md
2024-10-02 22:29:20 -04:00

12 KiB

Developer Guide

eBPF

The scheduling logic for sched_ext schedulers is written in eBPF (BPF). For high level documentation the kernel docs should be referenced.

When working on schedulers the following documentation is rather useful as schedulers will use a combination of BPF cpumasks, helper functions, kfuncs and maps for scheduling logic.

The kernel BPF tests are also a useful source of examples of BPF functionality.

Scheduling

The kernel scheduling docs provide a high level overview of the existing scheduler subsystem. The kernel docs cover various topics such as deadline scheduling, realtime scheduling and the interaction of schedulers with other system resources.

When schedulers are written to scale beyond more than a single core eventually the scheduler needs to implement a load balancing algorithm. Calculating the load between scheduling domains becomes a difficult problem. sched_ext has a common crate for calculating weights between scheduling domains. See the infeasible crate in rust/scx_utils/src for the implementation.

Rust

We use cargo fmt to ensure consistency in our Rust code. This runs on PRs in the CI and will fail with a patch if your code doesn't match. We currently need a nightly version of Rust to format so have pinned this for consistency. To run locally (with rustup) run:

$ rustup install nightly-2024-09-10
$ cargo +nightly-2024-09-10 fmt

Useful Tools

perf

The linux perf tool has a subcommand for profiling scheduling perf sched. The interface is text driven, but is able to provide various timeline views and aggregations of scheduler events. The following is an example of using perf sched to get a timeline histogram with additional scheduling metrics.

$ perf sched record
$ perf sched timehist -Vw --state
           time    cpu  0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0  task name                       wait time  sch delay   run time  state
                                                                                                           [tid/pid]                          (msec)     (msec)     (msec)       
--------------- ------  ---------------------------------------------------------------------------------  ------------------------------  ---------  ---------  ---------  -----
  960264.500659 [0000]                                                                                     perf[1635250]                                                    awakened: migration/0[19]
  960264.500680 [0000]  s                                                                                  perf[1635250]                       0.000      0.000      0.000      D                                 
  960264.500683 [0000]                                                                                     migration/0[19]                                                  awakened: perf[1635250]
  960264.500809 [0001]                                                                                     perf[1635250]                                                    awakened: migration/1[24]
  960264.500814 [0001]   s                                                                                 perf[1635250]                       0.000      0.000      0.000      D                                 
  960264.500816 [0001]                                                                                     migration/1[24]                                                  awakened: perf[1635250]
  960264.500824 [0001]   s                                                                                 migration/1[24]                     0.000      0.005      0.009      S                                 
  960264.502403 [0001]   i                                                                                 <idle>                              0.000      0.000      1.579      I                                 
  960264.502418 [0001]   s                                                                                 HTTPSrvExec39[3403538/3403436]      0.000      0.000      0.014      S                                 
  960264.506002 [0001]   i                                                                                 <idle>                              0.014      0.000      3.583      I                                 
  960264.506045 [0001]   s                                                                                 CfgrIO0[13302/13094]                0.000      0.000      0.043      S                                 
  960264.506763 [0001]                                                                                     swapper                                                          awakened: chef-client[1629157]
  960264.506767 [0001]   i                                                                                 <idle>                              0.043      0.000      0.721      I                                 
  960264.506784 [0001]   s                                                                                 chef-client[1629157]                0.000      0.003      0.017      S                                 
  960264.507622 [0001]   i                                                                                 <idle>                              0.017      0.000      0.837      I                                 
  960264.507806 [0001]                                                                                     mcrcfg-fci[1635235/1635080]                                      awakened: GlobalCPUThread[1635186/1635080
  960264.507937 [0001]                                                                                     mcrcfg-fci[1635235/1635080]                                       awakened: FalconClientThr[1635187/1635080
  960264.507996 [0001]                                                                                     mcrcfg-fci[1635235/1635080]                                       awakened: CfgrIO0[1635185/1635080]
  960264.508007 [0001]   s                                                                                 mcrcfg-fci[1635235/1635080]          0.000      0.000      0.384      S                                  
  960264.508079 [0001]   i                                                                                 <idle>                               0.384      0.000      0.071      I                                  
  960264.508100 [0001]                                                                                     ThriftSrv.N2104[1635036/2683498                                   awakened: IOThreadPool0[2685229/2683498]
  960264.508108 [0001]   s                                                                                 ThriftSrv.N2104[1635036/2683498      0.000      0.000      0.029      S                                  
  960264.508638 [0001]   i                                                                                 <idle>                               0.029      0.000      0.529      I                                  
  960264.508655 [0001]                                                                                     ThriftSrv.N2104[1635036/2683498                                   awakened: ThriftIO70[2683693/2683498]

bpftool

bpftool contains many utilities for interacting with the BPF subsystem and BPF programs. If you need to know what BPF programs, maps, iterators are loaded on a system bpftool will provide all this information.

Listing BPF maps:

$ sudo bpftool map list
11: hash_of_maps  name cgroup_hash  flags 0x0
        key 8B  value 4B  max_entries 2048  memlock 172992B
        pids systemd(1)

Listing struct_ops:

$ sudo bpftool struct_ops list 
21381: layered         sched_ext_ops                   

retsnoop

retsnoop is a BPF tool for tracing linux. It is very useful if you are trying to understand the flow of kernel functions. This can be useful when BPF verification issues are encountered. The following example shows how the verifier do_check_common function can be traced.

$ sudo retsnoop -e 'do_check*' -a ':kernel/bpf/*.c' -T
07:55:28.049718 -> 07:55:28.049797 TID/PID 270611/270611 (bpftool/bpftool):

FUNCTION CALL TRACE                 RESULT     DURATION
---------------------------------   ---------  --------
→ do_check_common                                      
    → init_func_state                                  
        ↔ tnum_const                [0]         2.084us
    ← init_func_state               [void]      6.648us
    ↔ tnum_const                    [0]         2.662us
    → do_check                                         
        ↔ mark_reg_unknown          [void]      2.251us
        ↔ tnum_const                [0]         2.421us
        ↔ reg_bounds_sanity_check   [0]         2.049us
        ↔ check_reference_leak      [0]         2.014us
        → check_return_code                            
            ↔ mark_reg_read         [0]         2.212us
        ← check_return_code         [0]         6.531us
        ↔ pop_stack                 [-ENOENT]   2.099us
    ← do_check                      [0]        34.822us
    ↔ pop_stack                     [-ENOENT]   2.167us
← do_check_common                   [0]        76.413us

                    entry_SYSCALL_64_after_hwframe+0x4b  (entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:130:0)
                    do_syscall_64+0x6a                   (arch/x86/entry/common.c:0:0)                       
                    __x64_sys_bpf+0x18                   (kernel/bpf/syscall.c:5792:1)                       
                    . __se_sys_bpf                       (kernel/bpf/syscall.c:5792:1)                       
                    . __do_sys_bpf                       (kernel/bpf/syscall.c:5794:9)                       
                    __sys_bpf+0x27e                      (kernel/bpf/syscall.c:0:9)                          
                    bpf_prog_load+0x593                  (kernel/bpf/syscall.c:2908:6)                       
                    bpf_check+0x1066                     (kernel/bpf/verifier.c:21608:8)                     
                    . do_check_main                      (kernel/bpf/verifier.c:20938:8)                     
    76us [0]        do_check_common+0x552                (kernel/bpf/verifier.c:20856:9)                     
!    2us [-ENOENT]  pop_stack                                                                                

bpftrace

bpftrace is a high level tracing language for BPF. When working with sched_ext bpftrace programs can be used for understanding scheduler run queue latency as other scheduler internals. See the scripts dir for examples.

stress-ng

For generating synthetic load on a system stress-ng can be used. stress-ng can generate different types of load on the system including cpu bound, fork heavy, NUMA, cache heavy and more.

veristat

veristat is a tool to provide statics from the BPF verifier for BPF programs. It can also be used to compare verification stats across runs. This is useful when trying to optimize BPF programs for their instruction count.

turbostat

turbostat is a tool for inspecting CPU frequency as well as power utilization. When optimizing schedulers for energy performance turbostat can be used to understand the energy required per operation.