Commit eea5422546 ("libdrgn: make Linux kernel stack unwinding more
robust") overlooked that if the task is running in userspace, the stack
pointer in PRSTATUS obviously won't match the kernel stack pointer.
Let's bite the bullet and use the PID. If the race shows up in practice,
we can try to come up with another workaround.
Declaring a local vector or hash table and separately initializing it
with vector_init()/hash_table_init() is annoying. Add macros that can be
used as initializers.
This exposes several places where the C89 style of placing all
declarations at the beginning of a block is awkward. I adopted this
style from the Linux kernel, which uses C89 and thus requires this
style. I'm now convinced that it's usually nicer to declare variables
where they're used. So let's officially adopt the style of mixing
declarations and code (and ditch the blank line after declarations) and
update the functions touched by this change.
drgn has a couple of issues unwinding stack traces for kernel core
dumps:
1. It can't unwind the stack for the idle task (PID 0), which commonly
appears in core dumps.
2. It uses the PID in PRSTATUS, which is racy and can't actually be
trusted.
The solution for both of these is to look up the PRSTATUS note by CPU
instead of PID.
For the live kernel, drgn refuses to unwind the stack of tasks in the
"R" state. However, the "R" state is running *or runnable*, so in the
latter case, we can still unwind the stack. The solution for this is to
look at on_cpu for the task instead of the state.
drgn was originally my side project, but for awhile now it's also been
my work project. Update the copyright headers to reflect this, and add a
copyright header to various files that were missing it.
Now that we can walk page tables, we can use it in a memory reader that
reads kernel memory via the kernel page table. This means that we don't
need libkdumpfile for ELF vmcores anymore (although I'll keep the
functionality around until this code has been validated more).
There are a few big use cases for this in drgn:
* Helpers for accessing memory in the virtual address space of userspace
tasks.
* Removing the libkdumpfile dependency for vmcores.
* Handling gaps in the virtual address space of /proc/kcore (cf. #27).
I dragged my feet on implementing this because I thought it would be
more complicated, but the page table layout on x86-64 isn't too bad.
This commit implements page table walking using a page table iterator
abstraction. The first thing we'll add on top of this will be a helper
for reading memory from a virtual address space, but in the future it'd
also be possible to export the page table iterator directly.
Before Linux v4.11, /proc/kcore didn't have valid physical addresses, so
it's currently not possible to read from physical memory on old kernels.
However, if we can figure out the address of the direct mapping, then we
can determine the corresponding physical addresses for the segments and
add them.
We treat core dumps with all zero p_paddrs as not having valid physical
addresses. However, it is theoretically possible for a kernel core dump
to only have one segment which legitimately has a p_paddr of 0 (e.g., if
it only has a segment for the direct mapping, although note that this
isn't currently possible on x86, as Linux on x86 reserves PFN 0 for the
BIOS [1]).
If the core dump has a VMCOREINFO note, then it is either a vmcore,
which has valid physical addresses, or it is /proc/kcore with Linux
kernel commit 23c85094fe18 ("proc/kcore: add vmcoreinfo note to
/proc/kcore") (in v4.19), so it must also have Linux kernel commit
464920104bf7 ("/proc/kcore: update physical address for kcore ram and
text") (in v4.11) (ignoring the possibility of a franken-kernel which
backported the former but not the latter). Therefore, treat core dumps
with a VMCOREINFO note as having valid physical addresses.
1: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/setup.c?h=v5.6#n678
I've found that I do this manually a lot (e.g., when digging through a
task's stack). Add shortcuts for reading unsigned integers and a note
for how to manually read other formats.
The internal _page_offset() helper gets the value of PAGE_OFFSET, but
the fallback when KASLR is disabled has been out of date since Linux
v4.20 and never handled 5-level page tables. Additionally, it makes more
sense as part of the Linux kernel (formerly vmcoreinfo) object finder so
that it's cleanly accessible outside of drgn internals.
Jay reported that the default language detection was happening too early
and not finding "main". We need to make sure to do it after the DWARF
index is actually populated. The problem with that is that it makes
error reporting much harder, as we don't want to return a fatal error
from drgn_program_set_language_from_main() if we actually succeeded in
loading debug info. That means we probably need to ignore errors in
drgn_program_set_language_from_main(). To reduce the surface area where
we'd be failing, let's get the language directly from the DWARF index.
This also allows us to avoid setting the language if it's actually
unknown (information which is lost by the time we convert it to a
drgn_object in the current code).
For operations where we don't have a type available, we currently fall
back to C. Instead, we should guess the language of the program and use
that as the default. The heurisitic implemented here gets the language
of the CU containing "main" (except for the Linux kernel, which is
always C). In the future, we should allow manually overriding the
automatically determined language.
UTS_RELEASE is currently only accessible once debug info is loaded with
prog.load_debug_info(main=True). This makes it difficult to get the
release, find the appropriate vmlinux, then load the found vmlinux. We
can add vmcoreinfo_object_find as part of set_core_dump(), which makes
it possible to do the following:
prog = drgn.Program()
prog.set_core_dump(core_dump_path)
release = prog['UTS_RELEASE'].string_()
vmlinux_path = find_vmlinux(release)
prog.load_debug_info([vmlinux_path])
The only downside is that this ends up using the default definition of
char rather than what we would get from the debug info, but that
shouldn't be a big problem.
Instead of having two internal variants (drgn_find_symbol_internal() and
drgn_program_find_symbol_in_module()), combine them into the former and
add a separate drgn_error_symbol_not_found() for translating the static
error to the user-facing one. This makes things more flexible for the
next change.
For live userspace processes, we add a single [0, UINT64_MAX) memory
file segment for /proc/$pid/mem. Of course, not every address in that
range is valid; reading from an invalid address returns EIO. We should
translate this to a DRGN_ERROR_FAULT instead of DRGN_ERROR_OS, but only
for /proc/$pid/mem.
If we only want debugging information for vmlinux and not kernel
modules, it'd be nice to only load the former. This adds a load_main
parameter to drgn_program_load_debug_info() which specifies just that.
For now, it's only implemented for the Linux kernel. While we're here,
let's make the paths parameter optional for the Python bindings.
Add a helper to get the state of a task (e.g., 'R', 'S', 'D'). This will
be used to make sure that a task is not running when getting a stack
trace, so implement it in libdrgn.
vmcores include a NT_PRSTATUS note for each CPU containing the PID of
the task running on that CPU at the time of the crash and its registers.
We can use that to unwind the stack of the crashed tasks.
Currently, we close the Elf handle in drgn_set_core_dump() after we're
done with it. However, we need the Elf handle in
userspace_report_debug_info(), so we reopen it temporarily. We will also
need it to support getting stack traces from core dumps, so we might as
well keep it open. Note that we keep it even if we're using libkdumpfile
because libkdumpfile doesn't seem to have an API to access ELF notes.
Currently, the interface between the DWARF index, libdwfl, and the code
which finds and reports vmlinux/kernel modules is spaghetti. The DWARF
index tracks Dwfl_Modules via their userdata. However, despite
conceptually being owned by the DWARF index, the reporting code reports
the Dwfl_Modules and sets up the userdata. These Dwfl_Modules and
drgn_dwfl_module_userdatas are messy to track and pass between the
layers.
This reworks the architecture so that the DWARF index owns the Dwfl
instance and files are reported to the DWARF index; the DWARF index
takes care of reporting to libdwfl internally. In addition to making the
interface for the reporter much cleaner, this improves a few things as a
side-effect:
- We now deduplicate on build ID in addition to path.
- We now skip searching for vmlinux and/or kernel modules if they were
already indexed.
- We now support compressed ELF files via libdwelf.
- We can now load default debug info at the same time as additional
debug info.
vmcores don't include program headers for special memory regions like
vmalloc and percpu. Instead, we need to walk the kernel page table to
map those addresses. Luckily, libkdumpfile already does that. So, if
drgn was built with libkdumpfile support, use it for ELF vmcores. Also
add an environment variable to override this behavior.
Closes#15.
Now that we have the bundled version of elfutils, build it from libdrgn
and link to it. We can also get rid of the elfutils version checks from
the libdrgn code.
For stack trace support, we'll need to have some architecture-specific
functionality. drgn's current notion of an architecture doesn't actually
include the instruction set architecture. This change expands it to a
"platform", which includes the ISA as well as the existing flags.
Now that we're not overloading the name "symbol", we can define struct
drgn_symbol as a symbol table entry. For now, this is very minimal: it's
just a name, address, and size. We can then add a way to find the symbol
for a given address, drgn_program_find_symbol(). For now, this is only
supported through the actual ELF symbol tables. However, in the future,
we can probably support adding "symbol finders".
struct drgn_symbol doesn't really represent a symbol; it's just an
object which hasn't been fully initialized (see c2be52dff0 ("libdrgn:
rename object index to symbol index"), it used to be called a "partial
object"). For stack traces, we're going to have a notion of a symbol
that more closely represents an ELF symbol, so let's get rid of the
temporary struct drgn_symbol representation and just return an object
directly.
We don't need to get the DWARF index at the time we get the Dwfl handle,
so get rid of drgn_program_get_dwarf(), add drgn_program_get_dwfl(), and
create the DWARF index right before we update in a new function,
drgn_program_update_dwarf_index().
libdwfl is the elfutils "DWARF frontend library". It has high-level
functionality for looking up symbols, walking stack traces, etc. In
order to use this functionality, we need to report our debugging
information through libdwfl. For userspace programs, libdwfl has a much
better implementation than drgn for automatically finding debug
information from a core dump or PID. However, for the kernel, libdwfl
has a few issues:
- It only supports finding debug information for the running kernel, not
vmcores.
- It determines the vmlinux address range by reading /proc/kallsyms,
which is slow (~70ms on my machine).
- If separate debug information isn't available for a kernel module, it
finds it by walking /lib/modules/$(uname -r)/kernel; this is repeated
for every module.
- It doesn't find kernel modules with names containing both dashes and
underscores (e.g., aes-x86_64).
Luckily, drgn already solved all of these problems, and with some
effort, we can keep doing it ourselves and report it to libdwfl.
The conversion replaces a bunch of code for dealing with userspace core
dump notes, /proc/$pid/maps, and relocations.
This converts several open-coded dynamic arrays to the new common vector
implementation:
- drgn_lexer stack
- Array dimension array for DWARF parsing
- drgn_program_read_c_string()
- DWARF index directory name hashes
- DWARF index file name hashes
- DWARF index abbreviation table
- DWARF index shard entries
Since we currently don't parse DWARF macro information, there's no easy
way to get the value PAGE_SIZE and friends in drgn. However, vmcoreinfo
contains the value of PAGE_SIZE, so let's add a special symbol finder
that returns that.
Currently, if we don't get vmcoreinfo from /proc/kcore, and we can't get
it from /sys/kernel/vmcoreinfo, then we manually determine the kernel
release and KASLR offset. This has a couple of issues:
1. We look for vmlinux to determine the KASLR offset, which may not be
in a standard location.
2. We might want to start using other information from vmcoreinfo which
can't be determined as easily.
Instead, we can get the virtual address of vmcoreinfo from
/proc/kallsyms and read it directly from there.
kernel_module_iterator_next() can also fail in
open_loaded_kernel_modules(), so handle it in the same way that we
currently handle kernel_module_iterator_init().
/proc/kcore contains segments which don't have a valid physical address,
which it indicates with a p_paddr of -1. Skip those segments, otherwise
we got an overflow error from the memory reader.
The current array-based memory reader has a bug in the following
scenario:
prog.add_memory_segment(0xffff0000, 128, ...)
# This should replace a subset of the first segment.
prog.add_memory_segment(0xffff0020, 32, ...)
# This moves the first segment back to the front of the array.
prog.read(0xffff0000, 32)
# This finds the first segment instead of the second segment.
prog.read(0xffff0032, 32)
Fix it by using the newly-added splay tree. This also splits up the
virtual and physical memory segments into separate trees.
Currently, we load debug information for every kernel module that we
find under /lib/modules/$(uname -r)/kernel. This has a few issues:
1. Distribution kernels have lots of modules (~3000 for Fedora and
Debian).
a) This can exceed the default soft limit on the number of open file
descriptors.
b) The mmap'd debug information can trip the overcommit heuristics
and cause OOM kills.
c) It can take a long time to parse all of the debug information.
2. Not all modules are under the "kernel" directory; some distros also
have an "extra" directory.
3. The user is not made aware of loaded kernel modules that don't have
debug information available.
So, instead of walking /lib/modules, walk the list of loaded kernel
modules and look up their debugging information.
Currently, programs can be created for three main use-cases: core dumps,
the running kernel, and a running process. However, internally, the
program memory, types, and symbols are pluggable. Expose that as a
callback API, which makes it possible to use drgn in much more creative
ways.
Similar to "libdrgn: make memory reader pluggable with callbacks", we
want to support custom type indexes (imagine, e.g., using drgn to parse
a binary format). For now, this disables the dwarf index tests; we'll
have a better way to test them later, so let's not bother adding more
test scaffolding.
I've been planning to make memory readers pluggable (in order to support
use cases like, e.g., reading a core file over the network), but the
C-style "inheritance" drgn uses internally is awkward as a library
interface; it's much easier to just register a callback. This change
effectively makes drgn_memory_reader a mapping from a memory range to an
arbitrary callback. As a bonus, this means that read callbacks can be
mixed and matched; a part of memory can be in a core file, another part
can be in the executable file, and another part could be filled from an
arbitrary buffer.
Currently, we deduplicate files for userspace mappings manually.
However, to prepare for adding symbol files at runtime, move the
deduplication to DWARF index. In the future, we probably want to
deduplicate based on build ID, as well.
Older versions of Clang generate a call to __muloti4() for
__builtin_mul_overflow() with mixed signed and unsigned types. However,
Clang doesn't link to compiler-rt by default. Work around it by making
all of our calls to __builtin_mul_overflow() use unsigned types only.
1: https://bugs.llvm.org/show_bug.cgi?id=16404
The current mixed Python/C implementation works well, but it has a
couple of important limitations:
- It's too slow for some common use cases, like iterating over large
data structures.
- It can't be reused in utilities written in other languages.
This replaces the internals with a new library written in C, libdrgn. It
includes Python bindings with mostly the same public interface as
before, with some important improvements:
- Types are now represented by a single Type class rather than the messy
polymorphism in the Python implementation.
- Qualifiers are a bitmask instead of a set of strings.
- Bit fields are not considered a separate type.
- The lvalue/rvalue terminology is replaced with reference/value.
- Structure, union, and array values are better supported.
- Function objects are supported.
- Program distinguishes between lookups of variables, constants, and
functions.
The C rewrite is about 6x as fast as the original Python when using the
Python bindings, and about 8x when using the C API directly.
Currently, the exposed API in C is fairly conservative. In the future,
the memory reader, type index, and object index APIs will probably be
exposed for more flexibility.