vmcores include a NT_PRSTATUS note for each CPU containing the PID of
the task running on that CPU at the time of the crash and its registers.
We can use that to unwind the stack of the crashed tasks.
Currently, we close the Elf handle in drgn_set_core_dump() after we're
done with it. However, we need the Elf handle in
userspace_report_debug_info(), so we reopen it temporarily. We will also
need it to support getting stack traces from core dumps, so we might as
well keep it open. Note that we keep it even if we're using libkdumpfile
because libkdumpfile doesn't seem to have an API to access ELF notes.
We'd like to be able to look up tasks by PID from libdrgn, but those
helpers are written in Python. Translate them to C and add some thin
bindings so we can use the same implementation from Python.
There are some cases where we want to read an integer regardless of its
signedness, so drgn_object_read_signed() and drgn_object_read_unsigned()
are cumbersome to use, and drgn_object_read_value() is too permissive.
Currently, the only information available from a stack frame is the
program counter. Eventually, we'd like to add support for getting
arguments and local variables, but that will require more work. In the
mean time, we can at least get the values of other registers. A
determined user can read the assembly for the code they're debugging and
derive the values of variables from the registers.
Rebase the existing patches and add the patches which extend the libdwfl
stack frame interface.
Based on:
47780c9e elflint, readelf: enhance error diagnostics
With the following patches:
configure: Add --disable-programs
configure: Add --disable-shared
configure: Fix -D_FORTIFY_SOURCE=2 check when CFLAGS contains -Wno-error
libcpu: compile i386_lex.c with -Wno-implicit-fallthrough
libdwfl: don't bother freeing frames outside of dwfl_thread_getframes
libdwfl: only use thread->unwound for initial frame
libdwfl: add interface for attaching to/detaching from threads
libdwfl: cache Dwfl_Module and Dwarf_Frame for Dwfl_Frame
libdwfl: add interface for evaluating DWARF expressions in a frame
In order to retrieve registers from stack traces, we need to know what
registers are defined for a platform. This adds a small DSL for defining
registers for an architecture. The DSL is parsed by an awk script that
generates the necessary tables, lookup functions, and enum definitions.
It's annoying to do obj.type_.size, and that doesn't even work for every
type. Add sizeof() that does the right thing whether it's given a Type
or Object.
For the following source code:
int arr[] = {};
GCC emits the following DWARF:
DWARF section [ 4] '.debug_info' at offset 0x40:
[Offset]
Compilation unit at offset 0:
Version: 4, Abbreviation section offset: 0, Address size: 8, Offset size: 4
[ b] compile_unit abbrev: 1
producer (strp) "GNU C17 9.2.0 -mtune=generic -march=x86-64 -g"
language (data1) C99 (12)
name (strp) "test.c"
comp_dir (strp) "/home/osandov"
stmt_list (sec_offset) 0
[ 1d] array_type abbrev: 2
type (ref4) [ 34]
sibling (ref4) [ 2d]
[ 26] subrange_type abbrev: 3
type (ref4) [ 2d]
upper_bound (sdata) -1
[ 2d] base_type abbrev: 4
byte_size (data1) 8
encoding (data1) signed (5)
name (strp) "ssizetype"
[ 34] base_type abbrev: 5
byte_size (data1) 4
encoding (data1) signed (5)
name (string) "int"
[ 3b] variable abbrev: 6
name (string) "arr"
decl_file (data1) test.c (1)
decl_line (data1) 1
decl_column (data1) 5
type (ref4) [ 1d]
external (flag_present) yes
location (exprloc)
[ 0] addr .bss+0 <arr>
Note the DW_AT_upper_bound of -1. We end up parsing this as UINT64_MAX
and returning a "DW_AT_upper_bound is too large" error. It appears that
GCC is simply emitting the array length minus one, so let's treat these
as having a length of zero.
Fixes#19.
There are a few places (e.g., Program.symbol(), Program.read()) where it
makes sense to accept, e.g., a drgn.Object with integer type. Replace
index_arg() with a converter function and use it everywhere that we use
the "K" format for PyArg_Parse*.
I got the error messages for DW_AT_upper_bound and DW_AT_count
backwards; fix it. Also fix the condition for word + 1 overflowing
dimension->length to be word >= UINT64_MAX. (Dwarf_Word is uint64_t so
this is kind of silly, but at least it documents the intent).
PlatformFlags and PrimitiveType got squashed into one string because of
a missing comma, and execscript was never added. Fix it and add some
test cases for it.
Commit c0bc72b0ea ("libdrgn: use splay tree for memory reader")
changed the signature of add_memory_segment(), but I didn't update the
example that used it. Fix it, and while we're here make it take the
device on the command line.
Sometimes, I'd like to see all of the missing debug info errors rather
than just the first 5. Allow setting this through the
DRGN_MAX_DEBUG_INFO_ERRORS environment variable.
Make the error message more concise, and reorder the sections so that we
check the most obviously-named section (.debug_info) first and least
important section (.debug_line) last.
Currently, the interface between the DWARF index, libdwfl, and the code
which finds and reports vmlinux/kernel modules is spaghetti. The DWARF
index tracks Dwfl_Modules via their userdata. However, despite
conceptually being owned by the DWARF index, the reporting code reports
the Dwfl_Modules and sets up the userdata. These Dwfl_Modules and
drgn_dwfl_module_userdatas are messy to track and pass between the
layers.
This reworks the architecture so that the DWARF index owns the Dwfl
instance and files are reported to the DWARF index; the DWARF index
takes care of reporting to libdwfl internally. In addition to making the
interface for the reporter much cleaner, this improves a few things as a
side-effect:
- We now deduplicate on build ID in addition to path.
- We now skip searching for vmlinux and/or kernel modules if they were
already indexed.
- We now support compressed ELF files via libdwelf.
- We can now load default debug info at the same time as additional
debug info.
It's undefined behavior to pass NULL to memcmp() even if the length is
zero. See also commit a17215e984 ("libdrgn: dwarf_index: fix memcpy()
undefined behavior").
vmcores don't include program headers for special memory regions like
vmalloc and percpu. Instead, we need to walk the kernel page table to
map those addresses. Luckily, libkdumpfile already does that. So, if
drgn was built with libkdumpfile support, use it for ELF vmcores. Also
add an environment variable to override this behavior.
Closes#15.
I was always doing the copy based on a comment in the equivalent
setuptools code [1]:
# Always copy, even if source is older than destination, to ensure
# that the right extensions for the current Python/platform are
# used.
On closer inspection, this isn't relevant as of PEP 3149 [2], since
extensions are tagged with the ABI version. Let's avoid the unnecessary
copies.
While we're here, let's restore self.inplace just to be safe.
1: e00f4a87ad
2: https://www.python.org/dev/peps/pep-3149/
I didn't want to use BUILT_SOURCES before because that would break make
$TARGET. But, now that doesn't work anyways because we're using SUBDIRS,
so we might as well use BUILT_SOURCES.
Now that we have the bundled version of elfutils, build it from libdrgn
and link to it. We can also get rid of the elfutils version checks from
the libdrgn code.
Based on:
c950e8a9 config: Fix spec file, add manpages and new GFDL license.
With the following patches:
configure: Add --disable-programs
configure: Add --disable-shared
configure: Fix -D_FORTIFY_SOURCE=2 check when CFLAGS contains -Wno-error
libcpu: compile i386_lex.c with -Wno-implicit-fallthrough
The plan is to stop relying on the distribution's version of elfutils
and instead ship our own. This gives us freedom to assume that we're
using the latest version and even ship our own patches (starting with a
few build system improvements). More details are in
scripts/update-elfutils.sh, which was used to generate this commit.
Currently, we have a special Makefile target to output the files for a
libdrgn source tarball, and we use that for setuptools. However, the
next change is going to import elfutils, and it'd be a pain to add the
same thing for the elfutils sources. Instead, let's just use git
ls-files for everything. The only difference is that source
distributions won't have the autoconf/automake output.
I started with drgn_elf_relocator as a separate interface to parallelize
by relocation. However, the final result is parallelized by file, which
means that it can be done as part of the main read_cus() loop. Get rid
of the elf_relocator interface and do it in dwarf_index.c instead. This
means that if/when libdwfl gets faster at ELF relocations, we can rip
out the relocation code without any other changes.
We're too inconsistent with how we use these for them to be useful (and
it's impossible to distinguish between a format error and some other
error from libelf/libdw/libdwfl), so let's just get rid of them and make
it all DRGN_ERROR_OTHER/Exception.
For now, we only support stack traces for the Linux kernel (at least
v4.9) on x86-64, and we only support getting the program counter and
corresponding function symbol from each stack frame.
For stack trace support, we'll need to have some architecture-specific
functionality. drgn's current notion of an architecture doesn't actually
include the instruction set architecture. This change expands it to a
"platform", which includes the ISA as well as the existing flags.
Now that we're not overloading the name "symbol", we can define struct
drgn_symbol as a symbol table entry. For now, this is very minimal: it's
just a name, address, and size. We can then add a way to find the symbol
for a given address, drgn_program_find_symbol(). For now, this is only
supported through the actual ELF symbol tables. However, in the future,
we can probably support adding "symbol finders".
struct drgn_symbol doesn't really represent a symbol; it's just an
object which hasn't been fully initialized (see c2be52dff0 ("libdrgn:
rename object index to symbol index"), it used to be called a "partial
object"). For stack traces, we're going to have a notion of a symbol
that more closely represents an ELF symbol, so let's get rid of the
temporary struct drgn_symbol representation and just return an object
directly.
Currently, finders indicate a non-fatal lookup error by setting the type
member to NULL. This won't work when we replace the symbol finder with
an object finder (which shouldn't modify the object on failure).
Instead, use a static error for this purpose.
Currently, repr() of structure and union types goes arbitrarily deep
(except for cycles). However, for lots of real-world types, this is
easily deeper than Python's recursion limit, so we can't get a useful
repr() at all:
>>> repr(prog.type('struct task_struct'))
Traceback (most recent call last):
File "<console>", line 1, in <module>
RecursionError: maximum recursion depth exceeded while getting the repr of an object
Instead, only print one level of structure and union types.
A couple of times, I've broken help(drgn) by formatting a function
signature in a way that the inspect module doesn't understand (namely,
it crashes on Enum default arguments). Let's add a simple test that the
documentation at least renders.
We don't need to get the DWARF index at the time we get the Dwfl handle,
so get rid of drgn_program_get_dwarf(), add drgn_program_get_dwfl(), and
create the DWARF index right before we update in a new function,
drgn_program_update_dwarf_index().