The current page table walker will on average read around half of the
entire page table for each level. This is inefficient, especially when
debugging a remote target which may have a low bandwidth connection to
the debugger. Address this by only reading one PTE per level.
I've only done the aarch64 page table walker because that's all that I
needed, but in principle the other page table walkers could work in a
similar way.
Signed-off-by: Peter Collingbourne <pcc@google.com>
This reverts commit 3fc72a92ea. GCC 13
seems to be generating DW_TAG_unspecified_type DWARF entries that older
versions of pahole don't support.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
While we're doing a mass rebuild for only a handful of kernels, we might
as well rebuild with the latest compiler.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Kernel versions 5.11-5.13 are hanging in the CI. It feels like a QEMU
bug, but there's a workaround we can apply to the kernel. The details
are in the new patch.
See #365.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
This reverts commit 747e02857d (except for
the test improvements). Peter Collingbourne noticed that the change I
used to test the performance of reading a single PTE at a time [1]
didn't cache higher level entries. Keeping that caching makes the
regression I was worried about negligible. So, there's no reason to add
the extra complexity of the hint.
1: https://github.com/osandov/drgn/pull/312#issuecomment-1754082129
Signed-off-by: Omar Sandoval <osandov@osandov.com>
- Fix messed up indentation by seven spaces instead of a tab.
- Use //-style comments.
- Put "imports" first.
- Call after setting up all other types so that future changes can set
up aliases referring to those types.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Sphinx normally makes type names in annotations links to the
documentation for that type, but this doesn't work for type aliases
(like drgn.Path). See sphinx-doc/sphinx#10785. Add a workaround inspired
by adafruit/circuitpython#8236.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Commit 1b47b866b4 ("libdrgn: go back to trusting PRSTATUS PID")
introduced a check so that we only use the PRSTATUS note for stack
unwinding if its PID field matches the PID of the task we're trying to
unwind. This was intended to detect when a CPU was in the middle of a
context switch during a crash. However, it has caused more trouble than
it's worth. For example:
1. It's broken on s390x, which puts the CPU number + 1 in the PID field.
See also commit 7cb3e99b23 ("libdrgn: program: find crashed task
with cpu_curr() instead of find_task()").
2. It's broken on dumps from QEMU's dump-guest-memory command, which
also puts the CPU number + 1 in the PID field. See issue #356.
3. It breaks the reasonable assumption that
prog.stack_trace(cpu_curr(prog, cpu)) always gets the stack trace for
a given CPU. This slowed down an internal scheduler investigation.
It doesn't even really accomplish what it's trying to do, since the
check itself is racy. Let's remove it (which is not as simple as it
sounds, since we were also using it in lieu of a proper on_cpu fallback
for !SMP kernels).
Signed-off-by: Omar Sandoval <osandov@osandov.com>
There's no reason to go through the trouble of checking the task_struct
if we were given a PRSTATUS note; it must be a thread that was running
at the time of the core dump. Refactor drgn_get_initial_registers() so
that we can use PRSTATUS earlier.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
In places where we're missing support for an architectures or live
processes, return a NOT_IMPLEMENTED error instead of an INVALID_ARGUMENT
error.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
The weird mismatch of r1 and the program counter was fixed in Linux v6.5
and backported to several stable branches. Unfortunately, we have to
break the kernel version checking ban to handle that.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Maple trees have been around and used for VMAs for almost a year now
(since Linux 6.1). Finally add helpers and tests for them.
Closes#261.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
This partially reverts commit 002b63b437 ("packit: disable ELN builds
until fedora-eln/eln#165 is fixed"). That ELN issue was fixed, but it's
still broken on s390x because of fedora-eln/eln#170.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
This makes the cpumask tests a little more thorough, as now the online
mask will be different from the possible and present masks. It also
makes the cpulist discontiguous in most cases (since you usually can't
offline CPU 0).
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Factor these out of the for_each_{online,possible,present}_cpu()
helpers. These are mainly so that we can test cpumask_to_cpulist(), but
they're also useful in their own right.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
In some cases (for example getting affinity of an irq), it is better
to have an easily understandable list of cpus corresonding to a given
cpumask.
This helper converts a given cpumask to string, such that the string
represents the range of CPUs that are present in the given mask.
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
The replacements of * with [0], -> with ., and & with address_of_() are
documented in the Object class docstring, but they're important enough
that we should mention them in the user guide. Also expand the
documentation of __getitem__ and __getattribute__ to mention this.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
We're determining which callbacks to use for the Dwfl handle too early,
before the program flags are set. Instead of creating it later, shift
the flag checks to the callback itself.
Fixes: c85dd74f3e ("libdrgn: embed drgn_debug_info in drgn_program")
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Rather than the indirect link between verbage in the docs and getting
dependencies, just include the URL. Indent the related modules for
clarity.
Signed-off-by: Alex Gartrell <alexgartrell@gmail.com>
The callables for object and type finders may return None -- and they
must do this in the case of a lookup failure. Update the type
annotations and docstrings to reflect this.
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
This is a surprising place where file references can be hiding that I've
run into before. There are some beginnings of an lsof-like script here.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
It's difficult to automatically detect calling an invalid, non-NULL
pointer when getting a stack trace. This manually recreates what we do
for calls to NULL since commit 412ce956b0 ("libdrgn: x86_64: unwind
call when pc is 0"). This was used to debug the issue fixed by "net:
tcp: fix crashes trying to free half-baked MTU probes" [1].
1: https://lore.kernel.org/all/20231010173651.3990234-1-kuba@kernel.org/T/
Signed-off-by: Omar Sandoval <osandov@osandov.com>
I missed this easy conversion back in commit ee51244dc1 ("libdrgn: add
_cleanup_free_ scope guard, no_cleanup_ptr(), and return_ptr()").
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Commit b16dad8a36 ("libdrgn: support SHT_REL relocations") added our
own implementation of SHT_REL relocations, which are used by Arm and
i386. However, it failed to remove the check that skips over all
non-SHT_RELA sections, so we've been falling back to the (slow) libdwfl
implementation this whole time.
Fixes: b16dad8a36 ("libdrgn: support SHT_REL relocations")
Signed-off-by: Omar Sandoval <osandov@osandov.com>
These aren't needed since commit c4a122ead6 ("libdrgn: dwarf_info:
scalably index all DIEs per name").
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Upcoming tests will need to combine flags.
Fixes: 104a14781d ("tests: test compressed debug sections")
Signed-off-by: Omar Sandoval <osandov@osandov.com>
The log levels are DRGN_LOG_CRITICAL and DRGN_LOG_ERROR, not
DRGN_LOG_CRIT and DRGN_LOG_ERR. These macros aren't being used anywhere
yet, so it wasn't caught before.
Fixes: c1a2792e6a ("libdrgn: add simple logging framework")
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Peter Collingbourne reported that the over-reading we do in the AArch64
page table iterator uses too much bandwidth for remote targets. His
original proposal in #312 was to change the page table iterator to only
read one entry per level. However, this would regress large reads that
do end up using the additional entries (in particular when the target is
/proc/kcore, which has a high latency per read but also high enough
bandwidth that the over-read is essentially free).
We can get the best of both worlds by informing the page table iterator
how much we expect to need (at the cost of some additional complexity in
this admittedly already pretty complex code). Requiring an accurate end
would limit the flexibility of the page table iterator and be more
error-prone, so let's make it a non-binding hint.
Add the hint and use it in the x86-64 page table iterator to only read
as many entries as necessary. Also extend the test case for large page
table reads to test this better.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
The arch_x86_64.c is often used as a reference when implementing support
for other architectures, so make sure it uses our latest best practices.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
In my branch for the module API (#332), I want to log an error without
any additional context. Passing an empty format string causes a
"zero-length gnu_printf format string" warning from GCC, and passing
NULL crashes in vsnprintf().
Empty format strings are totally valid, but NULL clearly isn't, so
annotate the format parameter as non-NULL and disable
-Wformat-zero-length.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Currently, when drgn is used to debug a running program, we assume it to
be running on the local machine. However, with remote debugging, this will
no longer be the case. To accommodate remote debugging, introduce a flag
DRGN_PROGRAM_IS_LOCAL, and use it to decide whether to use /sys/module.
Signed-off-by: Peter Collingbourne <pcc@google.com>