For variables which are predeclared, GCC generates a DW_TAG_variable DIE
with DW_AT_name and DW_AT_declaration as well as a DW_TAG_variable DIE
without DW_AT_name but with DW_AT_specification pointing to the
declaration DIE. We should index the latter, not the former. This has a
couple of benefits: we can skip indexing variable declaration DIEs,
which contribute a lot of duplicate hash table insertions; and, we can
always get the address of a variable from DW_AT_location of the indexed
DIE instead of having to parse the symbol table.
A name and tag are not always enough to uniquely identify a type or
variable. For example, "struct workspace" in the Linux kernel can refer
to one of at least three types; fs/btrfs/{lzo,zlib,zstd}.c each have
their own struct workspace type. We can, however, also differentiate
DIEs on the file they were declared in.
The naive thing to do would be to include the file name as a string in
the hash table entry. However, that means we must allocate and
canonicalize each path in the line number program header and pay an
extra cache miss plus string comparison when adding a new entry.
We can get rid of the cache miss and string comparison if we instead map
the file name to a unique identifier. The foolproof way to do this would
be to create another big hash table of file names and use the hash table
entry index as the unique identifier. However, for this, we'd still need
to allocate and canoicalize each path as well as worry about another big
hash table.
Once we observe that we can get away with "almost certainly unique"
instead of "truly unique" identifiers, the next logical step is to just
use a hash of the file name as the identifier. With a 64-bit hash and
the ~50k files in the kernel, the probability of a collision is 1 in 10
billion. Even in the extremely unlikely event that there is a collision,
it only matters if the files with colliding names also have colliding
DIEs, which brings things pretty close to the realm of impossibility.
After this change, DwarfIndex.find() returns a list of DIEs matching the
name and tag. The callers will be updated to use the list in upcoming
changes.
ProgramObject.__dir__(), member_(), and container_of_() all check that
the relevant type is a struct or union, but they all need to allow
typedefs of structs or unions, as well.
We're currently also doing this check in __getattr__(), which is
unnecessary overhead in the common case. We can just check the exception
in the error case.
The standard library rlcompleter doesn't support expressions involving
an item lookup (e.g., x[0] or x['foo']). This is a pain for the drgn
CLI, because it's common to use prog['variable'] and want to
autocomplete it. Instead of using the standard library rlcompleter,
implement our own, cleaned up version of it with the ability to handle
expressions containing [key]. rlcompleter already allows for arbitrary
__getattr__() calls, and __getitem__() isn't any different.
This is a big change that makes EnumType have a compatible integer type
member instead of copying the fields, which ends up touching a lot of
stuff but also fixing a bunch of static typing errors.
Converting an lvalue to an operand has to do a little bit more than
remove qualifiers:
- Convert array types to pointer types
- Convert function types to pointer types
Some types don't actually have to go through find_dwarf_type(), so they
can be handled in the common code. This allows us to add a MockTypeIndex
to the tests.
The rules are really subtle and not completely specified, so hopefully
this covers all of the corner cases... This will be used for
ProgramObject operators.
Instead, take callbacks for looking up variables and reading memory.
While we're at it, get rid of TypeFactory and instead implement its
methods as functions taking a DwarfIndex.
Most of drgn.dwarf is not performance-sensitive, and the part that is
(DwarfIndex) can use some extra tuning which is easier to do in C rather
than Cython.
The lldwarf/drgn.dwarf split wasn't working out too well, and moving all
of drgn.dwarf into lldwarf (by rewriting it into C) would be way too
much work. Instead, use Cython, which results in a parser which is just
as fast but with much cleaner code overall. It also turns out lldwarf
wasn't doing GC right, so the switch also fixed that.
Resolving parameters, variables with function scope, and global
variables should work. This is just the variable resolution, no fetching
yet, but a bunch of refactors snuck in here so committing it all now.
The parsing library shouldn't really care about keeping track of these.
Instead, add __dict__ and getattr()/setattr() to all of the lldwarf
objects so higher layers can store the offset if they want.
DwarfDie is special in two ways:
1. We want to store an offset for the DIE so we know where to parse
sibling and children entries.
2. For some attributes, we need to store an offset so we know where to
find them later.
Both of these are changed to be relative to the CU rather than the
buffer.
I wrote all of this code a few months back and am just now getting
around to committing it. The low-level DWARF parsing library is pretty
solid, although it only implements a subset of DWARF so far. The CLI and
higher-level interface are experimental.