This has a few benefits:
1. We no longer have to parse /proc/kallsyms, which actually takes just
as long as parsing all of the DWARF files
2. We can support vmcores
3. We can find the address of a specific variable even if it has the
same name as other variables in the same object file
/proc/kcore and vmcores include the physical memory address in the
program headers. Reading from physical memory can be useful, so support
it in CoreReader.
We need this in order to be able to reallocate the files and cus arrays.
It also shrinks the size of struct die_hash_entry back to what it was
before file_name_hash was added.
For variables which are predeclared, GCC generates a DW_TAG_variable DIE
with DW_AT_name and DW_AT_declaration as well as a DW_TAG_variable DIE
without DW_AT_name but with DW_AT_specification pointing to the
declaration DIE. We should index the latter, not the former. This has a
couple of benefits: we can skip indexing variable declaration DIEs,
which contribute a lot of duplicate hash table insertions; and, we can
always get the address of a variable from DW_AT_location of the indexed
DIE instead of having to parse the symbol table.
Currently, we use the sequentially consistent memory model for all
operations, which isn't necessary. The only ordering requirement is that
a thread which finds an already-used hash table slot sees it initialized
once it sees the tag set. This does get rid of an mfence on x86_64, but
it didn't have any measureable performance improvement.
A name and tag are not always enough to uniquely identify a type or
variable. For example, "struct workspace" in the Linux kernel can refer
to one of at least three types; fs/btrfs/{lzo,zlib,zstd}.c each have
their own struct workspace type. We can, however, also differentiate
DIEs on the file they were declared in.
The naive thing to do would be to include the file name as a string in
the hash table entry. However, that means we must allocate and
canonicalize each path in the line number program header and pay an
extra cache miss plus string comparison when adding a new entry.
We can get rid of the cache miss and string comparison if we instead map
the file name to a unique identifier. The foolproof way to do this would
be to create another big hash table of file names and use the hash table
entry index as the unique identifier. However, for this, we'd still need
to allocate and canoicalize each path as well as worry about another big
hash table.
Once we observe that we can get away with "almost certainly unique"
instead of "truly unique" identifiers, the next logical step is to just
use a hash of the file name as the identifier. With a 64-bit hash and
the ~50k files in the kernel, the probability of a collision is 1 in 10
billion. Even in the extremely unlikely event that there is a collision,
it only matters if the files with colliding names also have colliding
DIEs, which brings things pretty close to the realm of impossibility.
After this change, DwarfIndex.find() returns a list of DIEs matching the
name and tag. The callers will be updated to use the list in upcoming
changes.
For disambiguating symbols which are defined in multiple files, I want
to use a hash to identify file names without needing to compare the file
names themselves. DJBX33A doesn't cut it for this, as 32 bits isn't
enough, and we can afford to spend more cycles to avoid collisions.
SipHash is a good candidate: it produces a 64-bit output, is pretty fast
for short keys, and can be used incrementally. This adds an
implementation of SipHash-1-3 (the cheapest variant) without a key.
Instead of defining each command number explicitly, just define the
maximum skip command (i.e., the mininmum explicit command minus one),
and assert that the maximum explicit command is 255.
For now, return None for these special dentries. In the future, we can
probably special case the common ones, like sockfs and anon_inode, but
this is fine for now.