So the problem is the hash of a reference attribute not being
"stable"
in the face of circularity. The dirty solution of just always using a
hash of zero for a reference makes things indeed a lot worse, too many
things hash together to be practical.
Right.
I tried to hack together a "solution" that used the
referent's
non-reference attributes and tag. That showed a bug which looks
suspiciously like what we tried to debug before:
https://fedorahosted.org/pipermail/elfutils-devel/2010-September/001584.html
The dag burn
fedorahosted.org webserver is not responding at the moment,
so I can't look that up right now.
So I will try to get to the bottom of that first (somewhere when we
try
to finalize a die the attributes get misplaced). But in practice this
might also not be enough to create very distinct hashes.
It certainly won't in the case of <pointer_type type=.../> or
<reference_type type=.../>. Those are almost certainly part of
every circular reference chain in practice.
Another idea would be to use not use a "perfect hash" for
reference
attributes. Allow some equal reference attributes to not hash the same.
I don't see how that would fly, unless, as you say, we do it only
temporarily and then replace them with perfect ones later. The matching
hashes is the core of the whole plan for duplicate elimination.
The nice thing about most of the hashing done in dwarf_output is that
it
"uniqifies" value/die groups and so creates a perfect unique hash. This
can be done with ordered sets of values or die trees. But the reference
attributes introduce arbitrary (possibly circular) graphs between
elements. We get in trouble because we want to cut the circularity, but
don't know "the start of the circle".
It might be possible to choose a canonical ordering so that we always
start at the same step in the cycle. I haven't thought this through,
but maybe it is a line worth considering. For example, say we chose a
canonical ordering of the direct children of a CU. Then, among all CUs
so canonicalized, there would be a single choice for the first DIE in
the depth-first ordering of the whole CU tree that is the start of a
cycle, the same choice for all instances of an identical cycle.
Is that true?
So the problem comes from including the hash of all reference
attributes
in the construction of the "unique identifier object" for a attribute
reference that points to a die. That means that references inside
circular structures cannot really be completed since during the
construction of the "unique identifier object" of reference to that
circular structure itself it can never be fully resolved.
Here you mean the references that are themselves steps along the cyclic
reference chain, right? That is, not each and every reference that
appears inside a DIE tree that is part of a cycle. For example, a
reference whose referent is some <base_type .../> DIE can always be
completed, because that referent has no other references. Likewise, a
<const_type type=.../> can be completed if its type=... referent is a
base_type. Likewise, a <pointer_type type=.../> whose referent is such
a const_type, etc. i.e., any reference that is not itself part of a
cyclic reference chain. I think that's what you meant, but I want
always to be sure we are being precise in understanding each other.
So it seems that we should declare two "resolved" states
for DIEs. One
"tree resolved" that takes into account just the basic attributes plus
all the children "tree resolved". And another "fully resolved" that
all attributes, including reference values into account, plus "fully
resolved" children.
That sounds familiar. In some previous iterations of the 'struct entry'
thing with pending counts, I had multiple different kinds of pending
counts. It all got very confusing and I went back and forth confounding
myself several times before settling on the apparently simpler approach
we have now. That doesn't mean this isn't the right way to go. But it
will take great care, and much better reasoning and commenting than I
did before, to make sure it all makes sense. I'm sure what I did before
is not exactly what you are proposing, but it certainly was a way to
make all the code even harder to follow.
A reference value then only needs its referent DIE to be "tree
resolved" to create the "unique identifier object" for it. That
"breaks the circle" at the start of a reference attribute value so you
don't need to track circularity anymore.
Ok, I think that makes sense. But then we come back to the case of
<pointer_type type=.../>. That has no children and no non-reference
attributes, so there is nothing but DW_TAG_pointer_type to contribute to
its hash value, and so all of them hash the same.
The disadvantage is that you end up with two "pending
queues" which
seems to effectively mean you will scan the whole DIE tree twice (or at
least calculate the hash over a DIE twice, without and then with taking
reference values into account).
This might not be as repetitive as it sounds. You don't need to visit
the non-reference attributes a second time. In "tree resolved" state,
you can take the complete hash of all the non-reference attributes, and
also hash just the names of the reference attributes, and combine those
together as a single hash value. Then the second pass in "fully
resolved" state will just take the final hashes of the referent DIEs
(pointer-hashes of their .identity () values) and combine those with the
hash saved in the first pass--this yields the hash of an "attrs_pair".
You do then need to visit each child DIE again to combine their final
hashes into the "brood hash", but since each child is fully resolved by
then, that is just a (reduce 'identity children) iteration on the vector
(std::accumulate(c.begin(),c.end(),0,identity_hash) or something).
And since some "tree resolved" DIEs might have the same
hash, but are
not equal if you take the value references into account, you don't
have perfect hashes for the attribute reference values. But they only
collide when referring to the same structural trees with all basic
attribute in all included DIEs matching completely.
That sounds nice, but the "structural tree" <singleton_tag/> with
"all
basic attributes" being the empty set is a common case. Not that I have
any better plan. The supposed plan as I've left it is just lots of both
negative and positive caching for these costly comparisons after every
instance of {pointer,reference,const,...}_type has hashed the same (one
distinct hash per tag) by this plan. Maybe that really is good enough
if we address the cyclic cases in ways that avoid false-positive caching
(somehow that doesn't also abandon the correct caching, so we don't do
lots of repetitive costly comparisons while handling cycles).
But I just don't know.
Thanks,
Roland