[Resultsdb-users] Re: De-duplicating test results: 'scenarios'

Wednesday, 1 March 2017

...
 Hi folks! So rather than send a welcome mail I figured let's get
right
 into something real ;) I actually got the idea for this list because I
 wrote up some thoughts on something, and realized there was nowhere
 good to send it. So I invented this mailing list. And now I can send
 it! Here it is.

 I've been thinking some more about convenient consumption of ResultsDB
 results. I'm thinking about the problem of generic handling of
 duplicated / repeated tests.

 Say Alice from release engineering wants to decide whether a given
 deliverable is of release quality. With ResultsDB, Alice can query all
 test results for that deliverable, regardless of where they came from.
 Great. However, it's not uncommon for a test to have been repeated. Say
 the test failed, the failure was investigated and determined to be a
 bug in the test, and the test was repeated and passed. Both results
 wind up in ResultsDB; we have a fail, then a pass.

 How does Alice conveniently and *without special knowledge of the
 system that submitted the result* identify the second result as being
 for 'the same test' as the first, and thus know she can consider only
 the second (most recent) result, and not worry about the failure? (I'm
 expecting that to usually be the desired behaviour). 
Let's even complicate it more. Sometimes it's the same result repeated again (e.g.
new depcheck result with updated repo state), but sometimes it's a different tool
submitting the result for the "same" test case. For example, I assume this is
the reason why you chose "compose.base_selinux" testcase name instead of
"compose.openqa.base_selinux". The idea is that several tools can submit the
result for the same test case, so openqa, autocloud or even a manual tester can do it.
I'm not currently sold on this idea (sharing the testcase name instead of having
"compose.openqa.base_selinux", "compose.autocloud.base_selinux" and
"compose.manual.base_selinux"), but that seems to be the current state. So with
this, it's even harder to recognize whether we've received two results from openqa
(the latter superseding the former), or whether we received two results from two different
tools (and therefore we should consider both).

...
 There are also
 other situations in which it's useful to be able to identify 'the same
 test' for different executions; for instance, `check-compose` needs to
 do this when it does its 'system information comparison' checks from
 compose to compose.

 I guess it's worth noting that this is somewhat related to the similar
 question for test 'items' (the 'thing' being tested, in ResultsDB
 parlance) - the question of uniquely identifying 'the same' item within
 and across composes. At least for productmd 'images', myself and
 lsedlar are currently discussing that in
 https://pagure.io/pungi/issue/525 . 
We've discussed this with Josef a while back on qa-devel. We seemed to agree that item
should identify the thing under test well, even uniquely if possible, but stay simple. We
want to avoid having too many pieces of information concatenated into a single string,
just for the purpose of unique identification. Extra data should be used for that
(structured data, no string parsing). The tradeoff is that searching is a bit more
difficult (we'd need to allow users to also search by extra data in the frontend, and
they'd have to know what to search for).

For example, for git commits, we don't really like items like
"pagure#namespace/project#githash". Perhaps we could have just githash as item
(because it's almost unique identification even across many projects) and the rest as
extradata. This way we keep item simple, it's easy to search for manually, and
it's easy to search for automatically (no string parsing).

...
 Obviously it's more or less a
 solved problem for RPMs. 
Almost. For upgradepath, yes, NVR uniquely identifies the result. For depcheck, NVR + arch
(from extradata) is unique identification.

...

 I can think of two possible ways to handle this: via the extradata, or
 via the test case name.

 openQA has a useful concept here. It defines what combination of
 metadata defines a unique test scenario like this, and calls it...well,
 that - the 'scenario'. There's a constant definition called
 SCENARIO_KEYS in openQA that you can use to discover the appropriate
 keys. So I'm going to use the term 'scenario' for this from now on.

 There's kinda two levels of scenario, now I think about it, depending
 on whether you include 'item' identification in the scenario definition
 or not. For identifying duplicates within the results for a single
 item, you don't need to, but it doesn't hurt; for identifying the same
 scenario across multiple composes, you do need to. 
I don't follow here. In order to identify another execution of the same scenario, you
need at least testcase name and item to be exactly the same (and possibly also some
metadata). Do you have some examples to show it otherwise?

...
 I suppose someone
 may have a case for identifying 'the same' test against different
 items; for that purpose, you'd need the lighter 'scenario' definition
 (not including the item identifier). 
I don't understand this at all, it seems to go against the intended meaning of
"item".

...

 One thing we could do is make it a convention that each test case (and
 / or test case name?)  
What's the difference between the two?

...
 indicates a test 'scenario' - such that all
 results for the same test case for the same item should be considered
 'duplicates' in this sense, and consumers can confidently count all
 results for the same test case as results for the same test 'scenario'. 
That was our naive idea originally, until we got things that are not easily uniquely
identifiable via item only (if you don't want to make item a horrible compound of all
needed information).

...
 This seems to me like the simplest possibility, but I do see two
 potential issues.

 First, there's a possibility it may result in rather long and unwieldy
 test case names in some situations. If we take the more complete
 'scenario' definition and include sufficient information to uniquely
 identify the item, The test case name for an openQA test that includes
 sufficient information to uniquely identify the item under test may
 look something like: `fedora.25.server-dvd-
 iso.x86_64.install_default.uefi` (and that's with a fairly short test
 name). 
Abomination! Let's not do that. 

The purpose of testcase is to identify the steps that were taken to perform the test, or
the tool, or both. In case of rpmlint, both the tool and what gets done is the same
("the steps that rpmlint performs"), so the testcase name (excluding namespace)
is just "rpmlint". For openqa, it's a tool that performs many different test
cases, so it should create a separate testcase for each of them. For your example, it
should be "install_default", or better "openqa.install_default". I
wouldn't mind adding ".uefi" to the end, even though it's not clear
whether it's a sub-step result or part of the testcase (so maybe rather
"install_default_uefi"). But it would also make complete sense to make
"firmware_type" as part of the extradata and have just
"install_default".

We didn't want to impose any restrictions on what the tool produces (so everything
under "compose.openqa" is your playground, do whatever you wish), but of course
once the tool is important enough and we want to use it for gating or any other important
tasks, we need to recommend some approaches to make it report similarly to all other
important tasks, so that querying those results is not overly difficult. That could be
part of the resultsdb_conventions, I guess.

...

 Second, it makes it difficult to handle the two different kinds of
 'scenario' - i.e. it's not obvious how to split off the bits that
 identify the 'item' from the bits that identify the 'test scenario'
 proper. In this case the 'test scenario' is `install_default.uefi` and
 the 'item identifier' is `fedora.25.server-dvd.iso.x86_64`, but there's
 no real way to *know* that from the outside, unless we get into
 defining separators, which always seems to be a losing game. 
Yes, splitting strings is a bad idea.

...

 Another possibility would be to make it a convention to include some
 kind of indication of the test 'scenarios' in the extradata for each
 result: a 'scenario' key, or something along those lines. This would
 make it much easier to include the 'item identifier' and 'test
 scenario' proper separately, and you could simply combine them when you
 needed the 'complete' scenario. 
I'm not sure what do you mean exactly, but having a "scenario" key that
would list all the other keys which are necessary to understand what makes this scenario
unique looks like a reasonable idea. For example:
scenario = [firmware_type, arch]  # testcase name and item are implied

The downside is that task authors are required to provide this, and therefore it's
error-prone. I'm not sure how to do it better, though. We could set some reasonable
defaults for each type - so e.g. for koji_build type, we know we compare testcase name +
item (required to be nvr) + arch (if present). For bodhi_update type, it would be
testcase_name + item (required to be Bodhi ID) + last_updated_timestamp. Etc. Anything
above those defaults would need to be in "scenario".

...

 I'm trying to avoid consumers of ResultsDB data having to start
 learning about the details of individual test 'sources' in order to be
 able to perform this kind of de-duplication. It'd suck if releng had to
 learn the openQA 'scenario keys' concept directly, for instance, then
 learn corresponding smarts for any other system that submitted results. 
Yes, that's definitely an important goal. Initially, I think, we didn't think
about this much. My idea was that each task would document its results structure (it would
become an API basically), and the tools would learn how to consume that. I didn't
expect to have too many important tasks that we use for gating etc. But having common
conventions is definitely easier.

2024

2023

2022

2021

2020

2019

2018

2017

[Resultsdb-users] Re: De-duplicating test results: 'scenarios'