On Wed, Mar 22, 2017 at 2:07 PM, Kamil Paral <kparal@redhat.com> wrote:

I've had a long (three days long!) chat about this with Josef. The solution discussed here (bodhi deduplicating results by scenario, if it exists) is usable temporarily, but might not work well enough in the future, due to performance. The problem is that this approach requires to search through most or all the database before you figure out that you have every unique result available. For example, openqa result uses "scenario" to either put "uefi" or "bios" in there. However, Bodhi doesn't know all the possible values of scenario for this particular test case. So even if it already found several results with scenario:uefi and several results with scenario:bios, it needs to search further and further, until there are no more results to be returned and you can be sure there's no other scenario value available.

There are some optimizations we can do, for example limit the search with a bodhi id "last modified" timestamp (for bodhi_update results) or with the koji build "build_time" timestamp (for koji_build results). However, that won't always help (the koji build can be available for weeks or months, before you create a bodhi update with it, and then you're again searching through most of the database).

We tried to brainstorm the best solution for this that wouldn't be overly performance demanding, and we have some ideas. But first, let's clarify what the requirement is, so that there's no confusion what we're solving and why. IIUIC, the requirement is:
** We want to be able to retrieve all unique results for a particular item. Unique results might be determined just by testcase+item+type combination, or they can be further distinguished by some (but not necessarily all) additional extradata fields. Those fields might be arbitrary, i.e. not restricted to "common fields" for that type (e.g. "arch" for koji builds), but we can require to have them identified somehow (e.g. scenario key, linking to them or duplicating them). We want to be able to do this from a consumer that has no internal knowledge about the tasks it displays, e.g. Bodhi showing all tasks results for a specific item, without knowing what fields differentiate them and what are all the available values of those fields. **

Do you think that's a correct description? Also, do you see any other similar requirement/use case that would need in the mid-term future (maybe for Bodhi, maybe for some other consumer)? If you have any idea about further use cases, please state them now.

Now, based on the requirement above, we thought of this:
1. We will use resultsdb_conventions to specify common fields for a particular item type. For example for koji builds, it will be 'arch'. For git_commit, it might be 'url' (provided that commit hash is an item). For compose, it might be 'compose_id' (provided it won't be an item). Those fields will cover only things are universal and make sense for all results for given type.
2. Those common fields will be mandatory to be filled in. Libtaskotron will be able to emit warnings/fail early if any of them is missing. Resultsdb will be able to reject results not meeting the conventions. Even if Resultsdb stores them, the consumers might not display them properly or at all.
3. The purpose of these common fields is to ensure that all results for a given type has enough basic information for the consumer to make proper queries, if needed.
4. If the result needs to distinguished further, it will need to add "scenario" field and populate it with unique strings that will allow to distinguish the results.
5. The scenario field has the advantage (over adding the environment information to a testcase name) that it serves solely for this purpose and doesn't force the user to mix unrelated information (testcase and environment). It will also still allow the user to make exact queries for a particular testcase (if environment was added to the testcase name, the user would have to do wildcarded queries, which can be problematic in some cases).
6. We will implement a "middleware" which will be tailored to Bodhi use case (getting latest result of every unique result). Bodhi will be querying this middleware instead of ResultsDB directly. The middleware will intercept resultsdb fedmsgs and store only the latest unique results for a particular item. It will know the resultsdb conventions (therefore know which fields are important and mandatory for each item type) and for each item store a combination of testcase+type+common fields+scenario, and a matching resultsdb result ID. For this reason, it should be much faster to find all unique results because we won't have to search through all the database space, but instead we will have all the unique result ids stored in a separate database. Josef wants to have this middleware separated from resultsdb, because it's specific to a particular Fedora workflow. If there are other similar use cases in the future, the same approach would be used if possible (middleware separated from resultsdb).
7. Bodhi will query the middleware for an item, and it will receive back all existing unique (common fields+scenario) results for all testcases. Bodhi won't need to have any additional logic on its side. It will also be possible to futher restrict the query (ask just for a specific testcase prefix, or for a particular common fields value), but the filters will only narrow down the complete list of latest results. Specifying additional filters (other than item+type) will not change the set of result, which would be returned without those filters present. This is, in fact mostly a user-faced convenience, since the post-process filtering can be easily implemented on the server side. If you need to make a query for specific non-mandatory fields, you'll need to ask resultsdb instead of the middleware, but it will be slower and you won't have latest unique results automatically filtered, you'll need to do that yourself.

What do you think, does that sound reasonable and does it fulfill all the needs we can envision at this moment?

For completeness, I add a solution B, which we considered inferior to the one above:
B1) Resultsdb conventions again specify mandatory common fields for specific types.
B2) There will be no scenario field.
B3) If you need to differentiate your result beyond the common fields, you need to append that information into the testcase name (so e.g. autoqa.base_install.uefi).
B4) The consumer (e.g. Bodhi) needs to be aware of the resultsdb conventions, and needs to do very specific directed queries. For example it knows that koji_build results have a mandatory common field "arch", which can contain these 7 different values, and needs to perform a separate query for each arch it is interested in.

Solution B doesn't require any middleware, but it requires consumers to have deeper knowledge about what they're asking for, and it forces tasks with extra requirements to mix up their testcase names with environment descriptions. That might make it difficult to ask for a specific testcase result (for example, you can't ask for openqa.base_install regardless of environment, because openqa.base_install* might also select openqa.base_install.uefi.prepare and similar "sub-results").
If we're OK with these tradeoffs, however, this approach would again be reasonably fast.

I know this has been a long email, but, feedback needed! :-)

Thanks,
Kamil

_______________________________________________
Resultsdb-users mailing list -- resultsdb-users@lists.fedoraproject.org
To unsubscribe send an email to resultsdb-users-leave@lists.fedoraproject.org