On Fri, 2017-03-24 at 00:05 +0100, Josef Skladanka wrote:
On Wed, Mar 22, 2017 at 7:55 PM, Adam Williamson
<adamwill(a)fedoraproject.org
> wrote:
> Is asking for 'all results for item == X' gonna be too much work for
> rdb to handle? Is that the problem?
>
Sure could be done that way, the only slight difference is through-output.
First up, be sure to have a look at the attached graph_data_count.png
representing the number of results per item (Y-axis). tl;dr;
- DEV contains data since 2015-03-18, and has 226428 distinct items
- PROD contains data since2017-01-10, and has 30913 distinct items
- DEV
- median = 16
- max = 16817
- outliers begin at about 140 results per item (cca 7%)
- PROD
- median = 17
- max = 2867
- outliers begin at about 65 results per item (cca 13%)
The bulk of the outliers is composed of depcheck results, taking
gnote-3.17.0-3.fc23 as an example:
testcase_name | count
------------------+-------
dist.rpmlint | 1
dist.depcheck | 16287
dist.upgradepath | 21
So, wait. Let me be totally sure I'm on the same page here. You mean
that dist.depcheck has been run on that single package *sixteen
thousand times*? So part of the problem is, if you just ask for 'all
test results for package X', occasionally you might get tens of
thousands of results?
If so, then that certainly helps me understand the issue. I simply
hadn't expected anything like that to be the case. Obviously openQA is
my usual point of reference, and if you ask for 'all openQA results for
compose X', you're going to get, at max, a few hundred results.
(Usually, in prod, a couple hundred). From what I knew of the package
tests, I was expecting something in the same range, or fewer, for
those. So I just wasn't expecting that we needed to worry a lot about
optimizing beyond 'get all the results for the item you want results
for, then filter from there'.
If I'm understanding correctly, that certainly helps me to understand
why there might be a problem, and I'll go back and read kparal's mail
again in that light.
Still, if we're mostly ignoring the outliers for now and thinking about
the 'normal' cases, I think I'd contend that getting 140 results at a
time should be a pretty 'normal' use case for something like resultsdb.
I'm not sure I'm a fan of doing something more complex than the
'scenario' key if the goal is purely to try and reduce the number of
results that actually need to be returned by the server in the first
place from, say, 200 to 50. That's just MHO, though. :)
= Conclusion =
== Regarding ResultsDB ==
I will profile the serialization code itself, and investigate on a way to
force less sql queries so we can compare the results. My gut feeling is
that we could lower the "best" time focusing on these aspects
The actual problem, that needs solving though, is moving the resultsdb
production DB to a separate machine to get rid of that IO bound problems.
These cause/will cause the actual slowdown that can be observed by querying
ther resultsdb in the semi-random way bodhi does/will be doing.
== Regarding the Bodhi endpoint/middleware optimization ==
I'm not sure it is worth doing _right now_ but given the nature of what it
would do, and the way it would be implemented, there would be measurable
spead-up in the DB-part, as the database would be optimized for this one
specific use-case. It would also lower the Bodhi load, since the
de-duplication would be taken care of there, instead of in the consumer.
I still believe, that having designated DB server for resultsdb prod and
exploring the loading process will have bigger impact, but I can not take
care of the first part on my own.
I *think* kparal's approach was an attempt to be more generic than just
tailored to Bodhi's use case, but I'll have to go back and read it
again.
Did this help to answer your questions?
Yes, thanks very much!
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net