On Thu, May 11, 2017 at 9:39 AM, Kamil Paral
<kparal(a)redhat.com> wrote:
> On Wed, May 10, 2017 at 4:30 PM, Josef Skladanka <jskladan(a)redhat.com
> wrote:
>
>> On Wed, May 10, 2017 at
4:23 PM, Kamil Paral <kparal(a)redhat.com> wrote:
>> tl;dr; but:
>
>>> ... The simple and fast solution is making the
database schema directly
>>> tailored to our Fedora use cases, making scenario searches fast....
>>
>
>>
Not really simple, honestly, especially if by "scenario searches" you
>> mean "retrieving a list of results deduplicated by scenario" - as it is
the
>> same as that middleware thing.
>
> I didn't really mean that
when I wrote it. I meant it would be fast to
> ask for all scenarios available for $item, and then you could ask for the
> latest result for each scenario. But I admit I don't know much about
> databases, so...
Honestly, this is just a different angle on the same thing - if you want
to store item->ANY(scenario) mappings, then you can just go and store the
item->LATEST(scenario), and do what you wanted to do in the first place. I
can see why this can seem different to you, but it really is not. As I said
many times - each time you want to have "all of something" it means either
having very specific table structure, or just traversing the whole database.
>
>> as long as there is no
rigorous specification,
>
> Hahahah. This is not NASA,
Josef, this is open source! :o) Even if you
> received a rigorous specification, the requirements would change in 6
> months :)
No, it is not. But this sofware
engineering - knowing _what_, and even
more importantly _why_ you want to do something is an integral part of the
development process. We sure can have a steaming pile of quick hacks and
workarounds, but if that is the desired/expected outcome, I don't really
see a point of having the "design conversations".
> But I would like to know your thoughts on my last paragraph in my
> previous email.
Ad pruning - sure, why not. Even though I don't think that it is the right
way of tackling this, should we decide to do it, my preferred way would be
"move to archive" instead "delete".
But we are getting back to those NASA-like specifications and policies
here, so... You still need to define "what is unique", "what is
latest",
"how long do we keep stuff" - just like with any other solution. The only
difference here is, that instead of adding a layer on top of a datastore
(resultsdb), you decided to just prune the data in said data store.
I'm not trying to get personal, or petty, but in almost three months, we
were not able to agree even on what the "scenario" should be. I came from
"this is nonsense" to "OK, if used systematically, this can be
helpful",
Adam is on the path of switching from "Why do you guys need to make
everything so overcomplicated, this is but a string!" to "Well, some of
what you said actually makes sense, now that I think about it a bit more",
but that's about it. We do "something" now, but we don't do it in a
known,
defined way, so while this works for some cases, it does not for others.
And more than that - we are not even really sure where it does and does not
work.
Yes, this was slightly hyperbolic, but I hope you get the point. And I
might be a data-freak, but for some reason, I don't think it is a good idea
to sort "we have too much data" by deleting it haphazardly.
J.
Bah, do you also hate it, when you hit ctrl+enter instead of shift+enter,
and the email gets sent? ... Following up with that last sentence:
I sure agree, that having thousands of Depcheck results is nonsense. And in
this special case, I really agree that pruning the data is a good idea (if
we have a very well defined way of devising the "this is just a duplicate"
thing, that is). I also agree, that we don't need to have the whole history
of all the results in one place, that is fast-read access. But it still
makes sense (to me at least) to just have an archiving policy, rather than
deletion policy, as the first step. And to be honest, I don't think that
the policy for data retention has anything to do with result
de-duplication. Sure it makes the "stupid" undefined approach easier - we
could be going on saying "just download it all, and decide for yourself" a
bit longer, and effectively stall the "NASA-level" decision making. And
once again - I'm not saying that it is necessarily a wrong step to take
now. But we must be very consentient about what we are doing, and why.
Because having less data in the database does not really help with deciding
"what is the latest result in this usecase".
So, to sum it up a bit - I am the last person to go against making our
stuff a bit (a lot) more tailored to the usecases we have. But I need to
have those usecases defined, and ideally have policies in place to support
those usecases, before I feel it is a good idea to do it.
J.