On Thu, May 11, 2017 at 10:02 AM, Josef Skladanka <jskladan@redhat.com> wrote:

On Thu, May 11, 2017 at 9:39 AM, Kamil Paral <kparal@redhat.com> wrote:
On Wed, May 10, 2017 at 4:30 PM, Josef Skladanka <jskladan@redhat.com> wrote:

On Wed, May 10, 2017 at 4:23 PM, Kamil Paral <kparal@redhat.com> wrote:
tl;dr; but:
... The simple and fast solution is making the database schema directly tailored to our Fedora use cases, making scenario searches fast....

Not really simple, honestly, especially if by "scenario searches" you mean "retrieving a list of results deduplicated by scenario" - as it is the same as that middleware thing.

I didn't really mean that when I wrote it. I meant it would be fast to ask for all scenarios available for $item, and then you could ask for the latest result for each scenario. But I admit I don't know much about databases, so...

Honestly, this is just a different angle on the same thing - if you want to store item->ANY(scenario) mappings, then you can just go and store the item->LATEST(scenario), and do what you wanted to do in the first place. I can see why this can seem different to you, but it really is not. As I said many times - each time you want to have "all of something" it means either having very specific table structure, or just traversing the whole database.

as long as there is no rigorous specification,

Hahahah. This is not NASA, Josef, this is open source! :o) Even if you received a rigorous specification, the requirements would change in 6 months :)

No, it is not. But this sofware engineering - knowing _what_, and even more importantly _why_ you want to do something is an integral part of the development process. We sure can have a steaming pile of quick hacks and workarounds, but if that is the desired/expected outcome, I don't really see a point of having the "design conversations".

But I would like to know your thoughts on my last paragraph in my previous email.

Ad pruning - sure, why not. Even though I don't think that it is the right way of tackling this, should we decide to do it, my preferred way would be "move to archive" instead "delete".
But we are getting back to those NASA-like specifications and policies here, so... You still need to define "what is unique", "what is latest", "how long do we keep stuff" - just like with any other solution. The only difference here is, that instead of adding a layer on top of a datastore (resultsdb), you decided to just prune the data in said data store.

I'm not trying to get personal, or petty, but in almost three months, we were not able to agree even on what the "scenario" should be. I came from "this is nonsense" to "OK, if used systematically, this can be helpful", Adam is on the path of switching from "Why do you guys need to make everything so overcomplicated, this is but a string!" to "Well, some of what you said actually makes sense, now that I think about it a bit more", but that's about it. We do "something" now, but we don't do it in a known, defined way, so while this works for some cases, it does not for others. And more than that - we are not even really sure where it does and does not work.

Yes, this was slightly hyperbolic, but I hope you get the point. And I might be a data-freak, but for some reason, I don't think it is a good idea to sort "we have too much data" by deleting it haphazardly.

J.

Bah, do you also hate it, when you hit ctrl+enter instead of shift+enter, and the email gets sent? ... Following up with that last sentence:

I sure agree, that having thousands of Depcheck results is nonsense. And in this special case, I really agree that pruning the data is a good idea (if we have a very well defined way of devising the "this is just a duplicate" thing, that is). I also agree, that we don't need to have the whole history of all the results in one place, that is fast-read access. But it still makes sense (to me at least) to just have an archiving policy, rather than deletion policy, as the first step. And to be honest, I don't think that the policy for data retention has anything to do with result de-duplication. Sure it makes the "stupid" undefined approach easier - we could be going on saying "just download it all, and decide for yourself" a bit longer, and effectively stall the "NASA-level" decision making. And once again - I'm not saying that it is necessarily a wrong step to take now. But we must be very consentient about what we are doing, and why. Because having less data in the database does not really help with deciding "what is the latest result in this usecase".

So, to sum it up a bit - I am the last person to go against making our stuff a bit (a lot) more tailored to the usecases we have. But I need to have those usecases defined, and ideally have policies in place to support those usecases, before I feel it is a good idea to do it.