Hi folks!
We've had openQA testing of updates for stable and branched releases, and gating based on those tests, enabled for a while now. I believe this is going quite well, and I think we addressed the issues reported when we first enabled gating - Bodhi's gating status updates work more smoothly now, and openQA respects Bodhi's "re-run tests" button so failed tests can be re-triggered.
A few weeks ago, I enabled testing of Rawhide updates in the openQA lab/stg instance. This was to see how smoothly the tests run, how often we run into unexpected failures or problems, and whether the hardware resources we have are sufficient for the extra load.
So far this has been going more smoothly than I anticipated, if anything. The workers seem to keep up with the test load, even though one out of three worker systems for the stg instance is currently out of commission (we're using it to investigate a bug). We do get occasional failures which seem to be related to Rawhide kernel slowness (e.g. operations timing out that usually don't otherwise time out), but on the whole, the level of false failures is (I would say) acceptably low, enough that my current regime of checking the test results daily and restarting failed ones that don't seem to indicate a real bug should be sufficient.
So, I'd like to propose that we enable Rawhide update testing on the production openQA instance also. This would cause results to appear on the Automated Tests tab in Bodhi, but they would be only informational (and unless the update was gated by a CI test, or somehow otherwise configured not to be pushed automatically, updates would continue to be pushed 'stable' almost immediately on creation, regardless of the openQA results).
More significantly, I'd also propose that we turn on gating on openQA results for Rawhide updates. This would mean Rawhide updates would be held from going 'stable' (and included in the next compose) until the gating openQA tests had run and passed. We may want to do this a bit after turning on the tests; perhaps Fedora 37 branch point would be a natural time to do it.
Currently this would usually mean a wait from update submission to 'stable push' (which really means that the build goes into the buildroot, and will go into the next Rawhide compose when it happens) of somewhere between 45 minutes and a couple of hours. It would also mean that if Rawhide updates for inter-dependent packages are not correctly grouped, the dependent update(s) will fail testing and be gated until the update they depend on has passed testing and been pushed. The tests for the dependent update(s) would then need to be re- run, either by someone hitting the button in Bodhi or an openQA admin noticing and restarting them, before the dependent update(s) could be pushed.
In the worst case, if updated packages A and B both need the other to work correctly but the updates are submitted separately, both updates may fail tests and be blocked. This could only be resolved by waiving the failures, or replacing the separate updates with an update containing both packages.
All of those considerations are already true for stable and branched releases, but people are probably more used to grouping updates for stable and branched than doing it for Rawhide, and the typical flow of going from a build to an update provides more opportunity to create grouped updates for branched/stable. For Rawhide the easiest way to do it if you need to do it is to do the builds in a side tag and use Bodhi's ability to create updates from a side tag.
As with branched/stable, only critical path updates would have the tests run and be gated on the results. Non-critpath updates would be unaffected. (There's a small allowlist of non-critpath packages for which the tests are also run, but they are not currently gated on the results).
I think doing this could really help us keep Rawhide solid and avoid introducing major compose-breaking bugs, at minimal cost. But it's a significant change and I wanted to see what folks think. In particular, if you find the existing gating of updates for stable/branched releases to cause problems in any way, I'd love to hear about it.
Thanks folks!
On Thu, 2022-06-09 at 12:48 -0700, Adam Williamson wrote:
Currently this would usually mean a wait from update submission to 'stable push' (which really means that the build goes into the buildroot, and will go into the next Rawhide compose when it happens) of somewhere between 45 minutes and a couple of hours.
Hi, it is very convenient to `fedpkg chain-build ....` in the right order, do it in fire & forget way, check the whole chain build task in koji after some time and that's all. If it's going to take couple hours, the koji build will surely timeout.
I do the chain-build for stable branches too, just step in and set overrides for the packages when needed. It might be an outdated workflow, but it works for me for years.
I understand the side tags are here to help. That means to file the update manually for rawhide, because the automatic updates won't make it, right? It sounds like more work and time for the packagers.
I see some of my packages are failing automated tests, specifically the rpminspect 'runpath' test. Bye, Milan
On Fri, 2022-06-10 at 08:56 +0200, Milan Crha wrote:
On Thu, 2022-06-09 at 12:48 -0700, Adam Williamson wrote:
Currently this would usually mean a wait from update submission to 'stable push' (which really means that the build goes into the buildroot, and will go into the next Rawhide compose when it happens) of somewhere between 45 minutes and a couple of hours.
Hi, it is very convenient to `fedpkg chain-build ....` in the right order, do it in fire & forget way, check the whole chain build task in koji after some time and that's all. If it's going to take couple hours, the koji build will surely timeout.
Possibly. I'm not sure what the timeout on chain builds in Koji is. But yes, that's a scenario worth looking at.
I do the chain-build for stable branches too, just step in and set overrides for the packages when needed. It might be an outdated workflow, but it works for me for years.
I understand the side tags are here to help. That means to file the update manually for rawhide, because the automatic updates won't make it, right? It sounds like more work and time for the packagers.
You have to trigger the update creation manually, yeah. I don't see an easy way around that because there's no way the system can know when you're 'done' doing builds. But it's something we could try and improve at least, maybe make it easier to trigger the update creation too. What if there were a command like chain-build that uses a side tag? Puts each build into a side tag as it goes, then when all the builds are done, automatically creates the update from the side tag?
I see some of my packages are failing automated tests, specifically the rpminspect 'runpath' test.
That's a Fedora CI test, not an openQA one. This proposal would not involve gating on that test, so it shouldn't change anything. It's always worth taking a look at rpminspect failures/warnings and figuring out what it's trying to tell you, though, sometimes it's worthwhile.
On Fri, 2022-06-10 at 08:33 -0700, Adam Williamson wrote:
Possibly. I'm not sure what the timeout on chain builds in Koji is. But yes, that's a scenario worth looking at.
Hi, if I recall correctly, it's around 2 hours (by default).
What if there were a command like chain-build that uses a side tag? Puts each build into a side tag as it goes, then when all the builds are done, automatically creates the update from the side tag?
As Kevin mentioned in the other mail in this thread, one can --target the chain build too (and I use it for things like "f36-gnome"). Having this all done in a single command would be preferred, thus there are less things to think of from the packager point of view.
I see a little similarity with the scratch-build's --srpm argument. One can pass the srpm file name to it, but if it doesn't pass it, then the fedpkg builds the .srpm in the current folder and uses it as the srpm for the scratch build. That's quite convenient and helps to avoid mistakes.
If it's going to happen, then I agree to have added a new argument for the chain-build, to be able to cover both scenarios: 1) as it's now, just build packages in certain order and do not do anything else; 2) the new one, create a side tag, build packages in it, fill automated update (when it's a rawhide build) once all the packages are built. As things can fail, the side tag should be re-usable, thus one can continue the build from follow up build(s), but that feels natural.
This all looks like a concatenation of several fedpkg commands, as you mentioned.
That's a Fedora CI test, not an openQA one.
Aha. I thought those Automated Tests attached to the update are it. Where does one see the openQA tests results, please?
Bye, Milan
On Mon, 2022-06-13 at 08:32 +0200, Milan Crha wrote:
On Fri, 2022-06-10 at 08:33 -0700, Adam Williamson wrote:
Possibly. I'm not sure what the timeout on chain builds in Koji is. But yes, that's a scenario worth looking at.
Hi, if I recall correctly, it's around 2 hours (by default).
What if there were a command like chain-build that uses a side tag? Puts each build into a side tag as it goes, then when all the builds are done, automatically creates the update from the side tag?
As Kevin mentioned in the other mail in this thread, one can --target the chain build too (and I use it for things like "f36-gnome"). Having this all done in a single command would be preferred, thus there are less things to think of from the packager point of view.
I see a little similarity with the scratch-build's --srpm argument. One can pass the srpm file name to it, but if it doesn't pass it, then the fedpkg builds the .srpm in the current folder and uses it as the srpm for the scratch build. That's quite convenient and helps to avoid mistakes.
If it's going to happen, then I agree to have added a new argument for the chain-build, to be able to cover both scenarios: 1) as it's now, just build packages in certain order and do not do anything else; 2) the new one, create a side tag, build packages in it, fill automated update (when it's a rawhide build) once all the packages are built. As things can fail, the side tag should be re-usable, thus one can continue the build from follow up build(s), but that feels natural.
This all looks like a concatenation of several fedpkg commands, as you mentioned.
That's a Fedora CI test, not an openQA one.
Aha. I thought those Automated Tests attached to the update are it. Where does one see the openQA tests results, please?
The Automated Tests tab shows all test results from *both* Fedora CI and openQA that appear relevant to the update (i.e. are "for" the update itself or for any NVR it includes).
There's a few ways you can tell which results are from which system. Results from Fedora CI have names that start with "fedora-ci.". Results from openQA have names that start with "update.". Currently, results from openQA are always considered to be tests of "the update", so they show up under the update title (e.g. FEDORA-2022-6d2c62d6d6). Results from Fedora CI are always considered to be tests of "a specific package", so they show up under the name of a package in the update (e.g. systemd-249.12-4.fc35 ).
If you want to look at openQA update test results directly in openQA's own interface, you can go to https://openqa.fedoraproject.org/group_overview/2 (or go to the front page and click "Fedora Updates"), where you'll find all x86_64 update test results. You can see up to the last 400 by clicking the little number links under the initial list of 10.
On Fri, Jun 10, 2022 at 08:56:25AM +0200, Milan Crha wrote:
On Thu, 2022-06-09 at 12:48 -0700, Adam Williamson wrote:
Currently this would usually mean a wait from update submission to 'stable push' (which really means that the build goes into the buildroot, and will go into the next Rawhide compose when it happens) of somewhere between 45 minutes and a couple of hours.
Hi, it is very convenient to `fedpkg chain-build ....` in the right order, do it in fire & forget way, check the whole chain build task in koji after some time and that's all. If it's going to take couple hours, the koji build will surely timeout.
I do the chain-build for stable branches too, just step in and set overrides for the packages when needed. It might be an outdated workflow, but it works for me for years.
You can use chain-build with side tags just fine too.
Just pass a --target fNN-build-side-XXXX
No need for overrides.
I understand the side tags are here to help. That means to file the update manually for rawhide, because the automatic updates won't make it, right? It sounds like more work and time for the packagers.
Perhaps we could enhance fedpkg chain-build to submit the update at the end if all the builds are successfull? Then you could just start the chain-build and if it all worked be done.
kevin
On Thu, 2022-06-09 at 12:48 -0700, Adam Williamson wrote:
Hi folks!
...
I think doing this could really help us keep Rawhide solid and avoid introducing major compose-breaking bugs, at minimal cost. But it's a significant change and I wanted to see what folks think. In particular, if you find the existing gating of updates for stable/branched releases to cause problems in any way, I'd love to hear about it.
Thanks folks!
One thing I forgot to mention in the original email, the benefit here isn't theoretical - I've already caught several Rawhide-breaking bugs early, or been able to identify the cause more easily, because we have the tests running in staging. Here's an example I just caught: a new popt version that was sent out today seems to break authselect, which is a critical problem and breaks all new installs:
https://bugzilla.redhat.com/show_bug.cgi?id=2100287
if nirik catches my message in time before the next compose runs, he'll be able to untag the new build and the compose won't be completely broken. If we had this testing deployed in prod and gating turned on, the update would be blocked automatically.
On Wed, Jun 22, 2022 at 06:18:08PM -0700, Adam Williamson wrote:
On Thu, 2022-06-09 at 12:48 -0700, Adam Williamson wrote:
Hi folks!
...
I think doing this could really help us keep Rawhide solid and avoid introducing major compose-breaking bugs, at minimal cost. But it's a significant change and I wanted to see what folks think. In particular, if you find the existing gating of updates for stable/branched releases to cause problems in any way, I'd love to hear about it.
Thanks folks!
One thing I forgot to mention in the original email, the benefit here isn't theoretical - I've already caught several Rawhide-breaking bugs early, or been able to identify the cause more easily, because we have the tests running in staging. Here's an example I just caught: a new popt version that was sent out today seems to break authselect, which is a critical problem and breaks all new installs:
https://bugzilla.redhat.com/show_bug.cgi?id=2100287
if nirik catches my message in time before the next compose runs, he'll be able to untag the new build and the compose won't be completely broken. If we had this testing deployed in prod and gating turned on, the update would be blocked automatically.
It's been untagged from rawhide and eln.
kevin
On Thu, 2022-06-09 at 12:48 -0700, Adam Williamson wrote:
Hi folks!
More significantly, I'd also propose that we turn on gating on openQA results for Rawhide updates. This would mean Rawhide updates would be held from going 'stable' (and included in the next compose) until the gating openQA tests had run and passed. We may want to do this a bit after turning on the tests; perhaps Fedora 37 branch point would be a natural time to do it.
Hi again folks! A quick update here. Now the Rawhide update testing has been running in production for over a year - and Kevin and I have been "shadow gating" Rawhide for several months, untagging updates where openQA tests indicate genuine bugs - I think it's time to go ahead and enable gating for Rawhide updates. I've worked to make sure the tests are reliable and failures are promptly investigated, and that Bodhi provides accurate information on test and gating status. I've proposed this as a FESCo ticket just to get some visibility and sign-off on the idea:
https://pagure.io/fesco/issue/3011
thanks everyone!
desktop@lists.fedoraproject.org