Excuse my ignorance, but I'm just wondering what sense the regular rawhide "compose check reports" make. Who processes this information?
E.g. in today's report "Fedora-Rawhide-20220224.n.1 compose check report" on devel@lists.fedoraproject.org I find under New OpenQA failures
URL: https://openqa.fedoraproject.org/tests/1146933 ID: 1146980 Test: aarch64 Server-dvd-iso base_reboot_unmount@uefi
and about 17 other failure entries.
Who cares about that? Or is not necessary to take care of it?
Hi Peter,
Peter Boy pboy@uni-bremen.de writes:
Excuse my ignorance, but I'm just wondering what sense the regular rawhide "compose check reports" make. Who processes this information?
E.g. in today's report "Fedora-Rawhide-20220224.n.1 compose check report" on devel@lists.fedoraproject.org I find under New OpenQA failures
URL: https://openqa.fedoraproject.org/tests/1146933 ID: 1146980 Test: aarch64 Server-dvd-iso base_reboot_unmount@uefi
and about 17 other failure entries.
Who cares about that? Or is not necessary to take care of it?
I am pretty certain that Adam and probably other members of Fedora's QA team review these reports and decide whether the compose is good to go or not.
Cheers,
Dan
On Fri, 2022-02-25 at 22:35 +0100, Dan Čermák wrote:
Hi Peter,
Peter Boy pboy@uni-bremen.de writes:
Excuse my ignorance, but I'm just wondering what sense the regular rawhide "compose check reports" make. Who processes this information?
E.g. in today's report "Fedora-Rawhide-20220224.n.1 compose check report" on devel@lists.fedoraproject.org I find under New OpenQA failures
URL: https://openqa.fedoraproject.org/tests/1146933 ID: 1146980 Test: aarch64 Server-dvd-iso base_reboot_unmount@uefi
and about 17 other failure entries.
Who cares about that? Or is not necessary to take care of it?
I am pretty certain that Adam and probably other members of Fedora's QA team review these reports and decide whether the compose is good to go or not.
It's meant to give anyone who's interested a quick overview of the day's status, and anyone can review the results. The failed tests are linked directly. If you find the volume a bit much, the topic is easy to filter to a separate folder - this is what I do in fact (I filter these reports and the "compose report" mails generated by releng to a dedicated Reports folder).
When I have time I send a manual reply explaining the failures, but 36 and Rawhide have both been pretty chaotic lately so I haven't had time. Very broadly, most tests are actually passing now, a few are affected by known bugs. Most of interest to Server is that the aarch64 disk images are affected by https://bugzilla.redhat.com/show_bug.cgi?id=2057600 (initial-setup does not run on first boot as it should).
Am 25.02.2022 um 23:19 schrieb Adam Williamson adamwill@fedoraproject.org:
On Fri, 2022-02-25 at 22:35 +0100, Dan Čermák wrote:
Hi Peter,
Peter Boy pboy@uni-bremen.de writes:
Excuse my ignorance, but I'm just wondering what sense the regular rawhide "compose check reports" make. Who processes this information?
I am pretty certain that Adam and probably other members of Fedora's QA team review these reports and decide whether the compose is good to go or not.
Hi Dan, I hope my wording did not have the wrong connotation. I didn't want to doubt the usefulness, but to clarify if Server WG is expected to watch and to take care, or if it's a pure information, and others like QA Team take care of it primarily (and contact Server WG if necessary or helpful).
It's meant to give anyone who's interested a quick overview of the day's status, and anyone can review the results. The failed tests are linked directly. If you find the volume a bit much, the topic is easy to filter to a separate folder - this is what I do in fact (I filter these reports and the "compose report" mails generated by releng to a dedicated Reports folder).
When I have time I send a manual reply explaining the failures, but 36 and Rawhide have both been pretty chaotic lately so I haven't had time. Very broadly, most tests are actually passing now, a few are affected by known bugs. Most of interest to Server is that the aarch64 disk images are affected by https://bugzilla.redhat.com/show_bug.cgi?id=2057600 (initial-setup does not run on first boot as it should).
Hi Adam, thanks for the information. As you may have noticed I’ve initiated a discussion how to contribute to improve Server release quality and to avoid some unpleasant experiences that we had with the last releases. Those type of issues may not be detectable with automated tests. Unfortunately, I lack an overview of the processes. I have never had any reason to get familiar with them until now (unexpectedly having become somehow engaged in server WG).
Peter
Hi Peter,
Peter Boy pboy@uni-bremen.de writes:
Am 25.02.2022 um 23:19 schrieb Adam Williamson adamwill@fedoraproject.org:
On Fri, 2022-02-25 at 22:35 +0100, Dan Čermák wrote:
Hi Peter,
Peter Boy pboy@uni-bremen.de writes:
Excuse my ignorance, but I'm just wondering what sense the regular rawhide "compose check reports" make. Who processes this information?
I am pretty certain that Adam and probably other members of Fedora's QA team review these reports and decide whether the compose is good to go or not.
Hi Dan, I hope my wording did not have the wrong connotation. I didn't want to doubt the usefulness, but to clarify if Server WG is expected to watch and to take care, or if it's a pure information, and others like QA Team take care of it primarily (and contact Server WG if necessary or helpful).
Not at all! The openQA tests are rather specific and hard to analyze & triage for anyone who isn't really familiar with the system themselves. Requiring others to take care of test failures in openQA would be a pointless waste of everyone's time. Afaik the process is roughly like this: If tests fail, then the QA team takes a look at them. If it turns out to be an actual bug and not just a fluke, then the QA files a bug against the corresponding component with a reproducer, so that the maintainer can take a look.
Adam, please correct me if I'm wrong!
It's meant to give anyone who's interested a quick overview of the day's status, and anyone can review the results. The failed tests are linked directly. If you find the volume a bit much, the topic is easy to filter to a separate folder - this is what I do in fact (I filter these reports and the "compose report" mails generated by releng to a dedicated Reports folder).
When I have time I send a manual reply explaining the failures, but 36 and Rawhide have both been pretty chaotic lately so I haven't had time. Very broadly, most tests are actually passing now, a few are affected by known bugs. Most of interest to Server is that the aarch64 disk images are affected by https://bugzilla.redhat.com/show_bug.cgi?id=2057600 (initial-setup does not run on first boot as it should).
Hi Adam, thanks for the information. As you may have noticed I’ve initiated a discussion how to contribute to improve Server release quality and to avoid some unpleasant experiences that we had with the last releases. Those type of issues may not be detectable with automated tests. Unfortunately, I lack an overview of the processes. I have never had any reason to get familiar with them until now (unexpectedly having become somehow engaged in server WG).
I am not Adam, but I'll ask anyway: what issues did you encounter? openQA is pretty versatile and can test *a lot* of scenarios.
Cheers,
Dan
Hi Dan,
thanks for all the information. It makes me feel better that we, as Server WG, do not miss anything expected to contribute.
Am 27.02.2022 um 17:02 schrieb Dan Čermák dan.cermak@cgc-instruments.com:
... I am not Adam, but I'll ask anyway: what issues did you encounter? openQA is pretty versatile and can test *a lot* of scenarios.
With Fedora 33 it was the switch to systemd-resolved. It introduced a break in name resolution for the internal libvirt network virbr0. And after an update, it triggered an unexpected service outage for all servers that worked with that network. That was pretty bad. And until a solution was found, it took some time. (You have to use a libvirt hook via script to make configurations via resolvectl afterwards, which systemd normally finds itself, but can't find with libvirt, because libvirt configures the interface directly, independent of NetworkManager).
With F35, it was the switch to modularized libvirt libraries. We had tried to get a test day, unfortunately unsuccessfully. So we relied on the usual test procedures. And then after the release was published, suddenly we noticed that the usual startup and management procedures didn't work anymore. And all the servers hosting virtual machines had a problem. And which server doesn't host VMs nowadays.
Fortunately, this could be fixed very quickly, but the damage was done.
There was a third problem, probably with F34. This was obviously not that big, I don't remember at the moment.
And with earlier releases there were issues, too. With F30 or F31 there was a bug with systemd-nspawn and SELinux. This has not been fixed until today. Fortunately there is a workaround. You have to run about10 SELinux ACE overwrites, manually after the installation. Nice. And every time you want to administrate something, you have to set SELinux to permissive. And remember to enable it again afterwards.
For a server administrator, these experiences do not inspire much confidence. SELinux, systemd, network configuration, name resolution, all these are central and indispensable elements in Fedora and especially in Fedora Server. And obviously, not even those elements are reliably under control.
So, perhaps we have developed a kind of allergy or paranoia to change that threatens each time to add another problem and quick&dirty solution to a situation already stuffed with "Q&Ds“.
But, </complainMode off>, most things work very well in Fedora and especially Fedora Server. It's the best I've been able to find so far, hence my commitment.
And therefore I’m looking for ways to find such bugs or issues, that have "slipped through" so far, before the release date (so we can at least add yet another Q&D to our collection beforehand). I wonder if bugs like the ones described above are conceptually detectable at all with automated testing like openQA. As a first step we started work on an update of our technical specification and subsequently our test and release criteria. Perhaps we can expand or sharpen our tests on this basis.
I have never dealt with this practically before and therefore have a pretty big knowledge gap and need to catch up.
Best Peter
P.S. Sorry for the long text.
On Sun, 2022-02-27 at 21:55 +0100, Peter Boy wrote:
And therefore I’m looking for ways to find such bugs or issues, that have "slipped through" so far, before the release date (so we can at least add yet another Q&D to our collection beforehand). I wonder if bugs like the ones described above are conceptually detectable at all with automated testing like openQA. As a first step we started work on an update of our technical specification and subsequently our test and release criteria. Perhaps we can expand or sharpen our tests on this basis.
I mean, the answer is broadly speaking 'yes'. If you can come up with a definition of a precise set of steps that you want to test, we can usually implement that as an automated test.
The issues are purely resource ones. We need the resources to write the tests. We need the resources to run them. And we need the resources to do something if the tests fail: someone needs to review the result and find out if it's a real bug or a test failure. If it's a test failure someone needs to fix the test. If it's a real bug someone needs to fix the bug.
Practically speaking, there are limits on all of these in Fedora. We only really have two people writing tests for openQA at the moment. We only have a limited amount of machines to run the tests on. And it's mostly only me reviewing failures and deciding what to do about them at the moment.
Beyond that, the final limit there is the subtlest but maybe the most important of all. One of the most important things I've picked up doing this job for over a decade is that a test that runs every day and fails every day is the most useless thing in software. There is very little point in implementing a test if, when that test finds a bug, nobody is going to fix it.
This is a key reason, for me, why it's not a good idea to just write automated tests for *everything*. I tend to be quite strict about wanting to only adding tests to openQA if I'm very sure that someone is going to care if the test fails. For a long time we specifically only ran tests in openQA that validate the release criteria, because it neatly solves this problem almost entirely: if the test fails, then that's a release-blocking bug, and somebody *has* to fix it or the release doesn't go out.
In the last couple of years we have started carefully adding some tests beyond that scope, but usually only in response to requests from *developers*. If the FreeIPA or GNOME team comes to us and says, hey, can we add a test for this feature we think is really important, I have the confidence to say "yes" because I know that if the test finds a bug, I can go back to that team and say "hey, this test you told us to add found a bug, fix it". If someone who *isn't* the developer of Thing X says "can we add a test for Thing X?", my first question is, "if the test finds out that Thing X is broken, who do I email to fix it the next day?"
So to go back to your message - updating the tech spec and release criteria is an excellent idea. If we can get broad buy-in that "Thing X must work" ought to be a release criterion, then I would be very confident in adding a test for Thing X. In fact, I would very much *want to have* a test for Thing X, because one of our key goals is to automate testing for the release criteria as far as we possibly can. But there does need to be a solid justification for why Thing X working should be in the release criteria, and we need to have a "throat to choke" to fix Thing X when it breaks.
Hope that makes sense!
On Sun, 2022-02-27 at 21:55 +0100, Peter Boy wrote:
Hi Dan,
thanks for all the information. It makes me feel better that we, as Server WG, do not miss anything expected to contribute.
Am 27.02.2022 um 17:02 schrieb Dan Čermák dan.cermak@cgc-instruments.com:
... I am not Adam, but I'll ask anyway: what issues did you encounter? openQA is pretty versatile and can test *a lot* of scenarios.
With Fedora 33 it was the switch to systemd-resolved. It introduced a break in name resolution for the internal libvirt network virbr0. And after an update, it triggered an unexpected service outage for all servers that worked with that network. That was pretty bad. And until a solution was found, it took some time. (You have to use a libvirt hook via script to make configurations via resolvectl afterwards, which systemd normally finds itself, but can't find with libvirt, because libvirt configures the interface directly, independent of NetworkManager).
With F35, it was the switch to modularized libvirt libraries. We had tried to get a test day, unfortunately unsuccessfully. So we relied on the usual test procedures. And then after the release was published, suddenly we noticed that the usual startup and management procedures didn't work anymore. And all the servers hosting virtual machines had a problem. And which server doesn't host VMs nowadays.
Fortunately, this could be fixed very quickly, but the damage was done.
There was a third problem, probably with F34. This was obviously not that big, I don't remember at the moment.
And with earlier releases there were issues, too. With F30 or F31 there was a bug with systemd-nspawn and SELinux. This has not been fixed until today. Fortunately there is a workaround. You have to run about10 SELinux ACE overwrites, manually after the installation. Nice. And every time you want to administrate something, you have to set SELinux to permissive. And remember to enable it again afterwards.
For a server administrator, these experiences do not inspire much confidence. SELinux, systemd, network configuration, name resolution, all these are central and indispensable elements in Fedora and especially in Fedora Server. And obviously, not even those elements are reliably under control.
So all of those seem a bit vague and without Bugzilla links I can't quite understand exactly what the bug was in each case. So let me talk a bit more generally about what we cover right now.
We cover what's in the release criteria. So, we check that the system can be installed, and that it boots to a console, and you can log in as a regular user and root. We check that no system services fail on boot. We check that you can install updates. We check that Cockpit is set up by default and that you can connect to it. We test quite a few bits of Cockpit's feature set. We check that you can set up a postgresql database and connect to that. We check that you can deploy the system as a FreeIPA server, and deploy another system as a FreeIPA client of that server, and check various things about that work, including enrolling a client via Cockpit and doing some admin tasks in the web UI. We check the FreeIPA and postgresql tests work on a clean upgrade from the last two releases. All of these things we test with SELinux enabled.
There's quite a lot of "SELinux, systemd, network configuration, name resolution" in all the above. But the thing is, potential configurations of Fedora are essentially endless, and we can't really reproduce every one.
One specific thing about openQA is that it does not use libvirt. The tests run in qemu VMs, but openQA's test runner (os-autoinst) launches these directly, it does not use libvirt. We use qemu usermode networking for simple tests, and openvswitch networking for tests where two or more tests need to communicate with each other.
SELinux denials on things we don't test are definitely a thing that happens, yeah. What's really useful to avoid them in stable releases is for folks from this group to test their Server workflows *before* the release goes stable - say, when Beta comes out - and see what's broken. Then we can get them fixed before the stable release. I used to run my own mail, web, FreeIPA and radicale servers, and I upgraded my server VMs a little before the Final freeze so I could catch problems like that and report them. But maintaining those servers just got to be too much of a burden and I quit doing it, so I don't catch those things any more.
server@lists.fedoraproject.org