reversible dual-boot test station
by Chris Murphy
Hi,
I mentioned this in a QA meeting, and have given it enough testing that I think it's broadly usable. If desired it can be copied out of my user account and put up somewhere where QA folks will see it and can modify it as issues or improvements are discovered.
What is it? The idea is to produce a system that can confidently be used for baremetal testing, without risking the primary operating system. While VM's are a great way to test, it's also a really idealized environment that tends to not expose an assortment of bugs that affect particular hardware. But then quite a lot of folks reasonably don't want to upgrade their daily use hardware early on, because they don't want to always have to debug things, or have to figure out how to undo the upgrade if it really goes badly.
Therefore, I present a dual-boot setup offering:
* no re-partitioning;
* no installation step, instead system upgrade is used;
* reversibility, or undoability, i.e. with just a few steps you can delete the "test OS".
https://fedoraproject.org/wiki/User:Chrismurphy/Draft/dualboot_teststation
--
Chris Murphy
3 weeks, 4 days
Re: Thoughts welcome: interface between automated test gating and
the "critical path"
by Adam Williamson
On Fri, 2022-09-02 at 08:37 +0000, Zbigniew Jędrzejewski-Szmek wrote:
> >
> > Now, because I glued openQA to the critpath because it was handy, there
> > are two sets of consequences to a package being in critical path:
> >
> > 1. Tighter Bodhi requirements
> > 2. openQA tests are run, and results gate the update (except Rawhide)
> >
> > So, one of the implicit questions here is, is it OK to keep twinning
> > these two sets of consequences, or should we split them up? Splitting
> > them up kinda implies answer 2) from my original mail: "Keep the
> > current "critical path" concept but define a broader group
> > of "gated packages" somewhere". Because we would then need some new
> > concept that isn't "critical path". As I said, that's more *work* -
> > it'd require us to write new code in several places[0]. Even if we
> > decide it'd be nice to do this, is it nice *enough* to be worth doing
> > that work?
>
> I'd still vote for keeping a single critpath list and using it as
> "the list of packages that require extra care and testing".
>
> As you describe, the original meaning of critpath has shifted, but
> it's because the way we do updates and QA has also shifted. Doing
> gating tests for a package seems much more useful than just keeping
> it longer in 'updates-testing' in hope that somebody discovers an
> important regresion in the second week.
Well, there's a caveat there - openQA doesn't test everything. On the
whole we cover quite a lot with the set of tests that gets run on
updates, but there's certainly lots of potential for there to be
important bugs it misses, that a human tester might catch. So I think
there is still a case for the higher karma requirements too.
>
> So yeah, I don't think it makes sense to do the extra work to split
> the concepts. Also because we have way too many concepts and processes
> in Fedora already.
On the whole, though, I agree with you. I just don't trust my own
opinion because it's obviously biased by what's convenient for me. :D
> > If we don't think it's worth doing that work, then we're kinda stuck
> > with openQA glomming onto the critpath definition to decide which
> > updates to test and gate, because I don't have any other current viable
> > choices for that, really. And we'd have to figure out a critpath
> > definition that's as viable as possible for both purposes.
> >
> >
> > BTW, one other thought I've had in relation to all this is that we
> > could enhance the current critpath definition somewhat. Right now, it's
> > built out of package groups in comps which are kinda topic-separated:
> > there's a critpath-kde, a critpath-gnome, a critpath-server, and so on.
> > But the generated critical path package list is a monolith: it doesn't
> > distinguish between a package that's on the GNOME critpath and a
> > package that's on the KDE critpath, you just get a big list of all
> > critpath packages. It might be nice if we actually did distinguish
> > between those - the critpath definition could keep track of which
> > critpath topic(s) a package is included in, and Bodhi could display
> > that information in the web UI and provide it via the API. That way
> > manual testers could get a bit more info on why a package is critpath
> > and what areas to test, and openQA could potentially target its test
> > runs to conserve resources a bit, though this might require a bit more
> > coding work on the gating stuff now I think about it.
>
> That sounds useful. We only need a volunteer to figure out the details
> and do the work ;)
I actually did a huge rewrite of the thing that generates the critpath
data this week, and it probably wouldn't be tooooo much work, honestly.
The most annoying bit would be the Bodhi frontend stuff, but that's
because I'm bad at frontend dev in general. :P But yeah, this is
definitely off in sky-castle land. I'll add it to my ever-growing list
of sky-castle projects to do when I get a couple of years of spare
time...
--
Adam Williamson
Fedora QA
IRC: adamw | Twitter: adamw_ha
https://www.happyassassin.net
7 months
Plan / proposal: enable openQA update testing and potentially
gating on Rawhide updates
by Adam Williamson
Hi folks!
We've had openQA testing of updates for stable and branched releases,
and gating based on those tests, enabled for a while now. I believe
this is going quite well, and I think we addressed the issues reported
when we first enabled gating - Bodhi's gating status updates work more
smoothly now, and openQA respects Bodhi's "re-run tests" button so
failed tests can be re-triggered.
A few weeks ago, I enabled testing of Rawhide updates in the openQA
lab/stg instance. This was to see how smoothly the tests run, how often
we run into unexpected failures or problems, and whether the hardware
resources we have are sufficient for the extra load.
So far this has been going more smoothly than I anticipated, if
anything. The workers seem to keep up with the test load, even though
one out of three worker systems for the stg instance is currently out
of commission (we're using it to investigate a bug). We do get
occasional failures which seem to be related to Rawhide kernel slowness
(e.g. operations timing out that usually don't otherwise time out), but
on the whole, the level of false failures is (I would say) acceptably
low, enough that my current regime of checking the test results daily
and restarting failed ones that don't seem to indicate a real bug
should be sufficient.
So, I'd like to propose that we enable Rawhide update testing on the
production openQA instance also. This would cause results to appear on
the Automated Tests tab in Bodhi, but they would be only informational
(and unless the update was gated by a CI test, or somehow otherwise
configured not to be pushed automatically, updates would continue to be
pushed 'stable' almost immediately on creation, regardless of the
openQA results).
More significantly, I'd also propose that we turn on gating on openQA
results for Rawhide updates. This would mean Rawhide updates would be
held from going 'stable' (and included in the next compose) until the
gating openQA tests had run and passed. We may want to do this a bit
after turning on the tests; perhaps Fedora 37 branch point would be a
natural time to do it.
Currently this would usually mean a wait from update submission to
'stable push' (which really means that the build goes into the
buildroot, and will go into the next Rawhide compose when it happens)
of somewhere between 45 minutes and a couple of hours. It would also
mean that if Rawhide updates for inter-dependent packages are not
correctly grouped, the dependent update(s) will fail testing and be
gated until the update they depend on has passed testing and been
pushed. The tests for the dependent update(s) would then need to be re-
run, either by someone hitting the button in Bodhi or an openQA admin
noticing and restarting them, before the dependent update(s) could be
pushed.
In the worst case, if updated packages A and B both need the other to
work correctly but the updates are submitted separately, both updates
may fail tests and be blocked. This could only be resolved by waiving
the failures, or replacing the separate updates with an update
containing both packages.
All of those considerations are already true for stable and branched
releases, but people are probably more used to grouping updates for
stable and branched than doing it for Rawhide, and the typical flow of
going from a build to an update provides more opportunity to create
grouped updates for branched/stable. For Rawhide the easiest way to do
it if you need to do it is to do the builds in a side tag and use
Bodhi's ability to create updates from a side tag.
As with branched/stable, only critical path updates would have the
tests run and be gated on the results. Non-critpath updates would be
unaffected. (There's a small allowlist of non-critpath packages for
which the tests are also run, but they are not currently gated on the
results).
I think doing this could really help us keep Rawhide solid and avoid
introducing major compose-breaking bugs, at minimal cost. But it's a
significant change and I wanted to see what folks think. In particular,
if you find the existing gating of updates for stable/branched releases
to cause problems in any way, I'd love to hear about it.
Thanks folks!
--
Adam Williamson
Fedora QA
IRC: adamw | Twitter: adamw_ha
https://www.happyassassin.net
9 months, 1 week
Proposed change to rendering of release schedule ics file
by Ben Cotton
Hi everyone,
I want to get feedback before I make a change to how Fedora Linux
schedules are generated. As reported in schedule#91[1], the current
ics files include very long tasks that aren't particularly useful. I
have a plan for how to address that, but first I want to check that it
won't impact how people currently use the ics files.
If you're using the Fedora Linux schedule ics files, please see the
details in the Pagure issue[2]. If this will affect how you or your
team use those files, comment in the issue before 13 February. Note
that this will not change how the html or json versions are produced.
[1] https://pagure.io/fedora-pgm/schedule/issue/91
[2] https://pagure.io/fedora-pgm/schedule/issue/91#comment-838492
--
Ben Cotton
He / Him / His
Fedora Program Manager
Red Hat
TZ=America/Indiana/Indianapolis
1 year, 1 month
Boot issue, probably with grub and maybe mdraid
by Bruno Wolff III
Today one of three machines failed to boot 6.2.0-0.rc4. The machine that
failed had not been rebooted in a week and now won't boot any kernel and
it appears grub is aborting with a pointer out of range error. All three
machines use ext4 and luks, but only the failing machine uses mdraid.
I haven't recovered the failing machine yet, but plan to downgrade grub
tomorrow and hope that confirms a grub bug by allowing it to boot. If
so, I'll file a bug report.
I'm wondering if anyone else saw this and/or if they think there might be
a different issue I should be looking for?
1 year, 1 month
Delayed openQA test execution
by Adam Williamson
Hey folks! Wanted to send a note out for any maintainers who may be
waiting on openQA test results for critical path updates etc. Tests are
currently taking much longer than usual to be completed because there's
a large backlog. The Rawhide mass rebuild, plus the three critical bugs
(python-ptyprocess, systemd, and mesa) that I posted about yesterday,
caused 2-3 days' worth of Rawhide update tests to fail. After finally
sorting out those problems enough that the tests now pass, I
rescheduled that whole backlog, which turned out to be over 2000 tests.
The system is working its way through the backlog but it'll take a
while (I expect it should be clear by end of day today or so, it
depends a bit on how many updates are created in the mean time), and
until that gets done, it may take much longer for the tests for any
given update to be completed after the update is created.
I do apologize for this; with hindsight it might've been better to try
and hack up a way to reduce the priority on the backlogged Rawhide
update tests so tests for stable release updates ran first, but I
didn't think of that yesterday.
If there are any urgent security or critical bug fix updates for stable
releases which are waiting on testing and really need it run ASAP, let
me know and I can manually bump the priority of those tests so they run
sooner.
Thanks folks!
--
Adam Williamson
Fedora QA
IRC: adamw | Twitter: adamw_ha
https://www.happyassassin.net
1 year, 1 month
Re: Some heads-ups for Rawhide users: python3-ptyprocess and
systemd issues
by Adam Williamson
On Thu, 2023-01-26 at 12:41 +0100, Florian Weimer wrote:
> * Adam Williamson:
>
> > 6 (__libc_message.cold+0x5) [0x7fbae3c2560f]
> > Jan 25 13:38:47 fedora /usr/libexec/gdm-x-session[1040]: (EE) 6: /lib64/libc.so.6 (malloc_printerr+0x15) [0x7fbae3c96a05]
> > Jan 25 13:38:47 fedora /usr/libexec/gdm-x-session[1040]: (EE) 7: /lib64/libc.so.6 (_int_free+0x9e5) [0x7fbae3c98de5]
> > Jan 25 13:38:47 fedora /usr/libexec/gdm-x-session[1040]: (EE) 8: /lib64/libc.so.6 (__libc_free+0x7e) [0x7fbae3c9b42e]
> > Jan 25 13:38:47 fedora /usr/libexec/gdm-x-session[1040]: (EE) 9: /usr/lib64/dri/zink_dri.so (__driDriverGetExtensions_zink+0x9e70) [0x7fbad82b8180]
> > Jan 25 13:38:47 fedora /usr/libexec/gdm-x-session[1040]: (EE) 10: /lib64/libgbm.so.1 (gbm_format_get_name+0xe81) [0x7fbae3229361]
> > Jan 25 13:38:47 fedora /usr/libexec/gdm-x-session[1040]: (EE) 11: /lib64/libgbm.so.1 (gbm_format_get_name+0x1018) [0x7fbae32294f8]
> > Jan 25 13:38:47 fedora /usr/libexec/gdm-x-session[1040]: (EE) 12: /lib64/libgbm.so.1 (gbm_format_get_name+0x121a) [0x7fbae32296fa]
>
> I saw this during the latest glibc update, but given that the update
> wasn't tested in isolation, I waived it through because the update
> addresses the known sprintf issue.
>
> I'm not aware of anything in glibc this might have caused this. The
> crashes related to -D_FORTIFY_SOURCE=3 look different (they try to crash
> before causing heap corruption!), and it's not the sprintf assertion
> failure in __printf_buffer_as_file_commit, either.
>
No, it's definitely mesa that causes it - I verified that with manual
local testing, updating only mesa causes the bug to start happening,
downgrading it makes it go away. This is now being tracked/investigated
as https://bugzilla.redhat.com/show_bug.cgi?id=2164667 .
--
Adam Williamson
Fedora QA
IRC: adamw | Twitter: adamw_ha
https://www.happyassassin.net
1 year, 1 month
Some heads-ups for Rawhide users: python3-ptyprocess and systemd
issues
by Adam Williamson
Hey folks! Just wanted to send out a heads-up for Rawhide users about
some issues that have shown up in openQA testing.
First, you may have found that trying to update to today's Rawhide
fails like this:
Error: Transaction test error:
file /usr/lib/python3.11/site-packages/ptyprocess-0.7.0-py3.11.egg-info from install of python3-ptyprocess-0.7.0-2.fc38.noarch conflicts with file from package python3-ptyprocess-0.7.0-1.fc38.noarch
That's https://bugzilla.redhat.com/show_bug.cgi?id=2164207 , we figured
out the problem (thanks to Panu for getting to the root cause of it)
and it is fixed in python-ptyprocess-0.7.0-3.fc38 , which should be in
the next Rawhide compose. If you really can't wait you can update
directly to that version
from https://koji.fedoraproject.org/koji/buildinfo?buildID=2138649 ,
but you'd better watch out for issue #2...
The second issue is that systems that update to systemd-253~rc1-1.fc38
seem to get stuck on boot. With Plymouth enabled you just see the
splash screen. With it disabled (or by pressing esc) it seems to be
stuck at "Stopped initrd-switch-root.service - Switch Root.". I'm still
looking into this one, but it's happened to a lot of openQA tests and I
was able to confirm it first try in a local VM, by installing from the
20230123.n.0 compose then updating systemd and rebooting. Fresh
installs with the newer systemd seem to be OK, at least most openQA
tests for the new compose passed - it seems to be only updating an
existing install that has the problem, at least so far.
--
Adam Williamson
Fedora QA
IRC: adamw | Twitter: adamw_ha
https://www.happyassassin.net
1 year, 1 month