ResultsDB 2.0 - DB migration on DEV
by Josef Skladanka
So, I have performed the migration on DEV - there were some problems with
it going out of memory, so I had to tweak it a bit (please have a look at
D1059, that is what I ended up using by hot-fixing on DEV).
There still is a slight problem, though - the migration of DEV took about
12 hours total, which is a bit unreasonable. Most of the time was spent in
`alembic/versions/dbfab576c81_change_schema_to_v2_0_step_2.py` lines 84-93
in D1059. The code takes about 5 seconds to change 1k results. That would
mean at least 15 hours of downtime on PROD, and that, I think is unreal...
And since I don't know how to make it faster (tips are most welcomed), I
suggest that we archive most of the data in STG/PROD before we go forward
with the migration. I'd make a complete backup, and deleted all but the
data from the last 3 months (or any other reasonable time span).
We can then populate an "archive" database, and migrate it on its own,
should we decide it is worth it (I don't think it is).
What do you think?
J.
6 years, 4 months
New ExecDB
by Josef Skladanka
With ResultsDB and Trigger rewrite done, I'd like to get started on ExecDB.
The current ExecDB is more of a tech-preview, that was to show that it's
possible to consume the push notifications from Buildbot. The thing is,
that the code doing it is quite a mess (mostly because the notifications
are quite a mess), and it's directly tied not only to Buildbot, but quite
probably to the one version of Buildbot we currently use.
I'd like to change the process to a style, where ExecDB provides an API,
and Buildbot (or possibly any other execution tool we use in the future)
will just use that to switch the execution states.
ExecDB should be the hub, in which we can go to search for execution state
and statistics of our jobs/tasks. The execution is tied together via UUID,
provided by ExecDB at Trigger time. The UUID is passed around through all
the stack, from Trigger to ResultsDB.
The process, as I envision it, is:
1) Trigger consumes FedMsg
2) Trigger creates a new Job in ExecDB, storing data like FedMsg message
id, and other relevant information (to make rescheduling possible)
3) ExecDB provides the UUID, marks the Job s SCHEDULED and Trigger then
passes the UUID, along with other data, to Buildbot.
4) Buildbot runs runtask, (sets ExecDB job to RUNNING)
5) Libtaskotron is provided the UUID, so it can then be used to report
results to ResultsDB.
6) Libtaskotron reports to ResultsDB, using the UUID as the Group UUID.
7) Libtaskotron ends, creating a status file in a known location
8) The status file contains a machine-parsable information about the
runtask execution - either "OK" or a description of "Fault" (network
failed, package to be installed did not exist, koji did not respond... you
name it)
9) Buidbot parses the status file, and reports back to ExecDB, marking the
Job either as FINISHED or CRASHED (+details)
This will need changes in Buildbot steps - a step that switches the job to
RUNNING at the beginnning, and a step that handles the FINISHED/CRASHED
switch. The way I see it, this can be done via a simple CURL or HTTPie call
from the command line. No big issue here.
We should make sure that ExecDB stores data that:
1) show the execution state
2) allow job re-scheduling
3) describe the reason the Job CRASHED
1 is obviously the state. 2 I think can be satisfied by storing the Fedmsg
Message ID and/or the Trigger-parsed data, which are passed to Buildbot.
Here I'd like to focus on 3:
My initial idea was to have SCHEDULED, RUNNING, FINISHED states, and four
crashed states, to describe where the fault was:
- CRASHED_TASKOTRON for when the error is on "our" side (minion could not
be started, git repo with task not cloned...)
- CRASHED_TASK to use when there's an unhandled exception in the Task code
- CRASHED_RESOURCES when network is down, etc
- CRASHED_OTHER whenever we are not sure
The point of the crashed "classes" is to be able to act on different kind
of crash - notify the right party, or even automatically reschedule the
job, in the case of network failure, for example.
After talking this through with Kamil, I'd rather do something slightly
different. There would only be one CRASHED state, but the job would contain
additional information to
- find the right person to notify
- get more information about the cause of the failure
To do this, we came up with a structure like this:
{state: CRASHED, blame: [TASKOTRON, TASK, UNIVERSE], details:
"free-text-ish description"}
The "blame" classes are self-describing, although I'd love to have a better
name for "UNIVERSE". We might want to add more, should it make sense, but
my main focus is to find the right party to notify.
The "details" field will contain the actual cause of the failure (in the
case we know it), and although I have it marked as free-text, I'd like to
have a set of values defined in docs, to keep things consistent.
Doing this, we could record that "Koji failed, timed out" (and blame
UNIVERSE, and possibly reschedule) or "DNF failed, package not found"
(blame TASK if it was in the formula, and notify the task maintained), or
"Minion creation failed" (and blame TASKOTRON, notify us, I guess).
Implementing the crash clasification will obviously take some time, but it
can be gradual, and we can start handling the "well known" failures soon,
for the bigger gain (kparal had some examples, IIRC).
So - what do you think about it? Is it a good idea? Do you feel like there
should be more (I can't really imagine there being less) blame targets
(like NETWORK, for example), and if so, why, and which? How about the
details - hould we go with pre-defined set of values (because enums are
better than free-text, but adding more would mean DB changes), or is
free-text + docs fine? Or do you see some other, better solution?
joza
6 years, 4 months
Release validation NG: planning thoughts
by Adam Williamson
Hi folks!
We should probably set up some projects and so on for this so we can
use issue trackers, but I thought before committing to any structure we
could have at least a short mailing list discussion for planning the
'release validation NG' work.
For anyone who forgot / didn't know - 'release validation NG' is my
nickname for the project to write a dedicated system for manual release
validation testing result submission, using resultsdb for storage. The
goal is to make manual validation testing result submission easier and
less error-prone, and also to allow for improvement analysis of results
and integration of manual results with results from other systems
(taskotron, openQA, autocloud etc). This would be designed to replace
the system of editable wiki pages that I call 'Wikitcms':
https://fedoraproject.org/wiki/Test_Results:Current_Installation_Test (etc.)
https://fedoraproject.org/wiki/Wikitcms
the latter page is a broad overview of how I see the Wikitcms 'system'
working at present. It's that system we'd be replacing, so it may help
you to read through that page to get some context and background on how
we got here and why 'release validation NG' might be a good idea :)
We have a ticket open with the design team:
https://pagure.io/design/issue/483
where kathryng is helping us with design mock ups based on my initial
rough sketches, which is great. Please do take a look at the mockups
and discussion there and add thoughts if you have any.
My very initial thought on architecture is that we could have two main
components, a webui component and a validator/resultsdb submitter
component.
The webui component would be exactly that, the actual web UI for users
to interact with and submit their results to. It would query the
validator/submitter component to find out what relevant 'test events'
were available, and what tests and environments and so forth for each
event, and then present an appropriate UI to the user for them to fill
in their results.
The validator/submitter component would be responsible for watching out
for new composes and keeping track of tests and 'test environments' (if
we keep that concept); it would have an API with endpoints you could
query for this kind of information in order to construct a result
submission, and for submitting results in some kind of defined form. On
receiving a result it would validate it according to some schemas that
admins of the system could configure (to ensure the report is for a
known compose, image, test and test environment, and do some checking
of stuff like the result status, user who submitted the result, comment
content, stuff like that). Then it'd forward the result to resultsdb.
This is just an idea, though. There are a few reasons I thought it
might make sense to separate these two elements:
* It gives us flexibility in a few important respects:
* The validator/submitter could accept results from other things, not
just the webUI - e.g. relval
* The validator/submitter count send results to other things, not
just ResultsDB - e.g. the wiki
* The validator/submitter could be written to allow expansion to
cover things other than release validation results, e.g. Test Day
results, so a future rewrite of the 'testdays' webapp could use it
* It should help with splitting up the work between people; different
people can work on the web UI and the validator/submitter without
blocking each other too often
So these are just my very early thoughts on the project, it'd be great
to know what other folks think! If we can agree on a basic architecture
and plan we could start setting up projects (I think I'd suggest we do
this in Pagure, but we can also consider Phabricator) and tickets for
the initial work.
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
6 years, 5 months
Re: Release validation NG: planning thoughts
by Adam Williamson
On Tue, 2016-11-29 at 19:41 +0530, Kanika Murarka wrote:
> Hey everyone,
> I have some thoughts for the project:-
>
> 1. We can have a notification system, which gives a notifications like:-
> * 'There is a test day coming for this compose in 2 days'
> * 'A new compose has been added'
> Something to motivate and keep reminding testers about test days and new
> composes.
Yeah, this is certainly going to be needed if only simply to replace
the Wikitcms event creation notification emails (these are sent by
'relvalconsumer', which is the fedmsg consumer bot that creates the
events).
> 2. Keep a record of no. of validation test done by a tester and highlight
> it once he login. A badge is being prepared for no. of validation testing
> done by a contributor[1].
Well, this information would kind of inevitably be collected at least
in resultsdb and probably wind up in the transmitter component's DB
too, depending on exactly how we set things up. For badge purposes,
we're *certainly* going to have this system firing off fedmsgs in all
directions, so the badges can be granted just based on the fedmsgs.
'User W reported a X for test Y on compose Z' (or similar) is a very
obvious fedmsg to emit.
> 3. Someway to show that testing for a particular compose is not required
> now, so testers can move on to newer composes.
We're talking about approximately this in the design ticket. My initial
design idea would *only* show images for the 'current' validation event
if you need to download an image for testing; I don't really see an
awful lot of point in offering older images for download. I suggested
offering events from the previous week or so for selection if you
already have an image downloaded, to prevent people having to download
new images all the time but also prevent us getting uselessly old
reports.
I'd see it as the validator/submitter component's job to keep track of
information about events/composes (however we conceive it), like when
they appeared, and the web UI's job to make decisions about which to
actually show people.
> 4. Also, we can add a 'sort by priority' option in the list of test images.
Yes, something like that, at least. The current system actually does
something more or less like this. The download tables on the wiki pages
are not randomly ordered, but ordered using a weighting provided by
fedfind which includes the importance of the image subvariant as a
factor:
https://pagure.io/fedora-qa/fedfind/blob/master/f/fedfind/helpers.py#_331
It currently penalizes ARM images quite heavily, which is not because
ARM isn't important, but a craven surrender to the practical realities
of wiki tables: they look a lot better if all the ARM disk images are
grouped together than if they're interspersed throughout the table. We
obviously have more freedom to avoid this issue in the design of the
new system.
Thanks for the thoughts!
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
6 years, 6 months
openQA update heads-up
by Adam Williamson
Hey folks! Just a quick heads-up about updates coming to the openQA
instances. I've upgraded staging to F25 today, that seems to have gone
pretty much flawlessly. I'm building current git snapshots of os-
autoinst and openQA at present and will update staging to those as
well, and we'll see how things look over the next few days. Depending
on how that goes I'll aim to upgrade prod to F25 and update it to the
git snapshots shortly afterwards. Hoping this goes a bit smoother than
last cycle and I don't wind up spending another month cleaning up
upstream issues...
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
6 years, 6 months
2016-11-07 @ **15:00 UTC** - Fedora QA Devel Meeting
by Tim Flink
# Fedora QA Devel Meeting
# Date: 2016-11-07
# Time: 15:00 UTC (note time change)
(https://fedoraproject.org/wiki/Infrastructure/UTCHowto)
# Location: #fedora-meeting-1 on irc.freenode.net
Note the time change - as with the other QA meetings, we are keeping in
sync with US DST.
It's the second half of that time of the year again - the time when
nobody is quite sure what time meeting are at because many clocks have
changed.
https://phab.qadevel.cloud.fedoraproject.org/w/meetings/20161107-fedoraqa...
If you have any additional topics, please reply to this thread or add
them in the wiki doc.
Tim
Proposed Agenda
===============
Announcements and Information
-----------------------------
- Please list announcements or significant information items below so
the meeting goes faster
Tasking
-------
- Does anyone need tasks to do?
Potential Other Topics
----------------------
- Docker Testing Status
- Dist-Git Task Storage Proposal (and test case docs)
- Rebuilding Taskotron instances
Open Floor
----------
- TBD
6 years, 6 months
stats-bodhi license
by Adam Williamson
I'm back working on moving fedora-qa to Pagure. I'm now dealing with
the 'stats' scripts, and there's a problem: it appears that stats-bodhi
has never actually been properly licensed. It has no license header or
license text, and AFAICS, never has. I can't simply declare it to be
F/OSS licensed, as I didn't write it.
Can Lukas or Kamil give us a license declaration for this code? Thanks!
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
6 years, 6 months
Proposal to move things from fedora-qa.git to Pagure
by Adam Williamson
Hey folks!
So, if no-one has any objections, I'm intending to move the contents of
fedora-qa.git from fedorahosted to Pagure. At the same time, I think
it'd make sense to split some things out into their own projects. My
rough plan is to split out at least check-compose, relvalconsumer and
stats into separate projects. I'm not sure which of the other things
it's worth splitting out.
I'll probably put the new projects in the fedora-qa namespace and under
the fedora-qa group (if I can). git seems to have some fairly nifty
capabilities for isolating the history of individual files /
directories:
https://blogs.atlassian.com/2014/04/tear-apart-repository-git-way/
so we should be able to produce decent histories for each new project.
Does anyone mind me going ahead and doing this? And importantly, is
anyone aware of any significant deployments besides the ones I'm
already looking after (openQA boxes etc) which use the stuff from this
git repo, and would need to be updated to pull from the new project
repos?
Thanks, everyone!
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
6 years, 6 months