New ExecDB
by Josef Skladanka
With ResultsDB and Trigger rewrite done, I'd like to get started on ExecDB.
The current ExecDB is more of a tech-preview, that was to show that it's
possible to consume the push notifications from Buildbot. The thing is,
that the code doing it is quite a mess (mostly because the notifications
are quite a mess), and it's directly tied not only to Buildbot, but quite
probably to the one version of Buildbot we currently use.
I'd like to change the process to a style, where ExecDB provides an API,
and Buildbot (or possibly any other execution tool we use in the future)
will just use that to switch the execution states.
ExecDB should be the hub, in which we can go to search for execution state
and statistics of our jobs/tasks. The execution is tied together via UUID,
provided by ExecDB at Trigger time. The UUID is passed around through all
the stack, from Trigger to ResultsDB.
The process, as I envision it, is:
1) Trigger consumes FedMsg
2) Trigger creates a new Job in ExecDB, storing data like FedMsg message
id, and other relevant information (to make rescheduling possible)
3) ExecDB provides the UUID, marks the Job s SCHEDULED and Trigger then
passes the UUID, along with other data, to Buildbot.
4) Buildbot runs runtask, (sets ExecDB job to RUNNING)
5) Libtaskotron is provided the UUID, so it can then be used to report
results to ResultsDB.
6) Libtaskotron reports to ResultsDB, using the UUID as the Group UUID.
7) Libtaskotron ends, creating a status file in a known location
8) The status file contains a machine-parsable information about the
runtask execution - either "OK" or a description of "Fault" (network
failed, package to be installed did not exist, koji did not respond... you
name it)
9) Buidbot parses the status file, and reports back to ExecDB, marking the
Job either as FINISHED or CRASHED (+details)
This will need changes in Buildbot steps - a step that switches the job to
RUNNING at the beginnning, and a step that handles the FINISHED/CRASHED
switch. The way I see it, this can be done via a simple CURL or HTTPie call
from the command line. No big issue here.
We should make sure that ExecDB stores data that:
1) show the execution state
2) allow job re-scheduling
3) describe the reason the Job CRASHED
1 is obviously the state. 2 I think can be satisfied by storing the Fedmsg
Message ID and/or the Trigger-parsed data, which are passed to Buildbot.
Here I'd like to focus on 3:
My initial idea was to have SCHEDULED, RUNNING, FINISHED states, and four
crashed states, to describe where the fault was:
- CRASHED_TASKOTRON for when the error is on "our" side (minion could not
be started, git repo with task not cloned...)
- CRASHED_TASK to use when there's an unhandled exception in the Task code
- CRASHED_RESOURCES when network is down, etc
- CRASHED_OTHER whenever we are not sure
The point of the crashed "classes" is to be able to act on different kind
of crash - notify the right party, or even automatically reschedule the
job, in the case of network failure, for example.
After talking this through with Kamil, I'd rather do something slightly
different. There would only be one CRASHED state, but the job would contain
additional information to
- find the right person to notify
- get more information about the cause of the failure
To do this, we came up with a structure like this:
{state: CRASHED, blame: [TASKOTRON, TASK, UNIVERSE], details:
"free-text-ish description"}
The "blame" classes are self-describing, although I'd love to have a better
name for "UNIVERSE". We might want to add more, should it make sense, but
my main focus is to find the right party to notify.
The "details" field will contain the actual cause of the failure (in the
case we know it), and although I have it marked as free-text, I'd like to
have a set of values defined in docs, to keep things consistent.
Doing this, we could record that "Koji failed, timed out" (and blame
UNIVERSE, and possibly reschedule) or "DNF failed, package not found"
(blame TASK if it was in the formula, and notify the task maintained), or
"Minion creation failed" (and blame TASKOTRON, notify us, I guess).
Implementing the crash clasification will obviously take some time, but it
can be gradual, and we can start handling the "well known" failures soon,
for the bigger gain (kparal had some examples, IIRC).
So - what do you think about it? Is it a good idea? Do you feel like there
should be more (I can't really imagine there being less) blame targets
(like NETWORK, for example), and if so, why, and which? How about the
details - hould we go with pre-defined set of values (because enums are
better than free-text, but adding more would mean DB changes), or is
free-text + docs fine? Or do you see some other, better solution?
joza
6 years, 8 months
Proposal to move things from fedora-qa.git to Pagure
by Adam Williamson
Hey folks!
So, if no-one has any objections, I'm intending to move the contents of
fedora-qa.git from fedorahosted to Pagure. At the same time, I think
it'd make sense to split some things out into their own projects. My
rough plan is to split out at least check-compose, relvalconsumer and
stats into separate projects. I'm not sure which of the other things
it's worth splitting out.
I'll probably put the new projects in the fedora-qa namespace and under
the fedora-qa group (if I can). git seems to have some fairly nifty
capabilities for isolating the history of individual files /
directories:
https://blogs.atlassian.com/2014/04/tear-apart-repository-git-way/
so we should be able to produce decent histories for each new project.
Does anyone mind me going ahead and doing this? And importantly, is
anyone aware of any significant deployments besides the ones I'm
already looking after (openQA boxes etc) which use the stuff from this
git repo, and would need to be updated to pull from the new project
repos?
Thanks, everyone!
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
6 years, 10 months
Proposal to CANCEL: 2016-10-31 Fedora QA Devel Meeting
by Tim Flink
I'm not aware of any topics that need to be discussed/reviewed as a
group this week, so I propose that we cancel the weekly Fedora QA devel
meeting.
If there are any topics that I'm forgetting about and/or you think
should be brought up with the group, reply to this thread and we can
un-cancel the meeting.
Tim
6 years, 11 months
Moving all my tools to Pagure
by Adam Williamson
Hey folks! Just a heads up that I'm moving all the repos I maintain to
the fedora-qa space on Pagure. That includes:
(python-)wikitcms
relval
fedfind
testdays
The new projects will be:
https://pagure.io/fedora-qa/python-wikitcms
https://pagure.io/fedora-qa/relval
https://pagure.io/fedora-qa/fedfind
https://pagure.io/fedora-qa/testdays
testdays is already migrated, and I'm in the middle of doing wikitcms
now (renaming it as it goes). The others I'll get to later today I
hope.
The pages for each tool on happyassassin.net will go away and the URLs
will simply redirect to the Pagure project pages.
I plan to push a final commit to each repo on happyassassin.net/cgit
which will just have a 'MOVED' file or something with the Pagure
project info. (Except I haven't bothered for 'testdays', because it's a
small thing I don't think anyone else really uses). I'll leave that up
for a few weeks or something, then kill cgit from happyassassin
entirely.
This saves me maintaining the cgit setup and the front pages, and means
there's now a handy place to file issues and pull requests for each
project; I'm going to remove the Phabricator issue / PR tracking for
these and just go with Pagure, unless anyone yells that they really
want to be able to send issues/PRs via Phab.
I guess we should also migrate the stuff from
https://fedorahosted.org/fedora-qa/ soon.
Thanks folks!
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
6 years, 11 months
Resultsdb v2.0 - API docs
by Josef Skladanka
Hey gang,
I spent most of today working on the new API docs for ResultsDB, making use
of the even better Apiary.io tool.
Before I put even more hours into it, please let me know, whether you think
it's fine at all - I'm yet to find a better tool for describing APIs, so
I'm definitely biased, but since it's the Documentation, it needs to also
be useful.
http://docs.resultsdb20.apiary.io/
I am also trying to put more work towards documenting the attributes and
the "usual" queries, so please try and think about this aspect of the docs
too.
Thanks, Joza
6 years, 11 months
What to do with fedora-qa (fedorahosted is dying)
by Adam Williamson
We still have a few miscellaneous things hosted in:
https://git.fedorahosted.org/cgit/fedora-qa.git
since fedorahosted is dying next February, what should we do with them?
Is this the point where we should finally decide whether to use
Phabricator's built-in repository support or Pagure for this stuff and
the stuff we currently host on bitbucket?
We also still use the fedorahosted *trac* for non-code-related activity
tracking, but I guess that's better followed up on test@.
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
6 years, 11 months