Fraser Tweedale wrote:
On Wed, Oct 24, 2018 at 04:49:21PM -0400, Rob Crittenden via
FreeIPA-devel wrote:
> I started a design of an IPA healthcheck framework at
>
https://www.freeipa.org/page/V4/Healthcheck
>
> Have at it.
>
> Note that this concentrates more on how it will work big picture and
> less on individual checks that may be performed. I'm happy to add any
> ideas you come up with for specific tests.
>
> rob
>
Thanks Rob, feedback below.
1. I think we should consider promoting the server hostname into the
object, with attribute name 'ipaErrorHost' (or whatever). This may
make some kinds of searches easier, e.g. if you have
ipa[123].bne.example.com and
ipa[123].bos.example.com, and you are
interested in errors from the bne site, you can search for
'(ipaErrorHost=*.bne.example.com)'. We can index the attribute.
We already have a fqdn attribute in the IPA schema. I'd prefer to re-use
that. It has the eq, pres and sub indices.
It does make some sense to group in per-host subtrees but because
there is no subtree delete operation a flat container might be worth
it for the additional search flexibility.
Yes, I suppose if we specify the master within the entry that is
sufficient. Let's agree on what to call the master and I'll make this
change.
2. Schema and indices:
- for ipaErrorDateReported and ipaErrorDateResolved, specify:
EQUALITY generalizedTimeMatch
ORDERING generalizedTimeOrderingMatch
- for ipaSeverity specify:
EQUALITY integerMatch
ORDERING integerOrderingMatch
- ipaIgnoreError specify: EQUALITY booleanMatch
- ipaIgnoreError being MAY is a pitfall. Assuming absense
implies "not ignored", searching for:
(ipaIgnoreError=FALSE)
will _exclude_ entries without the ipaIgnoreError attribute.
The correct filter is '(!(ipaIgnoreError=FALSE))'. Better to
make it a MUST attribute and exclude this pitfall.
- We probably want presence index for ipaErrorDateResolved
Done.
3. Execution; we might want a watchdog to kill checks that take too
long (for whatever reason). There'll be some complexity so maybe
just make a note not to code ourselves into a corner and we can
defer it.
Added. I also added a config file so it can be overridden. I think I
need to explore configuration a bit more. Ideally most of the config
would be stored in LDAP (e.g. if you want to disable a whole set of
tests from running).
A local config for timeout is preferred in case LDAP is inaccessible for
some reason.
4. (Comment) regarding the separate repo, I'm not against it but
there's some interdependency, i.e. HC will depend on a lot of stuff
from ipalib, but the IPA healthcheck plugin will also depend on
stuff defined by HC. What bits will live where is not fully clear.
We might have to work it out as we go.
I'm not dead set on this but it might be nice and a check on the
developer API changing. I added a bit more verbiage.
5. CLI: the '--source' option has not been defined. Does
'--tool'
mean the same thing?
6. Terminology: not sure about "source"/"command" (especially
"command", which could be confusing ("what command failed?") Some
ideas: command -> item/check/fault. I don't care about bikeshedding
the strings, I just want to avoid overloaded/confusing terms.
PLEASE, bikeshed away! As you can see I'm having a heck of a time coming
up with a good way to specify the group of tests versus an individual
test. This is key to understanding everything so good naming is
important. I'm very open to suggestions on this.
7. CLI: there is some inconsistency with how other IPA commands work
(not necessarily bad, but it should be justified). If we follow the
IPA pattern:
- `ipa healthcheck-show UUID` would show a single report
- `ipa healthcheck-find` would have a `--master=HOSTNAME` filter
option.
- `--all` would show all attributes, and there would be a separate
option to show ignored reports (e.g. `include-ignored`).
So again, we don't have to do it that way, but the current design is
a deviation from the norm so I think that should be discussed from a
usability perspective.
Yes, this is complicated. If we want to drop it, and I'm perfectly ok
with it, we'd have to have extremely atomic, uniquely named individual
tests within a plugin. For example, to check on file ownership one way
to do it would be with a table:
files = [ ('/etc/httpd/alias/key3.db', 'root', 'apache',
'0640'),
('/etc/httpd/alias/cert8.db', 'root', 'apache',
'0640'),
...
]
for (file, owner, group, mode) in files:
[ test ]
How would we name a particular failure? This is why I went with UUID.
Similar applies to the certmonger tracking. We have 8 or so tracking
requests by default, if one or more fail we'd report each one
individually but how to name them automatically? I punted.
Honestly I think the -show command will be used more within the UI than
the CLI. The -find command will show the same information.
8. Can a single tool+command combo produced multiple reports for a
single master, with different ipaErrorMessage key-value pairs?
Example: file permissions. Is every possible file to check a
different tool+command, or is it one tool+command, with potentially
multiple reports with different ipaErrorMessage parameters?
Exactly. I imagined a separate report for every single failure.
Consider this from a usability perspective: the resolution is likely
to be very similar for all the possible instantiations. Also
consider how many tool+command combinations there would be if all
the possible files to check had to have different names. Lookup
tables for error message generation and external resources get huge.
The failure lookup is by the plugin and particular test. I kept them
separate so it is insanely easy to track which ones are resolved and
when (if 3 files have bad perms and are reported in a single LDAP entry
and one is fixed, what do we do)? I looked at it like a transaction file.
OTOH if a single tool+command can produce multiple reports, it
affects the API/CLI somewhat (e.g. `ipa healthcheck-ignore` must now
be given the UUID or enough parameters to uniquely identify the
report to ignore).
Yes. Icky but necessary.
9. Would be good to include links to external resources etc in
healthcheck-show. Also to indicate when 'ipa-healthcheck' may be
able to repair the issue (may reduce support burden if we can subtly
encourage the administrator to run the repair tool instead of
contact support / mailing list).
I've been a bit vague about working with the user on how to resolve a
particular problem. We have a few obvious options:
1) external documentation: wiki, downstream docs, both?
2) a separate LDAP lookup table
I've added a section on this and some additional schema as a starting
point for discussion.
That's all for now :) Overall the design is looking good.
Thanks for the feedback
rob