On Thu, Feb 20, 2014 at 05:48:23PM -0500, Nir Soffer wrote:
Can you describe a situation where guessing the lockspace is useful?
No, I'll remove this until there's a reason for it.
> This was the one big question I had about the design. If
it's necessary
> to address more than one host simultaneously I can do that, but I'll need
> to go back and come up with a more complex design. The existing design is
> simple (and completely compatible with the existing format) because it
> uses three unused fields in the delta lease area. So, perhaps think a
> little more about how important this would be and let me know.
We would like to be able to fence more than once host at a time, but having
backward compatible format is more important.
This can help when you have some network issue that cause many hosts to
become in accessible, and you have high-available vms on those hosts, that
should be started as soon as possible on another host.
OK, I'll keep the current method.
> How configurable would you need:
> 1. a daemon config option (set it when the daemon starts)
I think this will good enough.
OK
> - sanlock/wdmd/watchdog lease protection + WD_RESET
host_message
> . send victim the WD_RESET host_message, which would cause the victim
> to force it's own watchdog to expire in a minute
> . assume nothing about the receipt or effect of WD_RESET
> . wait for a fixed timeout from the victim host's last storage renewal
> . assume that the victim's watchdog has reset due to no lease renewal
> . let programs use locks/resources that the victim had been using
We do not want to assume that victim's watcdog has reset the machine.
What we plan to do is to wait until the host is up and query the state
of the vms, before we start these vms on another host.
OK, the obvious limitation being a host that really lost power or had a
hardware failure.
> Notice that:
> - the end goal/result is the same in all cases
> - there are assumptions made in all cases
Some assumptions are more likely. When you talk to a power management
device and it tells you that the machine is powered off, there is very
little chance that this is not true.
Human error plugging hosts into the wrong power outlet is probably the
biggest problem.
I'll try to get the patch updated next week.
Dave