Nir Soffer has uploaded a new change for review.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
vm: Improve error handling when Vm._dom is None
Vm._dom is initialized to None on when creating a vm, and set to None if the underlying libvirt domain has died. Since Vm._dom is modified by multiple threads, it is impossible to check for None before using it. Even if it was possible, we don't want to litter the code with None checks everywhere.
Some code was using this pattern:
try: self._dom.doSomething() except AttributeError: # Oh, it was None
This code is not communicating well our intent. Worse, it hides AttributeError in doSomthing()!
Most code never check self._dom before using it. In the rare cases it is None, we fail with this Traceback:
Thread-460::ERROR::2015-07-04 19:53:21,977::__init__::520::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 515, in _serveRequest res = method(**params) File "/usr/share/vdsm/rpc/Bridge.py", line 277, in _dynamicMethod result = fn(*methodArgs) File "/usr/share/vdsm/API.py", line 727, in freeze return v.freeze() File "/usr/share/vdsm/virt/vm.py", line 2882, in freeze frozen = self._dom.fsFreeze() AttributeError: 'NoneType' object has no attribute 'fsFreeze'
This traceback is a poor way to say "the vm is not running".
This patch introduces the DisconnectedVirDomain class. This object will raise NotConnectedError for any attribute access. Vm._dom is initialzied to DisconnectedVirDomain(vmid) on startup and after underlying libvirt domain has died.
Code trying to talk with a dead vm will fail now with:
NotConnectedError: Vm '681f6b09-a9c3-4422-a7e2-2f607368718b' is not running yet or was shut down.
Code handling disconnected state is using now:
try: self._dom.doSomething() except NotConnctedError: ...
This communicates our intent, and does not hide any error from the underlying code.
Code checking for None is using now:
if self._dom.connected: ...
Change-Id: I5349ec51c7accf3b417b3bc9489c7eed5bfd8733 Signed-off-by: Nir Soffer nsoffer@redhat.com --- M tests/vmTests.py M tests/vmfakelib.py M vdsm/virt/sampling.py M vdsm/virt/vm.py 4 files changed, 47 insertions(+), 20 deletions(-)
git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/25/43225/1
diff --git a/tests/vmTests.py b/tests/vmTests.py index 038ffcc..5b63e89 100644 --- a/tests/vmTests.py +++ b/tests/vmTests.py @@ -1056,7 +1056,6 @@
def testDomainNotRunningWithoutDomain(self): with fake.VM() as testvm: - self.assertEqual(testvm._dom, None) self.assertFalse(testvm._isDomainRunning())
def testDomainNotRunningByState(self): @@ -1087,7 +1086,6 @@
def testDomainNoneNotReadyForCommands(self): with fake.VM() as testvm: - testvm._dom = None self.assertFalse(testvm.isDomainReadyForCommands())
def testReadyForCommandsRaisesLibvirtError(self): @@ -1492,7 +1490,6 @@
def testVmWithoutDom(self): with fake.VM() as testvm: - self.assertTrue(testvm._dom is None) self.assertAPIFailed(testvm.setBalloonTarget(128))
def testTargetValueNotInteger(self): diff --git a/tests/vmfakelib.py b/tests/vmfakelib.py index 211d647..d75198e 100644 --- a/tests/vmfakelib.py +++ b/tests/vmfakelib.py @@ -179,6 +179,10 @@ self._diskErrors = {} self._downtimes = []
+ @property + def connected(self): + return True + def _failIfRequested(self): if self._virtError != libvirt.VIR_ERR_OK: raise Error(self._virtError, self._errorMessage) diff --git a/vdsm/virt/sampling.py b/vdsm/virt/sampling.py index af0910a..a14a1ef 100644 --- a/vdsm/virt/sampling.py +++ b/vdsm/virt/sampling.py @@ -546,6 +546,9 @@ elif not vm_obj.isDomainReadyForCommands(): self._skip_doms[vm_id] = True else: + # TODO: This racy check may fail if the underlying libvirt + # domain has died just after checking isDomainReadyForCommands + # succeeded. doms.append(vm_obj._dom._dom) return doms
diff --git a/vdsm/virt/vm.py b/vdsm/virt/vm.py index 826e4e9..fa7c517 100644 --- a/vdsm/virt/vm.py +++ b/vdsm/virt/vm.py @@ -138,6 +138,13 @@ pass
+class NotConnectedError(Exception): + """ + Raised when trying to talk with a vm that was not started yet or was shut + down. + """ + + VALID_STATES = (vmstatus.DOWN, vmstatus.MIGRATION_DESTINATION, vmstatus.MIGRATION_SOURCE, vmstatus.PAUSED, vmstatus.POWERING_DOWN, vmstatus.REBOOT_IN_PROGRESS, @@ -188,6 +195,20 @@ pass
+class DisconnectedVirDomain(object): + + def __init__(self, vmid): + self.vmid = vmid + + @property + def connected(self): + return False + + def __getattr__(self, name): + raise NotConnectedError("VM %r was not started yet or was shut down" + % self.vmid) + + class NotifyingVirDomain: # virDomain wrapper that notifies vm when a method raises an exception with # get_error_code() = VIR_ERR_OPERATION_TIMEOUT @@ -195,6 +216,10 @@ def __init__(self, dom, tocb): self._dom = dom self._cb = tocb + + @property + def connected(self): + return True
def __getattr__(self, name): attr = getattr(self._dom, name) @@ -277,7 +302,7 @@ :param recover: Signal if the Vm is recovering; :type recover: bool """ - self._dom = None + self._dom = DisconnectedVirDomain(params["vmId"]) self.recovering = recover self.conf = {'pid': '0', '_blockJobs': {}, 'clientIp': ''} self.conf.update(params) @@ -810,7 +835,7 @@
def _onQemuDeath(self): self.log.info('underlying process disconnected') - self._dom = None + self._dom = DisconnectedVirDomain(self.id) # Try release VM resources first, if failed stuck in 'Powering Down' # state result = self.releaseVm() @@ -1555,10 +1580,8 @@ if e.get_error_code() == libvirt.VIR_ERR_OPERATION_INVALID: return errCode['migCancelErr'] raise - except AttributeError: - if self._dom is None: - return errCode['migCancelErr'] - raise + except NotConnectedError: + return errCode['migCancelErr'] finally: self._guestCpuLock.release()
@@ -1673,10 +1696,10 @@ def _isDomainRunning(self): try: status = self._dom.info() - except AttributeError: + except NotConnectedError: # Known reasons for this: # * on migration destination, and migration not yet completed. - # * self._dom may be set to None asynchronously (_onQemuDeath). + # * self._dom may be disconnected asynchronously (_onQemuDeath). # If so, the VM is shutting down or already shut down. return False else: @@ -1744,7 +1767,7 @@ pass raise Exception('destroy() called before Vm started')
- if self._dom is None: + if not self._dom.connected: raise MissingLibvirtDomainError(vmexitreason.LIBVIRT_START_FAILED)
self._updateDomainDescriptor() @@ -1852,7 +1875,7 @@ self._connection.lookupByUUIDString(self.id), self._timeoutExperienced) elif 'migrationDest' in self.conf: - pass # self._dom will be None until migration ends. + pass # self._dom will be disconnected until migration ends. elif 'restoreState' in self.conf: fromSnapshot = self.conf.get('restoreFromSnapshot', False) srcDomXML = self.conf.pop('_srcDomXML') @@ -2676,14 +2699,14 @@ """ try: state, details, stateTime = self._dom.controlInfo() - except AttributeError: + except NotConnectedError: # this method may be called asynchronously by periodic # operations. Thus, we must use a try/except block # to avoid racy checks. return False except libvirt.libvirtError as e: if e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN: - # same as AttributeError above: possible race on shutdown + # same as NotConnectedError above: possible race on shutdown return False else: raise @@ -3615,7 +3638,7 @@ # Terminate the VM's creation thread. self._incomingMigrationFinished.set() self.guestAgent.stop() - if self._dom: + if self._dom.connected: result = self._destroyVmGraceful() if result['status']['code']: return result @@ -3707,7 +3730,7 @@
def setBalloonTarget(self, target):
- if self._dom is None: + if not self._dom.connected: return self._reportError(key='balloonErr') try: target = int(target) @@ -4363,11 +4386,11 @@ # Libvirt sometimes send the SUSPENDED/SUSPENDED_PAUSED event # after RESUMED/RESUMED_MIGRATED (when VM status is PAUSED # when migration completes, see qemuMigrationFinish function). - # In this case self._dom is None because the function + # In this case self._dom is disconnected because the function # _completeIncomingMigration didn't update it yet. try: domxml = self._dom.XMLDesc(0) - except AttributeError: + except NotConnectedError: pass else: hooks.after_vm_pause(domxml, self.conf) @@ -4383,7 +4406,7 @@ # callback however we do not use it. try: domxml = self._dom.XMLDesc(0) - except AttributeError: + except NotConnectedError: pass else: hooks.after_vm_cont(domxml, self.conf)
automation@ovirt.org has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 1:
* Update tracker::IGNORE, no Bug-Url found * Check Bug-Url::WARN, no bug url found, make sure header matches 'Bug-Url: ' and is a valid url. * Check merged to previous::IGNORE, Not in stable branch (['ovirt-3.5', 'ovirt-3.4', 'ovirt-3.3'])
Nir Soffer has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 1: Verified+1
Verified on rhel 7.1:
- Start and stop vms - Restart vdsm while vm are running - Restart vdsm while restarting vms - Migrate vms few times
Vinzenz Feenstra has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 1: Code-Review+1
Nice, I like this one :-)
Daniel Erez has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 1: Code-Review+1
Martin Polednik has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 1: Code-Review+1
Francesco Romani has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 1: Code-Review+1
I thought I already ACK'd this. Both Verification and code seems good.
Francesco Romani has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 1: Code-Review+2
automation@ovirt.org has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 2:
* Update tracker::IGNORE, no Bug-Url found * Check Bug-Url::WARN, no bug url found, make sure header matches 'Bug-Url: ' and is a valid url. * Check merged to previous::IGNORE, Not in stable branch (['ovirt-3.5', 'ovirt-3.4', 'ovirt-3.3'])
Francesco Romani has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 2: Code-Review+1
looks good, preliminar ACK
automation@ovirt.org has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 3:
* Update tracker::IGNORE, no Bug-Url found * Check Bug-Url::WARN, no bug url found, make sure header matches 'Bug-Url: ' and is a valid url. * Check merged to previous::IGNORE, Not in stable branch (['ovirt-3.5', 'ovirt-3.4', 'ovirt-3.3'])
Nir Soffer has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 3: Verified+1
Vinzenz Feenstra has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 3: Code-Review+1
Francesco Romani has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 3: Code-Review+2
(1 comment)
raising score(s)
https://gerrit.ovirt.org/#/c/43225/3/vdsm/virt/sampling.py File vdsm/virt/sampling.py:
Line 551: self._skip_doms[vm_id] = True Line 552: else: Line 553: # TODO: This racy check may fail if the underlying libvirt Line 554: # domain has died just after checking isDomainReadyForCommands Line 555: # succeeded. note to self: fix in a future patch Line 556: doms.append(vm_obj._dom._dom) Line 557: return doms Line 558: Line 559:
Dan Kenigsberg has submitted this change and it was merged.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
vm: Improve error handling when Vm._dom is None
Vm._dom is initialized to None on when creating a vm, and set to None if the underlying libvirt domain has died. Since Vm._dom is modified by multiple threads, it is impossible to check for None before using it. Even if it was possible, we don't want to litter the code with None checks everywhere.
Some code was using this pattern:
try: self._dom.doSomething() except AttributeError: # Oh, it was None
This code is not communicating well our intent. Worse, it hides AttributeError in doSomthing()!
Most code never check self._dom before using it. In the rare cases it is None, we fail with this Traceback:
Thread-460::ERROR::2015-07-04 19:53:21,977::__init__::520::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 515, in _serveRequest res = method(**params) File "/usr/share/vdsm/rpc/Bridge.py", line 277, in _dynamicMethod result = fn(*methodArgs) File "/usr/share/vdsm/API.py", line 727, in freeze return v.freeze() File "/usr/share/vdsm/virt/vm.py", line 2882, in freeze frozen = self._dom.fsFreeze() AttributeError: 'NoneType' object has no attribute 'fsFreeze'
This traceback is a poor way to say "the vm is not running".
This patch introduces the DisconnectedVirDomain class. This object will raise NotConnectedError for any attribute access. Vm._dom is initialzied to DisconnectedVirDomain(vmid) on startup and after underlying libvirt domain has died.
Code trying to talk with a dead vm will fail now with:
NotConnectedError: Vm '681f6b09-a9c3-4422-a7e2-2f607368718b' is not running yet or was shut down.
Code handling disconnected state is using now:
try: self._dom.doSomething() except NotConnctedError: ...
This communicates our intent, and does not hide any error from the underlying code.
Code checking for None is using now:
if self._dom.connected: ...
Change-Id: I5349ec51c7accf3b417b3bc9489c7eed5bfd8733 Signed-off-by: Nir Soffer nsoffer@redhat.com Reviewed-on: https://gerrit.ovirt.org/43225 Continuous-Integration: Jenkins CI Reviewed-by: Vinzenz Feenstra vfeenstr@redhat.com Reviewed-by: Francesco Romani fromani@redhat.com --- M tests/vmTests.py M tests/vmfakelib.py M vdsm/virt/sampling.py M vdsm/virt/vm.py 4 files changed, 47 insertions(+), 20 deletions(-)
Approvals: Nir Soffer: Verified Jenkins CI: Passed CI tests Vinzenz Feenstra: Looks good to me, but someone else must approve Francesco Romani: Looks good to me, approved
automation@ovirt.org has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 4:
* Update tracker::IGNORE, no Bug-Url found * Set MODIFIED::IGNORE, no Bug-Url found.
Nir Soffer has posted comments on this change.
Change subject: vm: Improve error handling when Vm._dom is None ......................................................................
Patch Set 3:
(1 comment)
https://gerrit.ovirt.org/#/c/43225/3/vdsm/virt/sampling.py File vdsm/virt/sampling.py:
Line 551: self._skip_doms[vm_id] = True Line 552: else: Line 553: # TODO: This racy check may fail if the underlying libvirt Line 554: # domain has died just after checking isDomainReadyForCommands Line 555: # succeeded.
note to self: fix in a future patch
The issue is is not only accessing private of a private (extremely evil :-), but the fact that vm_obj._dom may be replaced behind your back with DisconnectedDomain, while you hold the old object, so the call you do later may use a invalid object or one that we should not access.
You should add vm_obj to the list, and handle DisconnectedError if the vm was disconnected while the periodic loop was trying to access it. Line 556: doms.append(vm_obj._dom._dom) Line 557: return doms Line 558: Line 559:
vdsm-patches@lists.fedorahosted.org