Pavel Zhukov has uploaded a new change for review.
Change subject: vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time. ......................................................................
vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time.
Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Bugzilla-Url: https://bugzilla.redhat.com/1022976 Signed-off-by: Pavel Zhukov pzhukov@redhat.com --- M vdsm/storage/blockSD.py 1 file changed, 1 insertion(+), 1 deletion(-)
git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/52/20552/1
diff --git a/vdsm/storage/blockSD.py b/vdsm/storage/blockSD.py index b95be21..4b21c35 100644 --- a/vdsm/storage/blockSD.py +++ b/vdsm/storage/blockSD.py @@ -846,7 +846,7 @@ self._lastUncachedSelftest = now lvm.chkVG(self.sdUUID) elif lvm.getVG(self.sdUUID).partial != lvm.VG_OK: - raise se.StorageDomainAccessError(self.sdUUID) + lvm.chkVG(self.sdUUID)
def validate(self): """
oVirt Jenkins CI Server has posted comments on this change.
Change subject: vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time. ......................................................................
Patch Set 1:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5078/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4274/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5152/ : SUCCESS
Tomáš Došek has posted comments on this change.
Change subject: vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time. ......................................................................
Patch Set 1:
Nack - we need also to reset self._lastUncachedSelftest = now in the elif branch to avoid the whole section being repeated until infinity (possible race).
Tomáš Došek has posted comments on this change.
Change subject: vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time. ......................................................................
Patch Set 4: Code-Review+1
oVirt Jenkins CI Server has posted comments on this change.
Change subject: vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time. ......................................................................
Patch Set 2:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5080/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4276/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5154/ : SUCCESS
oVirt Jenkins CI Server has posted comments on this change.
Change subject: vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time. ......................................................................
Patch Set 3: Code-Review-1 Verified-1
Build Failed
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5081/ : UNSTABLE
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4277/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5155/ : ABORTED
oVirt Jenkins CI Server has posted comments on this change.
Change subject: vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time. ......................................................................
Patch Set 4: Verified-1
Build Failed
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5082/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4278/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5156/ : ABORTED
oVirt Jenkins CI Server has posted comments on this change.
Change subject: vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time. ......................................................................
Patch Set 5:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5085/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4281/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5159/ : SUCCESS
Dan Kenigsberg has posted comments on this change.
Change subject: vgscan doesn't return nonzero return code if one or more phisycal volumes are filtered (missed). We should check vg before raising of the exception. chkVG itself raises storageAccessError if failed, so selftest doesn't have to it second time. ......................................................................
Patch Set 5: Code-Review-1
(1 comment)
.................................................... Commit Message Line 3: AuthorDate: 2013-10-25 11:10:39 +0200 Line 4: Commit: Pavel Zhukov pzhukov@redhat.com Line 5: CommitDate: 2013-10-25 13:00:55 +0200 Line 6: Line 7: vgscan doesn't return nonzero return code if one or more phisycal please have a short line at the top, separated by a newline.
your current subject has double negation and is hard to comprehend. Please explain the problem at hand and how this patch solves it. Line 8: volumes are filtered (missed). We should check vg before raising of Line 9: the exception. chkVG itself raises storageAccessError if failed, so Line 10: selftest doesn't have to it second time. Line 11:
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Add additional vgcheck. ......................................................................
Patch Set 6:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5088/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4284/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5162/ : SUCCESS
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Add additional vgcheck. ......................................................................
Patch Set 7:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5089/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4285/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5163/ : SUCCESS
Tomáš Došek has posted comments on this change.
Change subject: Add additional vgcheck. ......................................................................
Patch Set 7: Code-Review+1
Pavel Zhukov has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 8:
I'm sorry for the push flooding. Added the main reason of the patch (vgextend workflow) to the subject.
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 8:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5090/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4286/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5164/ : SUCCESS
Tomáš Došek has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 8: Code-Review+1
Sergey Gotliv has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 8:
(2 comments)
.................................................... Commit Message Line 6: Line 7: Add additional vgcheck to fix vgextend workflow on HSMs. Line 8: Line 9: vgscan doesn't return nonzero return code if one or more physical Line 10: volumes are missed during the _reloadvgs opration. As a result Can you rephrase "doesn't return nonzero" with something like "return zero even if one...", please. Line 11: LvmCache doesn't invalidate stale filters and volume group is Line 12: marked as partial and sefttest fails (host goes to non-operational status). Line 13: We should call vgck before raising of the exception, because vgck Line 14: returns error code 5 in case of missed PVs, as a result filters
Line 8: Line 9: vgscan doesn't return nonzero return code if one or more physical Line 10: volumes are missed during the _reloadvgs opration. As a result Line 11: LvmCache doesn't invalidate stale filters and volume group is Line 12: marked as partial and sefttest fails (host goes to non-operational status). selftest Line 13: We should call vgck before raising of the exception, because vgck Line 14: returns error code 5 in case of missed PVs, as a result filters Line 15: are rebuilded with correct uuids. chkVG itself raises Line 16: storageAccessError if failed, so selftest doesn't have to do it second time.
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 9:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5091/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4287/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5165/ : SUCCESS
Dan Kenigsberg has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 9: Code-Review+1
Nir Soffer has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 9: Code-Review-1
(1 comment)
I don't understand how this fixes the related bug. If the issue was a missing pv, how running vgchk again fixes the missing pv?
If the issue is stale filters, why not invalidate the filters in _reloadvgs, when we find that a vg has missing pv? Why wait until selftest()?
.................................................... File vdsm/storage/blockSD.py Line 844: Line 845: if now - self._lastUncachedSelftest > timeout or \ Line 846: lvm.getVG(self.sdUUID).partial != lvm.VG_OK: Line 847: self._lastUncachedSelftest = now Line 848: lvm.chkVG(self.sdUUID) The indentation is wrong, and using \ for breaking long lines is not recommended.
If this is indeed a correct fix, I think this would be a more clear:
isStale = now - self._lastUncachedSelftest > timeout isPartial = lvm.getVG(self.sdUUID).partial != lvm.VG_OK
if isStale or isPartial: self._lastUncachedSelftest = now lvm.chkVG(self.sdUUID) Line 849: Line 850: def validate(self): Line 851: """ Line 852: Validate that the storage domain metadata
Pavel Zhukov has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 9:
I don't understand how this fixes the related bug. If the issue was a missing pv, how running vgchk again fixes the missing pv?
The issue is stale filters
If the issue is stale filters, why not invalidate the filters in _reloadvgs, when we find that a vg has missing pv? Why wait until selftest()?
Because _reloadvgs uses vgs (not vgck). VGs returns zero even if PV is missed (see commit message) and We know nothing about missed PVs on this step. I thought about rebuilding filters before vgs command, but do we really want to do it each time for each vg?
Tomáš Došek has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 9: Code-Review+1
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 10:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5095/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4291/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5169/ : SUCCESS
Tomáš Došek has posted comments on this change.
Change subject: Add additional vgcheck to fix vgextend workflow on HSMs. ......................................................................
Patch Set 10: Verified+1
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Invalidate filters before reloading vgs ......................................................................
Patch Set 11:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5106/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4302/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5180/ : SUCCESS
Ayal Baron has posted comments on this change.
Change subject: Invalidate filters before reloading vgs ......................................................................
Patch Set 11: Code-Review-1
This beats the purpose of short filters [1] which I believe already takes care of this (so this patch seems irrelevant for 3.3 (still need to understand under what conditions this happens in 3.2)
[1] http://gerrit.ovirt.org/#/c/17968/
Pavel Zhukov has abandoned this change.
Change subject: Invalidate filters before reloading vgs ......................................................................
Abandoned
Fixed in http://gerrit.ovirt.org/#/c/17968/
Pavel Zhukov has restored this change.
Change subject: Invalidate filters before reloading vgs ......................................................................
Restored
Checked http://gerrit.ovirt.org/#/c/17968/ It doesn't solve the problem
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Invalidate filters before reloading vgs ......................................................................
Patch Set 12:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5118/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4314/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5192/ : SUCCESS
Ayal Baron has posted comments on this change.
Change subject: Invalidate filters before reloading vgs ......................................................................
Patch Set 12: Code-Review-1
reloadvgs is the wrong place to do this as it devoids the cache from meaning and will become a serious problem. vgextend flow calls getDevicesVisibility so in the least, invalidate should be limited to this verb.
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 13:
Build Successful
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5135/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4331/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5209/ : SUCCESS
Sergey Gotliv has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 13: Code-Review+1
Ayal Baron has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 13: Code-Review+2
Tomáš Došek has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 13: Code-Review+1
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 13: Code-Review-1
(1 comment)
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm._lvminfo.invalidateFilter() can we avoid using a private data member here?
Could you explain why adding a side effect to getDevicesVisibility() is a good thing? Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 14: Code-Review-1 Verified-1
Build Failed
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5223/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4344/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5148/ : UNSTABLE
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 15: Code-Review-1 Verified-1
Build Failed
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5224/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4345/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5149/ : UNSTABLE
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16: Verified-1
Build Failed
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5225/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4346/ : FAILURE
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5150/ : SUCCESS
Pavel Zhukov has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16: Verified+1
Step to verification: - Extend VG with new LUN. - wait for next vgs call (domainMonitor selftest) up to 300 sec - new device should be in filter set and vgs output should contain nothing in the stderr. (without patch applied new device is not in filter set).
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16: Code-Review-1
(2 comments)
Looks good, but I think it can more correct.
.................................................... File vdsm/storage/hsm.py Line 1985: def scanDevicesVisibility(self, guids): Line 1986: visible = lambda guid: (guid, os.path.exists( Line 1987: os.path.join("/dev/mapper", guid))) Line 1988: visibleDevs = map(visible, guids) Line 1989: if not all(visibleDevs): This seems to be the correct way to invalidate the filters - we found a device which is not visible.
Can you validate this? Line 1990: multipath.rescan() Line 1991: visibleDevs = map(visible, guids) Line 1992: return dict(visibleDevs) Line 1993:
Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() I suspect that this is not the most correct way to invalidate the filter. If all guids are visible, the filters are probably correct as well.
If we choose to invalidate the filters here anyway, it should be done either before calling scanDevicesVisibility, or after processing visibility dict. Not sure why you chose to insert the call in the middle of the existing code. Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16:
One more thing - please document the fact that filters are invalidated by the function that invalidates them.
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16:
(1 comment)
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() And I am still missing an explanation why adding this side-effect to getDevicesVisibility solves anything. Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16:
(1 comment)
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() getDeviceVisibility is called before engine extends a volume group, to make sure that all hosts see the new physical volume. Here it is likely that the host does not know yet the pv and will discover it using multipath (see scanDeviceVisibility). So getDeviceVisibility already has side effect and has a poor name, but that too late to fix.
When new pv is found, filters must be invalidated, otherwise volume groups that use that pv become partial, because we filter all calls to lvm with the list of known pvs. Having partial vg will cause selftest() to fail and the host will become non-operational.
This should probably be documented in the relevant code. Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Sergey Gotliv has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16:
(1 comment)
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() Nir, At this point device is already connected via iscsi otherwise running multipath rescan won't help in case this device is invisible. So if multipath is properly configured and its daemon was running when establishing iscsi connection then /dev/mapper/{guid} is already created therefore device is visible but we still have to invalidate filters.
This fix looks fine, it can move one raw up if its really matter but this is the place. Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Sergey Gotliv has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16:
(1 comment)
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() Just to be clear, it can move to scanDevicesVisibility either but not into the "if" there. Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Sergey Gotliv has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16: Code-Review+1
Tomáš Došek has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16: Code-Review+1
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16:
(1 comment)
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() The code in scanDevicesVisibility is trying to handle the situation where new pv is not known yet.
Now the question is if the filters are stale in the first case, where the device is already knonw, but is not part of the vg. If they are stale, then this patch is correct.
We use filters in two ways - some calls use shorter filters, where only the relevant pvs are used, and some are using long filter with all the known devices. The question is why the call failed - because we use a wrong short filter (pv was not part of vg), or wrong long filter (unknown pv). Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Ayal Baron has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16: Code-Review+2
(1 comment)
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() seeing as this bug was found on 3.2 which didn't have short filters the answer is pretty clear. Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16: Code-Review-1
(6 comments)
.................................................... Commit Message Line 5: CommitDate: 2013-10-30 20:01:34 +0100 Line 6: Line 7: Invalidate filters on HSMs before rescanning extended VG Line 8: Line 9: domainMonitor thread might use stale filters while running vgs command filters in plural, or the global filter? Line 10: after extending of the VG. As a result LvmCache doesn't invalidate Line 11: stale filters, volume group is marked as partial and selftest fails Line 12: (host goes to non-operational status). By design filter should be Line 13: invalidated if cmd returns nonzero code but vgs returns zero even if
Line 6: Line 7: Invalidate filters on HSMs before rescanning extended VG Line 8: Line 9: domainMonitor thread might use stale filters while running vgs command Line 10: after extending of the VG. As a result LvmCache doesn't invalidate English sentence is not clear: As a result LvmCache...?? Did you mean "Since LvmCache doesn't invalidate the stale filter," Line 11: stale filters, volume group is marked as partial and selftest fails Line 12: (host goes to non-operational status). By design filter should be Line 13: invalidated if cmd returns nonzero code but vgs returns zero even if Line 14: devices are filtered . The only way to invalidate filter now is flush LvmCache.
Line 8: Line 9: domainMonitor thread might use stale filters while running vgs command Line 10: after extending of the VG. As a result LvmCache doesn't invalidate Line 11: stale filters, volume group is marked as partial and selftest fails Line 12: (host goes to non-operational status). By design filter should be maybe: "currently, the filter is invalidated if cmd returns nonzero code but vgs returns zero even if devices are filtered out." Line 13: invalidated if cmd returns nonzero code but vgs returns zero even if Line 14: devices are filtered . The only way to invalidate filter now is flush LvmCache. Line 15: To avoid perfomance issues the patch introduce public method for filter invalidation Line 16: and calls it from the getDevicesVisibility instead of _reloadvgs.
Line 11: stale filters, volume group is marked as partial and selftest fails Line 12: (host goes to non-operational status). By design filter should be Line 13: invalidated if cmd returns nonzero code but vgs returns zero even if Line 14: devices are filtered . The only way to invalidate filter now is flush LvmCache. Line 15: To avoid perfomance issues the patch introduce public method for filter invalidation keep lines shorter, introduce->introduces Line 16: and calls it from the getDevicesVisibility instead of _reloadvgs. Line 17: Line 18: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 19: Bugzilla-Url: https://bugzilla.redhat.com/1022976
Line 12: (host goes to non-operational status). By design filter should be Line 13: invalidated if cmd returns nonzero code but vgs returns zero even if Line 14: devices are filtered . The only way to invalidate filter now is flush LvmCache. Line 15: To avoid perfomance issues the patch introduce public method for filter invalidation Line 16: and calls it from the getDevicesVisibility instead of _reloadvgs. The commit message lacks an explanation on why it makes sense to invalidate the filter on getDevicesVisibility. Line 17: Line 18: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 19: Bugzilla-Url: https://bugzilla.redhat.com/1022976
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() That's a good explanation, that should appear in the commit message. Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 16:
oh please explain whether this solves a similar "partial" condition on hsm hosts.
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 17:
Build Successful
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4405/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5209/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5285/ : SUCCESS
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 17: Code-Review-1
(2 comments)
sorry, I still do not understand how this fix works.
.................................................... Commit Message Line 12: group is marked as partial and domainMonitor selftest fails Line 13: (host goes to non-operational status). By design filter should be Line 14: invalidated if cmd returns nonzero code but vgs returns zero even if Line 15: devices are filtered . The only way to invalidate filter now is flush LvmCache. Line 16: To avoid perfomance issues the patch introduces public method for filter invalidation please keep shorter lines Line 17: and calls it from the getDevicesVisibility. As far as getDevicesVisibility Line 18: is called to all added devices this allows to reload filters every time Line 19: after new devices has been mapped but before vgscan. Line 20:
Line 15: devices are filtered . The only way to invalidate filter now is flush LvmCache. Line 16: To avoid perfomance issues the patch introduces public method for filter invalidation Line 17: and calls it from the getDevicesVisibility. As far as getDevicesVisibility Line 18: is called to all added devices this allows to reload filters every time Line 19: after new devices has been mapped but before vgscan. Please state that getDevicesVisibility is called on the spm before extendSD, and explain what happens on hsm hosts (why don't they get into "partial" state?). Line 20: Line 21: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 22: Bugzilla-Url: https://bugzilla.redhat.com/1022976
Sergey Gotliv has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 17:
(1 comment)
.................................................... Commit Message Line 15: devices are filtered . The only way to invalidate filter now is flush LvmCache. Line 16: To avoid perfomance issues the patch introduces public method for filter invalidation Line 17: and calls it from the getDevicesVisibility. As far as getDevicesVisibility Line 18: is called to all added devices this allows to reload filters every time Line 19: after new devices has been mapped but before vgscan. getDeviceVisibility() is hsm call, not spm! Line 20: Line 21: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 22: Bugzilla-Url: https://bugzilla.redhat.com/1022976
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 17:
(3 comments)
I think that this fix is correct. The only thing holding this now is more clear commit message. If you use the suggested commit message I'm sure Dan will be happy to merge this soon.
.................................................... Commit Message Line 3: AuthorDate: 2013-10-25 11:10:39 +0200 Line 4: Commit: Pavel Zhukov pzhukov@redhat.com Line 5: CommitDate: 2013-11-04 09:39:47 +0100 Line 6: Line 7: Invalidate filters on HSMs before rescanning extended VG Lets use:
hsm: Invalidate lvm filter before extending a volume group Line 8: Line 9: domainMonitor thread might use stale lvm filters while running vgs command Line 10: after extending of the VG (added PV is missed in the filter set). Line 11: Since LvmCache doesn't invalidate filters, the volume
Line 16: To avoid perfomance issues the patch introduces public method for filter invalidation Line 17: and calls it from the getDevicesVisibility. As far as getDevicesVisibility Line 18: is called to all added devices this allows to reload filters every time Line 19: after new devices has been mapped but before vgscan. Line 20: Lets use something like:
lvm module maintains a cache of the global filter, used to speed up lvm operations. The cache becomes stale after we extend a volume group with a new physical volume, leading to failure in the domain monitor thread, where a volume group is found partial.
The current code try to invalidate the filter when lvm command fails, but this logic is not good enough for commands such as vgs do not return nonzero exit code when a vg is found partial.
This patch invalidates the lvm filter cache before extending a volume group, ensuring that the next lvm command will build up a new filter, and would not report a volume group as partial when it is actually not.
The change is made in getDevicesVisibility(), since it is called before extending a volume, to make sure that all hosts are seeing the new physical volume. Line 21: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 22: Bugzilla-Url: https://bugzilla.redhat.com/1022976
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() This is not critical, but I think the lvm.invalidateFilter() call should be before the visibility scan.
There are few reasons: - This is important side effect, and we don't want to hide in the middle of the visibility processing - You don't want to break the visibility logic anyway. - It will merge more cleanly into 3.2 Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 17:
(2 comments)
.................................................... Commit Message Line 15: devices are filtered . The only way to invalidate filter now is flush LvmCache. Line 16: To avoid perfomance issues the patch introduces public method for filter invalidation Line 17: and calls it from the getDevicesVisibility. As far as getDevicesVisibility Line 18: is called to all added devices this allows to reload filters every time Line 19: after new devices has been mapped but before vgscan. sure! but according to Nir's explanation, we expect it to be called on the spm! Line 20: Line 21: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 22: Bugzilla-Url: https://bugzilla.redhat.com/1022976
Line 16: To avoid perfomance issues the patch introduces public method for filter invalidation Line 17: and calls it from the getDevicesVisibility. As far as getDevicesVisibility Line 18: is called to all added devices this allows to reload filters every time Line 19: after new devices has been mapped but before vgscan. Line 20: I'm missing an explanation on why this cannot be done in the begining of extendStorageDomain verb, and what happens on other hosts in the cluster, which would have stale filters as well. Line 21: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 22: Bugzilla-Url: https://bugzilla.redhat.com/1022976
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 17:
(1 comment)
The missing piece?
.................................................... Commit Message Line 16: To avoid perfomance issues the patch introduces public method for filter invalidation Line 17: and calls it from the getDevicesVisibility. As far as getDevicesVisibility Line 18: is called to all added devices this allows to reload filters every time Line 19: after new devices has been mapped but before vgscan. Line 20: Consider adding to the commit message:
Before extending a volume group, engine invokes getDevicesVisibility() on all hosts to verify that all hosts are seeing all physical volumes. This change will cause all hosts to invalidate their filter, ensuring that the extended volume group will not become partial on any host when the storage is updated. Line 21: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 22: Bugzilla-Url: https://bugzilla.redhat.com/1022976
Ayal Baron has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 17: Code-Review+2
(2 comments)
.................................................... Commit Message Line 15: devices are filtered . The only way to invalidate filter now is flush LvmCache. Line 16: To avoid perfomance issues the patch introduces public method for filter invalidation Line 17: and calls it from the getDevicesVisibility. As far as getDevicesVisibility Line 18: is called to all added devices this allows to reload filters every time Line 19: after new devices has been mapped but before vgscan. no we don't. there is no problem on the spm, that works fine. This patch makes sure that the other hosts in the cluster work properly as well Line 20: Line 21: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 22: Bugzilla-Url: https://bugzilla.redhat.com/1022976
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() absolutely not, that is raceful. you first scan for new devices, then update the filter to include anything new found in the scan. Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 17:
(1 comment)
.................................................... Commit Message Line 15: devices are filtered . The only way to invalidate filter now is flush LvmCache. Line 16: To avoid perfomance issues the patch introduces public method for filter invalidation Line 17: and calls it from the getDevicesVisibility. As far as getDevicesVisibility Line 18: is called to all added devices this allows to reload filters every time Line 19: after new devices has been mapped but before vgscan. So please state in the commit message that getDevVis is called on all DC hosts before extendSD. That would have answered my question. It's worth documenting in getDevVis, too. Line 20: Line 21: Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Line 22: Bugzilla-Url: https://bugzilla.redhat.com/1022976
oVirt Jenkins CI Server has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Verified-1
Build Failed
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/4420/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_pep8_gerrit/5224/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/5300/ : FAILURE
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Code-Review+1
Tomáš Došek has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Code-Review+1
Sergey Gotliv has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Code-Review+1
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 17:
(1 comment)
.................................................... File vdsm/storage/hsm.py Line 2004: boolean Line 2005: :rtype: dict Line 2006: """ Line 2007: visibility = self.scanDevicesVisibility(guids) Line 2008: lvm.invalidateFilter() Good point - therefore it must be documented in the code. As is, this line may be moved up later by mistake. Line 2009: for guid in guids: Line 2010: if visibility[guid]: Line 2011: visibility[guid] = (os.stat('/dev/mapper/' + guid).st_mode & Line 2012: stat.S_IRUSR != 0)
Pavel Zhukov has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Verified+1
verification steps: Extend VG with new LUN. wait for next vgs call (domainMonitor selftest) up to 300 sec new device should be in filter set and vgs output should have empty stderr output.
Ayal Baron has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Code-Review+2
Eduardo has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Code-Review-1
Very respectfully I disagree with this solution. My opinion is that the design should be revised. A long mail about sent.
Sergey Gotliv has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
Eduardo, where can I read that mail?
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Code-Review+1
I'm not happy with the commit message, but hey, this is good enough, and everyone is waiting for this patch.
Please merge it.
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Code-Review-1
I am reluctant to merge, since I have understood (guess how) that this patch does not eliminate the race between extendVG (on the SPM) and domainMonitor (on an HSM).
After getDevVis is called on HSM (but before extendVG), the domainMonitor can sample the storage, and re-create an exact same filter.
Later, after the vg is extended, the domainMonitor would report it as "partial" again.
We should find a way to halt domainMonitor on HSMs before a vgextend, and restart it later. Alternatively, Engine should ignore false negative results during that period.
Alter
Sergey Gotliv has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
Dan,
Correct if I am wrong, filters are created from here:
for dmInfoDir in glob("/sys/block/dm-*/dm/")
If device is visible means that:
os.stat('/dev/mapper/' + guid).st_mode & stat.S_IRUSR != 0
How is that possible that device is visible and now I am creating a filter but my device is not a part of it?
Federico Simoncelli has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18: Code-Review+1
Dan it seems that such race could not exist, I've been convinced to think that the flow would be:
- connectStorageServer - getDevicesVisibility (refresh filters including the new devices) - extendStorageDomain (on SPM)
Let's be clear, I'm not thrilled by this solution at all, but if the assumption above is correct, it seems that it's the only thing that we can do now. I don't feel comfortable enough to give more than a +1.
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
What happens is this:
1. In getDevicesVisibility, I can see the new device (but it is not in the filter, because we do not update it normally) 2. We now mark the filter as stale (this patch) 3. SPM extends the vg with new device
What we like to happen, is that the filter is updated after 3 by the next lvm command (filters are built when a command is run and filter is stale)
What may happen (the race Eduardo is talking about), is that filter is updated before 3, when the vg is not yet used by the vg.
So yes there is little race, but it does not cause any harm, since the filter always contains all visible devices on the machine.
So this patch is good enough for 3.2 and 3.3 and should be merged *now*.
Should be have even better way to do it in 3.4? maybe - not interesting now.
Ayal Baron has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
This talk about a race is simply incorrect. There is NO race here, not small, nor big.
There are 2 ways in lvm.py that a filter is created: 1. when you call cmd without any list of devices. In this case the filter you will get will include *all* visible devices on the system, regardless of whether they belong to any VG or not. In such calls, invalidating the filter after calling getDevicesVisibility is the correct behaviour and there is no race (the filter *will* contain the new devices whether the vg has already been extended or not.
2. when you call cmd with a specific list of devices. In this case a filter is created based solely on this list and _filterStale is skipped entirely so again, no race.
The only question is when and how the list of devices that are associated with the VG would get updated [1] and that is taken care of irrespective of this patch
[1] so that calls to cmd with short filter would be with full list of devices
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
It's hard to have a multi-participant conversation in gerrit, but I'll try to join.
Sergey and Ayal: The devices may be visible in getDevVis, but clearly there has been a time in the past where they did not exist (that's the reason of this bug, devices popping up unattended).
If the domainMonitor has issued its vgs command when the new lun was not there, it's filter would have lacked the new device. On other threads, getDevVis may now report the device as visible, a vgextend command takes place, and the old domainMonitor would return a dreaded "partial" result.
Federico: I believe that connectStorageServer is not sent for FC, which is the core of the bug. If it was called after the lun was visible on HSM, its invalidateStorage() call would have given HSM to rebuild its filter.
Nir: as Ayal said, this 3.2 bug is about the "long" filter, not the new per-vg short filter.
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
If the domainMonitor has issued its vgs command when the new lun was not there, it's filter would have lacked the new device. On other threads, getDevVis may now report the device as visible, a vgextend command takes place, and the old domainMonitor would return a dreaded "partial" result.
When domain monitor runs, it invokes some lvm commands; on the first command, the filter is rebuilt using all /dev/mapper devices. Since we invalidate the filter in getDeviceVisibility, the next lvm command after that will cause the filter to be correct.
If domain monitor run before vgextend, the vg is obviously not partial. If domain monitor run after vgextend, the vg contains now the new pv, and the vg is not partial. In both case the filter created by running domain monitor is correct.
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
More, even if extend fail, the filter is still correct, because it contains all the devices in the system, not all devices use by our vgs.
Pavel Zhukov has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
Dan,
On other threads, getDevVis may now report the device as visible, a vgextend command takes place, and the old domainMonitor would return a dreaded "partial" result.
It's not correct. domainMonitor uses LVMCache instance. The instance is being updated in _reloadvgs flow (not in vgck unfortunately, the fix might be easy). So it'll never marked as partial. The only thing here is that vg is extended but domainMonitor uses "old" cache until next invalidateSD is called.
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
Pavel, Nir: Think of the following scenario: domainMonitor runs dom.selftest() that ends up calling lvmcache's cmd(). cmd() builds the command line based on the old filter. Later, on another thread, the filter is invalidated. cmd() does not care, and moves on with its oldish vgchk, returning "partial".
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
Pavel, Nir: Think of the following scenario: domainMonitor runs dom.selftest() that ends up calling lvmcache's cmd(). cmd() builds the command line based on the old filter. Later, on another thread, the filter is invalidated. cmd() does not care, and moves on with its oldish vgchk, returning "partial".
This cannot happen. I think you miss important point here - vg is not extended before getDevicesVisibility returns from all hosts, and the new device is visible on all hosts.
So the next command on any host will include the new device is the filter, and will never fail with partial vg.
Pavel Zhukov has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
Dan, I though about it (that's why I've submitted wrong patch first time actually). But vgck is not able to manipulate (update) LVMCache (well it can only if rc != 0, but it invalidates filter before second attempt anyway). The vgck logic works properly because It invalidates filters if vg is partial (retcode = 5).
You can check the logs attached to the bugzilla. There is ~200 secs between getDevVisibil. and vgs calls so vgs used updated filters but (actually) outdated LVMCache this time.
Ayal Baron has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
Dan, clearly there is not enough clarity on the flow which is leading you to wrongly think there is a race here. The flow is: on SPM host: discover and connect to storage (optional) getDeviceList in GUI choose a new device to add to domain
Now *before* vgextend is run, on *all* hosts: discover and connect to storage (if needed) getDevicesVisibility
Only after all hosts have reported that the new device(s) is visible (and filter is up to date) do we go back to the spm and run: vgextend
On top of all this, a host only moves to non-op after *5* minutes !!! (to differentiate between host and total storage failure).
So now is it clear that there is no race?
Ayal Baron has submitted this change and it was merged.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Invalidate filters on HSMs before rescanning extended VG
domainMonitor thread might use stale lvm filters while running vgs command after extending of the VG (added PV is missed in the filter set). Since LvmCache doesn't invalidate filters, the volume group is marked as partial and domainMonitor selftest fails (host goes to non-operational status). By design filter should be invalidated if cmd returns nonzero code but vgs returns zero even if devices are filtered . The patch introduces public method for filter invalidation and calls it from the getDevicesVisibility because getDevicesVisibility is called on all DC hosts after adding new device but before extendSD.
Change-Id: If1eeed1c203f2c8c73370987048565d665932299 Bugzilla-Url: https://bugzilla.redhat.com/1022976 Signed-off-by: Pavel Zhukov pzhukov@redhat.com Reviewed-on: http://gerrit.ovirt.org/20552 Reviewed-by: Tomáš Došek tdosek@redhat.com Reviewed-by: Sergey Gotliv sgotliv@redhat.com Reviewed-by: Ayal Baron abaron@redhat.com Reviewed-by: Nir Soffer nsoffer@redhat.com Reviewed-by: Federico Simoncelli fsimonce@redhat.com --- M vdsm/storage/hsm.py M vdsm/storage/lvm.py 2 files changed, 5 insertions(+), 0 deletions(-)
Approvals: Ayal Baron: Looks good to me, approved Nir Soffer: Looks good to me, but someone else must approve Pavel Zhukov: Verified Federico Simoncelli: Looks good to me, but someone else must approve Sergey Gotliv: Looks good to me, but someone else must approve Tomáš Došek: Looks good to me, but someone else must approve
Objections: Eduardo: I would prefer that you didn't submit this Dan Kenigsberg: I would prefer that you didn't submit this
Dan Kenigsberg has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 18:
Ayal, the domainMonitor thread is independent of all Engine commands. We have no assurances on when it decides to issue its vgchk command. It may happen immediately after vgextend was performed.
At least race condition exists. Please relate to my scenario.
However, it is unlikely to manifest itself after 5 minutes.
Nir Soffer has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 19:
If vgck will run before vg is extended, it will run fine.
If run after vg is extended, it will fail because it uses short filter. But this failure is not seen by selftest, since all lvm commands are executed using LVMCache.cmd.
Failing commands are run again *without* the short filter. vgck with long filter will succeed because the long filter is update (this patch fixed it).
So the only issue here is that commands using short filter will always fail on the first run and then will run again with long filter.
Please check this patch that try to solve this problem: http://gerrit.ovirt.org/#/c/21192/
Ayal Baron has posted comments on this change.
Change subject: Invalidate filters on HSMs before rescanning extended VG ......................................................................
Patch Set 19:
Dan, as Nir noted and I tried to explain previously, your race does not exist. It is very simple: devices are added, getDevicesVisibility is used to discover newly added devices, this must update the list of devices used in commands. period. regardless of anything else. The patch does exactly what it is supposed to do.
If you forget monitoring and focus on the simple fact that with the spm architecture you always have a period of time between spm making changes and other hosts being aware of these changes then you realize why we have the fallback in case of error (i.e. if failed - bypass caches and refresh).
vdsm-patches@lists.fedorahosted.org