Liron Ar has uploaded a new change for review.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
sp: fix spm start when failing to produce domain
During spmStart we attept to go over all the domains and set their role to be regular in case it's not to avoid situation in which there are two domains with the master role. If we fail to produce any of the domains (which might happen, the domains returned by getDomains(activeOnly=True) are all the domains that are are marked as active in the metadata, but it doesn't mean that they are actually reachable and that calling produce() for the domain won't fail.
In case the we fail to call produce() on domain or failing to alter it's role, it doesn't mean that spmStart should fail - otherwise we might never succeed to start spm.
Change-Id: Ia4f34360770ca8c8741a50956d0cab92bcd9a810 Signed-off-by: Liron Aravot laravot@redhat.com --- M vdsm/storage/sp.py 1 file changed, 9 insertions(+), 4 deletions(-)
git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/24/25424/1
diff --git a/vdsm/storage/sp.py b/vdsm/storage/sp.py index a63d074..df89c27 100644 --- a/vdsm/storage/sp.py +++ b/vdsm/storage/sp.py @@ -201,12 +201,17 @@ for sdUUID in self.getDomains(activeOnly=True): if sdUUID == self.masterDomain.sdUUID: continue + try: + domain = sdCache.produce(sdUUID)
- domain = sdCache.produce(sdUUID) - if domain.getDomainRole() == sd.REGULAR_DOMAIN: - continue + if domain.getDomainRole() == sd.REGULAR_DOMAIN: + continue
- self._backend.setDomainRegularRole(domain) + self._backend.setDomainRegularRole(domain) + except Exception: + # log any exception, but keep going + self.log.error("Error trying to check/update domain %s role", + sdUUID, exc_info=True)
@unsecured def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
oVirt Jenkins CI Server has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 1:
Build Successful
http://jenkins.ovirt.org/job/vdsm_master_pep8_gerrit/7420/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/6627/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/7529/ : SUCCESS
Nir Soffer has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 1: Code-Review+1
The error in the log may be little noisy for expected errors (such as no such domain), but lets see customers logs before we attempt to clean this.
Nir Soffer has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 1:
(3 comments)
The commit message needs more love.
http://gerrit.ovirt.org/#/c/25424/1//COMMIT_MSG Commit Message:
Line 3: AuthorDate: 2014-03-05 15:25:56 +0200 Line 4: Commit: Liron Aravot laravot@redhat.com Line 5: CommitDate: 2014-03-05 15:56:21 +0200 Line 6: Line 7: sp: fix spm start when failing to produce domain fix -> Fix
when failing to produce domain -> when failing to update domain role
If you want to handle only produce errors, you should put the try-except around that line only. Line 8: Line 9: During spmStart we attept to go over all the domains and set their role Line 10: to be regular in case it's not to avoid situation in which there are two Line 11: domains with the master role.
Line 5: CommitDate: 2014-03-05 15:56:21 +0200 Line 6: Line 7: sp: fix spm start when failing to produce domain Line 8: Line 9: During spmStart we attept to go over all the domains and set their role attept -> attempt Line 10: to be regular in case it's not to avoid situation in which there are two Line 11: domains with the master role. Line 12: If we fail to produce any of the domains (which might happen, the Line 13: domains returned by getDomains(activeOnly=True) are all the domains that
Line 7: sp: fix spm start when failing to produce domain Line 8: Line 9: During spmStart we attept to go over all the domains and set their role Line 10: to be regular in case it's not to avoid situation in which there are two Line 11: domains with the master role. Empty line between paragraphs makes it easier to read. If this should be one paragraph, do not break the line. Line 12: If we fail to produce any of the domains (which might happen, the Line 13: domains returned by getDomains(activeOnly=True) are all the domains that Line 14: are are marked as active in the metadata, but it doesn't mean that they Line 15: are actually reachable and that calling produce() for the domain won't fail.
oVirt Jenkins CI Server has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
Build Successful
http://jenkins.ovirt.org/job/vdsm_master_pep8_gerrit/7427/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit_el/6634/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_unit_tests_gerrit/7536/ : SUCCESS
Nir Soffer has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(2 comments)
Commit message still need more love.
http://gerrit.ovirt.org/#/c/25424/2//COMMIT_MSG Commit Message:
Line 3: AuthorDate: 2014-03-05 15:25:56 +0200 Line 4: Commit: Liron Ar laravot@redhat.com Line 5: CommitDate: 2014-03-06 10:55:35 -0500 Line 6: Line 7: sp: fix spm start when failing to produce domain Tittle is still wrong - you explain bellow that this handle errors when altering domain role. Line 8: Line 9: During spmStart we attept to go over all the domains and set their role Line 10: to be regular in case it's not to avoid situation in which there are two Line 11: domains with the master role.
Line 5: CommitDate: 2014-03-06 10:55:35 -0500 Line 6: Line 7: sp: fix spm start when failing to produce domain Line 8: Line 9: During spmStart we attept to go over all the domains and set their role attept is still misspelled. Line 10: to be regular in case it's not to avoid situation in which there are two Line 11: domains with the master role. Line 12: If we fail to produce any of the domains (which might happen, the Line 13: domains returned by getDomains(activeOnly=True) are all the domains that
Federico Simoncelli has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", use self.log.exception (and drop exc_info). Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Dan Kenigsberg has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2: Code-Review-1
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
use self.log.exception (and drop exc_info).
If it's fine to keep the master role on active domains, why do we even try to unset it? This new try-except block is sooo wide that we could equally drop _updateDomainsRole(). Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Federico Simoncelli has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2: Code-Review+2
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
If it's fine to keep the master role on active domains, why do we even try
Up until now we always used best-effort to set the domains role to regular. Think of a master domain that gets unreachable, we obviously can't set the regular role in its metadata. We'll keep failing to set its role until it gets reachable again but in the end we'll eventually fix it. So by the rules we're playing now it's not fine to keep the master role on multiple active domains, we just can't fix it on unreachable domains.
That said, this will become irrelevant as soon as we get rid of the master domain. Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Federico Simoncelli has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
Commit message must be fixed.
Dan Kenigsberg has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
Up until now we always used best-effort to set the domains role to regular.
Federico, I am afraid I disagree. If we can live with with two domains with ROLES=MASTER, we should not bother to fix this condition. And if we cannot live with that condition, we must not swallow this failure.
I understand that avoiding this "best effort" attempt may expose more bugs that ovirt-3.4 can handle right now, but this patch is destined to *master* branch, which should not include this kind of shortcuts.
Based on the cited BZ, it might be possible to work around the issue by skipping setDomainRegularRole() on iso and export domains. But dropping _updateDomainsRole() seems nicer. Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Federico Simoncelli has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
Federico, I am afraid I disagree. If we can live with with two domains with
Once we'll drop the storage pool we won't call startSpm anymore and this part will be kept only to deal with old data centers. This is not a matter of master/ovirt-3.4 branches, it's a matter of removing something that was part of the format (and we'll have to assess if older vdsm may have a problem with that). If we end up with 2 masters and we find an issue, the problem would be fixed as easily as re-electing the spm. On the contrary if you remove the code we'll have to handle the bug somehow (what if it's in old versions?). Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Dan Kenigsberg has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
Once we'll drop the storage pool we won't call startSpm anymore and this pa
I do not understand how the suggested code is any different from dropping _updateDomainsRole() - both end up with two master domains in a pool. In the latter case, it's going to be more frequent, but it is not fundamentally different. Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Liron Ar has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
I do not understand how the suggested code is any different from dropping _
Dan, skipping setDomainRegularRole() on iso and export domains isn't good enough as solution - as the bug will still occur for unreachable data domains - so that's not a bug solver and this could be easily reproduced preventing spm starts.
IMO there's no reason for doing the best effort for having the correct MD here - it may help users and provide better accuracy of the role of each domain (for example, when accessing the storage directly it may help a user to tell in which domain it should look for his most updated OVF files (located under the master), so that's the way i'd prefer it rather than dropping it completely. Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Dan Kenigsberg has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
Dan,
Comparing the masterVersion is the way to tell which is your most-recent master domain. That's not a good-enough reason to erase the master role. And note again that you do not always erase it: if there has been a temporary IO failure you say "never mind", and go on as if nothing happened.
Why do we want to assume spm role if we cannot access our storage domains? Why such an spm is useful? Engine should select another spm in such condition, or reconstruct the master domain without the inaccessible domains.
It boils down to a simple condition: * If we must have a single master role, the suggested code gets us into a bad state. * if we may live with multiple master roles, we do not need _updateDomainsRole. Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Eduardo has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2: Code-Review-2
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 207: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 208: continue Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: SPM start should fail if one of the domains is unreacheable by the host designated of the SPM.
Allowing this behaviour can lead, among others, to a situation when extend request of VMs using the missing SD are not served.
If at any point in time one of the domains is unreacheable by the host, spm property should be lost.
The missing domain is reported by SPM and should be used by engine to find another host to be SPM or to change the pool definition accordingly. Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", Line 214: sdUUID, exc_info=True) Line 215:
Liron Ar has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(2 comments)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 207: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 208: continue Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception:
SPM start should fail if one of the domains is unreacheable by the host des
Regardless of changes in the engine that you think that should be done (I don't agree with the comment, but that's not the scope for discussing it), This code currently breaks BC with older engine versions, so we need to keep the behavior that was before, therefore it's irrelevant - the two options are either removing the code or to ignore the expcetion. Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", Line 214: sdUUID, exc_info=True) Line 215:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
Comparing the masterVersion is the way to tell which is your most-recent ma
Dan, Let's separate the issues here - 1. there's a current situation that breaks spm start, the current terminology of starting SPM doesn't require access to all the storage domain (previously to that code) - why is it needed? because that's excatly how the user can continue to work with the system while having monitoring thread on the domain - but that's irrelevant to this patch (that's the current way the system operates).
Previous to this change, vdsm never required the spm to access all domains - therfore it breaks BC with older engines regardless of what will be done in the engine.
let's first get this solved by either removing the method or having this fix.
2. About that master version - i'm not sure that it's always good enough (because you might have attached a domain that was master in other dc for example..etc), I don't see harm in trying to set the correct role for the domains that we can reach - but again, i'm not totally against removing it. Fede? Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Eduardo has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(2 comments)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
Dan,
2. How are you attaching such domain? Why is still there the former pool MD? Do you understand the implications = bugs? Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
Dan,
SPM _should not_ start or continue in the same if it is not capable to reach all the pools domains. Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Eduardo has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 207: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 208: continue Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception:
Regardless of changes in the engine that you think that should be done (I d
This was merged in 3.0. (vdsm-4.9-112, Nov 7 16:45:00 2011)
Therefore has nothing to do with BC.
If any engine version expects SPM to start when no all the SDs are reacheable this is wrong and should be fixed, as was explained before.
Is not what "I think that should be done", it is basic oVirt knowledge. Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", Line 214: sdUUID, exc_info=True) Line 215:
Sergey Gotliv has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 207: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 208: continue Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception:
This was merged in 3.0. (vdsm-4.9-112, Nov 7 16:45:00 2011)
This change was introduced by I2ecf801d58b34c1c811e311e3779887a406af5f0 on December 6, 2013. Before that It wasn't a part of the spmStart flow, I think this is new. What I am missing? Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", Line 214: sdUUID, exc_info=True) Line 215:
Eduardo has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 207: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 208: continue Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception:
This change was introduced by I2ecf801d58b34c1c811e311e3779887a406af5f0 on
You are wrong.
Change-Id: I593060e354c0bdc9b19f4e11a376094d83e567ce
Anyway maintaining wrong behaviours is not "BC" . Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", Line 214: sdUUID, exc_info=True) Line 215:
Nir Soffer has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 207: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 208: continue Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception:
You are wrong.
Nothing here: http://gerrit.ovirt.org/#/q/I593060e354c0bdc9b19f4e11a376094d83e567ce,n,z
Or on master: git grep I593060e354c0bdc9b19f4e11a376094d83e567ce
Do you have the commit hash? Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", Line 214: sdUUID, exc_info=True) Line 215:
Liron Ar has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(3 comments)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 207: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 208: continue Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception:
Nothing here: http://gerrit.ovirt.org/#/q/I593060e354c0bdc9b19f4e11a376094d
As sergey wrote, seems like it was introduced in I2ecf801d58b34c1c811e311e3779887a406af5f0. Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", Line 214: sdUUID, exc_info=True) Line 215:
Line 207: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 208: continue Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception:
You are wrong.
As Sergey wrote, it was introduced in I2ecf801d58b34c1c811e311e3779887a406af5f0. Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", Line 214: sdUUID, exc_info=True) Line 215:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
SPM _should not_ start or continue in the same if it is not capable to reac
Eduardo, this is behavior change that is debatable (and we can discuss it regardless) and isn't related to this bug - this bug is about the changed requirements for succesful start spm introduced in I2ecf801d58b34c1c811e311e3779887a406af5f0, not a change the terms for succesfully starting the spm and related side effacts from it.
Just to add - ReconstructMaster won't help you in all vdsm versions, because if you reconstruct without this domain appearing as "active"/without this domain listed you won't have domain monitoring for this domain so you won't be able to tell when it's reachable (and not all versions support "monitordomains").
To sum it up, the spmStart "requirments" were changed in patch I2ecf801d58b34c1c811e311e3779887a406af5f0 since last december and should be fixed, discussions about other refactors can be done regardless. Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Eduardo has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(2 comments)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 207: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 208: continue Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception:
As Sergey wrote, it was introduced in I2ecf801d58b34c1c811e311e3779887a406a
Use your git-fu and look in 3.0! Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role", Line 214: sdUUID, exc_info=True) Line 215:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
Eduardo, this is behavior change that is debatable (and we can discuss it r
Your are killing any sense of the SPM semantics and you can understand it if you were so kind to read other people comments, in this patch and in the BZ.
As stated in the BZ you have nothing to do with monitor domains, use getStorageDomainsList() in order to discover reachable domains. (Supported on all versions.)
Your "requeriments" and "behavioral changes" are the result of careless changes in the engine reconstruct flows. Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
Liron Ar has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 2:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/2/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 209: Line 210: self._backend.setDomainRegularRole(domain) Line 211: except Exception: Line 212: # log any exception, but keep going Line 213: self.log.error("Error trying to check/update domain %s role",
Your are killing any sense of the SPM semantics and you can understand it i
no one is killing the SPM semantics - this behavior was the behavior till it was changed accidentally in the given patch I2ecf801d58b34c1c811e311e3779887a406af5f0 (as stated by the patch owner as well). You want to change the semantics that currently all the flows rely on, you are free to do so - first of all we should return the semantics to what they were before they were mistakenly changed either by deleting it or by ignoring a failure - that's what the discussion here should be about. Line 214: sdUUID, exc_info=True) Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None):
oVirt Jenkins CI Server has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 3:
Build Successful
http://jenkins.ovirt.org/job/vdsm_master_pep8_gerrit/8359/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit_el/7569/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8479/ : SUCCESS
Liron Ar has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 3: Verified+1
verified on NFS
Dan Kenigsberg has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 3:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/3/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 206: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 207: continue Line 208: Line 209: self._backend.setDomainRegularRole(domain) Line 210: except se.StorageDomainDoesNotExist: iirc, this error can be raised only from within sdc.produce(). So either the span of the try block is too big, or that we need to catch another exception that may be raised within setDomainRegularRole(). Line 211: self.log.error("Error when trying to find domain %s, ignoring", Line 212: sdUUID, exc_info=True) Line 213: Line 214: @unsecured
Nir Soffer has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 3:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/3/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 206: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 207: continue Line 208: Line 209: self._backend.setDomainRegularRole(domain) Line 210: except se.StorageDomainDoesNotExist:
iirc, this error can be raised only from within sdc.produce(). So either th
We need to catch exception raised in getDomainRole and in setDomainRole. Line 211: self.log.error("Error when trying to find domain %s, ignoring", Line 212: sdUUID, exc_info=True) Line 213: Line 214: @unsecured
Liron Ar has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 4: Verified+1
oVirt Jenkins CI Server has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 4:
Build Failed
http://jenkins.ovirt.org/job/vdsm_master_pep8_gerrit/8367/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit_el/7577/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8487/ : FAILURE
Liron Ar has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 3:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/3/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 206: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 207: continue Line 208: Line 209: self._backend.setDomainRegularRole(domain) Line 210: except se.StorageDomainDoesNotExist:
We need to catch exception raised in getDomainRole and in setDomainRole.
Done Line 211: self.log.error("Error when trying to find domain %s, ignoring", Line 212: sdUUID, exc_info=True) Line 213: Line 214: @unsecured
Dan Kenigsberg has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 4: Code-Review+1
Federico Simoncelli has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 4:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/4/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 203: continue Line 204: try: Line 205: domain = sdCache.produce(sdUUID) Line 206: except se.StorageDomainDoesNotExist: Line 207: self.log.error("Error when trying to find domain %s, ignoring", minor (no need to re-verify): could you use:
self.log.exception("Error when trying to find domain %s, ignoring", sdUUID)
thanks Line 208: sdUUID, exc_info=True) Line 209: continue Line 210: Line 211: if domain.getDomainRole() == sd.REGULAR_DOMAIN:
Federico Simoncelli has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 4: Code-Review+2
Liron Ar has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 4:
(1 comment)
http://gerrit.ovirt.org/#/c/25424/4/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 203: continue Line 204: try: Line 205: domain = sdCache.produce(sdUUID) Line 206: except se.StorageDomainDoesNotExist: Line 207: self.log.error("Error when trying to find domain %s, ignoring",
minor (no need to re-verify): could you use:
Done Line 208: sdUUID, exc_info=True) Line 209: continue Line 210: Line 211: if domain.getDomainRole() == sd.REGULAR_DOMAIN:
Liron Ar has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 5: Verified+1
verified on last version, the only change is the use of log.exception instead of log.error
oVirt Jenkins CI Server has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 5:
Build Successful
http://jenkins.ovirt.org/job/vdsm_master_pep8_gerrit/8368/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit_el/7578/ : SUCCESS
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8488/ : SUCCESS
Federico Simoncelli has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 5: Code-Review+2
Dan Kenigsberg has posted comments on this change.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
Patch Set 5:
(1 comment)
SPM never promised unintermitted visibility to all storage domains. Vdsm revokes its own spm role only when the master domain is lost - it is Engine's responsibility to decide whether an SPM with partial visibility to other storage domains should survive.
With this patch, this is kept also on spm start.
http://gerrit.ovirt.org/#/c/25424/5/vdsm/storage/sp.py File vdsm/storage/sp.py:
Line 210: Line 211: if domain.getDomainRole() == sd.REGULAR_DOMAIN: Line 212: continue Line 213: Line 214: self._backend.setDomainRegularRole(domain) if produce() succeeds, but connection to storage is lost a bit later, setDomainRegularRole's failure would fail spmStart. I suppose that this can be fixed in a future patch. Line 215: Line 216: @unsecured Line 217: def startSpm(self, prevID, prevLVER, maxHostID, expectedDomVersion=None): Line 218: """
Dan Kenigsberg has submitted this change and it was merged.
Change subject: sp: fix spm start when failing to produce domain ......................................................................
sp: fix spm start when failing to produce domain
During spmStart we attept to go over all the domains and set their role to be regular in case it's not to avoid situation in which there are two domains with the master role. If we fail to produce any of the domains (which might happen, the domains returned by getDomains(activeOnly=True) we shouldn't fail, as we attempt to produce all the domains that are marked as active in the metadata, but it doesn't mean that they are actually reachable and that calling produce() for the domain won't fail.
In case the we fail to call produce() on domain with StorageDomainDoesNotExist exception, it doesn't mean that spmStart should fail - otherwise we might never succeed to start spm.
I believe that we should ignore any exception to avoid the same case in different scenarios (for example - failure to alter the metadata), but as reviewers requested, let's start with StorageDomainDoesNotExist exception only.
Change-Id: Ia4f34360770ca8c8741a50956d0cab92bcd9a810 Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=1072900 Signed-off-by: Liron Aravot laravot@redhat.com Reviewed-on: http://gerrit.ovirt.org/25424 Reviewed-by: Federico Simoncelli fsimonce@redhat.com --- M vdsm/storage/sp.py 1 file changed, 6 insertions(+), 1 deletion(-)
Approvals: Federico Simoncelli: Looks good to me, approved Liron Ar: Verified
vdsm-patches@lists.fedorahosted.org