Hi,
A patch I had pushed into 2.16 had an error which resulted in the regression described in the aforementioned bug as well as the bug described in bz @15339. Attached patch fixes this. If it looks OK, I'll apply it in rawhide and f19. I have also posted it upstream:
http://sourceware.org/ml/libc-alpha/2013-05/msg00567.html
Siddhesh
commit 69cbe0434e5ddf223ed17141ae9527fb83fde1fa Author: Siddhesh Poyarekar siddhesh@redhat.com Date: Wed May 15 20:06:20 2013 +0530
Set EAI_SYSTEM only when h_errno is NETDB_INTERNAL
Fixes BZ #15339.
NSS_STATUS_UNAVAIL may mean that a necessary input resource is not available. This could occur in a number of cases including when the network is down, system runs out of file descriptors, etc. The correct differentiator in such a case is the h_errno, which gives the nature of failure. In case of failures other than a simple 'not found', we set h_errno as NETDB_INTERNAL and let errno be the identifier for the exact error.
diff --git a/nss/getXXbyYY_r.c b/nss/getXXbyYY_r.c index 1067744..9e4d110 100644 --- a/nss/getXXbyYY_r.c +++ b/nss/getXXbyYY_r.c @@ -284,11 +284,11 @@ done: #endif *result = status == NSS_STATUS_SUCCESS ? resbuf : NULL; #ifdef NEED_H_ERRNO - if (status == NSS_STATUS_UNAVAIL) - /* Either we failed to lookup the functions or the functions themselves - had a system error. Set NETDB_INTERNAL here to let the caller know - that the errno may have the real reason for failure. */ - *h_errnop = NETDB_INTERNAL; + if (status == NSS_STATUS_UNAVAIL && !any_service && errno != ENOENT) + /* This happens when we weren't able to use a service for reasons other + than the module not being found. In such a case, we'd want to tell the + caller that errno has the real reason for failure. */ + *h_errnop = NETDB_INTERNAL; else if (status != NSS_STATUS_SUCCESS && !any_service) /* We were not able to use any service. */ *h_errnop = NO_RECOVERY; diff --git a/sysdeps/posix/getaddrinfo.c b/sysdeps/posix/getaddrinfo.c index d368306..cdd869d 100644 --- a/sysdeps/posix/getaddrinfo.c +++ b/sysdeps/posix/getaddrinfo.c @@ -1035,7 +1035,14 @@ gaih_inet (const char *name, const struct gaih_service *service, } } else - status = NSS_STATUS_UNAVAIL; + { + status = NSS_STATUS_UNAVAIL; + /* Could not load any of the lookup functions. Indicate + an internal error if the file was found but some other + error led to the failure. */ + if (errno != 0 && errno != ENOENT) + __set_h_errno (NETDB_INTERNAL); + } }
if (nss_next_action (nip, status) == NSS_ACTION_RETURN) @@ -1049,7 +1056,7 @@ gaih_inet (const char *name, const struct gaih_service *service,
_res.options |= old_res_options & RES_USE_INET6;
- if (status == NSS_STATUS_UNAVAIL) + if (h_errno == NETDB_INTERNAL) { result = GAIH_OKIFUNSPEC | -EAI_SYSTEM; goto free_and_return;
On 05/16/2013 02:06 AM, Siddhesh Poyarekar wrote:
Hi,
A patch I had pushed into 2.16 had an error which resulted in the regression described in the aforementioned bug as well as the bug described in bz @15339. Attached patch fixes this. If it looks OK, I'll apply it in rawhide and f19. I have also posted it upstream:
This class of change is interesting, in that nobody can easily review it because it was a test-driven change. The patch adjusts the code in response to feedback from tests run by users with observed versus expected output being adjusted.
Reviewing the code is easy, it looks right, it isn't invalid C, and it's only three lines of change.
However, the existing implementation is so complex and subtle that it would take me days to review this code well.
What would help in the review of this, and the upstream review, is to have a detailed breakdown of the analysis you did while fixing this.
Do you have such an analysis?
I wish we had some white-box testing for these internals, similar to the testing I did for gettext with systemtap where you inject a failure on purpose.
We need to look at better testing for all of the networking interfaces for glibc.
http://sourceware.org/ml/libc-alpha/2013-05/msg00567.html
Siddhesh
commit 69cbe0434e5ddf223ed17141ae9527fb83fde1fa Author: Siddhesh Poyarekar siddhesh@redhat.com Date: Wed May 15 20:06:20 2013 +0530
Set EAI_SYSTEM only when h_errno is NETDB_INTERNAL Fixes BZ #15339. NSS_STATUS_UNAVAIL may mean that a necessary input resource is not available. This could occur in a number of cases including when the network is down, system runs out of file descriptors, etc. The correct differentiator in such a case is the h_errno, which gives the nature of failure. In case of failures other than a simple 'not found', we set h_errno as NETDB_INTERNAL and let errno be the identifier for the exact error.
diff --git a/nss/getXXbyYY_r.c b/nss/getXXbyYY_r.c index 1067744..9e4d110 100644 --- a/nss/getXXbyYY_r.c +++ b/nss/getXXbyYY_r.c @@ -284,11 +284,11 @@ done: #endif *result = status == NSS_STATUS_SUCCESS ? resbuf : NULL; #ifdef NEED_H_ERRNO
- if (status == NSS_STATUS_UNAVAIL)
- /* Either we failed to lookup the functions or the functions themselves
had a system error. Set NETDB_INTERNAL here to let the caller know
that the errno may have the real reason for failure. */
*h_errnop = NETDB_INTERNAL;
- if (status == NSS_STATUS_UNAVAIL && !any_service && errno != ENOENT)
- /* This happens when we weren't able to use a service for reasons other
than the module not being found. In such a case, we'd want to tell the
caller that errno has the real reason for failure. */
- *h_errnop = NETDB_INTERNAL; else if (status != NSS_STATUS_SUCCESS && !any_service) /* We were not able to use any service. */ *h_errnop = NO_RECOVERY;
diff --git a/sysdeps/posix/getaddrinfo.c b/sysdeps/posix/getaddrinfo.c index d368306..cdd869d 100644 --- a/sysdeps/posix/getaddrinfo.c +++ b/sysdeps/posix/getaddrinfo.c @@ -1035,7 +1035,14 @@ gaih_inet (const char *name, const struct gaih_service *service, } } else
status = NSS_STATUS_UNAVAIL;
{
status = NSS_STATUS_UNAVAIL;
/* Could not load any of the lookup functions. Indicate
an internal error if the file was found but some other
error led to the failure. */
if (errno != 0 && errno != ENOENT)
__set_h_errno (NETDB_INTERNAL);
}
}
if (nss_next_action (nip, status) == NSS_ACTION_RETURN)
@@ -1049,7 +1056,7 @@ gaih_inet (const char *name, const struct gaih_service *service,
_res.options |= old_res_options & RES_USE_INET6;
if (status == NSS_STATUS_UNAVAIL)
if (h_errno == NETDB_INTERNAL) { result = GAIH_OKIFUNSPEC | -EAI_SYSTEM; goto free_and_return;
glibc mailing list glibc@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/glibc
On Thu, May 16, 2013 at 09:31:54AM -0400, Carlos O'Donell wrote:
On 05/16/2013 02:06 AM, Siddhesh Poyarekar wrote:
Hi,
A patch I had pushed into 2.16 had an error which resulted in the regression described in the aforementioned bug as well as the bug described in bz @15339. Attached patch fixes this. If it looks OK, I'll apply it in rawhide and f19. I have also posted it upstream:
This class of change is interesting, in that nobody can easily review it because it was a test-driven change. The patch adjusts the code in response to feedback from tests run by users with observed versus expected output being adjusted.
Reviewing the code is easy, it looks right, it isn't invalid C, and it's only three lines of change.
However, the existing implementation is so complex and subtle that it would take me days to review this code well.
Agreed, these bits are an ugly maze.
What would help in the review of this, and the upstream review, is to have a detailed breakdown of the analysis you did while fixing this.
Do you have such an analysis?
I don't have a written detailed breakdown right now, but I don't mind doing it. I'll post an analysis of this soon. Meanwhile, here's the original patch which caused this breakage, in case that helps build some of the context:
http://sourceware.org/ml/libc-alpha/2012-10/msg00611.html
Siddhesh
On Thu, May 16, 2013 at 07:20:22PM +0530, Siddhesh Poyarekar wrote:
What would help in the review of this, and the upstream review, is to have a detailed breakdown of the analysis you did while fixing this.
Do you have such an analysis?
I don't have a written detailed breakdown right now, but I don't mind doing it. I'll post an analysis of this soon. Meanwhile, here's the original patch which caused this breakage, in case that helps build some of the context:
OK, so here's the background of all these changes:
The original request was to have getaddrinfo return EAI_SYSTEM if the errno from the resulting request was EMFILE or ENFILE during the DNS host resolution, to differentiate those conditions from the usual EAI_NONAME. The reason behind this was that they could be treated differently from the default error conditions. To satisfy this, I looked at the two places that these were implemented, i.e. get_XX_by_YY.c (for AF_INET) and dns-host.c (for AF_UNSPEC and AF_INET6). The places that set this errno were during module load and/or during socket creation in both those places. I set nss_status to NSS_STATUS_UNAVAILABLE in those cases to signify that the error was something other than a standard failure and used it in gaih_inet() to set the return code as EAI_SYSTEM.
There were two problems with that patch. The last part (of setting NSS_STATUS_UNAVAILABLE and checking on it) was wrong since the default failure was NSS_STATUS_UNAVAILABLE. As a result, all failures were returning EAI_SYSTEM instead of EAI_NONAME. The correct check here would have been to look for h_errno to be NETDB_INTERNAL since I set that explicitly. Secondly, I had not checked for module load errors due to AF_UNSPEC or AF_INET6, which would occur in gaih_inet() itself. This needed the second patch (i.e. the one I'm requesting review for) which checks h_errno to check failure and ensures that ENOENT does not qualify for an EAI_SYSTEM, since the latter case is documented to be a valid reason for EAI_NONAME.
Finally, the tweak for #958652. This was a minor oversight TBH - I was testing for errno != ENOENT, but the module open code resets errno to 0 if there is a failure in dlopen'ing the module. As a result this check also needs a check to ensure that errno is actually set, to signal a failure.
I hope this is sufficient. Please let me know if you need any further clarification.
Siddhesh
On 05/21/2013 01:24 AM, Siddhesh Poyarekar wrote:
On Thu, May 16, 2013 at 07:20:22PM +0530, Siddhesh Poyarekar wrote:
What would help in the review of this, and the upstream review, is to have a detailed breakdown of the analysis you did while fixing this.
Do you have such an analysis?
I don't have a written detailed breakdown right now, but I don't mind doing it. I'll post an analysis of this soon. Meanwhile, here's the original patch which caused this breakage, in case that helps build some of the context:
OK, so here's the background of all these changes:
The original request was to have getaddrinfo return EAI_SYSTEM if the errno from the resulting request was EMFILE or ENFILE during the DNS host resolution, to differentiate those conditions from the usual EAI_NONAME. The reason behind this was that they could be treated differently from the default error conditions. To satisfy this, I looked at the two places that these were implemented, i.e. get_XX_by_YY.c (for AF_INET) and dns-host.c (for AF_UNSPEC and AF_INET6). The places that set this errno were during module load and/or during socket creation in both those places. I set nss_status to NSS_STATUS_UNAVAILABLE in those cases to signify that the error was something other than a standard failure and used it in gaih_inet() to set the return code as EAI_SYSTEM.
There were two problems with that patch. The last part (of setting NSS_STATUS_UNAVAILABLE and checking on it) was wrong since the default failure was NSS_STATUS_UNAVAILABLE. As a result, all failures were returning EAI_SYSTEM instead of EAI_NONAME. The correct check here would have been to look for h_errno to be NETDB_INTERNAL since I set that explicitly. Secondly, I had not checked for module load errors due to AF_UNSPEC or AF_INET6, which would occur in gaih_inet() itself. This needed the second patch (i.e. the one I'm requesting review for) which checks h_errno to check failure and ensures that ENOENT does not qualify for an EAI_SYSTEM, since the latter case is documented to be a valid reason for EAI_NONAME.
Finally, the tweak for #958652. This was a minor oversight TBH - I was testing for errno != ENOENT, but the module open code resets errno to 0 if there is a failure in dlopen'ing the module. As a result this check also needs a check to ensure that errno is actually set, to signal a failure.
I hope this is sufficient. Please let me know if you need any further clarification.
That's perfect and exactly what I needed to look over and review the code changes.
Cheers, Carlos.