Eh sorry about this but it appears that my original hunch was correct. The 1.1 DS instance did indeed hang again recently. I was able to check a localhost query and that failed, too. So the problem definitely appears to be a hang in the FDS code somewhere. The question is, how do I go about debugging this? Strace doesn’t show much at all. Enabling debug trace logging kills the server. Any ideas? Thanks.

-richard


On 2/14/08 11:21 AM, "Richard Hesse" <richard@powerset.com> wrote:

Actually, it ends up that debug logging was putting too much disk load on the server and the process fell behind/stopped servicing socket requests. Thanks for your help Richard.

-richard


On 2/12/08 12:32 PM, "Richard Megginson" <rmeggins@redhat.com> wrote:

Richard Hesse wrote:
> There's a load balancer acting as the client to the DS (proxying client requests). I think that's a red herring though. Any search requests sent directly to the DS, bypassing the LB, would fail. I think I even tried requests locally from the server and they still failed. I can't be sure about that last statement, it was a long day.
>
What are all of these closed connections from? e.g. conn=71007,
conn=71003, etc.?  Are they from the load balancer?

I'm not really sure how to proceed to diagnose this from the directory
server because events like these usually indicate something is happening
at the TCP/IP layer.

I would be really interested to see if you continued to have problems if
you shut off the load balancer completely and just contacted the
directory server via the loopback interface.
> What about the network file descriptor is not connected error?
>
It's similar to the B4 - it means there was a problem with the
connection to the client.
> Thanks.
>
> -richard
>
> -----Original Message-----
> From: fedora-directory-users-bounces@redhat.com [mailto:fedora-directory-users-bounces@redhat.com] On Behalf Of Richard Megginson
> Sent: Monday, February 11, 2008 7:43 PM
> To: General discussion list for the Fedora Directory server project.
> Subject: Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
>
> Richard Hesse wrote:
>
>> Started to play with FDS 1.1 for some dogfood testing. After running for 10-15 minutes, the server stopped responding to network requests and went silent. The process was running, the error log was updating with the ldbm event loop, but no socket requests were fulfilled. Checking the access log, I saw this:
>>
>> [12/Feb/2008:01:47:58 +0000] conn=71108 op=-1 fd=79 closed error 107 (Transport endpoint is not connected) - Network file descriptor is not connected.
>> [12/Feb/2008:01:47:59 +0000] conn=71007 op=60 fd=69 closed - B4
>> [12/Feb/2008:01:48:00 +0000] conn=71003 op=48 fd=68 closed - B4
>> [12/Feb/2008:01:48:01 +0000] conn=71017 op=47 fd=72 closed - B4
>> [12/Feb/2008:01:48:06 +0000] conn=71102 op=2 fd=66 closed - B4
>> [12/Feb/2008:01:48:07 +0000] conn=71103 op=2 fd=70 closed - B4
>> [12/Feb/2008:01:48:07 +0000] conn=71040 op=10 fd=76 closed - B4
>>
>> Any ideas or suggestions on how to approach troubleshooting this issue would be greatly appreciated.
>>
>>
> B4 means SLAPD_DISCONNECT_BER_FLUSH - this usually means the client has reset or closed the connection while the server was attempting to send a response.
>
> http://www.redhat.com/docs/manuals/dir-server/cli/8.0/Configuration_Command_File_Reference-Access_Log_and_Connection_Code_Reference-Common_Connection_Codes.html
>
> Do you have a firewall or some other network device?
>
>> Thanks.
>>
>> -richard
>>
>> --
>> Fedora-directory-users mailing list
>> Fedora-directory-users@redhat.com
>> https://www.redhat.com/mailman/listinfo/fedora-directory-users
>>
>>
>
>
> --
> Fedora-directory-users mailing list
> Fedora-directory-users@redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-directory-users
>