[389-users] 389 directory server crash

Wed Jul 17 15:36:21 UTC 2013

On 07/17/2013 01:52 AM, Mitja Mihelič wrote:
> On 07/16/2013 04:49 PM, Rich Megginson wrote:
>> On 07/16/2013 01:23 AM, Mitja Mihelič wrote:
>>> On 07/15/2013 05:28 PM, Rich Megginson wrote:
>>>> On 07/15/2013 02:57 AM, Mitja Mihelič wrote:
>>>>> On 07/12/2013 05:55 PM, Rich Megginson wrote:
>>>>>> On 07/12/2013 08:22 AM, Mitja Mihelič wrote:
>>>>>>> On 07/09/2013 03:34 PM, Rich Megginson wrote:
>>>>>>>> On 07/09/2013 06:43 AM, Mitja Mihelič wrote:
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> We are having problems with some our 389-DS instances. They 
>>>>>>>>> crash after receiving an update from the provider.
>>>>>>>>
>>>>>>>> After looking at the stack trace, I think this is 
>>>>>>>> https://fedorahosted.org/389/ticket/47391
>>>>> Yes, it looks like it might be it. When CONSUMER_ONE crashed for 
>>>>> the first time, the last thing replicated was a password change.
>>>>> Do you perhaps know, where I could get a 389DS version for Centos6 
>>>>> that has the patch? The ticket says it was pushed to 1.2.11, but 
>>>>> would seem that our 1.2.11.15-14 is still an unpatched one and the 
>>>>> repositories do not have any newer versions.
>>>>
>>>> Is that the 389-ds-base that is included with CentOS6?
>>> Yes, the 389-ds-base-1.2.11.15-14.el6_4.x86_64 and 
>>> 389-ds-base-libs-1.2.11.15-14.el6_4.x86_64 are from the official 
>>> Centos6 updates repoository.
>>> 389-ds-base-debuginfo is from http://debuginfo.centos.org/6/
>>> The rest are from epel.
>>
>> Looking at the stack trace you sent earlier - there is only 1 
>> thread?  You ran
>> gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` > stacktrace.`date +%s`.txt 2>&1
>>
>>
>> ?  If so, I have no idea what's going on - I've never seen the server deadlock itself with only 1 thread . . .
> I ran
> gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply 
> all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof -o 49171 ns-slapd` > 
> stacktrace.`date +%s`.txt 2>&1
> The "-o 49171" is to exclude the pid of the config server instance, so 
> only the problematic pid was looked at.
> If you get any more information regarding this crash it would be very 
> much appreciated.
>
> It may be best if I removed all 389DS related data from both of the 
> consumer servers and start fresh. If they crash again I will send the 
> relevant stack traces.

Yes, that sounds good.

>
>>
>>
>>
>>>>
>>>>>>>>
>>>>>>>>> The crash happened twice after about a week of running without 
>>>>>>>>> problems. The crashes happened on two consumer servers but not 
>>>>>>>>> at the same time.
>>>>>>>>> The servers are running CentOS 6x with the following 389DS 
>>>>>>>>> packages installed:
>>>>>>>>> 389-ds-console-doc-1.2.6-1.el6.noarch
>>>>>>>>> 389-console-1.1.7-1.el6.noarch
>>>>>>>>> 389-adminutil-1.1.15-1.el6.x86_64
>>>>>>>>> 389-dsgw-1.1.10-1.el6.x86_64
>>>>>>>>> 389-ds-base-debuginfo-1.2.11.15-14.el6_4.x86_64
>>>>>>>>> 389-admin-1.1.29-1.el6.x86_64
>>>>>>>>> 389-ds-console-1.2.6-1.el6.noarch
>>>>>>>>> 389-admin-console-doc-1.1.8-1.el6.noarch
>>>>>>>>> 389-ds-1.2.2-1.el6.noarch
>>>>>>>>> 389-ds-base-1.2.11.15-14.el6_4.x86_64
>>>>>>>>> 389-ds-base-libs-1.2.11.15-14.el6_4.x86_64
>>>>>>>>> 389-admin-console-1.1.8-1.el6.noarch
>>>>>>>>>
>>>>>>>>> We are in the process of replacing the Centos 5x base 
>>>>>>>>> consumer+provider setup with a CentOS 6x base one. For the 
>>>>>>>>> time being, the CentOS 6 machines are acting as consumers for 
>>>>>>>>> the old server. They run for a while and then the replicated 
>>>>>>>>> instances crash though not at the same time.
>>>>>>>>> One of the servers did not want to start after the crash,
>>>>>>>>
>>>>>>>> Can you provide the error messages from the errors log?
>>>>>>> I have attached error logs from the provider 
>>>>>>> (2013-06-27-provider_error) and the consumer 
>>>>>>> (2013-06-27-server_two_error) in question.
>>>>>>>>
>>>>>>>>> so I have run db2index on its database. It's been running for 
>>>>>>>>> four days and it has still not finished. 
>>>>>>>>
>>>>>>>> Try exporting using db2ldif, then importing using ldif2db.
>>>>>>> The export process hangs. After an hour strace still shows:
>>>>>>> futex(0x7f5822670ed4, FUTEX_WAIT, 1, NULL
>>>>>>> The error log for this is attached as 
>>>>>>> 2013-07-10-server_two-ldif_import_hangs.
>>>>>>
>>>>>> Are you using db2ldif or db2ldif.pl?  If you are using db2ldif, 
>>>>>> is the server running?  If not, please try first shutting down 
>>>>>> the server and use db2ldif.
>>>>>>
>>>>>> If db2ldif still hangs, then please follow the instructions at 
>>>>>> http://port389.org/wiki/FAQ#Debugging_Hangs to get a stack trace 
>>>>>> of the hung process.
>>>>> I was using db2ldif with the server shut down. I tried it again 
>>>>> and it hung. The LDIF file was created but its size was zero. The 
>>>>> produced stack trace is attached as 
>>>>> server_two-db2ldif_hang-stacktrace.1373877200.txt.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> All I get from db2index now are these outputs:
>>>>>>>>> [09/Jul/2013:13:29:11 +0200] - reindex db: Processed 65095 
>>>>>>>>> entries (pass 1104) -- average rate 53686277.5/sec, recent 
>>>>>>>>> rate 0.0/sec, hit ratio 0%
>>>>>>>>
>>>>>>>> How many entries do you have in your database?
>>>>>>> The number revolves around 65400. It varies perhaps 2 user 
>>>>>>> del/add operations a month and 20 attribute changes per week, if 
>>>>>>> that.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The other instance did start up, but the replication process 
>>>>>>>>> did not work anymore. I disabled the replication to this host 
>>>>>>>>> and set it up again. I chose "Initialize consumer now" and the 
>>>>>>>>> consumer crashed every time.
>>>>>>>>
>>>>>>>> Can provide a stack trace of the core when the server crashes?  
>>>>>>>> This may be different than the stack trace below.
>>>>>>> The last provided stack trace was produced at the last server 
>>>>>>> crash. I will provide another stack trace when CONSUMER_ONE 
>>>>>>> crashes again. Currently it refuses to crash at initialization 
>>>>>>> time and keeps running.
>>>>>>>>
>>>>>>>>> I have enabled full error logging and could find nothing.
>>>>>>>>> I have read a few threads (not all, I admit) on this list and 
>>>>>>>>> http://directory.fedoraproject.org/wiki/FAQ#Debugging_Crashes 
>>>>>>>>> and tried to troubleshoot.
>>>>>>>>>
>>>>>>>>> The crash produced the attached core dump and I could use your 
>>>>>>>>> help with understanding it. As well as any help with the 
>>>>>>>>> crash. If more info is needed I will gladly provide it.
>>>>>>>>>
>>>>>>>>> Regards, Mitja
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 389 users mailing list
>>>>>>>>> 389-users at lists.fedoraproject.org
>>>>>>>>> https://admin.fedoraproject.org/mailman/listinfo/389-users
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/389-users/attachments/20130717/eaecaeb3/attachment.html>