[389-users] replication from 1.2.8.3 to 1.2.10.4

Thu Jul 12 20:52:22 UTC 2012

On 07/12/2012 02:47 PM, Robert Viduya wrote:
> On Jul 12, 2012, at 11:36 AM, Rich Megginson wrote:
>
>> On 07/12/2012 08:50 AM, Robert Viduya wrote:
>>> On Jul 11, 2012, at 7:17 PM, Rich Megginson wrote:
>>>
>>>> On 07/11/2012 11:12 AM, Robert Viduya wrote:
>> So is it possible that the hub was
> This question seems incomplete?
Sorry, I didn't mean to send that.
>
>> ok - please follow the directions at http://port389.org/wiki/FAQ#Debugging_Crashes to enable core files and get a stack trace
>>
>> Also, 1.2.10.12 is available in the testing repos.  Please give this a try.  There were a couple of fixes made since 1.2.10.4 that may be applicable:
>>
>> Ticket 336 [abrt] 389-ds-base-1.2.10.4-2.fc16: index_range_read_ext: Process /usr/sbin/ns-slapd was killed by signal 11 (SIGSEGV)
>> Ticket #347 - IPA dirsvr seg-fault during system longevity test
>> Ticket #348 - crash in ldap_initialize with multiple threads
>> Ticket #361: Bad DNs in ACIs can segfault ns-slapd
>> Trac Ticket #359 - Database RUV could mismatch the one in changelog under the stress
>> Ticket #382 - DS Shuts down intermittently
>> Ticket #390 - [abrt] 389-ds-base-1.2.10.6-1.fc16: slapi_attr_value_cmp: Process /usr/sbin/ns-slapd was killed by signal 11 (SIGSEGV
> I've enabled the core dump stuff, but now I can't seem to get it to crash.  But I'm still getting the changelog messages in the error logs whenever I restart.  In addition, the hub server keeps running out of disk space.  I tracked it down to the access log filling up with MOD messages from replication.  It looks like changes are coming down from our 1.2.8 servers and being applied over and over again.  As an example, one of our entries was modified three times today, and on all our other machines I see the following in the access log file:
>
> # egrep 78b8cc871a3cda9f352580e797b270bc access
> [12/Jul/2012:11:00:59 -0400] conn=383671 op=3145 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:11:01:24 -0400] conn=383671 op=3153 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:11:01:38 -0400] conn=383671 op=3157 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
>
> But on the problematic hub server, I see:
>
> # egrep 78b8cc871a3cda9f352580e797b270bc access
> [12/Jul/2012:15:17:29 -0400] conn=2 op=58 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:17:29 -0400] conn=2 op=60 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:17:29 -0400] conn=2 op=61 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:42 -0400] conn=6 op=169 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:42 -0400] conn=6 op=171 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:42 -0400] conn=6 op=172 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:45 -0400] conn=3 op=170 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:45 -0400] conn=3 op=172 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:45 -0400] conn=3 op=173 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:51 -0400] conn=2 op=2234 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:51 -0400] conn=2 op=2236 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:51 -0400] conn=2 op=2237 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:55 -0400] conn=6 op=2233 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:55 -0400] conn=6 op=2235 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:55 -0400] conn=6 op=2236 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> [12/Jul/2012:15:24:57 -0400] conn=3 op=2234 MOD dn="gtdirguid=78b8cc871a3cda9f352580e797b270bc,ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu"
> ...
>
> I truncated the output for brevity, but there's over 250 MODs to that one object.  It's as if the server isn't able to do the replication bookkeeping and is accepting changes over and over again.  Eventually the disk fills up.
Do you see error messages from the supplier suggesting that it is 
attempting to send the operation but failing and retrying?

Do all of these operations have the same CSN?  The csn will be logged 
with the RESULT line for the operation.  Also, what is the err=? for the 
MOD operations?  err=0?  Some other code?
>
> I just upgraded it to 1.2.10.12 as suggested and just to be safe, I'm doing a clean import.  We'll see how it goes.
>
> --
> 389 users mailing list
> 389-users at lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/389-users