[389-users] how to quickly recover from a corrupt database in multiple master configuration

Thu Jun 24 13:52:26 UTC 2010

Just in case anybody is interested in the reason the corruption occurred.
Apparently a rotten browsing index caused it.
An error message pointed me in this direction :
 errors:[17/Jun/2010:12:51:18 +0200] - vlv_build_idl: can't follow db cursor
(err -30989)
I deleted the browsing index from the particular ou and the problem was
gone.

On Wed, Jun 16, 2010 at 9:39 PM, Rich Megginson <rmeggins at redhat.com> wrote:

> mark benschop wrote:
> > Hi Rich,
> > Thanks for your reply.
> > Please find the logging from the problems below.
> > The serverb55 is one of 2 servers in a multiple masters configuration
> > that consists of serverb55 and serverb05.
> >
> > The problem I inititially had was that I had 2 entries that could not
> > be deleted serverb55.
> >
> > Here's logging from the access file.
> > =======================================================================
> > access.20100614-092820:[15/Jun/2010:09:20:49 +0200] conn=342177 op=7
> > SRCH base="uid=dbeijk, ou=people, dc=directory,dc=intern" scope=0
> > filter="(objectClass=*)" attrs=ALL
> > access.20100614-092820:[15/Jun/2010:09:20:49 +0200] conn=342177 op=7
> > RESULT err=0 tag=101 nentries=1 etime=0
> > access.20100614-092820:[15/Jun/2010:09:22:08 +0200] conn=342177 op=8
> > SRCH base="uid=dbeijk, ou=people, dc=directory,dc=intern" scope=1
> > filter="(objectClass=*)" attrs="objectClass"
> > access.20100614-092820:[15/Jun/2010:09:22:08 +0200] conn=342177 op=8
> > RESULT err=0 tag=101 nentries=0 etime=0 notes=U
> > access.20100614-092820:[15/Jun/2010:09:22:08 +0200] conn=342177 op=9
> > DEL dn="uid=dbeijk, ou=people, dc=directory,dc=intern"
> > access.20100614-092820:[15/Jun/2010:09:22:08 +0200] conn=342177 op=9
> > RESULT err=1 tag=107 nentries=0 etime=0 csn=4c172a21000000370000
> > access.20100614-092820:[15/Jun/2010:09:22:08 +0200] conn=342177 op=10
> > SRCH base="uid=dbeijk, ou=people, dc=directory,dc=intern" scope=1
> > filter="(objectClass=*)" attrs="objectClass"
> > access.20100614-092820:[15/Jun/2010:09:22:08 +0200] conn=342177 op=10
> > RESULT err=0 tag=101 nentries=0 etime=0 notes=U
> > =======================================================================
> >
> > LDAP error 1 i found means 'unwiling to perform'. First I thought
> > something might be wrong with the entry itself.
> > The error log found in the error log from the serverb55 I've added
> > below seemed to point in that direction.
> >
> >
> > When I logged on the the other ldapserver, serverb05, I tried to
> > delete the same entry to see if this slapd had the same issue but here
> > it worked.
> > Replicating the delete didn't. The following error was logged to the
> > errorlog of this :
> >
> > ========================================================================
> > [15/Jun/2010:09:35:17 +0200] NSMMReplicationPlugin -
> > agmt="cn=serverb55" (serverb55:636): Consumer failed to replay change
> > (uniqueid a276337c-5dc511df-852cfef8-667fa4d4, CSN
> > 4c172d36000000050000): Operations error. Will retry later.
> > =======================================================================
> >
> > So there seemed to be a problem with the serverb55 only.
> > Since I assumed the database got somehow corrupt or inconsistent I've
> > tried the following steps to try and recreate the database or had it
> > checked in order to get it right again.
> > First there's the errors from the account that could not be deleted.
> > I 'reinitialised the consumer' from the working serverb05 to the
> > problematic serverb55.
> > Then I restarted the slapd.
> > Made an export of the database and imported that.
> > Slapd stopped the database.
> >
> > Please find the logging from /var/log/dirsrv/slapd-serverb55/errors
> > from the actions leading to the problem of the fatal server stop.
> > ======================================================================
> >         CentOS-Directory/8.1.0
> > B2009.134.1334
> >
> >         serverb55:636
> > (/etc/dirsrv/slapd-serverb55)
> >
> >
> >
> > [15/Jun/2010:09:22:58 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "uidNumber" required by
> > object class "posixAccount"
> > [15/Jun/2010:09:22:58 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "gidNumber" required by
> > object class "posixAccount"
> > [15/Jun/2010:09:22:58 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" -- attribute "uidNumber" not allowed
> > [15/Jun/2010:09:22:58 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "uid" required by object
> > class "posixAccount"
> > [15/Jun/2010:09:22:58 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "cn" required by object
> > class "posixAccount"
> > [15/Jun/2010:09:22:58 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "homeDirectory" required by
> > object class "posixAccount"
> > [15/Jun/2010:09:23:04 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "uidNumber" required by
> > object class "posixAccount"
> > [15/Jun/2010:09:23:18 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "uidNumber" required by
> > object class "posixAccount"
> > [15/Jun/2010:09:23:18 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "gidNumber" required by
> > object class "posixAccount"
> > [15/Jun/2010:09:23:18 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" required attribute "objectclass" missing
> > [15/Jun/2010:09:23:18 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" required attribute "objectclass" missing
> > [15/Jun/2010:09:23:18 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "uid" required by object
> > class "posixAccount"
> > [15/Jun/2010:09:23:18 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "cn" required by object
> > class "posixAccount"
> > [15/Jun/2010:09:23:18 +0200] - Entry "uid=dbeijk, ou=People,
> > dc=directory,dc=intern" missing attribute "homeDirectory" required by
> > object class "posixAccount"
> > [15/Jun/2010:09:24:56 +0200] - Entry "uid=DEL *.*, ou=People,
> > dc=directory,dc=intern" missing attribute "homeDirectory" required by
> > object class "posixAccount"
> > [15/Jun/2010:09:50:20 +0200] - Entry "cn=wchiman, ou=people,
> > dc=directory,dc=intern" -- attribute "uidNumber" not allowed
> > [15/Jun/2010:10:12:43 +0200] NSMMReplicationPlugin -
> > multimaster_be_state_change: replica dc=directory,dc=intern is going
> > offline; disabling replication
> > [15/Jun/2010:10:12:43 +0200] - attrcrypt_unwrap_key: failed to unwrap
> > key for cipher AES
> > [15/Jun/2010:10:12:43 +0200] - Failed to retrieve key for cipher AES
> > in attrcrypt_cipher_init
> > [15/Jun/2010:10:12:43 +0200] - Failed to initialize cipher AES in
> > attrcrypt_init
> > [15/Jun/2010:10:12:43 +0200] - WARNING: Import is running with
> > nsslapd-db-private-import-mem on; No other process is allowed to
> > access the database
> > [15/Jun/2010:10:12:48 +0200] - import userRoot: Workers finished;
> > cleaning up...
> > [15/Jun/2010:10:12:49 +0200] - import userRoot: Workers cleaned up.
> > [15/Jun/2010:10:12:49 +0200] - import userRoot: Indexing complete.
> > Post-processing...
> > [15/Jun/2010:10:12:49 +0200] - import userRoot: Flushing caches...
> > [15/Jun/2010:10:12:49 +0200] - import userRoot: Closing files...
> > [15/Jun/2010:10:12:49 +0200] - import userRoot: Import complete.
> > Processed 4849 entries in 6 seconds. (808.17 entries/sec)
> > [15/Jun/2010:10:12:49 +0200] - attrcrypt_unwrap_key: failed to unwrap
> > key for cipher AES
> > [15/Jun/2010:10:12:49 +0200] - Failed to retrieve key for cipher AES
> > in attrcrypt_cipher_init
> > [15/Jun/2010:10:12:49 +0200] - Failed to initialize cipher AES in
> > attrcrypt_init
> > [15/Jun/2010:10:12:49 +0200] NSMMReplicationPlugin -
> > multimaster_be_state_change: replica dc=directory,dc=intern is coming
> > online; enabling replication
> > [15/Jun/2010:10:12:49 +0200] NSMMReplicationPlugin -
> > replica_reload_ruv: Warning: new data for replica
> > dc=directory,dc=intern does not match the data in the changelog.
> >  Recreating the changelog file. This could affect replication with
> > replica's  consumers in which case the consumers should be reinitialized.
> > [15/Jun/2010:10:12:49 +0200] - skipping cos definition
> > cn=nsAccountInactivation_cos,dc=directory,dc=intern--no templates found
> > [15/Jun/2010:10:55:43 +0200] NSMMReplicationPlugin -
> > multimaster_be_state_change: replica dc=directory,dc=intern is going
> > offline; disabling replication
> > [15/Jun/2010:10:55:43 +0200] - attrcrypt_unwrap_key: failed to unwrap
> > key for cipher AES
> > [15/Jun/2010:10:55:43 +0200] - Failed to retrieve key for cipher AES
> > in attrcrypt_cipher_init
> > [15/Jun/2010:10:55:43 +0200] - Failed to initialize cipher AES in
> > attrcrypt_init
> > [15/Jun/2010:10:55:43 +0200] - WARNING: Import is running with
> > nsslapd-db-private-import-mem on; No other process is allowed to
> > access the database
> > [15/Jun/2010:10:55:49 +0200] - import userRoot: Workers finished;
> > cleaning up...
> > [15/Jun/2010:10:55:49 +0200] - import userRoot: Workers cleaned up.
> > [15/Jun/2010:10:55:49 +0200] - import userRoot: Indexing complete.
> > Post-processing...
> > [15/Jun/2010:10:55:49 +0200] - import userRoot: Flushing caches...
> > [15/Jun/2010:10:55:49 +0200] - import userRoot: Closing files...
> > [15/Jun/2010:10:55:49 +0200] - import userRoot: Import complete.
> > Processed 4850 entries in 5 seconds. (970.00 entries/sec)
> > [15/Jun/2010:10:55:49 +0200] - attrcrypt_unwrap_key: failed to unwrap
> > key for cipher AES
> > [15/Jun/2010:10:55:49 +0200] - Failed to retrieve key for cipher AES
> > in attrcrypt_cipher_init
> > [15/Jun/2010:10:55:49 +0200] - Failed to initialize cipher AES in
> > attrcrypt_init
> > [15/Jun/2010:10:55:49 +0200] NSMMReplicationPlugin -
> > multimaster_be_state_change: replica dc=directory,dc=intern is coming
> > online; enabling replication
> > [15/Jun/2010:10:55:49 +0200] NSMMReplicationPlugin -
> > replica_reload_ruv: Warning: new data for replica
> > dc=directory,dc=intern does not match the data in the changelog.
> >  Recreating the changelog file. This could affect replication with
> > replica's  consumers in which case the consumers should be reinitialized.
> > [15/Jun/2010:10:55:49 +0200] - skipping cos definition
> > cn=nsAccountInactivation_cos,dc=directory,dc=intern--no templates found
> > [15/Jun/2010:10:59:57 +0200] - slapd shutting down - signaling
> > operation threads
> > [15/Jun/2010:10:59:57 +0200] - slapd shutting down - waiting for 26
> > threads to terminate
> > [15/Jun/2010:10:59:57 +0200] - slapd shutting down - closing down
> > internal subsystems and plugins
> > [15/Jun/2010:10:59:58 +0200] - Waiting for 4 database threads to stop
> > [15/Jun/2010:10:59:59 +0200] - All database threads now stopped
> > [15/Jun/2010:10:59:59 +0200] - slapd stopped.
> >         CentOS-Directory/8.1.0 B2009.134.1334
> >         <host>:<port> (/etc/dirsrv/slapd-serverb55)
> >
> > [15/Jun/2010:11:00:01 +0200] - Entry "cn=schema" single-valued
> > attribute "modifyTimestamp" has multiple values
> >         CentOS-Directory/8.1.0 B2009.134.1334
> >         serverb55:636 (/etc/dirsrv/slapd-serverb55)
> >
> > [15/Jun/2010:11:00:01 +0200] - CentOS-Directory/8.1.0 B2009.134.1334
> > starting up
> > [15/Jun/2010:11:00:01 +0200] - I'm resizing my cache now...cache was
> > 20000000 and is now 8000000
> > [15/Jun/2010:11:00:01 +0200] - attrcrypt_unwrap_key: failed to unwrap
> > key for cipher AES
> > [15/Jun/2010:11:00:01 +0200] - Failed to retrieve key for cipher AES
> > in attrcrypt_cipher_init
> > [15/Jun/2010:11:00:01 +0200] - Failed to initialize cipher AES in
> > attrcrypt_init
> > [15/Jun/2010:11:00:01 +0200] - attrcrypt_unwrap_key: failed to unwrap
> > key for cipher AES
> > [15/Jun/2010:11:00:01 +0200] - Failed to retrieve key for cipher AES
> > in attrcrypt_cipher_init
> > [15/Jun/2010:11:00:01 +0200] - Failed to initialize cipher AES in
> > attrcrypt_init
> > [15/Jun/2010:11:00:01 +0200] - skipping cos definition
> > cn=nsAccountInactivation_cos,dc=directory,dc=intern--no templates found
> > [15/Jun/2010:11:00:01 +0200] NSMMReplicationPlugin -
> > replica_check_for_data_reload: Warning: data for replica
> > dc=directory,dc=intern was reloaded and it no longer matches the data
> > in the changelog (replica data > changelog). Recreating the changelog
> > file. This could affect replication with replica's consumers in which
> > case the consumers should be reinitialized.
> > [15/Jun/2010:11:00:01 +0200] - skipping cos definition
> > cn=nsAccountInactivation_cos,dc=directory,dc=intern--no templates found
> > [15/Jun/2010:11:00:01 +0200] - slapd started.  Listening on All
> > Interfaces port 389 for LDAP requests
> > [15/Jun/2010:11:00:01 +0200] - Listening on All Interfaces port 636
> > for LDAPS requests
> > [15/Jun/2010:11:29:59 +0200] - slapd shutting down - signaling
> > operation threads
> > [15/Jun/2010:11:29:59 +0200] - slapd shutting down - closing down
> > internal subsystems and plugins
> > [15/Jun/2010:11:30:00 +0200] - Waiting for 4 database threads to stop
> > [15/Jun/2010:11:30:00 +0200] - All database threads now stopped
> > [15/Jun/2010:11:30:00 +0200] - slapd stopped.
> >         CentOS-Directory/8.1.0 B2009.134.1334
> >         <host>:<port> (/etc/dirsrv/slapd-serverb55)
> >
> > [15/Jun/2010:11:30:03 +0200] - Entry "cn=schema" single-valued
> > attribute "modifyTimestamp" has multiple values
> >         CentOS-Directory/8.1.0 B2009.134.1334
> >         serverb55:636 (/etc/dirsrv/slapd-serverb55)
> >
> > [15/Jun/2010:11:30:03 +0200] - CentOS-Directory/8.1.0 B2009.134.1334
> > starting up
> > [15/Jun/2010:11:30:03 +0200] - attrcrypt_unwrap_key: failed to unwrap
> > key for cipher AES
> > [15/Jun/2010:11:30:03 +0200] - Failed to retrieve key for cipher AES
> > in attrcrypt_cipher_init
> > [15/Jun/2010:11:30:03 +0200] - Failed to initialize cipher AES in
> > attrcrypt_init
> > [15/Jun/2010:11:30:03 +0200] - attrcrypt_unwrap_key: failed to unwrap
> > key for cipher AES
> > [15/Jun/2010:11:30:03 +0200] - Failed to retrieve key for cipher AES
> > in attrcrypt_cipher_init
> > [15/Jun/2010:11:30:03 +0200] - Failed to initialize cipher AES in
> > attrcrypt_init
> > [15/Jun/2010:11:30:03 +0200] - skipping cos definition
> > cn=nsAccountInactivation_cos,dc=directory,dc=intern--no templates found
> > [15/Jun/2010:11:30:03 +0200] - skipping cos definition
> > cn=nsAccountInactivation_cos,dc=directory,dc=intern--no templates found
> > [15/Jun/2010:11:30:03 +0200] - slapd started.  Listening on All
> > Interfaces port 389 for LDAP requests
> > [15/Jun/2010:11:30:03 +0200] - Listening on All Interfaces port 636
> > for LDAPS requests
> > [15/Jun/2010:11:40:44 +0200] - Beginning export of 'userroot'
> > [15/Jun/2010:11:40:44 +0200] - export userRoot: Processed 139 entries
> > (100%).
> > [15/Jun/2010:11:40:44 +0200] - Export finished.
> > [15/Jun/2010:11:46:12 +0200] NSMMReplicationPlugin -
> > multimaster_be_state_change: replica dc=directory,dc=intern is going
> > offline; disabling replication
> > [15/Jun/2010:11:46:12 +0200] - attrcrypt_unwrap_key: failed to unwrap
> > key for cipher AES
> > [15/Jun/2010:11:46:12 +0200] - Failed to retrieve key for cipher AES
> > in attrcrypt_cipher_init
> > [15/Jun/2010:11:46:12 +0200] - Failed to initialize cipher AES in
> > attrcrypt_init
> > [15/Jun/2010:11:46:12 +0200] - WARNING: Import is running with
> > nsslapd-db-private-import-mem on; No other process is allowed to
> > access the database
> > [15/Jun/2010:11:46:14 +0200] - libdb: page 1: illegal page type or format
> > [15/Jun/2010:11:46:14 +0200] - libdb: PANIC: Invalid argument
> > [15/Jun/2010:11:46:14 +0200] - FATAL ERROR at by MCC ou=people
> > dc=directory dc=intern (77); server stopping as database recovery needed.
> >         CentOS-Directory/8.1.0 B2009.134.1334
> >         <host>:<port> (/etc/dirsrv/slapd-serverb55
> > ======================================================================
> >
> > Finally my questions :
> > What could be the cause of the problem ?
> Not sure - never seen this particular error before - any disk errors in
> /var/log/messages?
>
> Can you try doing a reinit of this server from the other server?
> > What would be the best procedure to get the serverb55 up and running
> > again ?
> Doing a reinit of this server from the other server.
> >
> > Thanks for any advise.
> >
> > Regards,
> > Mark
> >
> >
> >
> > =======
> >
> >
> > On Tue, Jun 15, 2010 at 7:04 PM, Rich Megginson <rmeggins at redhat.com
> > <mailto:rmeggins at redhat.com>> wrote:
> >
> >     mark benschop wrote:
> >     > Hi All,
> >     >
> >     > I'm having a problem on a CentOs Directory Server 8.1 multiple
> >     master
> >     > setup.
> >     > The database of one of the servers has been marked as corrupt
> >     and has
> >     > been brought offline by the Directory Server.
> >     Can you post any relevant error messages from the error log of the
> >     server?
> >     > Ldapclients querying the ldapserver for e.g. loggin in of users
> >     get an
> >     > errormessage, effectively disabling users to log in.
> >     What error message?
> >     >
> >     > I'm wondering what the best method is to recover from this
> >     situation.
> >     > I can think of a few :
> >     > 1) Starting the ldapserver, deleting the database, recreating it
> and
> >     > restoring a backup.
> >
> >     > 2) Starting the ldapserver, deleting the database and
> reinitialising
> >     > the server from the other master.
> >     If you reinitialize the problem server from another server, you don't
> >     need to delete the database, reinit will do that for you.
> >     >
> >     > Can anyone give me some hints if this wil work or would another
> >     > approach be better ?
> >     >
> >     > Thanks for your advise,
> >     > Mark
> >     >
> >
> ------------------------------------------------------------------------
> >     >
> >     > --
> >     > 389 users mailing list
> >     > 389-users at lists.fedoraproject.org
> >     <mailto:389-users at lists.fedoraproject.org>
> >     > https://admin.fedoraproject.org/mailman/listinfo/389-users
> >
> >     --
> >     389 users mailing list
> >     389-users at lists.fedoraproject.org
> >     <mailto:389-users at lists.fedoraproject.org>
> >     https://admin.fedoraproject.org/mailman/listinfo/389-users
> >
> >
> > ------------------------------------------------------------------------
> >
> > --
> > 389 users mailing list
> > 389-users at lists.fedoraproject.org
> > https://admin.fedoraproject.org/mailman/listinfo/389-users
>
> --
> 389 users mailing list
> 389-users at lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/389-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/389-users/attachments/20100624/bf76993d/attachment.html>