On Tue, Oct 1, 2013 at 8:19 AM, Ric <389-users-list@vorticity.org> wrote:
Hello All,

I hope you can forgive a request which I am sure doesn't have enough
information in it, please let me know what else I can add if you might
be able to help.

I have a problem with our installation of RHDS9 and practically
nothing in the logs to suggest where to look.

We have a multi-master pair, with DNS round robin to load balance.
Due to the problem I have updated DNS to point all traffic to the
working server so I hope I can get this working again without
impacting the users. But while I don't know the reason I'm concerned
it may occur on the working server and prevent all logins. :(

We first noticed that replication was not working, now it seems that I
can't get slapd to start on one of the pair.
Have restarted both dirsrv and both servers.

There is woefully little in the log files, but if there is a way to
increase logging levels I haven't found it yet. If there is, please
advise and I'll do that and post.

This is the info I have gathered so far. Please let me know what else
might help.


/usr/sbin/ns-slapd -v
389 Project
389-Directory/1.2.11.15 B2013.211.1952

dirsrv dir01 is stopped
There is no:
/var/run/dirsrv/slapd-dir01.pid

# service dirsrv start
  *** Error: 1 instance(s) failed to start

The start-up runs the wait loop and finally exists, with the message above.
errors log includes the message:

[01/Oct/2013:12:14:47 +0100] - 389-Directory/1.2.11.15 B2013.211.1952
starting up
[01/Oct/2013:12:14:47 +0100] - WARNING: userRoot: entry cache size
10485760B is less than db size 10739712B; We recommend to increase the
entry cache size nsslapd-cachememsize.


The start-up process leaves one slapd running:
# ps -ef |grep slapd
dsuser   12560     1  0 09:51 ?        00:00:03 /usr/sbin/ns-slapd -D
/etc/dirsrv/slapd-dir01 -i /var/run/dirsrv/slapd-dir01.pid -w
/var/run/dirsrv/slapd-dir01.startpid

but no working ns-slapd.

I recognise that we need to tune the cache, but don't believe that it
will cause the start-up failure, just a performance hit. To tune via
the console I suspect I have to get it running first!
The working server shows the same error, along with:

[01/Oct/2013:12:16:26 +0100] slapi_ldap_bind - Error: could not send
bind request for id [cn=repman,cn=config] mech [SIMPLE]: error -1
(Can't contact LDAP server) 0 (unknown) 107 (Transport endpoint is not
connected)

Which makes sense.

The logs errors and access provide no other content at all, so nothing
to indicate what is failing.

Any ideas where I might start will be greatly welcomed.

Many thanks, Ric.
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

I'm surprised to see that the failing node doesn't produce a real working output from a startup failure. Try permissions of the /var/run/dirsrv folder to root:nobody and then nobody:nobody. Remove any PID files from within the directories.

A few to start:
 - Check for differences in the dse.ldif files. Node specific info will show normal differences like agreements, etc. See if something was changed on the non starting node. What logs are you looking at?
 - Permissions on the files/directories that directory server uses (nobody:nobody) should be the permissions for 389 DS.
 - Location and status of a PID file such as /var/run/dirsrv/admin-serv.pid and /var/run/dirsrv/slapd-dirsrv1.pid
 - Check logs of working node during the time of initial failure

A few for the hopeful:
- Do you have backups? Mine are in "/var/lib/dirsrv/slapd-baldirsrv1/bak"
- Can you build a new node and join it to the multimaster? I think it supports 20+ masters now. Add more as they are fairly easy to get up and running after working out the kinks.
-