That is odd behavior.
Do all of the replcas have the same applications connecting to them? Not nessisarily the same instances but the same applcations configured in a simmilar way. The reason I ask is I'm wondering if there might be a rouge app sending heavy queries repeatedly to the servers. Is sounds like issues I've seen in environments where nscd wasn't properly tuned (hundreds of processes querying nscd but only the default low thread count limit) and had stopped functioning correctly. The result was the ldap servers were victims of hundreds of boxes launching an unintended dos atack.

On May 14, 2012 7:54 PM, "Brad Schuetz" <brad@omnis.com> wrote:
I have recently upgraded our 389 servers from pretty old versions that
were a mix and match of 389 release and CentOS released versions (all on
centos 5) to the latest (on centos6) (specific RPMs listed below).

I did this though a full ldif dump of the original server and imported
into a freshly installed new master server.  Then I setup the
replication agreements with the 7 slave servers and everything was
running fine.

After about a week I starting having a problem with the hubs servers
where all of them after (possibly exactly) 24 hours would start going
crazy on the disk IO (95-100% according to sysstat) of that server
making queries to ldap slow.  The master server does not exhibit this
problem, it will run completely fine.

A simple restart of the dirsrv process corrects the issue and then it
will run for another 24 hours before repeating the issue.

The hardware running each node is somewhat different with varying disk
speeds underlying, but all exhibit the same behavior.

This happens the same on the 2 nodes that get relatively little traffic
and the 5 nodes that get a lot of traffic.

I was originally on the 389-ds-base release that shipped with CentOS6
and have changed to the version from the
<http://repos.fedorapeople.org/repos/rmeggins/389-ds-base/epel-389-ds-base.repo>
repo, both do the same thing.

Any thoughts/suggestions on how to fix or further diagnose this?  I've
had no luck with strace or error logs to find any issues.  At this point
I've unfortunately had to resort to a cron job to restart all of my LDAP
hubs.

Installed RPMs:
389-ds-console-1.2.6-1.el6.noarch
389-ds-1.2.2-1.el6.noarch
389-console-1.1.7-1.el6.noarch
389-admin-console-1.1.8-1.el6.noarch
389-ds-console-doc-1.2.6-1.el6.noarch
389-dsgw-1.1.9-1.el6.x86_64
389-admin-1.1.29-1.el6.x86_64
389-ds-base-1.2.10.7-1.el6.x86_64
389-adminutil-1.1.15-1.el6.x86_64
389-admin-console-doc-1.1.8-1.el6.noarch
389-ds-base-libs-1.2.10.7-1.el6.x86_64

--
Brad
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users