The problem is when we are about to reset the server status, we don't
get through the timeout (30 seconds) because the "switch to primary
server" task is scheduled 30 seconds after fall back to a backup
server. Thus the server status is resetted after another 30 seconds.
It can be nicely seen in the logs:
[be_primary_server_timeout] (0x0400): Looking for primary server!
[fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
[get_server_status] (0x1000): Status of server 'backup.pb' is 'working'
[get_port_status] (0x1000): Port status of port 389 for server
'backup.pb' is 'working'
[get_server_status] (0x1000): Status of server 'ipa-server.ipa.pb' is
'working'
[get_port_status] (0x1000): Port status of port 389 for server
'ipa-server.ipa.pb' is 'not working'
[get_port_status] (0x0010): ===== DIFF = 30 > 30
[get_server_status] (0x1000): Status of server 'backup.pb' is 'working'
[get_port_status] (0x1000): Port status of port 389 for server
'backup.pb' is 'working'
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to
10 seconds
[get_server_status] (0x1000): Status of server 'backup.pb' is 'working'
[be_resolve_server_process] (0x1000): Saving the first resolved server
[be_resolve_server_process] (0x0200): Found address for server
backup.pb: [10.16.78.43] TTL 60
[be_primary_server_timeout_activate] (0x2000): Primary server
reactivation timeout set to 30 seconds
The question is how we should deal with it? The easiest solution (in
the attached patch) is just switching > with >=. But I'm not quite sure
whether it is the best solution. It doesn't really fit into my
understanding of timeout.
My other thoughts were:
1. schedule "switch to primary server" task to 31 seconds
2. reset server status of all primary servers when this task is
triggered
Show replies by thread