Fri, Apr 03, 2020 at 05:00:50PM CEST, olichtne(a)redhat.com wrote:
On Fri, Apr 03, 2020 at 04:43:07PM +0200, Jan Tluka wrote:
> Wed, Apr 01, 2020 at 06:52:53PM CEST, olichtne(a)redhat.com wrote:
> >From: Ondrej Lichtner <olichtne(a)redhat.com>
> >
> >File descriptors opened for communication with a Job process on the
> >slave were not being closed and cleaned up properly. This leads to file
> >descriptor ulimit exhaustion very quickly and only via launching recipe
> >jobs in a sequence. This basically made any Recipe "time" limited to a
> >certain number of jobs that it can execute.
> >
> >Closing the multiprocessing.Pipe-s (unix sockets) and removing the Job
> >from the JobContext to clean up the sentinel pipe resolves this issue.
> >
> >The file descriptor limit can still be hit in case enough parallel Jobs
> >are run, but that is intended and a problem that is out-of-scope for
> >LNST.
>
> What happens if this is hit?
> Would LNST handle this properly?
> Would user get an information that the limit has been reached (OSError) and
> all resources will be freed?
>
> -Jan
Tested it out with the following recipe:
class ReproducerRecipe(BaseRecipe):
host1 = HostReq()
host1.eth0 = DeviceReq(label="to_switch")
def test(self):
host1 = self.matched.host1
for host in self.matched:
for dev in host.devices:
dev.down()
host1.eth0.ip_add("192.168.1.1/24")
host1.eth0.up()
for i in range(1500):
host1.run("sleep 100", bg=True)
ulimit on the slave:
[root@f28 lnst]# ulimit -n
1024
the recipe just gets stuck after 336 jobs
Logs from slave:
2020-04-03 16:55:00 (localhost) - DEBUG: Running job 336 with pid
"2342"
2020-04-03 16:55:00 (localhost) - DEBUG: Executing: "sleep
100"
Traceback (most recent call last):
File "./lnst-slave", line 102, in <module>
File "./lnst-slave", line 99, in main
File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 966, in run
File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 999, in _process_msg
File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 391, in run_job
File "/tmp/lnst/lnst/Slave/Job.py", line 91, in run
File "/usr/lib64/python3.6/multiprocessing/process.py", line 105, in
start
File "/usr/lib64/python3.6/multiprocessing/context.py", line 223, in
_Popen
File "/usr/lib64/python3.6/multiprocessing/context.py", line 277, in
_Popen
File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19, in
__init__
File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 65, in
_launch
OSError: [Errno 24] Too many open files
^C2020-04-03 16:55:10 (localhost) - INFO: Caught signal 2 -> dying
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 28, in
poll
File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 1058, in
_signal_die_handler
lnst.Slave.NetTestSlave.SystemCallException
I had to kill the slave with two Ctrl-C key combinations. And there's no
log indicating an issue on the Controller.
The ip address and UP state remained on the eth0 mapped device.
So good point, I think that's a separate issue that's there even now,
and the patchset just doesn't address it. The patchset does however
address the original issue of exhausting the pipes sequentially.
I'll look into the parallel scenario and send a separate patch(set)
which will attempt to handle the exception and die gracefully.
-Ondrej
Sure, agree.
Ack to the series.