[PATCH 1/5] lnst.Common.ConnectionHandler: don't pass connections as a reference

[PATCH]...

[PATCH v3] Sphinx based...

olichtne＠redhat.com

Wednesday, 1 April 2020 Wed, 1 Apr '20

11:52 a.m.

From: Ondrej Lichtner <olichtne(a)redhat.com> Passing self._connections directly like this means that the list object is passed as a reference. This is potentially dangerous as the _check_connections method loop as well as updates this list. We should instead pass a copy of the list. Signed-off-by: Ondrej Lichtner <olichtne(a)redhat.com> --- lnst/Common/ConnectionHandler.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lnst/Common/ConnectionHandler.py b/lnst/Common/ConnectionHandler.py index 91a94aa..29d4b19 100644 --- a/lnst/Common/ConnectionHandler.py +++ b/lnst/Common/ConnectionHandler.py @@ -48,7 +48,7 @@ class ConnectionHandler(object): self._connection_mapping = {} def check_connections(self, timeout=None): - return self._check_connections(self._connections, timeout) + return self._check_connections(list(self._connections), timeout) def check_connections_by_id(self, connection_ids, timeout=None): connections = [] -- 2.26.0

Show replies by date

olichtne＠redhat.com

Wednesday, 1 April Wed, 1 Apr

11:52 a.m.

New subject: [PATCH 2/5] lnst.Common.ConnectionHandler: add exception logging

From: Ondrej Lichtner <olichtne(a)redhat.com> This except block can hide potential issues, the exception should be logged. Signed-off-by: Ondrej Lichtner <olichtne(a)redhat.com> --- lnst/Common/ConnectionHandler.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lnst/Common/ConnectionHandler.py b/lnst/Common/ConnectionHandler.py index 29d4b19..3f7f99f 100644 --- a/lnst/Common/ConnectionHandler.py +++ b/lnst/Common/ConnectionHandler.py @@ -13,6 +13,8 @@ olichtne(a)redhat.com (Ondrej Lichtner) import select import socket +import logging +import traceback from multiprocessing.connection import Connection from pyroute2 import IPRSocket from lnst.Common.SecureSocket import SecureSocket, SecSocketException @@ -61,6 +63,7 @@ class ConnectionHandler(object): try: rl, wl, xl = select.select(connections, [], [], timeout) except select.error: + logging.debug(traceback.format_exc()) return [] for f in rl: f_ready = True -- 2.26.0

olichtne＠redhat.com

11:52 a.m.

New subject: [PATCH 3/5] lnst.Common.ConnectionHandler: clean up closed connections

From: Ondrej Lichtner <olichtne(a)redhat.com> Before calling select on provided connection/socket objects we need to check if any of them may have been closed and need to be removed first. The main reason for this is that a closed multiprocessing.Connection object may be present in the list which will result in the following select call returning with an exception. This exception doesn't indicate which of the file objects causes the issue and selecting for "error" file descriptors also doesn't help. As such we need to check this before calling select. Signed-off-by: Ondrej Lichtner <olichtne(a)redhat.com> --- lnst/Common/ConnectionHandler.py | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/lnst/Common/ConnectionHandler.py b/lnst/Common/ConnectionHandler.py index 3f7f99f..a74d4a8 100644 --- a/lnst/Common/ConnectionHandler.py +++ b/lnst/Common/ConnectionHandler.py @@ -59,6 +59,11 @@ class ConnectionHandler(object): return self._check_connections(connections, timeout) def _check_connections(self, connections, timeout): + for c in list(connections): + if c.closed: + self.remove_connection(c) + connections.remove(c) + requests = [] try: rl, wl, xl = select.select(connections, [], [], timeout) -- 2.26.0

Jan Tluka

Friday, 3 April Fri, 3 Apr

9:18 a.m.

New subject: [PATCH 3/5] lnst.Common.ConnectionHandler: clean up closed connections

Wed, Apr 01, 2020 at 06:52:51PM CEST, olichtne(a)redhat.com wrote:

...

From: Ondrej Lichtner <olichtne(a)redhat.com> Before calling select on provided connection/socket objects we need to check if any of them may have been closed and need to be removed first.

Not quite sure but don't we need to first check if there was any data available on the connection/socket? With the patch we could silently discard such data.

...

The main reason for this is that a closed multiprocessing.Connection object may be present in the list which will result in the following select call returning with an exception. This exception doesn't indicate which of the file objects causes the issue and selecting for "error" file descriptors also doesn't help. As such we need to check this before calling select. Signed-off-by: Ondrej Lichtner <olichtne(a)redhat.com> --- lnst/Common/ConnectionHandler.py | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/lnst/Common/ConnectionHandler.py b/lnst/Common/ConnectionHandler.py index 3f7f99f..a74d4a8 100644 --- a/lnst/Common/ConnectionHandler.py +++ b/lnst/Common/ConnectionHandler.py @@ -59,6 +59,11 @@ class ConnectionHandler(object): return self._check_connections(connections, timeout) def _check_connections(self, connections, timeout): + for c in list(connections): + if c.closed: + self.remove_connection(c) + connections.remove(c) + requests = [] try: rl, wl, xl = select.select(connections, [], [], timeout) -- 2.26.0 _______________________________________________ LNST-developers mailing list -- lnst-developers(a)lists.fedorahosted.org To unsubscribe send an email to lnst-developers-leave(a)lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/lnst-developers@lists.fedora...

Ondrej Lichtner

9:24 a.m.

New subject: [PATCH 3/5] lnst.Common.ConnectionHandler: clean up closed connections

On Fri, Apr 03, 2020 at 04:18:38PM +0200, Jan Tluka wrote:

...

Wed, Apr 01, 2020 at 06:52:51PM CEST, olichtne(a)redhat.com wrote: >From: Ondrej Lichtner <olichtne(a)redhat.com> > >Before calling select on provided connection/socket objects we need to >check if any of them may have been closed and need to be removed first. Not quite sure but don't we need to first check if there was any data available on the connection/socket? With the patch we could silently discard such data.

We don't because "closed" is only true if the file descriptor number for the socket is -1, so we can't really read from it anymore anyway. And for Connection objects created by multiprocessing.Pipe it's a similar situation, the "handle" which would normally contain the filedescriptor is cleaned up already and the "closed" property is calculated based on the availability of this handle. -Ondrej

Jan Tluka

9:34 a.m.

New subject: [PATCH 3/5] lnst.Common.ConnectionHandler: clean up closed connections

Fri, Apr 03, 2020 at 04:24:47PM CEST, olichtne(a)redhat.com wrote:

...

On Fri, Apr 03, 2020 at 04:18:38PM +0200, Jan Tluka wrote: > Wed, Apr 01, 2020 at 06:52:51PM CEST, olichtne(a)redhat.com wrote: > >From: Ondrej Lichtner <olichtne(a)redhat.com> > > > >Before calling select on provided connection/socket objects we need to > >check if any of them may have been closed and need to be removed first. > > Not quite sure but don't we need to first check if there was any data > available on the connection/socket? With the patch we could silently > discard such data. We don't because "closed" is only true if the file descriptor number for the socket is -1, so we can't really read from it anymore anyway. And for Connection objects created by multiprocessing.Pipe it's a similar situation, the "handle" which would normally contain the filedescriptor is cleaned up already and the "closed" property is calculated based on the availability of this handle. -Ondrej

Ok. Then it looks fine. Reading the "lost [5/5]" patch now :-) -Jan

olichtne＠redhat.com

Wednesday, 1 April Wed, 1 Apr

11:52 a.m.

New subject: [PATCH 4/5] lnst.Common.SecureSocket: add 'closed' property

From: Ondrej Lichtner <olichtne(a)redhat.com> Indicates if the socket has been closed, extends compatibility with other file like objects in python. Signed-off-by: Ondrej Lichtner <olichtne(a)redhat.com> --- lnst/Common/SecureSocket.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/lnst/Common/SecureSocket.py b/lnst/Common/SecureSocket.py index 8fae397..48262f1 100644 --- a/lnst/Common/SecureSocket.py +++ b/lnst/Common/SecureSocket.py @@ -335,6 +335,10 @@ class SecureSocket(object): def close(self): return self._socket.close() + @property + def closed(self): + return self._socket.fileno == -1 + def shutdown(self, how): return self._socket.shutdown(how) -- 2.26.0

olichtne＠redhat.com

11:52 a.m.

New subject: [PATCH 5/5] lnst.Slave.Job: cleanup leftover open file descriptors

From: Ondrej Lichtner <olichtne(a)redhat.com> File descriptors opened for communication with a Job process on the slave were not being closed and cleaned up properly. This leads to file descriptor ulimit exhaustion very quickly and only via launching recipe jobs in a sequence. This basically made any Recipe "time" limited to a certain number of jobs that it can execute. Closing the multiprocessing.Pipe-s (unix sockets) and removing the Job from the JobContext to clean up the sentinel pipe resolves this issue. The file descriptor limit can still be hit in case enough parallel Jobs are run, but that is intended and a problem that is out-of-scope for LNST. Signed-off-by: Ondrej Lichtner <olichtne(a)redhat.com> --- lnst/Slave/Job.py | 7 +++++++ lnst/Slave/NetTestSlave.py | 3 +++ 2 files changed, 10 insertions(+) diff --git a/lnst/Slave/Job.py b/lnst/Slave/Job.py index 004ef02..b4cdac4 100644 --- a/lnst/Slave/Job.py +++ b/lnst/Slave/Job.py @@ -95,6 +95,8 @@ class Job(object): return True def _run(self): + self._parent_pipe.close() + os.setpgrp() signal.signal(signal.SIGHUP, signal.SIG_DFL) signal.signal(signal.SIGINT, signal.SIG_DFL) @@ -150,6 +152,11 @@ class Job(object): self._finished = True self._result = result + self._parent_pipe.close() + self._child_pipe.close() + self._parent_pipe = None + self._child_pipe = None + def get_result(self): return self._result diff --git a/lnst/Slave/NetTestSlave.py b/lnst/Slave/NetTestSlave.py index e2e8d6d..710dbce 100644 --- a/lnst/Slave/NetTestSlave.py +++ b/lnst/Slave/NetTestSlave.py @@ -1032,6 +1032,9 @@ class NetTestSlave: job.set_finished(msg["result"]) self._server_handler.send_data_to_ctl(msg) + + self._job_context.del_job(job) + elif msg["type"] == "from_netns": self._server_handler.send_data_to_ctl(msg["data"]) elif msg["type"] == "to_netns": -- 2.26.0

Jan Tluka

Friday, 3 April Fri, 3 Apr

9:43 a.m.

New subject: [PATCH 5/5] lnst.Slave.Job: cleanup leftover open file descriptors

Wed, Apr 01, 2020 at 06:52:53PM CEST, olichtne(a)redhat.com wrote:

...

What happens if this is hit? Would LNST handle this properly? Would user get an information that the limit has been reached (OSError) and all resources will be freed? -Jan

Ondrej Lichtner

10 a.m.

New subject: [PATCH 5/5] lnst.Slave.Job: cleanup leftover open file descriptors

On Fri, Apr 03, 2020 at 04:43:07PM +0200, Jan Tluka wrote:

...

Wed, Apr 01, 2020 at 06:52:53PM CEST, olichtne(a)redhat.com wrote: >From: Ondrej Lichtner <olichtne(a)redhat.com> > >File descriptors opened for communication with a Job process on the >slave were not being closed and cleaned up properly. This leads to file >descriptor ulimit exhaustion very quickly and only via launching recipe >jobs in a sequence. This basically made any Recipe "time" limited to a >certain number of jobs that it can execute. > >Closing the multiprocessing.Pipe-s (unix sockets) and removing the Job >from the JobContext to clean up the sentinel pipe resolves this issue. > >The file descriptor limit can still be hit in case enough parallel Jobs >are run, but that is intended and a problem that is out-of-scope for >LNST. What happens if this is hit? Would LNST handle this properly? Would user get an information that the limit has been reached (OSError) and all resources will be freed? -Jan

Tested it out with the following recipe: class ReproducerRecipe(BaseRecipe): host1 = HostReq() host1.eth0 = DeviceReq(label="to_switch") def test(self): host1 = self.matched.host1 for host in self.matched: for dev in host.devices: dev.down() host1.eth0.ip_add("192.168.1.1/24") host1.eth0.up() for i in range(1500): host1.run("sleep 100", bg=True) ulimit on the slave: [root@f28 lnst]# ulimit -n 1024 the recipe just gets stuck after 336 jobs Logs from slave: 2020-04-03 16:55:00 (localhost) - DEBUG: Running job 336 with pid "2342" 2020-04-03 16:55:00 (localhost) - DEBUG: Executing: "sleep 100" Traceback (most recent call last): File "./lnst-slave", line 102, in <module> File "./lnst-slave", line 99, in main File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 966, in run File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 999, in _process_msg File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 391, in run_job File "/tmp/lnst/lnst/Slave/Job.py", line 91, in run File "/usr/lib64/python3.6/multiprocessing/process.py", line 105, in start File "/usr/lib64/python3.6/multiprocessing/context.py", line 223, in _Popen File "/usr/lib64/python3.6/multiprocessing/context.py", line 277, in _Popen File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19, in __init__ File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 65, in _launch OSError: [Errno 24] Too many open files ^C2020-04-03 16:55:10 (localhost) - INFO: Caught signal 2 -> dying Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 28, in poll File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 1058, in _signal_die_handler lnst.Slave.NetTestSlave.SystemCallException I had to kill the slave with two Ctrl-C key combinations. And there's no log indicating an issue on the Controller. The ip address and UP state remained on the eth0 mapped device. So good point, I think that's a separate issue that's there even now, and the patchset just doesn't address it. The patchset does however address the original issue of exhausting the pipes sequentially. I'll look into the parallel scenario and send a separate patch(set) which will attempt to handle the exception and die gracefully. -Ondrej

Jan Tluka

10:26 a.m.

New subject: [PATCH 5/5] lnst.Slave.Job: cleanup leftover open file descriptors

Fri, Apr 03, 2020 at 05:00:50PM CEST, olichtne(a)redhat.com wrote:

...

On Fri, Apr 03, 2020 at 04:43:07PM +0200, Jan Tluka wrote: > Wed, Apr 01, 2020 at 06:52:53PM CEST, olichtne(a)redhat.com wrote: > >From: Ondrej Lichtner <olichtne(a)redhat.com> > > > >File descriptors opened for communication with a Job process on the > >slave were not being closed and cleaned up properly. This leads to file > >descriptor ulimit exhaustion very quickly and only via launching recipe > >jobs in a sequence. This basically made any Recipe "time" limited to a > >certain number of jobs that it can execute. > > > >Closing the multiprocessing.Pipe-s (unix sockets) and removing the Job > >from the JobContext to clean up the sentinel pipe resolves this issue. > > > >The file descriptor limit can still be hit in case enough parallel Jobs > >are run, but that is intended and a problem that is out-of-scope for > >LNST. > > What happens if this is hit? > Would LNST handle this properly? > Would user get an information that the limit has been reached (OSError) and > all resources will be freed? > > -Jan Tested it out with the following recipe: class ReproducerRecipe(BaseRecipe): host1 = HostReq() host1.eth0 = DeviceReq(label="to_switch") def test(self): host1 = self.matched.host1 for host in self.matched: for dev in host.devices: dev.down() host1.eth0.ip_add("192.168.1.1/24") host1.eth0.up() for i in range(1500): host1.run("sleep 100", bg=True) ulimit on the slave: [root@f28 lnst]# ulimit -n 1024 the recipe just gets stuck after 336 jobs Logs from slave: 2020-04-03 16:55:00 (localhost) - DEBUG: Running job 336 with pid "2342" 2020-04-03 16:55:00 (localhost) - DEBUG: Executing: "sleep 100" Traceback (most recent call last): File "./lnst-slave", line 102, in <module> File "./lnst-slave", line 99, in main File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 966, in run File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 999, in _process_msg File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 391, in run_job File "/tmp/lnst/lnst/Slave/Job.py", line 91, in run File "/usr/lib64/python3.6/multiprocessing/process.py", line 105, in start File "/usr/lib64/python3.6/multiprocessing/context.py", line 223, in _Popen File "/usr/lib64/python3.6/multiprocessing/context.py", line 277, in _Popen File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19, in __init__ File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 65, in _launch OSError: [Errno 24] Too many open files ^C2020-04-03 16:55:10 (localhost) - INFO: Caught signal 2 -> dying Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 28, in poll File "/tmp/lnst/lnst/Slave/NetTestSlave.py", line 1058, in _signal_die_handler lnst.Slave.NetTestSlave.SystemCallException I had to kill the slave with two Ctrl-C key combinations. And there's no log indicating an issue on the Controller. The ip address and UP state remained on the eth0 mapped device. So good point, I think that's a separate issue that's there even now, and the patchset just doesn't address it. The patchset does however address the original issue of exhausting the pipes sequentially. I'll look into the parallel scenario and send a separate patch(set) which will attempt to handle the exception and die gracefully. -Ondrej

Sure, agree. Ack to the series.

Ondrej Lichtner

Monday, 6 April Mon, 6 Apr

10:18 a.m.

New subject: [PATCH 1/5] lnst.Common.ConnectionHandler: don't pass connections as a reference

On Wed, Apr 01, 2020 at 06:52:49PM +0200, olichtne(a)redhat.com wrote:

...

Pushing the set to master. -Ondrej

1481

days inactive

1486

days old

lnst-developers@lists.fedorahosted.org

Manage subscription

11 comments

3 participants

tags (0)

participants (3)

Jan Tluka
olichtne＠redhat.com
Ondrej Lichtner

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[PATCH 1/5] lnst.Common.ConnectionHandler: don't pass connections as a reference