Hi,
I've got a system that is acting as an OpenStack controller. As such, it's exporting LVM block devices over iscsi using lio with the targetcli/targetctl tools. The system being tested is running CentOS 7.
We have an active/standby configuration, and when we want to switch activity we need to shut down the iscsi targets, move activity to the other node, and then bring them back up again. In order to shut down the iscsi targets we're saving the configuration with "targetctl" and then running "targetctl clear". We then go on to try to deactivate the LVM volumes.
The problem we're running into is that if there are any actively-in-use targets (the backing store for a running boot-from-volume OpenStack guest, for example), after we run "targetctl clear" any LVM-related command hangs. As an example, I ran "vgs" and it hung with the following stack trace:
[<ffffffff81081ae5>] flush_work+0x105/0x1d0 [<ffffffff81081c39>] __cancel_work_timer+0x89/0x120 [<ffffffff81081d03>] cancel_delayed_work_sync+0x13/0x20 [<ffffffff812dba60>] disk_block_events+0x80/0x90 [<ffffffff811dee0e>] __blkdev_get+0x6e/0x4d0 [<ffffffff811df445>] blkdev_get+0x1d5/0x360 [<ffffffff811df67b>] blkdev_open+0x5b/0x80 [<ffffffff811a1cc7>] do_dentry_open+0x1a7/0x2e0 [<ffffffff811a1ef9>] vfs_open+0x39/0x70 [<ffffffff811b131d>] do_last+0x1ed/0x1270 [<ffffffff811b4082>] path_openat+0xc2/0x490 [<ffffffff811b584b>] do_filp_open+0x4b/0xb0 [<ffffffff811a33c3>] do_sys_open+0xf3/0x1f0 [<ffffffff811a34de>] SyS_open+0x1e/0x20 [<ffffffff81681249>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
other times it looks like this: [<ffffffff811e19a3>] do_blockdev_direct_IO+0xbc3/0x2560 [<ffffffff811e3395>] __blockdev_direct_IO+0x55/0x60 [<ffffffff811ddc77>] blkdev_direct_IO+0x57/0x60 [<ffffffff81134913>] generic_file_aio_read+0x6d3/0x750 [<ffffffff811de0ec>] blkdev_aio_read+0x4c/0x70 [<ffffffff811a38bd>] do_sync_read+0x8d/0xd0 [<ffffffff811a401c>] vfs_read+0x9c/0x170 [<ffffffff811a4b6f>] SyS_read+0x7f/0xe0 [<ffffffff81681249>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
or this: [<ffffffff812dba11>] disk_block_events+0x31/0x90 [<ffffffff811dee0e>] __blkdev_get+0x6e/0x4d0 [<ffffffff811df445>] blkdev_get+0x1d5/0x360 [<ffffffff811df67b>] blkdev_open+0x5b/0x80 [<ffffffff811a1cc7>] do_dentry_open+0x1a7/0x2e0 [<ffffffff811a1ef9>] vfs_open+0x39/0x70 [<ffffffff811b131d>] do_last+0x1ed/0x1270 [<ffffffff811b4082>] path_openat+0xc2/0x490 [<ffffffff811b584b>] do_filp_open+0x4b/0xb0 [<ffffffff811a33c3>] do_sys_open+0xf3/0x1f0 [<ffffffff811a34de>] SyS_open+0x1e/0x20 [<ffffffff81681249>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
Eventually we seem to get a kernel log like this: [23448.266758] session41: session recovery timed out after 900 secs followed by a number of logs like this: [23448.266772] sd 48:0:0:0: rejecting I/O to offline device [23448.272701] sd 48:0:0:0: rejecting I/O to offline device [23448.278630] sd 48:0:0:0: rejecting I/O to offline device [23448.284554] sd 48:0:0:0: rejecting I/O to offline device [23448.290481] sd 48:0:0:0: rejecting I/O to offline device
And the hung processes continue on.
Are we doing something wrong here by just calling "targetctl clear" when there are actively-in-use targets? Or is there a kernel bug in this CentOS kernel?
Thanks, Chris
On 06/14/2016 11:07 AM, Chris Friesen wrote:
I've got a system that is acting as an OpenStack controller. As such, it's exporting LVM block devices over iscsi using lio with the targetcli/targetctl tools. The system being tested is running CentOS 7.
We have an active/standby configuration, and when we want to switch activity we need to shut down the iscsi targets, move activity to the other node, and then bring them back up again. In order to shut down the iscsi targets we're saving the configuration with "targetctl" and then running "targetctl clear". We then go on to try to deactivate the LVM volumes.
Have you tried deactivating the LVM volumes first?
The problem we're running into is that if there are any actively-in-use targets (the backing store for a running boot-from-volume OpenStack guest, for example), after we run "targetctl clear" any LVM-related command hangs. As an example, I ran "vgs" and it hung with the following stack trace:
If I'm not missing something here, I think it makes sense that after you clear the target configuration, the initiator can no longer issue commands to the LUNs (LVM commands may read from PVs etc.) and therefore the most logical solution is to deactivate/quiesce/etc the use of the LUNs before you make them inaccessible.
-- Andy
targetcli-fb-devel@lists.fedorahosted.org