Re: plans for user-backed device support
by Andy Grover
On 02/04/2015 12:21 PM, Alex Elsayed wrote:
>> Maybe we step back and define a DBus interface
>
> I'd be quite happy with that - I considered suggesting it, in fact, but
> wasn't sure of the prevailing opinion re: dbus around here.
>
> What would you do re: discovery, though?
>
> (Explaining some DBus things, which readers may already be aware of)
>
> In DBus, there's a two-level hierarchy of busnames holding objects.
> Busnames are either the inherent, connection-level one of the :\d+.\d+ form,
> or the human-readable well-known name form (reverse DNS). Objects, then,
> implement interfaces.
>
> However, well-known names can only have a single owner - so discovering
> which busnames have objects which implement an interface is non-trivial.
>
> The approach taken by KDE is to suffix the well-known name with the PID
> (org.kde.StatusNotifierItem-2055 or whatever), call ListNames, and filter in
> the client. This has the drawback of making DBus activation impossible.
>
> Another approach is for every implementor to try to claim the well-known
> name, and on failure contact the existing owner to republish their objects
> (possibly under a namespaced object path). This has the drawback of
> complicating the implementation somewhat, as well as making bus activation
> only able to activate a single 'default' implementation.
>
> A third approach would be to explicitly define a multiplexor, which backends
> ask to republish their objects. This simplifies implementations, and it
> could also provide its own API that requests a backend by name, and ensures
> that backend's object is available. This could be driven by something as
> simple as a key-value mapping from backend name to a well-known DBus name
> specific to that backend, which the multiplexor calls to trigger service
> activation.
>
> Thoughts?
It really seems to come down to: will multiple independent user-handler
daemons be needed? Because I'm trying really hard to make tcmu-runner
good enough so that the answer is no :-)
tcmu-runner supports multiple handler modules, so it's extensible. It is
permissively licensed so no issues with non-FOSS handlers needing their
own daemon. It also could be replaced entirely (either with a modified
version of itself or from scratch) and still not give up the
single-busname, service activation approach.
So that would be my current preference. (The fact that the kernel API
doesn't preclude multiple handler daemons does not mean we need to
*support* those right away, or ever.)
If there are likely use cases that tcmu-runner is unsuitable for solving
by itself then that would change things of course, and let's please talk
about them!
-- Andy
8 years, 6 months
LIO crashing Fedora box, multiple versions and kernels tested
by Dan Lane
I have now built out several servers in an attempt to use LIO for my
lab, but all recent attempts have met failure. I used a very similar
setup in the past with a different lab so I know this all works, but
that build used Ubuntu and an older kernel (Ubuntu 11.04). Here is a
brief description of my test environment:
Storage servers: IBM bladecenter using HS21 (8853) blades
Disk backend: 10k SAS JBOD, RAID 1, 5 and 6 attempted
Disk controllers: On-board LSI and Serveraid 8k (adaptec) tested
Storage fabric: QLA2462 equivalent (IBM branded) 4gb FC cards (PCI-X based)
OS: Fedora 19, 20 and 21 using multiple kernels, the latest being Kernel 3.19.3
I'm able to get things up and running, but as soon as the ESXi hosts
start using the shared storage, the server blade crashes (it actually
seems to lock up). I've also tried multiple versions of ESXi, just in
case. To ensure the problem wasn't with the server I ran multiple
loops of bonnie++, used Linux "stress", and ran memtest+ on the server
for over 24 hours, with no failures. I found reports of similar
problems in the past and it was suggested that the problem was with
the performance of the backend storage. As unlikely as I expect this
to be with my storage, here are the results of bonnie++ on my RAID 6
array, hopefully one of you are better at interpreting them than I am.
[root@labsan1 bonnie]# bonnie++ -d /bonnie -r 4096 -s 32G -n 0 -m TEST
-f -b -n 128 -u root
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
TEST 32G 336366 47 80745 12 352922
28 542.4 8
Latency 348ms 297ms 42853us 224ms
Version 1.96 ------Sequential Create------ --------Random Create--------
TEST -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
128 1786 9 +++++ +++ 1033 5 1769 9 +++++ +++ 984 5
Latency 286ms 478us 448ms 282ms 32us 597ms
I've seen quite a few different errors, but the following are some of
the most common, but please let me know what I should do to collect
the best possible logs for troubleshooting the problem.
------------------------------------------------------------------------------------------SNIP--------------------------------------------------------------------------------------------------------------
Message from syslogd@labsan1 at Feb 5 23:49:01 ...
kernel:[17184.127000] NMI watchdog: BUG: soft lockup - CPU#3 stuck
for 22s! [systemd-udevd:1602]
------------------------------------------------------------------------------------------/SNIP--------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------SNIP--------------------------------------------------------------------------------------------------------------
Apr 3 00:24:34 labsan1 kernel: rport-5:0-9: blocked FC remote port
time out: removing rport
Apr 3 00:24:34 labsan1 kernel: rport-4:0-9: blocked FC remote port
time out: removing rport
Apr 3 00:24:34 labsan1 kernel: rport-6:0-9: blocked FC remote port
time out: removing rport
Apr 3 00:24:34 labsan1 kernel: rport-3:0-9: blocked FC remote port
time out: removing rport
Apr 3 00:26:40 labsan1 kernel: [41095.711611] MODE SENSE:
unimplemented page/subpage: 0x1c/0x02
Apr 3 00:26:40 labsan1 kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Apr 3 00:34:03 labsan1 kernel: [41538.771168] Detected MISCOMPARE for
addr: ffff8800caeb7000 buf: ffff8800c9896e00
Apr 3 00:34:03 labsan1 kernel: [41538.771173] Target/iblock: Send
MISCOMPARE check condition and sense
Apr 3 00:34:03 labsan1 kernel: Detected MISCOMPARE for addr:
ffff8800caeb7000 buf: ffff8800c9896e00
Apr 3 00:34:03 labsan1 kernel: Target/iblock: Send MISCOMPARE check
condition and sense
Apr 3 00:34:48 labsan1 kernel: [41584.159441] ABORT_TASK: Found
referenced qla2xxx task_tag: 1170576
Apr 3 00:34:48 labsan1 kernel: [41584.159446] ABORT_TASK: ref_tag:
1170576 already complete, skipping
Apr 3 00:34:48 labsan1 kernel: [41584.159448] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1170576
Apr 3 00:34:48 labsan1 kernel: ABORT_TASK: Found referenced qla2xxx
task_tag: 1170576
Apr 3 00:34:48 labsan1 kernel: ABORT_TASK: ref_tag: 1170576 already
complete, skipping
Apr 3 00:34:48 labsan1 kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1170576
Apr 3 00:34:49 labsan1 kernel: [41585.156976] ABORT_TASK: Found
referenced qla2xxx task_tag: 1171236
Apr 3 00:34:49 labsan1 kernel: ABORT_TASK: Found referenced qla2xxx
task_tag: 1171236
Apr 3 00:34:50 labsan1 kernel: [41586.351683] ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1171236
Apr 3 00:34:50 labsan1 kernel: [41586.351691] ABORT_TASK: Found
referenced qla2xxx task_tag: 1172644
Apr 3 00:34:50 labsan1 kernel: ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1171236
Apr 3 00:34:50 labsan1 kernel: ABORT_TASK: Found referenced qla2xxx
task_tag: 1172644
Apr 3 00:34:50 labsan1 kernel: [41586.472423] ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1172644
Apr 3 00:34:50 labsan1 kernel: [41586.472432] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1147388
Apr 3 00:34:50 labsan1 kernel: [41586.472438] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1171368
Apr 3 00:34:50 labsan1 kernel: [41586.472441] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1171456
Apr 3 00:34:50 labsan1 kernel: [41586.472445] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1171500
------------------------------------------------------------------------------------------/SNIP-------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------SNIP--------------------------------------------------------------------------------------------------------------
Apr 3 02:04:40 labsan1 kernel: [ 2282.079013] rport-6:0-5: blocked
FC remote port time out: no longer a FCP target, removing starget
Apr 3 02:04:40 labsan1 kernel: rport-6:0-5: blocked FC remote port
time out: no longer a FCP target, removing starget
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] ------------[ cut here
]------------
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] WARNING: CPU: 3 PID: 0
at kernel/watchdog.c:317 watchdog_overflow_callback+0x92/0xc0()
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] Watchdog detected hard
LOCKUP on cpu 3
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] Modules linked in:
tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file
target_core_iblock iscsi_target_mod target_core_mod ip6t_rpfilter
ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security i
p6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip
table_security iptable_raw coretemp kvm_intel kvm iTCO_wdt
iTCO_vendor_support gpio_ich ipmi_ssif ipmi_devintf lpc_ich mfd_core
i5000
_edac serio_raw edac_core ses ioatdma enclosure i5k_amb shpchp dca
ipmi_si acpi_cpufreq ipmi_msghandler nfsd auth_rpcgss nfs_acl lock
d grace sunrpc radeon i2c_algo_bit drm_kms_helper ttm drm qla2xxx bnx2
ata_generic pata_acpi scsi_transport_fc aacraid
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] CPU: 3 PID: 0 Comm:
swapper/3 Not tainted 3.19.3-200.fc21.x86_64 #1
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] Hardware name: IBM IBM
eServer BladeCenter HS21 -[8853L6U]-/Server Blade, BIOS -[BCE14
8BUS-1.21]- 04/04/2011
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] 0000000000000000
ef4cea677ede47e6 ffff88012fd85a60 ffffffff8176e215
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] 0000000000000000
ffff88012fd85ab8 ffff88012fd85aa0 ffffffff8109bc1a
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] 0000000000000000
ffff88012a510000 0000000000000000 ffff88012fd85c00
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] Call Trace:
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] <NMI>
[<ffffffff8176e215>] dump_stack+0x45/0x57
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8109bc1a>]
warn_slowpath_common+0x8a/0xc0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8109bca5>]
warn_slowpath_fmt+0x55/0x70
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81150ad2>]
watchdog_overflow_callback+0x92/0xc0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8119191b>]
__perf_event_overflow+0x9b/0x250
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81192434>]
perf_event_overflow+0x14/0x20
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8103460a>]
intel_pmu_handle_irq+0x1da/0x3f0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8102bafb>]
perf_event_nmi_handler+0x2b/0x50
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81018fd8>]
nmi_handle+0x88/0x130
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81019562>]
default_do_nmi+0x42/0x110
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810196b8>]
do_nmi+0x88/0xd0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81776d21>]
end_repeat_nmi+0x1e/0x2e
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177448a>] ?
_raw_spin_lock_irqsave+0x4a/0x60
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177448a>] ?
_raw_spin_lock_irqsave+0x4a/0x60
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177448a>] ?
_raw_spin_lock_irqsave+0x4a/0x60
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] <<EOE>> <IRQ>
[<ffffffffa00eade2>] qlt_fc_port_deleted+0x62/0xd0 [qla2xxx]
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffffa008cc13>]
qla2x00_mark_device_lost+0x153/0x2e0 [qla2xxx]
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffffa00ac8a9>]
qla2x00_async_event+0xe39/0x1890 [qla2xxx]
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810cb828>] ?
sched_clock_cpu+0x88/0xb0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810c04ca>] ?
update_rq_clock.part.78+0x1a/0xe0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810cf9c3>] ?
update_blocked_averages+0x2f3/0x7a0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffffa00ade41>]
qla24xx_intr_handler+0x1a1/0x2f0 [qla2xxx]
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810f2a47>]
handle_irq_event_percpu+0x77/0x1a0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810f2bab>]
handle_irq_event+0x3b/0x60
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8101e8ca>] ?
native_sched_clock+0x2a/0x90
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810f5a99>]
handle_fasteoi_irq+0x79/0x120
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81017414>]
handle_irq+0x74/0x140
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810bb54a>] ?
atomic_notifier_call_chain+0x1a/0x20
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177777f>]
do_IRQ+0x4f/0xf0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177556d>]
common_interrupt+0x6d/0x6d
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] <EOI>
[<ffffffff81103c98>] ? hrtimer_start+0x18/0x20
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8105ea56>] ?
native_safe_halt+0x6/0x10
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810fabb3>] ?
rcu_eqs_enter+0xa3/0xb0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8101f97e>]
default_idle+0x1e/0xc0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8102034f>]
arch_cpu_idle+0xf/0x20
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810de15a>]
cpu_startup_entry+0x37a/0x3c0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8104af3a>]
start_secondary+0x1aa/0x1f0
Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] ---[ end trace
05113eaf8d7e2e9d ]---
Apr 3 02:05:10 labsan1 kernel: [ 2312.441004] INFO: rcu_sched
detected stalls on CPUs/tasks: { 3} (detected by 0, t=60003 jiffies,
g=3933, c=3932, q=0)
Apr 3 02:05:10 labsan1 kernel: [ 2312.441207] Task dump for CPU 3:
Apr 3 02:05:10 labsan1 kernel: [ 2312.441211] swapper/3 R
running task 0 0 1 0x00000008
Apr 3 02:05:10 labsan1 kernel: [ 2312.441215] 0000000000000000
0000000000000000 ffffffffffffff1e ffffffff8105ea56
Apr 3 02:05:10 labsan1 kernel: [ 2312.441219] 0000000000000010
0000000000000246 ffff88012abf7e88 0000000000000018
Apr 3 02:05:10 labsan1 kernel: [ 2312.441222] ffffffff810fabb3
ffff88012abf7ea8 ffffffff8101f97e ffffffff81d2a6c0
Apr 3 02:05:10 labsan1 kernel: [ 2312.441226] Call Trace:
Apr 3 02:05:10 labsan1 kernel: [ 2312.441237] [<ffffffff8105ea56>] ?
native_safe_halt+0x6/0x10
Apr 3 02:05:10 labsan1 kernel: [ 2312.441242] [<ffffffff810fabb3>] ?
rcu_eqs_enter+0xa3/0xb0
Apr 3 02:05:10 labsan1 kernel: [ 2312.441247] [<ffffffff8101f97e>] ?
default_idle+0x1e/0xc0
Apr 3 02:05:10 labsan1 kernel: [ 2312.441251] [<ffffffff8102034f>] ?
arch_cpu_idle+0xf/0x20
Apr 3 02:05:10 labsan1 kernel: [ 2312.441254] [<ffffffff810de15a>] ?
cpu_startup_entry+0x37a/0x3c0
Apr 3 02:05:10 labsan1 kernel: [ 2312.441259] [<ffffffff8104af3a>] ?
start_secondary+0x1aa/0x1f0
Apr 3 02:05:10 labsan1 kernel: ------------[ cut here ]------------
Apr 3 02:05:10 labsan1 kernel: WARNING: CPU: 3 PID: 0 at
kernel/watchdog.c:317 watchdog_overflow_callback+0x92/0xc0()
Apr 3 02:05:10 labsan1 kernel: Watchdog detected hard LOCKUP on cpu 3
Apr 3 02:05:10 labsan1 kernel: Modules linked in:
Apr 3 02:05:16 labsan1 kernel: [ 2318.111331] ABORT_TASK: Found
referenced qla2xxx task_tag: 1192752
Apr 3 02:05:30 labsan1 kernel: [ 2332.303075] ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1192752
Apr 3 02:05:30 labsan1 kernel: [ 2332.303084] ABORT_TASK: Found
referenced qla2xxx task_tag: 1199704
------------------------------------------------------------------------------------------/SNIP--------------------------------------------------------------------------------------------------------------
Thanks,
Dan
8 years, 11 months
New releases: targetcli.fb40 and rtslib.fb53
by Andy Grover
Hi all,
This release includes some small usability tweaks: it documents in
targetcli a little better that the user will likely have to enable a
service to get settings restored on reboot, and it also changes the
targetctl script in rtslib-fb to not return a failure error code if
recoverable errors were encountered. There were some cases where setting
an attribute returned an error and led to the service being marked by
systemd as failed, and that's more strict than we want.
Github:
https://github.com/agrover/targetcli-fb
https://github.com/agrover/rtslib-fb
https://github.com/agrover/configshell-fb
tarballs:
https://fedorahosted.org/released/targetcli-fb/
targetcli-fb 2.1.fb40:
Andy Grover <agrover(a)redhat.com> (4):
self.cfs_cwd is unused
Don't need to set ui_so.name = so.name in refresh()
Add enabling service to man page quickstart section
update to 2.1.fb40
rtslib-fb 2.1.fb53:
Andy Grover <agrover(a)redhat.com> (4):
Add properties to RTSRoot that return new Group objects
Remove trailing whitespace
Do not fail if there were recoverable errors on restore
update version to 2.1.fb53
Regards -- Andy
9 years
Re: target refuses to start due to sector size after update
by Andy Grover
On 04/03/2015 03:28 PM, Dan Lane wrote:
> That took care of half the problem, but it's still failing because of:
>
> block/data: Cannot find attribute: fabric_max_sectors, skipped
>
> I tried adding it but either I'm putting it in the wrong location or
> my syntax is wrong, please advise.
Hmm? I think this means saveconfig.json contains fabric_max_sectors, but
it doesn't exist as an attribute for LIO. So I'd think you'd want to
remove it from saveconfig.json, not add it.
BTW since this is a -fb -specific issue, we can spare everyone else and
use the targetcli-fb list (CC'd) if further discussion is needed.
-- Andy
9 years