sanlock events - events not received after releasing and acquiring host id

Sunday, 10 August 2014

Hi David,

While testing host events using the python bindings, I found a corner case which
may cause trouble.

I have 2 hosts running vdsm, using the vdsm ids lockspace.

On host_id 1, I run this code:

    lockspace = 'add39817-ed89-4ef7-88cd-f6e4c5a11d1d'

    fd = sanlock.reg_event(lockspace)
    try:
        poll = select.poll()
        poll.register(fd, select.POLLIN)
        while True:
            poll.poll()
            events = sanlock.get_event(fd)
            for ev in events:
                print 'received event'
                pprint.pprint(ev)
    finally:
        poll.unregister(fd)
        sanlock.end_event(fd, lockspace)

On the other host_id 3, I run this:

    sanlock.set_event(lockspace, 1, 0, 1, flags=sanlock.SETEV_CUR_GENERATION)

When both hosts are up, events send from host_id 3 to host_id 1 are received
correctly.

For example, I get this input:

    received event
    {'data': 0,
     'event': 1,
     'from_generation': 12,
     'from_host_id': 3,
     'generation': 21,
     'host_id': 0}

Trouble start when I put host_id 1 to maintenance, which release the host id on 
this lockspace.

When host is down, sanlock.get_hosts() on host_id 3 returns:

    sanlock.get_hosts('add39817-ed89-4ef7-88cd-f6e4c5a11d1d', 1)
    [{'generation': 21, 'host_id': 1, 'flags': 2,
'io_timeout': 10, 'timestamp': 0}]

The code registered for events on lockspace
'add39817-ed89-4ef7-88cd-f6e4c5a11d1d'
does not receive events when the host id is released. This makes sense, as the host
is not part of the cluster.

But, when I activate the host, and it acquire a the same host id again, on the
same lockspace, I still do not receive events from the code registered for
this lockspace.

Each time you put a host to maintenance, releasing the host_id and acquiring it again,
you can see in sanlock.log that sanlock is creating a new lockspace:

2014-08-10 18:40:01+0300 842420 [16247]: s6 lockspace
67ebc639-f790-4b1b-b5d2-526b0f569ac4:1:/dev/67ebc639-f790-4b1b-b5d2-526b0f569ac4/ids:0
2014-08-10 18:44:35+0300 842694 [16659]: s7 lockspace
add39817-ed89-4ef7-88cd-f6e4c5a11d1d:1:/dev/add39817-ed89-4ef7-88cd-f6e4c5a11d1d/ids:0
2014-08-10 18:44:35+0300 842694 [16247]: s8 lockspace
67ebc639-f790-4b1b-b5d2-526b0f569ac4:1:/dev/67ebc639-f790-4b1b-b5d2-526b0f569ac4/ids:0
2014-08-10 18:56:16+0300 843394 [16247]: s9 lockspace
add39817-ed89-4ef7-88cd-f6e4c5a11d1d:1:/dev/add39817-ed89-4ef7-88cd-f6e4c5a11d1d/ids:0
2014-08-10 18:56:16+0300 843394 [5023]: s10 lockspace
67ebc639-f790-4b1b-b5d2-526b0f569ac4:1:/dev/67ebc639-f790-4b1b-b5d2-526b0f569ac4/ids:0

To continue getting events, I must unregister the event fd, and register a new one
on the same lockspace.

It seems that the listening fd is tied to the lockspace number e.g. "s9", and
not to
the lockspace name "67ebc639-f790-4b1b-b5d2-526b0f569ac4". So when
"s9" is replaced
by "s10", the listener is broken silently.

Few questions:

1. Why listening for events stop working after releasing host id and acquiring it again
   on the same lockspace?

2. Is this the right behavior?

3. Assuming this is the right behavior, why the listener does not fail? I think it is
   expected that the event fds would close or fail so the listener can release
   the resources.
   For example, when I restart sanlock, the listener fails with:

       sanlock.SanlockException: (1, 'Unable to get events', 'Operation not
permitted')

   So a similar failure is expected in this case.

4. Do we have an event fd leak until the listener reg_end() the fd?

I think that we should have either:

- Listening for events works after host id released and acquired again
- Listening for events fail when host id released

Nir

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

sanlock events - events not received after releasing and acquiring host id