----- Original Message -----
From: "David Teigland" <teigland(a)redhat.com>
To: "Nir Soffer" <nsoffer(a)redhat.com>
Cc: "Allon Mureinik" <amureini(a)redhat.com>, "Ayal Baron"
<abaron(a)redhat.com>, sanlock-devel(a)lists.fedorahosted.org,
fsimonce(a)redhat.com, smizrahi(a)redhat.com, "Barak Azulay"
<bazulay(a)redhat.com>, "Eli Mesika" <emesika(a)redhat.com>
Sent: Thursday, March 6, 2014 12:32:42 AM
Subject: Re: [PATCH] sanlock: host_message
I visited with Nir and one of the big problems with my initial
host_message design was the lack of acknowledgements, which I'd
been strongly resisting.
I've come up with a new design that could be a workable way of doing host
messages with acknowledgements. I don't like it very much, but will give
it a try.
There are three 64 bit fields in the delta lease leader record that we can
use as follows:
field 1:
uint32_t send_to_host_id; /* message destination */
Do we need 32 bit for that? isn't this a number from 1 to 2000 (11 bits)?
uint32_t send_to_host_generation; /* message destination */
Is this 32 bit value?
field 2:
uint32_t send_msg; /* the caller-specified message */
Do we really need 4G different messages?
uint32_t send_seq; /* internal sequence number */
Do we really need 32 bit counter?
field 3:
uint32_t recv_from_host_id; /* acknowledgement: message source */
uint32_t recv_seq; /* acknowledgement: send_seq */
Why send this message in different field?
host_id and host_generation are 64 bit values everywhere else in sanlock,
and this shortens them to 32 bits to fit them into the available space.
Realistically, they should always fit in 32 bits, but it's ugly.
Can be cleaned by converting these everywhere to 32 bit values :-)
This also removes the 32 bit "data" field that could previously be sent
along with the 32 bit msg number. 160 bits of overhead for a 32 bit
message is a little sad.
How about this:
Message format - 64bit value:
field bits
-----------------
host_id 12
generation 32
send_msg 8
send_seq 12
Acknowledge is just another message, so we send in the same field.
RESET = 0x01
ACK = 0xFF
The sender detects the ack by getting an ACK message in the lease of the receiver
with the sender host_id and generation and the same send_seq as the sent message.
field2 and field3 - use for sending messages to multiple hosts, or keep as
reserved for future extension.
Sending a message
-----------------
int sanlock_host_message(const char *ls_name, uint32_t flags,
int hm_size, struct sanlk_host_message *hm
uint32_t *send_seq);
struct sanlk_host_message {
uint64_t host_id;
uint64_t generation;
uint32_t send_msg;
}
The sending host sets the following in its delta_lease:
field 1:
send_to_host_id = hm.host_id & 0xFFFFFFFF;
send_to_host_gen = hm.generation & 0xFFFFFFFF;
field 2:
send_msg = hm.send_msg;
send_seq = local_msg_seq++;
send_seq is returned to the caller for matching an acknowledgement.
Receiving a message
-------------------
The receiving host sees its own host_id/generation in the sending host's
lease, processes send_msg, and saves host_id/seq in a list of messages to
be acknowledged.
At the next delta lease renewal, it takes the next host_id/seq from its
list and sets:
field 3:
recv_from_host_id = the host_id that send the message;
recv_seq = the seq number that accompanied the message;
Receiving acknowledgement
-------------------------
sanlock will not keep any state about the host messages it has sent or try
to match acknowledgements. But, sanlock does keep track of other host's
delta lease state, and that could include recv_from_host_id/recv_seq. We
can add an api for the caller to query the recv_from_host_id/recv_seq for
a given host_id.
This means that the clients has to remember sent messages sequence, so
implementing a simple fence agent script will be impossible. You will have
to create another process running from start of fencing, remembering the sent
message sequence, and polling sanlock daemon for the result.
If sanlock does remember sent messages and check for acks, it will be easier
to use it from other tools.
In the caller, sanlock_host_message() returned the send_seq value that was
used for the message. After this, the caller would query sanlock for the
recv_seq until it matched send_seq (or until it wants to give up.)
Problems with the acknowledgement scheme:
1. It will not work with a fast reset option using /proc/sysrq-trigger
because there will not be enough time for the acknowledgement to be
written before the host is reset. (With another independent message
area, we could write an acknowledgement immediately, but borrowing the
lockspace lease means we do not have this option.)
You can do a fast reset after the write to the storage finished, assuming
that the write is not asynchronous.
2. If multiple hosts send messages to a single destination at once, the
destination host will need to acknowledge them one at a time in
consecutive renewals. It takes longer to get an ack, each ack would
be visible for one renewal and could be missed.
I don't see a problem here for the fencing use case.
Nir