On 09/14/2011 02:34 AM, Anand Avati wrote:
There is a situation where, in the middle of Step 2. where half the
servers have completed the write (other half servers have not yet
processed the writes, and there is a power outage of the entire data
center including the client. If the writes happened to be overwrites
which do not extend the file size, they will go unnoticed and never
get healed.
Indeed, Joe raised the same issue. The funny thing is that the state
we're in here is essentially the same as we get in the async
long-distance replication case. I have a pretty complete design for
handling these scenarios there, but it's necessarily complex and I was
hoping to avoid some of that complexity in the local case. I guess at
least some of it is still necessary; the trick is going to be figuring
out how much.
Being optimistic (without writing pre-changelog) works in situation
where partial failures are trivially detected - e.g. namespace
operations where lookups can detect there was a failure just by the
fact that an entry is present on one server and not on the other (an
xattr journal is not necessary to "show" a mismatch). It could even
work for writes which extend the file size as lookup will notice
mismatching file sizes instantly without the need for an xattr
changelog.
I think the key here is that the journal/changelog/whatever can be
maintained *locally* on the servers, without extra network round trips
in the latency path. Clearly, as you/Kaleb/Joe/Etsuji have pointed out,
there are some details that still need to be discussed, but I think
avoiding those network round trips is essential to improving latency.
Another important point to note here, is the recovery process. Even
in failure situations described above, the question of direction of
recovery comes into picture. If a changelog exists (i.e, client
survived long enough to write out the journal), then that will
indicate the direction of "healing". The client should absolutely not
return the syscall before the journal update is done (just cannot be
a background process).
Completely agree. The only place we differ so far seems to be on where
the changelog is and who updates it. I think having it on the client is
unsafe because of the scenario you describe and others as well. Clients
can't be trusted. Any number of clients can go away mid-operation and
never come back, and the result should still be consistent. A single
server can also go away mid-operation and never come back, but not N
servers all at once (where N is the replication level and thus N-1 is
the number of concurrent failures the system has been explicitly
configured to tolerate).
But if there was a partial overwrite in the middle of the file, it
is just not feasible to bring it under the "optimistic changelogging"
kind of a optimization.
I think it is feasible if sufficient ordering information is present
(e.g. version vectors). Yes, I know that would require a significant
protocol change. This is precisely the complexity I've gone through
with the async stuff, which I was hoping to avoid for sync. Let's walk
through the relevant failure scenario with N=2 to see how this works. A
client writes to two servers, using a last known version number at each
server as a predicate. If the write succeeds both places, that means
there were no conflicting writes and we're done. If the write fails
both places, for any reason not limited to predicate failure, that means
we had no effect at all and can simply retry (presumably using new
version numbers that we got back in the previous replies). So far, so good.
The real fun starts when a write succeeds at one server X and fails at
another server Y because of a version mismatch. This means someone
updated Y without (yet) updating X, either because the writes were
concurrent or because the other writer failed in the middle of an update
(the situation Kaleb and Etsuji both pointed out). The key here is that
we haven't yet acknowledged the write to the user, and both versions of
the conflicting region exist - our version on X and some other writer's
version on Y. All that remains is to pick an order, and ensure that the
conflict region contains the later version (according to the chosen
order, all before we do acknowledge to the user. To do that, we define
the version vectors as follows:
{ server1_version, server2_version, client_ID, client_version }
For the most part, the standard older-than rules for version vectors
apply. We can add a twist, though, which is that the client versions
are only comparable when the client IDs are identical. When all of the
server versions are identical, the client ID is used instead of the
client version to break the tie. This establishes a consistent "pecking
order" to determine the order of application for concurrent writes from
mutually oblivious clients.
OK, enough computer science. How does this work in practice? Simply
put, when two clients write to two servers in opposite orders, each
server will end up with a fully versioned write which it will try to
push to the other. When the two servers try to push to one another,
they'll agree on an order for the two writes and cross-propagate the
parts that correspond to that order. This is the same code that would
get exercised during startup to deal with the server-failure case, or
possibly as the result of a dirty-status timeout to deal with the
network-partition case, and here it can run in response to an explicit
request from a client that has detected a conflict.
Another step which I don't see in the above sequence of
operations
is locking/unlocking of the modification regions.
The one drawback to this approach is that the conflicting region will
differ transiently on the two servers, and without additional mechanisms
it would be possible for two clients to read different data. I'm going
to commit heresy and suggest that that's OK. If clients issue reads
while a prior write is in progress, they have no guarantee whether it
and/or any other unrelated writes will be present in what they read. If
they want that kind of ordering they should take locks, not expect that
somebody else will reduce their performance by automagically taking
locks on their behalf.
We can apply much stricter consistency rules and accompanying mechanisms
to every operation other than data reads and writes, and IMO should do
so. In the specific case of data reads and writes, for the workloads
that are driving this, incurring large performance penalties in return
for imperceptible or irrelevant consistency gains would be the wrong
tradeoff.