RFC: replication proposal v2

Tuesday, 20 September 2011

After thinking about some of the problem scenarios people had mentioned
in my first RFC (thank you), I've devised a new version which uses a
much more robust conflict resolution and repair component.  The bad
news is that the proposal is a lot more complex than before.  The good
news is that most of the complexity is stuff that was already part of
the plan for doing asynchronous wide-area replication in a future
release, so doing this stuff now actually gives us a head start on
that.  As always, please comment freely so we can get this right and
make things better sooner rather than later.

# HekaFS Improved Replication #

## Background and Requirements ##

One of the most serious internal complaints about GlusterFS is performance for
small synchronous requests when using their filesystem-level replication (AFR).
This problem particularly afflicts virtual-machine-image and database
workloads, reducing performance to about a third of what it "should" be
(compared on a per-server basis to NFS on the same hardware).  The fundamental
problem is that the AFR approach to making writes crash-proof involves the
following operations:

	1. Lock on the primary (first) server

	2. Record operation-pending state (using extended attributes) on all
	   servers

	3. Issue write to all servers

	4. As writes complete, update operation-pending state on other servers

	5. Unlock on primary server

Even with some operations in parallel, this requires a minimum of five network
round trips to/from the primary server - possibly more as step 4 might be
repeated if there are more than two replicas.  Even with pending changes to
AFR, such as coalescing step 4 updates, AFR's per-request latency is likely to
remain terrible.

Externally, users seem to focus on a different problem: the timeliness and
observability of replica repair after a server has failed and been
restored[1][2].  AFR was built on the assumption that on-demand repair of
individual files or directories as they're accessed would be sufficient.  The
message from users ever since has been unequivocal: leaving unknown numbers of
unrepaired files vulnerable to a second failure for an indefinite period is
unacceptable.  These users require immediate repair with explicit notification
of return to a fully protected state, but here they run into a second snag: the
time required to do a full xattr scan of a multi-terabyte filesystem through a
single node is also unacceptable.  Patches were submitted almost a year ago[3]
to implement precise recovery by maintaining a list of files that are partially
written and might therefore require repair, but those have never been adopted.
The recently introduced "proactive self heal" functionality is only slightly
better.  It is triggered automatically and runs inside one of the server
daemons - avoiding many machine to machine and user to kernel round trips - but
it's still single-threaded and drags all data through one server that might be
neither source nor destination.  Worse, if a second failure occurs while the
lengthy repair process for a previous failure is still ongoing, a new repair
cycle will be scheduled but might not even start for days while the previous
repair scans millions of perfectly healthy files.

The primary requirements, therefore, are:

	* Improve performance for synchronous small requests

	* Provide efficient "minimal" replica repair with a positive indication
	  of replica status

In addition to these requirements, compatibility with planned enhancements to
distribution and wide-area replication would also be highly desirable.

## Proposed Solution ##

The origin of AFR's performance problems is that it requires extra operations
(beyond the necessary N writes) in the non-failure case to ensure correct
operation in the failure case.  The basis of the proposed solution is therefore
to be optimistic instead of pessimistic, expending minimal resources in the
normal case and taking extra steps only after a failure.  The basic write
algorithm (which will be extended later in this document so don't get too
attached to it) becomes:

	1. Forward the write to all N replicas

	2. If all N replicas indicate success, we're done

	3. If any replica fails, add information about the failed request (e.g.
	   file, offset, length) to journals on the replicas where it succeeded

	4. As part of the startup process, defer completion of startup until
	   brought up to date by replaying peers' journals

Because the process relies on a journal, there's no need to maintain a
separate list of files in need of repair; journal contents can be examined at
any time, and if they're empty (the normal case) that serves as a positive
indication that the volume is in a fully protected state.

Doing repair as part of the startup process means that, if the failure is a
network partition rather than a server failure[4], then neither side will go
through the startup process.  Each server must therefore initiate  repair upon
being notified of another server coming up as well as during startup.  Journal
entries are pushed rather than pulled, from the servers that have them to the
newly booted or reconnected server.  Each server must also be a client, both to
receive peer-status notifications (which currently go only to clients) and to
issue journal-related requests.

In the case of a network partition, a second problem also arises: split brain.
Writes might continue to be received and entered into the journal on both sides
of the partition.  When journal entries are being propagated in both directions
between two servers, establishing the correct combined order for writes that
overlap would require additional information (e.g. version vectors) not
currently present in the GlusterFS network protocol.  Part of the solution to
this is to enforce quorum as has already been suggested[5] and implemented for
AFR.  Only a client which can contact a majority of servers, or exactly half
with the first server as part of the set, will even attempt a write.  The other
part of the solution to network-partition problems is the same as for wide-area
replication: vector clocks and a robust conflict-resolution method.  These will
be described in a subsequent section.

Although the description so far has mostly concentrated on writes, other
modifications - e.g. create, symlink, setxattr - mostly work the same way.  In
the case of namespace operations followed by data operations - e.g. rename
followed by write - ordinary care must be taken to ensure that the second
operation is applied to the correct object.  In the worst case, we might need
to store UUIDs in the journal and use a UUID-to-path mapping maintained on each
server (which would be useful for other reasons).

Full scans must still be available as a fallback in case of a total brick
failure, to repopulate it or its replacement from scratch.

## Vector Clocks ##

The conflict resolution protocol exists not only to handle server and network
failures, but also transient conflicts that can occur when two (or more)
clients write to two (or more) servers in different orders.  For example,
consider what happens if events regarding the same byte range occur in this
order:

	1. Client A sends a write ("lmnop") to servers X and Y

	2. Client B sends a write ("rstuv") to the same two servers

	3. Client A's write is performed at server X

	4. Client B's write is performed at server Y

	5. The other two writes complete

Now server X has "lmnop" and server Y has "rstuv" - a permanently
inconsistent
state since both writes seem complete (to their issuing clients).  To avoid
this sort of insanity, we use simple vector clocks[6].  Each server maintains
its own "clock" (integer sequence number) and the "vector clock"
(really a
clock vector) is an array of all such clocks for all replica servers.  Every
write or conflict-resolution message is accompanied by the vector clock from
its sender, which might cause some clock values to be updated on the receiver.
(Yes, this means the low level RPC protocol must be extended.)  Lastly, we
define vector clock M as being later than vector clock N if and only if:

	* For every component clock, M's value is *at least* N's

	* At least one of M's clock values is *strictly greater* than N's

If these rules are followed, the vector clocks provide enough information to
resolve most conflicts automatically.  For truly simultaneous operations at
different places, it's possible that the later-than relation described above
does not apply in either direction.  In these cases, some tie-breaker must be
applied to make the comparison result deterministic again.  To fulfill this
role, a unique ID for the client attempting a write is also passed along with
all messages related to that write.

## Advanced Write Protocol ##

The full write protocol uses vector clocks to deal with the various conflict
and fault-recovery cases in three ways.

	* Clients send their current vector clock, indicating the expected
	  clock values at each server, along with each write.  Each server
	  checks the clock value contained in the write against their own
	  current clock value.  If there's a mismatch, indicating a write at
	  the server that the client didn't know about, then the new write is
	  rejected as conflicting.

	* Servers maintain a journal of writes whose full disposition is
	  currently unknown.  These writes can be forwarded to other servers
	  immediately to handle conflict, or when a server comes back up to
	  handle faults.

	* If a client's write fails everywhere (for conflict or other reasons)
	  this means it had no effect anywhere and can simply be retried.  If
	  it succeeded at some servers and failed at others, then the client
	  can tell one or more of the succeeding servers to forward the write
	  immediately to the failing servers.  As part of this write-forwarding
	  process, vector clocks are used again to determine the order in which
	  conflicting writes will be applied.

The full write sequence is therefore as follows.

	1. Client sends a write to all servers, with vector clocks.

	2. Each server compares the write's vector clock to the server's own,
	   possibly rejecting the write as conflicting.

	3. Each server also adds the write to its local journal.

	4. If the client's write succeeds everywhere, the client can complete
	   it.  In addition, a "cleanup" message is sent to each replica to
	   retire the write's journal entry.  This can be done asynchronously
	   so that it doesn't increase write latency.

	5. If the client's write fails everywhere - meaning that it had no
	   permanent effect at all - it can simply be retried.

	6. In mixed success/failure cases, the client sends "forward" messages
	   to servers where the write succeeded, directing them to forward the
	   write to other servers where it failed.

	7. A server receiving a forwarded write uses the contained vector clock
	   to determine order relative to items in its own journal.  The
	   incoming write is either partially or fully applied according to
	   outcome of this ordering process.

	8. The server receiving the forwarded write replies to the sender,
	   which replies to the client.  The forwarding server's journal is
	   cleaned up along the way, as its disposition elsewhere becomes
	   known.

Step 7 is really the key.  By comparing vector clocks (with client IDs as
tie-breakers), applying some writes partially and others completely, even two
servers forwarding writes to one another converge on a single set of file
contents.  To see this in action, let's expand a little on the example in the
previous section.

	1. Initial data on both servers is "AAAA" with version 0.

	2. Client A sends a write for "BBB_" (where "_" means that character
is
	   not being written) to both servers with vector clock {0,0}.

	3. Client B sends a write for "_CCC" to both servers with vector clock
	   {0,0}.

	4. Server X receives and accepts A's write.  It now has data "BBBA" and
	   vector clock {1,0}.  It also has a journal entry for a write from A
	   at {1,0}.

	5. Server Y receives and accepts B's write.  It now has data "ACCC" and
	   vector clock {0,1}.  It also has a journal entry for a write from B
	   at {0,1}.

	6. The other two writes (A->Y and B->X) fail due to version conflict.

	7. Client A, seeing the mixed result, sends a "forward" message to X.

	8. Server X forwards the write (client=A, clock={1,0}, data="BBB_") to
	   Y.

	9. Server Y receives the forwarded write and compares the clock to that
	   for its own journaled write.  Since {1,0} and {0,1} are not
	   resolvable, the client ID tie-breaker is applied.  A's write is
	   before B's, so it's only applied partially.  This yields "BCCC"
with
	   a merged version vector of {1,1}.

	10. Replies propagate back from Y to X to A.  X's journal entry is
	    retired, and A indicates completion of the write.

	11. Client B, also seeing a mixed result, sends a "resolve" message to
	    Y which forwards the write (client=B, clock={0,1}, data="_CCC") to
	    X.

	12. Server X receives the second forwarded write.  This time the
	    tie-break rule has the opposite effect, causing the forwarded write
	    to apply in its entirety.  X now has data "BCCC" with merged
	    version {1,1} - the same as Y.

	13. Replies propagate from X to Y to B, etc.

The same protocol also works and yields consistent results even if the two
resolutions overlap, as has been verified by model checking[7].

## Consistency Issues ##

In the absence of a "cleanup" message, the same write forwarding with the same
conflict-resolution rules can also be applied in response to a server
restarting or becoming reachable after a network partition.  In all cases, a
write remains in a server's journal until that server is *sure* it has been
fully forwarded and resolved at all other servers within its replica set.

In the server-unavailable case, if multiple overlapping writes occur at one
of the surviving servers, the metadata-journaling approach can introduce a
transient sort of inconsistency.  Consider the following sequence.

	1. Original state: file X=AAA, file Y=BBB

	2. Write CC_ to file X (contents = CCA)

	3. Write DDD to file Y (contents = DDD)

	4. Write _EE to file X (contents = CEE)

If the journal contains only metadata, then the first will be forwarded to
other servers (using data from the file) as CE_.  This could result in a
state of CEA on the destination server - i.e. part of the third write has
effectively been applied before the second.  Since these writes might have
been fully acknowledged on the first server, this state - which the application
might have carefully avoided on the first server by sequencing its writes -
is invalid.  There are four ways to deal with this.

	A. Accept it, and write applications so that they don't rely on this
	   level of consistency.

	B. Do full data journaling for every write.  This approach is simple,
	   but has the worst performance characteristics of the
	   consistency-preserving options.

	C. Store the contents of the first write directly in the file, journal
	   full data for later writes.  This requires two disk writes for each
	   user write during the failure, then a read plus a write to apply the
	   journal during recovery.  It also has terrible read performance, as
	   reads must consult an arbitrary number of journal entries to find
	   the current contents of a byte range.

	D. Copy on write, to move overlapped data from the live file into the
	   journal before it's overwritten.  This requires a read plus two
	   writes during the failure (worse than the previous option) but no
	   additional I/O to replay the journal.  Read performance is good
	   because only the live file needs to be consulted for current data.

Based on this analysis, D is clearly preferable to B or C.  Some users might
still prefer A for its performance-during-failure characteristics, so that
should also be available as an option.

## Notes ##

[0] No, this document isn't really proper Markdown.  Close enough.

[1] "Experience with GlusterFS"
http://www.devco.net/archives/2010/09/22/experience_with_glusterfs.php

[2] "Why GlusterFS is Glusterfsck'd Too"
http://chip.typepad.com/weblog/2011/09/why-glusterfs-is-glusterfsckd-too....

[3] http://bugs.gluster.com/show_bug.cgi?id=2088

[4] Yes, partitions do occur even in a local network environment.

[5] http://bugs.gluster.com/show_bug.cgi?id=3533

[6] http://wiki.basho.com/Vector-Clocks.html (includes further references)

[7] See e.g. http://www.cs.utah.edu/formal_verification/Murphi/ for background,
though this effort did not use Murphi.  391,368 states (after symmetry
breaking) were checked for the two-client two-server scenario described above.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

RFC: replication proposal v2