The origin of AFR's performance problems is that it requires extra operations
(beyond the necessary N writes) in the non-failure case to ensure correct
operation in the failure case. The basis of the proposed solution is therefore
to be optimistic instead of pessimistic, expending minimal resources in the
normal case and taking extra steps only after a failure. The basic write
algorithm becomes:

1. Forward the write to all N replicas
2. If all N replicas indicate success, we're done
3. If any replica fails, add information about the failed request (e.g.
file, offset, length) to journals on the replicas where it succeeded
4. As part of the startup process, defer completion of startup until
brought up to date by replaying peers' journals

Because the process relies on a journal, there's no need to maintain a
separate list of files in need of repair; journal contents can be examined at
any time, and if they're empty (the normal case) that serves as a positive
indication that the volume is in a fully protected state.

Doing repair as part of the startup process means that, if the failure is a
network partition rather than a server failure4, then neither side will go
through the startup process. Each server must therefore initiate repair upon
being notified of another server coming up as well as during startup. Journal
entries are pushed rather than pulled, from the servers that have them to the
newly booted or reconnected server. Each server must also be a client, both to
receive peer-status notifications (which currently go only to clients) and to
issue journal-related requests.

There is a situation where, in the middle of Step 2. where half the servers have completed the write (other half servers have not yet processed the writes, and there is a power outage of the entire data center including the client. If the writes happened to be overwrites which do not extend the file size, they will go unnoticed and never get healed. Being optimistic (without writing pre-changelog) works in situation where partial failures are trivially detected - e.g. namespace operations where lookups can detect there was a failure just by the fact that an entry is present on one server and not on the other (an xattr journal is not necessary to "show" a mismatch). It could even work for writes which extend the file size as lookup will notice mismatching file sizes instantly without the need for an xattr changelog.

That is the pattern of situations where such "optimisitic" changelog handling can be done (where partial failures result in easily noticeable mismatches). I refer to "optmisitc changelog" meaning proceeding to perform the actual syscall modification without the initial changelog shielding against failures in the middle (and writing out a changelog - if you survive - about where all the change succeeded).

Another important point to note here, is the recovery process. Even in failure situations described above, the question of direction of recovery comes into picture. If a changelog exists (i.e, client survived long enough to write out the journal), then that will indicate the direction of "healing". The client should absolutely not return the syscall before the journal update is done (just cannot be a background process). But if a changelog does not exist after noticing the mismatch, it means that the client did not survive long enough to make the change.

At this point it becomes an arbitrary decision about chosing the direction of healing. With most changes (both namespace and data) you can always make a "conservative" choice. If a file exists on one and not on the other, then recreate it. That means, if it was due to a partial creation, we "roll-ahead" the transaction to completion, whereas if it was due to a partial unlinking, then we "roll-back" the transaction to the initial state - we can make a conservative decision without really caring what the actual transaction was.

The situation with file data is slightly different, if there is a mismatch in file size we can heal in the direction of the making the file size big on both servers. That way, we would have "rolled-ahead" a partial file-extending write and "rolled-back" a partial truncate.

But if there was a partial overwrite in the middle of the file, it is just not feasible to bring it under the "optimistic changelogging" kind of a optimization.

Another step which I don't see in the above sequence of operations is locking/unlocking of the modification regions.

Avati