Re: Ordered writes in HekaFS encryption layer

Tuesday, 15 November 2011

(Stupid mail client replied only to Edward instead of the list before)

On Mon, 14 Nov 2011 19:31:13 +0100
Edward Shishkin <edward(a)redhat.com&gt; wrote:

...
 > Your discussion of how clients handle appending vs. overwriting
 > sequences seems to indicate that sequences are recognized as such on
 > the client side.  How?  

 Oplock xlator accumulates events (file size changes) in special
 maintained data-structures and evaluates every arrived request as
 append-truncate, or overwrite. This is a part of locking protocol.
 I'll post the design document a bit later.   
That would be nice.  I look forward to seeing how it can do such
aggregation without adding even more latency.

...
 > If the client has grouped the writes it receives into a
sequence,
 > why not coalesce the entire sequence into a single writev?  

 On the one hand there is a funny restriction for iov.len:
 it must not be larger than (2G minus small). I tried to eliminate
 this restriction a year ago without success:
 http://bugzilla.redhat.com/show_bug.cgi?id=612839

 On the other hand Gluster restricts the number of iovecs by
 MAX_IOVEC.

 All this means that we can not avoid granulation stuff (teaching  
 ->cbk() to spawn ->writev() for next chunk, etc).   
Shouldn't we at least *try* to do such aggregation in the vast majority
of cases where it's still possible?  Failure to do so will impact
already-questionable performance.

...
 > If a sequence is *only* appending (i.e. completely beyond
current
 > EOF), does the client synthesize an encrypted-zero-byte write to
 > fill the hole?  Does it lock (actually lease) the entire region
 > from current EOF to the end of the sequence all at once?  

 Appending writes are performed under exclusive lock (see next mail).
 Perhaps we can grant a shared lock for one appending write. However,
 more then one appending writes executing in parallel can conflict:

 Suppose file is 20K
 Process A performs ->writev(size = 10, off = 20);
 Process B performs ->writev(size = 10, off = 30);

 Process A checks file size (20K);
 Process B checks file size (20K);
 Process A writes 10K bytes from offset 20K;
 Process B converts a 10K "hole" at offset 20K.
 Process B writes from offset 30K.

 In the result we'll have unexpected 10K of zeros at offset 20K.
 Obviously "shared" locks don't work here in spite of writes to
 disjoint intervals.   
So why not exclusive locks?  Such conflicts should be rare, and we have
to issue some kind of lock request anyway, so there should be no
additional overhead for the common cases.

...
 > Let me play devil's advocate here.  Why fill holes at all? 

 Because this is a reasonable working solution.   
It is clearly *not* working yet, not without a lot of additional work,
and whether it's reasonable is what we're discussing.

...
 > An
 > alternative would be for the server to store information about
 > holes, e.g. in one or more xattrs, and keep an up-to-date version
 > of that information in memory for any open file.  

 Such hole maps will be large xattrs/memory consumers. We'll need
 to flush parts of such map to disk once in a while. Also we'll need
 to synchronize a set of disk holes with the hole map in xattrs by
 a special order to make sure that a set disk holes is "not larger",
 then a hole map.   
A fair point.  In many cases, the holes will be aligned to filesystem
blocks, so the information we need will already be available from
there.  That means we only need to track holes bigger than a
cipherblock but smaller than a filesystem block, but that list could
still become quite large.  Could we "cheat" by filling such holes with
explicit zeroes but leaving bigger holes at the local-filesystem
level?

I'm not trying to "champion" one approach or the other.  We shouldn't
be rejecting *any* alternatives because of easily-solved secondary
issues, just because of who has proposed what.  The goal here is to
determine how this problem can be solved with minimum development time
and/or impact on performance, and it's not at all clear which route
leads to that goal. Let's not cut short lines of inquiry too soon
because of any one person's opinion or bias (including mine).

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Ordered writes in HekaFS encryption layer