Dynamic config and quota

Friday, 23 September 2011

While there seemed to be good reasons for it at the time, managing our
own ports and daemons in HekaFS has put is in the awkward position of
having to play catch-up with Gluster as they add features and
dependencies on their own glusterd infrastructure.  There are three
categories of operations we need to worry about:

* Gluster CLI functions that don't work (or don't work as expected) for
  HekaFS volumes - e.g. start/stop volume, set volume options, get
  profile/top data.

* Gluster CLI operations that take on per-tenant instead of global
  scope for HekaFS volumes - e.g. quota operations.

* HekaFS-specific operations - e.g. adding/removing tenants, setting
  UID ranges.

The first and second categories can mostly be addressed by moving the
place where we re-write volfiles from our own scripts to the Gluster
CLI, at precisely the point where it generates its own volfiles
(volgen_write_volfile).  This completely addresses start/stop and port
mapping, since glusterfsd processes for HekaFS volumes would be started
exactly the same way as others.  Nothing in that code needs to know that
the volfiles have been rearranged.  Setting volume options would also
work, due to a quirk in how that's done currently.  The process is:

(1) CLI parses command, sends request to local glusterd.

(2) Glusterd creates new volfiles, sends "fetch spec" RPC calls to each
    glusterfsd.

(3) Glusterfsd actually fetches and parses the new volfile.

(4) The new "graph" - the list of translators and relationships between
    them - is compared to the old one.

(5) The "reconfigure" entry point on any existing translator is called
    with a dictionary containing the new options.

As long as our translator rewriting process is deterministic, this
should work fine because the same tenant-specific translators will
(given the same list of tenants) be the same as before.  This will
suffice even for setting UID-range options on existing per-tenant
bricks.

For quota operations, we need to start by doing something very
different, because Gluster runs the quota translator on the client side
(which is useless) and we'll run it on the server side.

Adding and removing tenants is probably the trickiest piece, because in
our case it means we'll be adding and removing whole translators with
connections to the protocol/server translator.  It looks like
server_setvolume is capable of finding the new translator and binding
it to a connection properly, but I haven't been able to check whether
the "right things" will happen when per-tenant translators are removed.

The hfs_* scripts shouldn't need to change all that much except for
daemon stop/start (which will go through the gluster CLI).  Much of the
config-database manipulation in existing commands will happen the same
way as before.  The new quota commands will update our config database,
which will then be used during the volfile-generation phase to set
correct values on individual translators.  Hekafsd will remain as a web
interface, and possibly also as the communications path for our own
distributed config database.

Longer term, we probably need to think more about how to avoid spurious
"quota overruns" that can occur when the division of actual usage
across servers doesn't match the division of quota among those same
servers.  For example, if a user has 60GB of total quota we might
divide it into 20GB each on bricks X, Y, and Z.  Perhaps, thanks to the
inherent randomness of consistent hashing, they end up with 18GB used
on X but only 10GB each on the others.  The actual filesystem
containing X might not be anywhere near full, so the consistent-hashing
algorithm won't avoid it and a new 4GB file might be put on X.  Even
though that filesystem isn't full, and even though that user has plenty
of quota remaining, they would still get a quota failure.  Perhaps a
periodic quota-rebalancing task (implemented within hekafsd?) could
adjust the quota so that free space remains even across a tenant's
bricks, with a little bit of slack to make such errors even less likely.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010