While there seemed to be good reasons for it at the time, managing our
own ports and daemons in HekaFS has put is in the awkward position of
having to play catch-up with Gluster as they add features and
dependencies on their own glusterd infrastructure. There are three
categories of operations we need to worry about:
* Gluster CLI functions that don't work (or don't work as expected) for
HekaFS volumes - e.g. start/stop volume, set volume options, get
profile/top data.
* Gluster CLI operations that take on per-tenant instead of global
scope for HekaFS volumes - e.g. quota operations.
* HekaFS-specific operations - e.g. adding/removing tenants, setting
UID ranges.
The first and second categories can mostly be addressed by moving the
place where we re-write volfiles from our own scripts to the Gluster
CLI, at precisely the point where it generates its own volfiles
(volgen_write_volfile). This completely addresses start/stop and port
mapping, since glusterfsd processes for HekaFS volumes would be started
exactly the same way as others. Nothing in that code needs to know that
the volfiles have been rearranged. Setting volume options would also
work, due to a quirk in how that's done currently. The process is:
(1) CLI parses command, sends request to local glusterd.
(2) Glusterd creates new volfiles, sends "fetch spec" RPC calls to each
glusterfsd.
(3) Glusterfsd actually fetches and parses the new volfile.
(4) The new "graph" - the list of translators and relationships between
them - is compared to the old one.
(5) The "reconfigure" entry point on any existing translator is called
with a dictionary containing the new options.
As long as our translator rewriting process is deterministic, this
should work fine because the same tenant-specific translators will
(given the same list of tenants) be the same as before. This will
suffice even for setting UID-range options on existing per-tenant
bricks.
For quota operations, we need to start by doing something very
different, because Gluster runs the quota translator on the client side
(which is useless) and we'll run it on the server side.
Adding and removing tenants is probably the trickiest piece, because in
our case it means we'll be adding and removing whole translators with
connections to the protocol/server translator. It looks like
server_setvolume is capable of finding the new translator and binding
it to a connection properly, but I haven't been able to check whether
the "right things" will happen when per-tenant translators are removed.
The hfs_* scripts shouldn't need to change all that much except for
daemon stop/start (which will go through the gluster CLI). Much of the
config-database manipulation in existing commands will happen the same
way as before. The new quota commands will update our config database,
which will then be used during the volfile-generation phase to set
correct values on individual translators. Hekafsd will remain as a web
interface, and possibly also as the communications path for our own
distributed config database.
Longer term, we probably need to think more about how to avoid spurious
"quota overruns" that can occur when the division of actual usage
across servers doesn't match the division of quota among those same
servers. For example, if a user has 60GB of total quota we might
divide it into 20GB each on bricks X, Y, and Z. Perhaps, thanks to the
inherent randomness of consistent hashing, they end up with 18GB used
on X but only 10GB each on the others. The actual filesystem
containing X might not be anywhere near full, so the consistent-hashing
algorithm won't avoid it and a new 4GB file might be put on X. Even
though that filesystem isn't full, and even though that user has plenty
of quota remaining, they would still get a quota failure. Perhaps a
periodic quota-rebalancing task (implemented within hekafsd?) could
adjust the quota so that free space remains even across a tenant's
bricks, with a little bit of slack to make such errors even less likely.
Show replies by date