On Thu, 8 Sep 2011 13:29:32 -0600
Pete Zaitcev <zaitcev(a)redhat.com> wrote:
On Tue, 6 Sep 2011 16:50:27 -0400
Jeff Darcy <jdarcy(a)redhat.com> wrote:
> Kaleb deserves the credit for
> pointing out that we already have a pretty good distributed database
> available to us - GlusterFS itself, which already has some of the
> distribution and self-healing features we need.
It is a clever idea if the chicken and egg problem gets resolved
neatly.
It's not quite a chicken and egg problem because HekaFS isn't using
itself. It's just using GlusterFS - which is oblivious to HekaFS's
machinations - in a different way.
But I'm just curious, did you guys think about using external
configuration management infrastructure, and if yes, what?
For example, we already rely on DNS (even if implicitly). How about
using the zone update feature to add-remove SRV records of servers
that have this "small private filesystem". They have port numbers.
Relying on DNS in that way seems fraught with peril. For one thing it
implies either a private DNS server or a complex security setup to
allow such arbitrary modification of records, and that will simply not
be acceptable to many users. That bad decision was one of the main
obstacles to adoption of Hail, and I don't want it to be a similar kind
of barrier for HekaFS.
Or, do not use private filesystem at all and roll our a Zookeeper
client (personally I hate both the complexity and verbosity of its
API, but I'm just asking).
I actually did think about using Zookeeper (or CLD). On the one hand,
it would solve the config-consistency problem quite well. On the
other hand, ...
* It would add a pretty nasty set of dependencies not only on ZK itself,
but also on various libraries, the JVM itself, etc.
* It would increase startup/shutdown complexity and resource use, as
there would be more daemons and ports to manage.
* Since we probably don't want to run ZK on every single server, we'd
still need a way to identify which GlusterFS/HekaFS servers are also
ZK servers . . . unless the ZK servers are separate, which would add
even more complexity.
I'm not totally opposed to that approach, but it doesn't seem like the
incremental benefits outweigh the costs. There are other distributed-DB
options that would pull in fewer dependencies and consume fewer
resources, but the other issues would remain and for every such option
I looked at (e.g. Voldemort, Riak) the result would have been the same.
Or, use rsyncd like in OpenStack (in Rackspace production really).
That's similar to what we do now. The difference between broadcasting
high-level requests and broadcasting low-level data operations isn't
that much in that case. It works OK in the normal cases, but leaves us
with the problem of whose copy is authoritative after a failure.
Building N-way consensus on top of two-way replication is exactly the
kind of thing I want to avoid; conflict resolution in the "weird" cases
is exactly what I want to push down to Someone Else's Code.
In short, I see what's approved but I wish to see what was
rejected,
if you have the time to satisfy my curiosity.
> When it starts up, hekafsd would go through [...]
> Check whether this node is supposed to be a
> config-filesystem server. If so, generate a simple server volfile
> and spin up an instance of glusterfsd.
What does the "check" include, in more detail? Is there a seed
configuration file in /var/lib/hekafsd that "hfs_config init" creates?
Correct, except that it's really more "hfs_config serve" that would do
what you describe; "hfs_config init" would populate the FS that's
shared between all of the servers.