RFC: configuration via distributed database

Tuesday, 6 September 2011

Many of the I/O parts of HekaFS rely on having consistent configuration
data, e.g. to ensure that servers export the same volumes with the same
credentials that clients are going to use.  Unfortunately, the way the
various hfs_* scripts manage configuration data (essentially by
forwarding user requests in their entirety to every node) has proven
tedious, inefficient, and error-prone.  In particular, many instances
have already been found of configuration data getting out of sync
between nodes, and there's no infrastructure for bringing them back
into sync after either an error or a failure.  That's why one of the
tasks for "HekaFS 1.1" has been to change the configuration subsystem
to use a distributed database instead.  Kaleb deserves the credit for
pointing out that we already have a pretty good distributed database
available to us - GlusterFS itself, which already has some of the
distribution and self-healing features we need.  Using a small private
filesystem also offers other advantages, such as easing configuration
backup or allowing the use of directories as containers for
multi-valued configuration variables.  There are still some details
that need to be worked out about how to configure that private
filesystem, hence this email.

Most of the code to deal with the configuration filesystem is in two
places: the hfs_utils module (e.g. open_db) and a new hfs_config
script.  This script would support the following operations:

	hfs_config serve
	Make the current node a server for the config filesystem.  You
	need at least one server for anything else to work; normally
	you'd have at least two or three to protect against node
	failures (with GlusterFS handling replication between them).

	hfs_config retire [node_name]
	Remove the named node, or the current node by default, from the
	list of config-filesystem servers.

	hfs_config init
	Populate the config filesystem with initial values.  Obviously
	there would have to be some protection against running this on
	an already-populated filesystem.

	hfs_config volfile
	Generate a client volfile for the config filesystem, e.g. to
	perform manual repair or actions for which no scripts exist
	(yet).

When it starts up, hekafsd would go through the following sequence:

	Check whether this node is supposed to be a config-filesystem
	server.  If so, generate a simple server volfile and spin up an
	instance of glusterfsd.

	Get the list of cluster members (currently from "gluster peer
	status"; in future it might be from a real cluster
	infrastructure).

	Query each member for its server status, using a curl request
	to hekafsd on that node.

	Construct a client volfile based on which members identify
	themselves as config-filesystem servers, including the local
	node if appropriate, and mount that.

	Verify the server list.  If a discrepancy is found (see below),
	take corrective action and re-mount if necessary.

The last step is necessary because a config-filesystem server might be
down while another node is querying for servers, and it would be
disastrous if this led to different nodes writing to different
overlapping subsets of servers.  The GlusterFS repair code would get
hopelessly confused, and the config filesystem would probably end up in
an unrecoverable state.  Therefore, each config-filesystem server
leaves a marker for itself *in the config filesystem*, to persist even
when that server is down.  (Needless to say, this marker should be
removed by "hfs_config retire" and the node_name option exists for the
specific purpose of removing a server that will never come back up.)
The list stored within the filesystem can then be checked against that
used for the initial mount.  If the two lists differ, then corrective
action can be taken before attempting the "real" mount.  Some examples
of corrective action might include:

	Manual sync from the "consensus" config to the local copy.

	Force a GlusterFS replica repair ("find ... | stat").

	Check versions/timestamps (maintained at the end of every user
	request) and perform more complex repair.

This approach - unlike the "broadcast operations" approach used
currently - should be robust against all but the most pathological
multiple-failure cases.  Now is the time to poke holes in it, if you're
so inclined.  Otherwise I'll begin implementing the infrastructure
described here, and converting scripts to use it.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010