attempted hosted migration to gluster back end post-mortem

Fri Apr 27 16:45:07 UTC 2012

As most/all people know we attempted to migrate hosted to using a
gluster backend across two systems on wednesday evening. Thursday we
awoke to a host of problems and tackled solving them. Thursday evening
we migrated back to our previous configuration.

Thanks for the patience on thursday everyone. 
-sv

The below is the explanation of what all happened:

Hosted migration started on wednesday afternoon

plan was to move to glusterfs from a single node/drbd failover
configuration

hosted01 and hosted02 would become 'hosted' - both serving files
from /srv (our glusterfs share)

both systems were clients and servers (in glusters sense):
 - both systems exporting a brick of the same replica.
 - both systems mounting that replicated share.

when mounting with fuse we started seeing pretty serious performance
issues to the point that users were complaining it was not working. It
would take 20-30s to render a single ticket from trac.

We switched to nfs mounts and performance improved but we saw enormous
number of db locking issues on the servers. 

At this point we contacted the gluster upstream developers who were
outrageously helpful in tracking down the problems.

After some research it was determined that:

- gluster 3.2 over nfs doesn't support any remote locking at all
- if we brought things down to 1 node and local_lock=all then things
  would work and perform 'ok' but would not allow us to access from the
  other client
- this meant we could replicate the fs but not use it from both hosts

After moving to gluster over nfs we ran into a new problem:

  gluster's nfs server does not support --manage-gids so we were
restricted to 16 gids per user. No solution outside of new code for
this one - investigation into doing that for gluster ++ is occurring

jdarcy and pranithk given sysadmin-hosted access to look at logs
directly on hosted01/02 to look up on the split-brain reports we were
seeing.

jdarcy and pranithk tracked the self-heal/split brain problems back to
dirs with out of sync fattrs. The only way to solve this was to
manually remove the out of sync fattrs after verifying that ONLY the
fattrs were out of sync and not any data.
this involved looking at all dirs with self-heal problems and running:

> setfattr -x trusted.afr.hosted-client-0 /glusterfs/hosted/$dir
> setfattr -x trusted.afr.hosted-client-1 /glusterfs/hosted/$dir

to clear those settings then reaccessing the dir at:
 /srv/$dir

 to force the self-heal to complete correctly.

At this point we did not appear to be having self-heal issues but we
still have the group-ids limited to 16 under the nfs clients.

The only option to resolve that is to patch the gluster nfs server to
do the equivalent of --manage-gids.

We attempted to see if we could optimize the fuse mounts to work around
the nfs limitations. We set the fuse mount up hosted02 and did
performance tests - they were 'okay' but not really acceptable.
Additionally, after testing fuse enhancements we were informed that fuse
suffers from the same 16 gid limitation that nfs suffers from. so we
are completely dead in the water.

We punted back to hosted03 - re-rsyncing everything back.

We also setup a new host: hosted-list01.fedoraproject.org at internetx.
This will allow us to move the hosted mailing lists OFF of
fedorahosted.org which gains us a lot of latitude in how we move around
projects that we did not have before.

We will start on the gluster migration + testing if/when we get a patch
for 3.3 from jdarcy to handle > 16gids via nfs.

If that occurs we will be testing to handle the following problems:

Tests to run once we get 3.3 and the > 16 gid patch in place:
1. that nfs locking actually works (test with local_lock=none) and a
sqlite3 .dump rm -f /srv/trac/projects/fedora-infrastructure/db/fixed.db
    sqlite3 /srv/trac/projects/fedora-infrastructure/db/trac.db |
sqlite3 /srv/trac/projects/fedora-infrastructure/db/fixed.db 
2. that writes with a gid beyond 16 works
3. that performance is palatable: cloning git repos
4. test trac with both systems
5. look for self-healing issues
6. failover testing. Kill one node and confirm other works with limited
problems.

Things to do before production of gluster:
- MOVE GITWEB CACHING OFF OF /srv

Much thanks to the gluster dev team in helping us track down where the
problems were coming from and attempting to help us fix them. Their
help was indispensable.