Gareth Bult wrote:
Hi,
Thanks for that, but I'm afraid I'd already read it ... :(
The fundamental problem I have is with the method apparently employed
by self-heal.
Here's what I'm thinking;
Take a 5G database sitting on an AFR with three copies. Normal
operation - three consistent replica's, no problem.
Issue # 1; glusterfsd crashes (or is crashed) on one node. That
replica is immediately out of date as a result of continuous writes
to the DB.
Question # 1; When glusterfsd is restarted on the crashed node, how
does the system know that node is out of date and should not be used
for striped reads?
The trusted.afr.version extended attribute tracks while file version is
being used, and on a read, all participating AFR members should respond
with this information, and any older/obsoleted file versions are
replaced by a newer copy from one of the valid AFR members (this is
self-heal)
My assumption; Because striped reads are per file and as a result,
striping will not be applied to the database, hence there will be no
read advantage obtained by putting the database on the filesystem ..
??
I think they are planning striped reads per block (maybe definable) at a
later date.
Question # 2; Apart from closing the database and hence closing the
file, how do we tell the crashed node that it needs to re-mirror the
file?
Read from the the file from a client (head -c1 FILE >/dev/null to force).
Question # 3; Mirroring a 5G file will take "some time" and happens
when you re-open the file. While mirroring, the file is effectively
locked.
Net effect;
a. To recover from a crash the DB needs a restart b. On restart, the
DB is down for the time taken to copy 5G between machines (over a
minute)
From an operational point of view, this doesn't fly .. am I missing
something?
you could use the stripe translator over AFR to AFR chunks of the DB
file, thus allowing per chunk self-heal. I'm not familiar enough with
database file writing practices in general (not to mention your
particular database's practices), or the stripe translator to tell
whether any of the following will cause you problems, but they are worth
looking into:
1) Will the overhead the stripe translator introduces with a very large
file and relatively small chunks cause performance problems? (5G in 1MB
stripes = 5000 parts...)
2) How will GlusterFS handle a write to a stripe that is currently
self-healing? Block?
3) Does the way the DB writes the DB file cause massive updates
throughout the file, or does it generally just append and update the
indices, or something completely different. It could have an affect on
how well something like this works.
Essentially, using this layout, you are keeping track of which stripes
have changed and only have to sync those particular ones on self-heal.
The longer the downtime, the longer self-heal will take, but you can
mitigate that problem with a rsync of the stripes between the active
and failed GlusterFS nodes BEFORE starting glusterfsd onthe failed node
(make sure to get the extended attributes too).
Also, it appears that I need to restart glusterfsd when I change the
configuration files (i.e. to re-read them) which effectively crashes
the node .. is there a way to re-read a config without crashing the
node? (on the assumption that as above, crashing a node is
effectively "very" expensive...?)
The above setup, if feasible, would mitigate restart cost, to the point
where only a few megs might need to be synced on a glusterfs restart.
--
-Kevan Benson
-A-1 Networks