Re: Choice of Translator question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Gareth Bult wrote:
Hi,

Thanks for that, but I'm afraid I'd already read it ... :(

The fundamental problem I have is with the method apparently employed
by self-heal.

Here's what I'm thinking;

Take a 5G database sitting on an AFR with three copies. Normal
operation - three consistent replica's, no problem.

Issue # 1; glusterfsd crashes (or is crashed) on one node. That
replica is immediately out of date as a result of continuous writes
to the DB.

Question # 1; When glusterfsd is restarted on the crashed node, how
does the system know that node is out of date and should not be used
for striped reads?

The trusted.afr.version extended attribute tracks while file version is being used, and on a read, all participating AFR members should respond with this information, and any older/obsoleted file versions are replaced by a newer copy from one of the valid AFR members (this is self-heal)

My assumption; Because striped reads are per file and as a result,
striping will not be applied to the database, hence there will be no
read advantage obtained by putting the database on the filesystem ..
??

I think they are planning striped reads per block (maybe definable) at a later date.

Question # 2; Apart from closing the database and hence closing the
file, how do we tell the crashed node that it needs to re-mirror the
file?

Read from the the file from a client (head -c1 FILE >/dev/null to force).

Question # 3; Mirroring a 5G file will take "some time" and happens
when you re-open the file. While mirroring, the file is effectively
locked.

Net effect;

a. To recover from a crash the DB needs a restart b. On restart, the
DB is down for the time taken to copy 5G between machines (over a
minute)

From an operational point of view, this doesn't fly .. am I missing
something?

you could use the stripe translator over AFR to AFR chunks of the DB file, thus allowing per chunk self-heal. I'm not familiar enough with database file writing practices in general (not to mention your particular database's practices), or the stripe translator to tell whether any of the following will cause you problems, but they are worth looking into:

1) Will the overhead the stripe translator introduces with a very large file and relatively small chunks cause performance problems? (5G in 1MB stripes = 5000 parts...) 2) How will GlusterFS handle a write to a stripe that is currently self-healing? Block? 3) Does the way the DB writes the DB file cause massive updates throughout the file, or does it generally just append and update the indices, or something completely different. It could have an affect on how well something like this works.

Essentially, using this layout, you are keeping track of which stripes have changed and only have to sync those particular ones on self-heal. The longer the downtime, the longer self-heal will take, but you can mitigate that problem with a rsync of the stripes between the active and failed GlusterFS nodes BEFORE starting glusterfsd onthe failed node (make sure to get the extended attributes too).

Also, it appears that I need to restart glusterfsd when I change the
configuration files (i.e. to re-read them) which effectively crashes
the node .. is there a way to re-read a config without crashing the
node? (on the assumption that as above, crashing a node is
effectively "very" expensive...?)

The above setup, if feasible, would mitigate restart cost, to the point where only a few megs might need to be synced on a glusterfs restart.

--

-Kevan Benson
-A-1 Networks




[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux