Re: Choice of Translator question

Kevan Benson <kbenson@xxxxxxxxxxxxxxx> · Thu, 27 Dec 2007 09:58:13 -0800

Gareth Bult wrote:
The trusted.afr.version extended attribute tracks while file
version is being used, and on a read, all participating AFR members
should respond with this information, and any older/obsoleted file
versions are replaced by a newer copy from one of the valid AFR
members (this is self-heal)

Yes, understood.

I think they are planning striped reads per block (maybe definable)
at a later date.

Mmm, so at the moment, when it says AFR does striped reads, what it
really means is that it does striped reads, just so long as you have
lots of relatively small files and not a few large files .. ???

I'm not sure.  It could very well depend on which version you are using, 
and where you read that.  I'm sure some features listed in the wiki are 
only implemented in the TLA releases until they put out the next point 
release.

Read from the the file from a client (head -c1 FILE >/dev/null to
force)

OR find /mountedfs -exec head -c1 > /dev/null {} \;

.. which is good, but VERY inefficient for a large file-system.

Agreed, which is why I just showed the single file self-heal method, 
since in your case targeted self heal (maybe before a full filesystem 
self heal) might be more useful.

you could use the stripe translator over AFR to AFR chunks of the
DB file, thus allowing per chunk self-heal.

Mmm, my experimentation indicates that this does not happen. I've
just spent 3 hours trying to prove / disprove this with various
configurations - AFR self-heals on a file basis, not on a
stripe-chunk basis.

If I have 4 bricks, two stripes using 2 bricks each, then an AFR on
top - any sort of self-heal replicates the entire DB. If I have 4
bricks, two AFR's and one stripe on top, I get the same thing.

I would expect AFR over stripe to replicate the whole file on 
inconsistent AFR versions, but I would have though stripe over AFR would 
work, as the AFR should only be seeing chunks of files.  I don't see how 
the AFR could even be aware the chunks belong to the same file, so how 
it would know to replicate all the chunks of a file is a bit of a 
mystery to me.  I will admit I haven't done much with the stripe 
translator though, so my understanding of it's operation may wrong.

I'm not familiar enough with database file writing practices in
general (not to mention your particular database's practices), or
the stripe translator to tell whether any of the following will
cause you problems, but they are worth looking into:

We're talking about flat files here, some with append, some with
seek/write updates.

Eh, it's probably not a problem anyways because of the way filesystems 
do block management.

1) Will the overhead the stripe translator introduces with a very
large file and relatively small chunks cause performance problems?
(5G in 1MB stripes = 5000 parts...)

No, this would be fine if the AFR/Stripe combination actually did a
per-chunk self heal.

I was thinking the stripe translator may add some extra overhead to the 
network, but it probably only requests the stripes that hold data you 
are requesting, so it probably is a non-issue (as you said).

2) How will GlusterFS handle a write to a stripe that is currently
self-healing?  Block?

The stripe replicates the entire stripe (which is big) and both read
and write operations block during the heal.

Do you mean that a change to a stripe replicates the entire file?

3) Does the way the DB writes the DB file cause massive updates
throughout the file, or does it generally just append and update
the indices, or something completely different.  It could have an
affect on how well something like this works.

I don't think access speed is an issue, glusterfs is very quick. The
issue is recovery, it appears not to operate as advertised!

Understood.  I'll have to actually try this when I have some time, 
instead of just doing some armchair theorizing.

Essentially, using this layout, you are keeping track of which
stripes have changed and only have to sync those particular ones on
self-heal. The longer the downtime, the longer self-heal will take,
but you can mitigate that problem with a rsync  of the stripes
between the active and failed GlusterFS nodes BEFORE starting
glusterfsd onthe failed node (make sure to get the extended
attributes too).

Ok, firstly, manual rsync's sort of defeat the object of the
exercise. Secondly, having to go through this process every time a
configuration is changed / glusterfsd is restarted is unworkable. 
Thirdly, replicating many GB's of data hammers the IO system and
slows down the entire cluster - again undesirable.

Well, it depends on your goal.  I only suggested rsync for when a node 
was offline for quite a while, which meant a large number of stripe 
components would have needed to be updates, requiring a long sync time. 
 If it was a quick outage (glusterfs restart or system reboot), it 
wouldn't be needed.  Think of it as a jumpstart on the self-heal process 
without blocking.

This, of course, was assuming that the stripe of AFR setup works.

Being able to restart a glusterfsd without breaking the replica's
would help, but I see no mention of this ...

Because I'm not a dev, and have no control over this.  ;)  Yes, I would 
like this feature as well, although I can imagine a couple of snags that 
can make it problematic to implement.

The above setup, if feasible, would mitigate restart cost, to the
point where only a few megs might need to be synced on a glusterfs
restart.

Ok, well I appear to have both AFR and Striping working and I can
observe their operation at brick level and confirm they are working
Ok.

Here's my basic test harness;

On the client system;

$dd if=/dev/zero of=/mnt/stripe/database bs=1M count=1024

write.py #!/usr/bin/python io=open("/mnt/stripe/database","r+") 
io.seek(1024*1024*900) io.write("Change set version # 6\n") 
io.close()

On the bricks I have;

read.py #!/usr/bin/python io=open("/export/stripe-1/database","r+") 
io.seek(1024*1024*900) print io.readline() io.close()

When I run write.py on the client, both bricks show the correct
change. Then I kill glusterfsd on brick2. Running write.py on the
client shows an update on brick1, obviously not on brick2. Restarting
glusterfsd on brick2 shows a reconnect in the logs. On the client;
head -c1 database Initiates a self heal, shown in the logs with DEBUG
turned on Running read.py on brick1 and brick2 blocks ... An entire
1G chunk is copied to brick 2 read.py on bricks 1 and 2 then continue
when the copy finishes ..

(!)

Was this on AFR over stripe or stripe over AFR?

I'm using fuse-2.7.2 from the repos and gluster 1.3.7 from the stable
tgz ...

fyi; The fuse that comes with Ubuntu/Gutsy seems to cause gluster to
crash under write-load, I'm still waiting to see if the current CVS
version solves the problem ...

The GlusterFS provided fuse is supposed to have some better default 
values for certain variables relating to transfer block size or some 
such that optimize it for glusterfs, and it's probably what they test 
against, so it's what I've been using.

--

-Kevan Benson
-A-1 Networks