Re: [Gluster-users] 2.0.6

Anand Avati <avati@xxxxxxxxxxx> · Sun, 23 Aug 2009 05:21:58 -0700

On Sun, Aug 23, 2009 at 3:17 AM, Stephan von
Krawczynski<skraw@xxxxxxxxxx> wrote:
> On Sat, 22 Aug 2009 10:24:48 -0700
> Anand Avati <avati@xxxxxxxxxxx> wrote:
>
>> [... long technical explanation ...]
>> As you rightly summarized,
>> Your theory: glusterfs is buggy (cause) and results in all fuse
>> mountpoints hanging, and also results in server2's backend fs hanging
>> (effect)
>>
>> My theory: your backend fs is buggy (cause) and hangs and results in
>> all fuse mountpoints to hang (effect) which happens because of reasons
>> explained above
>>
>> I maintain that my theory is right because glusterfsd just cannot
>> cause a backend filesystem to hang, and if it indeed did, the bug is
>> in the backend fs because glusterfsd only performs system calls to
>> access it.
>
> Lets assume your theory is right. Then I obviously managed to create a
> scenario where the bail-out decisions for servers are clearly bad. In fact
> they are so bad that the whole service breaks down. This is of course a no-go
> for an application thats sole (or primary) purpose is to keep your fileservice
> up, no matter what servers in the backend crash or vanish. As long as there is
> a theoretical way of performing the needed fileservice it should be up and
> running. Even iff your theory were right, still glusterfs does not handle
> the situation as good as is could (read: as a user would expect).

OK, first of all, this is now a very different issue we are trying to
address. Correct me if I'm wrong, the new problem definition now is -
'when glusterfs is presented with a backend filesystem which hangs FS
calls, the replicate module does not provide FS service' (and not any
more, as previously described by you, 'glusterfs has not been able to
run bonnie even for an hour on all 2.0.x releases because of lack of
attention towards stability and concentration on featurism'). Please
do understand that this is not at all a (regular) crash of the
filesystem, as described, which can be reliably reproduced within an
hour, and the dev team not caring to fix it. The problem does not
deserve such an attack.

The reason why this issue persists is - there is no reliable way to
even detect this hang programatically. The right way to "deal" with it
is to translate the "disk hang" into a "subvolume down" is hard,
because -- Has the server stopped responding? No, ping-pong replies
are coming just fine. Has the backend disk started returning IO
errors? No, the FS calls just hang exactly like a deadlock. Detecting
hardware failures can be done with reasonable reliability. Detecting
buggy software lockups and such deadlocks is a very hard (theoretical)
problem.

The simplest way around it having timeouts at a higher layer. And it
is for a reason that the current call timeouts are 1800 seconds - we
have seen in our QA lab that truncate() call on multi terabyte large
file on ext3 takes more than 20 minutes to complete, and during that
period all other calls happening on that filesystem also freeze.
Programatically this situation is no different from the hang you face.
The 1800sec timeout currently used is based on experimental
calculations and not arbitrary. If you can come up with a better way
of reliably detecting that the backend FS has hung itself (even
considering the delay situations which I explained above), we are
willing to use that technique provided it is reasonable enough (do
consider situations where the backend fs could be an NFS which might
have temporarily blocked for multiple minutes for the server to reboot
etc).

Avati