Re: Timeout settings and self-healing ? (WAS: HA failover test unsuccessful (inaccessible mountpoint))

"Anand Avati" <avati@xxxxxxxxxxxxx> · Fri, 4 Apr 2008 14:40:54 +0530

Daniel/Guido,
 can you paste the logs which are relevant from the time of unplugging the
cable till the end of experiment?

avati

2008/4/3, Daniel Maher <dma+gluster@xxxxxxxxx <dma%2Bgluster@xxxxxxxxx>>:
>
> On Thu, 3 Apr 2008 14:55:48 +0530 "Anand Avati" <avati@xxxxxxxxxxxxx>
> wrote:
>
> > Daniel,
> >  maybe it is just taking long to detect connection failure. Can you
> > try with 'option transport-timeout 20' (sets response timeout to 20
> > seconds) in all your protocol/client and see if you still face the
> > 'hang' ?
>
> My simple test case is as follows :
> 1. Unplug one of the nodes (dfsD)
> 2. Attempt to ls -l the /opt/ (in which gfs-mount/ - the mountpoint -
> is contained)
>
> I set the timeout option along with every client instance in both the
> client and server configs.  I tested timeout settings of 10 and 20
> seconds (just to see).  In both cases, the 'hang' releases after a while
> (approx 30 seconds), but the results are odd. For example :
>
> # ls -l
>    (hang ~ 30 seconds)
> ls: cannot access gfs-mount: Transport endpoint is not connected
> total 0
> d????????? ? ? ? ?                ? gfs-mount
>
> # ls -l
>    (immediate)
> ls: cannot access gfs-mount: Transport endpoint is not connected
> total 0
> d????????? ? ? ? ?                ? gfs-mount
>
>    (user wait ~ 5 seconds)
>
> # ls -l
> total 8
> drwxr-xr-x 2 root root 4096 2008-04-03 09:43 gfs-mount
>
> It would appear that the "recovery" time, regardless of whether the
> timeout is set to 10 or 20, is around 35 to 40 seconds - though, at the
> very least, it recovered.  Is there any reasonable way to bring this
> period of time down ?
>
> Thank you all so much for your feedback on this topic !
>
>