bailout after period of inactivity

ajwytzes at wise-guys.nl (Arend-Jan Wijtzes) · Tue, 23 Jun 2009 13:24:11 +0200

On Tue, Jun 23, 2009 at 05:45:10AM -0500, Vikas Gorur wrote:
> 
> ----- "Arend-Jan Wijtzes" <ajwytzes at wise-guys.nl> wrote:
> 
> > Hi Gluster people,
> > 
> > We are seeing errors when GlusterFS is being accessed after a long
> > period (days) of inactivity (the FS is used but not from this
> > machine).
> 
> The error is not related to the inactivity. Take a look at these lines of
> the log file:
> 
> > 2009-04-10 16:18:06 E [client-protocol.c:263:call_bail] brick-0-0:
> > activating bail-out. pending frames = 1. last sent = 2009-04-10
> > 16:17:14. last received = 2009-03-30 03:13:43. transport-timeout = 42
> > 2009-04-10 16:18:06 C [client-protocol.c:298:call_bail] brick-0-0:
> > bailing transport
> 
> A request was sent at 16:17:14 but no reply has been received even at
> 16:18:06 (= 52 seconds). Since the transport timeout is set to 42 seconds,
> the request has been aborted. There was probably some kind of network issue
> which caused the reply to not arrive.

The strange thing is that this happens every time after a long period
of inactivity. So the first whatever access to the filesystem fails
in this manner *every time*. Subsequent access is alright. Network is 
a local gigabit switch and, altough not impossible, 'a network issue'
does not sound too plausible to me. The machine is being used plenty
for other tasks without problems.

So there is a timeout, but whatever the cause is, it's triggered by long
term inactivity. We never had any network problems.

Other machines that access the filesystem on a regular basis do not show
this problem. It's only the machine that get's used once in a while.
The problem is reproducable, not a one time event.

-- 
Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl