Replication not working on server hang

avati at gluster.com (Anand Avati) · Sat, 29 Aug 2009 11:24:30 -0700

> well, that never hapen before when using nfs with the same
> computers, same disk, etc ... for almost 2 years, so it's more
> than possible that is glusterfs the one which is triggering this
> suposed ext3 bug, but appart from this:
>
> a) documentation says "All operations that do not modify the file
> or directory are sent to all the subvolumes and the first successful
> reply is returned to the application", why is blocking then ?
> it's suposed that the reply from the non blocked server will
> come first and nothing will block, but clients are blocking on
> a simple ls operation

The calls (as you have seen in the logs as well) which are hanging are
lookup calls, which have to be sent to all subvolumes to ensure all
the copies are in sync.

> b) server1 (the ?non blocked one) also has the volumes mounted like
> any other client, but having option read-subvolume set to the local
> volume, but it also hangs when it was suposed to read from the local
> volume, not from the hanged one

The read calls are indeed served from read-subvolume, but that is only
for read() system calls so that you can avoid bulk data transfer on
the network. Calls like lookup() have to be sent to all subvolumes as
long as they report to be "up". The problem is that in the current
version there is no way to translate a "hanging backend fs" into a
"down subvolume".

> c) does not glsuterfs ping the servers periodically to see if they
> are available or not ? if so, why does not it detect that situation ?

It does, but in this case the server is up and running and replying
with pongs. The current ping-pong only checks for network reachability
to the server process.

Avati