On Wed, Apr 23, 2008 at 3:47 AM, Guido Smit <guido@xxxxxxxxx> wrote: > Krishna, > > I did the test. I killed glusterfsd on one server. > All tests (ls, df, cp) worked like it should. I didn't even notice any > difference. Unplugging the cable however, blocked all operations and finally > after a few minutes > the transport endpoint message appears. > > The problem with TCP/IP is that when you unplug the cable, there is no messages sent to application's poll() on network. Driver internally tries to reconnect, and only after a long time. (it was around 10+minutes when we tested) we get message saying no route to host. But when applications die on server, or there is a shutdown, the connected nodes get a notification, hence everything will be smooth. Hence the delay in case of network cable unplugging. We came with an work around for managing this delay, that was 'transport-timeout' option, which times out each request after certain time. The default is '108's now. We kept it as high as this considering few applications which use mandatory locks, (block the write till a lock gets freed) can take easily up to 1+minutes for releasing the locks. Users have the option to set 'transport-timeout' (In client/protocol volume). So, they can tune it considering the I/O time of their apps. In our test setups, we could timeout exactly after given transport-timeout setting, everytime. So, the issue of freezing indefinitely, we couldn't reproduce. Regards, Amar -- Amar Tumballi Gluster/GlusterFS Hacker [bulde on #gluster/irc.gnu.org] http://www.zresearch.com - Commoditizing Super Storage!