Re: GlusterFS hangs/fails: Transport endpoint is not connected

"Harald Stürzebecher" <haralds@xxxxxxxxxxxxxxx> · Tue, 25 Nov 2008 15:21:20 +0100

Hello!

2008/11/25 Fred Hucht <fred@xxxxxxxxxxxxxx>:
> Hello Harald!
>
> I didn't test Infiniband transport until now, as I don't want to interfere
> with the parallel applications which are running over Infiniband. Gigabit
> Ethernet throughput would be sufficient for us at the moment.
>
> Today "only" three nodes were affected, yesterday it were nine nodes. The
> problems only occur on nodes to which jobs are scheduled which use /scratch
> as working directory: We test the filesystem in normal operation, one user
> submits jobs to the queueing system which use /scratch/... as working
> directory. While some of his jobs run without problems, other jobs fail due
> to FS problems. No problems occur over the usual NFS home directory.

IMHO, the fact that everything else works rules out the "network
problem". Sorry for wasting your time.

> When I test the FS with, e.g., dd on all nodes in parallel, no problems
> occur.h
>
> Which timeout shall I increase?

I had some "transport-timeout" in the back of my mind but the doc
(http://www.gluster.org/docs/index.php/GlusterFS_Translators_v1.3#client)
 says that the default already is 30 seconds.
I'd not change anything there without request from the developers.

Harald Stürzebecher