On 08/28/2009 09:28 AM, Stephan von Krawczynski wrote: > On Fri, 28 Aug 2009 14:28:51 +0200 > David Saez Padros<david at ols.es> wrote: > > >> Hi >> >> well, that never hapen before when using nfs with the same >> computers, same disk, etc ... for almost 2 years, so it's more >> than possible that is glusterfs the one which is triggering this >> suposed ext3 bug, but appart from this: >> FYI: NFS will have problems in such a situation as well. If NFS tries to write to a local file system which hangs, NFS will hang too. With soft mounts, these can time out, but that's not clearly the better option either. Anybody who has worked with NFS for some time has seen this before. Once upon a time at our company (1990?) all mounts were hard mounts, and when a problem occurred, the processes completely locked out the user. Control-C wouldn't even work! NFS doesn't offer the options that GlusterFS does, so comparing 1:1 doesn't really make sense. For "not experiencing it in 2 years" - I think people really need to understand that GlusterFS is a *user space application*. Most specifically, this means that it *only* runs standard system calls, that any other program such as /bin/cat, /bin/tar, or /bin/du would run. There is no legitimate reason at all that the underlying file system should lock up as a result of this. For some file systems, there is a hang, not a lock up, until some critical section of the disk or in memory representation is finished being worked on. I understand that ext3 removing a large file is an example of something that might trigger this for a period. I think GlusterFS should try to be more resilient to this sort of thing as well - but comparisons to NFS are invalid, and treating this as a GlusterFS problem only (i.e. not tracking down the FS vendor and having them fix their FS) is also invalid. It's not a simple problem to solve. But, it should be solved. For RAID disks, for example, they are often tuned to significantly reduce the retry attempts and timeouts, so that the system remains responsive even when the disk is failing. GlusterFS should do the same. It's not a perfect solution - having a long running operation time out to soon incorrectly is a risk - but it's a necessary solution for any large scale cluster. Cheers, mark -- Mark Mielke<mark at mielke.cc>