Hello everybody! I've using GlusterFS for rather long time. It's a great project! Thanks! My old GlusterFS 1.2tla184 is a rock stable, but with new 1.3.x series I still have problems :(. Here there is a bunch of them for the version 1.3.9tla772. I use unify over 24 bricks, each on different cluster node. Each node runs glusterfs server (exporting local HDD) and client (mounting glusterFS unify at /home) as different processes. Server xlators are: storage/posix (and name-space on head node) -> features/posix-locks -> tcp/server. Client consists of: tcp/clients -> cluster/unify NUFA (except the head node with ALU) -> performance/write-behind. Each node runs openSUSE 10.3 with kernel 2.6.22.17-0.1 x86_64, fuse-2.7.2glfs9. 1) The most annoying problem is a complete glusterFS lockup. It became apparent in real-world usage by multiple users. At random moment any attempt to access glusterFS on head node (name-space, ALU unify) fails and the client.log is flooded by messages like -------- 2008-06-03 17:11:06 W [client-protocol.c:205:call_bail] c36: activating bail-out. pending frames = 3. last sent = 2008-06-03 17:10:23. last received = 2008-06-03 17:10:23 transport-timeout = 42 2008-06-03 17:11:06 C [client-protocol.c:212:call_bail] c36: bailing transport 2008-06-03 17:11:06 W [client-protocol.c:205:call_bail] c45: activating bail-out. pending frames = 4. last sent = 2008-06-03 17:10:23. last received = 2008-06-03 17:10:23 transport-timeout = 42 2008-06-03 17:11:06 C [client-protocol.c:212:call_bail] c45: bailing transport -------- repeated infinitely each minute (with node names changing in loop). A have two log files of 62Mb and 138Mb filled with such errors (they were generated when I left the system unattended for a day). Moreover, when the glusterfs enters in such a regime it can't be killed, even with "killall -9 glusterfs". But on the other cluster nodes (with NUFA unify) logs are free from these messages, and it is possible to access unify FS without lockup. I can't identify the initial cause of the lockup. Once it happened just after I switched off one of the bricks. But most of the times there is no any unusual actions on FS, just file/dir creation and coping/moving. Logs are too huge and full of other errors (see bellow), to find the cause. BTW, what does this message mean? :) 2) The second problem is already mentioned in the mailing list - sometimes files are double created on bricks. And the file became inaccessible until I delete one copy. Can this be done automatically? 3) My logs are also full of the following error: ----- 2008-06-02 16:03:33 E [unify.c:325:unify_lookup] bricks: returning ESTALE for / [translator generation (25) inode generation (23)] 2008-06-02 16:03:33 E [fuse-bridge.c:459:fuse_entry_cbk] glusterfs-fuse: 301: (34) / => -1 (116) 2008-06-02 16:03:33 E [unify.c:325:unify_lookup] bricks: returning ESTALE for / [translator generation (25) inode generation (23)] 2008-06-02 16:03:33 E [fuse-bridge.c:459:fuse_entry_cbk] glusterfs-fuse: 302: (34) / => -1 (116) ----- This error happens when the glusterFS mount point is touched somehow (for example "ls /home"), but not the subdirs. Despite of the error such an operation succeeds, but with a lag. It seems, that this is somehow connected with the non-simultaneous start of the cluster nodes (namely their glusterfs servers). When all nodes are up, the remount of the glusterfs helps to get rid of the mentioned error. Hope these problems can be resolved... With best regards, Andrey