Hi, All: I've met a problem when doing unit test for replicate translator. And this essential problem has bothered me for 2 weeks. My setup is 4 copies in 4 machines (A,B,C,D), all A,B,C,D are acting both server and client(the mount point is /home/), and there is another machine E, which doesn't has any copy, just act as purely client and utilize all A,B,C,D's disks my simulate failure strategy is: Step 1. randomly choose 1 machine, fail the NIC glusterfs is listen on(there are still 3 copies on-line) Step 2. sleep for a while (like 60 seconds) Step 3. bring up the NIC I failed before (there are 4 copies on-line right now) Step 4. do "ls -laR /home" on the failed machine before Step 5. goto step 1 Simultaneously, I'm also doing `ls -laR /home > test_result.txt` on the machine E for times. I've found the problems like: 1. missing files in directory or duplicate name in the same directory in the ls output, like: see the small part of different output of `ls -laR`, this is the vimdiff output. -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file84 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file83 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file84 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file84 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file85 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file85 ----------------------------------------------------------------------------------| -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file86 ----------------------------------------------------------------------------------| -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file87 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file88 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file88 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file89 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file89 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file89 | ---------------------------------------------------------------------------------- -rw-r--r-- 1 root root 5004454 2009-10-16 19:20 file9 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:20 file9 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file90 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file90 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file91 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file91 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file91 | ---------------------------------------------------------------------------------- -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file92 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file92 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file93 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file93 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file94 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file94 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file95 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file95 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file97 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file96 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file97 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file97 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file98 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file98 -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file99 | -rw-r--r-- 1 root root 5004454 2009-10-16 19:21 file99 There are apparently some files are missing and duplicate name, all are in the same directory. 2. occasionally, the ls reports: ls: reading directory /home/dir1/dir21: File descriptor in bad state I really want guys can solve this basic and essential problem The glusterfsd.vol I'm using for all 5 machines is: ========================================================================================= # THIS IS THE SERVER-END CONFIGURATION # Brick 1 volume posix type storage/posix option directory /mnt/disk1 end-volume volume locks type features/locks subvolumes posix end-volume volume brick type performance/io-threads option thread-count 16 subvolumes locks end-volume # Server volume server type protocol/server option transport-type tcp/server option transport.socket.bind-address `ifconfig -a | grep "10.106.105." | awk '{print $2}' | awk 'BEGIN {FS=":"};{print $2}'` option transport.socket.listen-port 6996 subvolumes brick option auth.addr.brick.allow * end-volume # SERVER-END CONFIGURATION ENDS # THIS IS THE CLIENT-END CONFIGURATION # 3 Disks Machines # Machine 1 volume cbrick1 type protocol/client option transport-type tcp option remote-host 10.106.105.150 option remote-port 6996 option remote-subvolume brick end-volume # Machine 2 volume cbrick4 type protocol/client option transport-type tcp option remote-host 10.106.105.151 option remote-port 6996 option remote-subvolume brick end-volume # Machine 3 volume cbrick7 type protocol/client option transport-type tcp option remote-host 10.106.105.152 option remote-port 6996 option remote-subvolume brick end-volume # Machine 4 volume cbrick10 type protocol/client option transport-type tcp option remote-host 10.106.105.153 option remote-port 6996 option remote-subvolume brick end-volume # All the bricks delare finished # Replicate part volume rep1 type cluster/replicate subvolumes cbrick1 cbrick4 cbrick7 cbrick10 end-volume # CLIENT END CONFIGURATION ENDS ======================================================================================================== Regards, Zhuo Yin (917)215-8740 Gentoo Linux Fan - int (*(*(*pFile)())[10])();