I?ve been doing some failure testing, and I ran into a really nasty condition. I'm hoping that I did something stupid. If you guys know what happened, or can shed some light, please let me know. My test environment is four virtual machines. Two I installed Gluster 3.3, and created a redundant volume between the two. I also installed apache and my custom application (it's like webdav) on these boxes. The boxes mount the redundant volume via 127.0.0.1 as Gluster clients. The application uses the volume as it's storage. The other two boxes are clients. They run a custom python script to download files, upload files, remove files and list directories; very similar to webdav. Clients connect via http, perform the operation (PUT,GET,DELETE) then disconnect. Rinse, repeat. The balance of PUT/GET/DELETE is 1/5/1. One client connects to one server/brick, the other client connects to the other server/brick. I let both clients run for a while, then I stop one client. I then ?reset? the brick/server that is not ?active? (the other one is servicing the HTTP traffic) now. This is interesting to watch the test client, because there is a 15 second pause, then the operations proceed. This is great. I'm very happy with this. When the ?failed? brick comes back up, the operations stop for 45 seconds. This is also fine. I then let the client run for a while, but the test suite fails shortly (10 minutes?) afterwards with a 500 server error. While investigating, I discover that there are a lot of ?phantom? files that are listed with just a filename, and lots of question marks (????) when doing an ?ls ?l?. ?rm ?rf *? on the Gluster volume seems to complete, but leaves behind all the ?broken? files. I eventually decided to blow away the volume and start over again, which caused me to get educated on 'setfattr' and wasted the rest of the day. I?m going to start some overnight runs now (before I leave for the day). I'm going to try to reproduce this failure mode tomorrow. So guys, what might be going on here? My workload is moderate, and it?s only one client; not like it?s writing a bunch of files at once. Gluster has been pretty bulletproof and this is the first time it?s really scared me. If this was production, I'd certainly have data loss. I have to believe that I'm doing something very wrong, as hardware failures (simulated by the virtual 'reset') are very common, and should not be a problem.. Thanks for any insights, Steve -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120814/7c0f2317/attachment-0001.htm>