Hello, > -----Original Message----- > I'm dumping data to it like mad, and I decide to simulate a filesystem error > my remounting half of the cluster's drives in read-only mode with "mount - > o remount,ro". > > The cluster seems to slow just slightly, but it kept on ticking. Great. You are lucky then. In my case when FS was remounted RO due to SCSI controller detected ECC error in cache memory (If I correctly recall, read operations also was failing) it was not so bright. Everything seemed to be working OK, until I have listed mount point. Then whole cluster went down to a grinding halt. It seems that in my case Gluster was not able to identify that node had failed and attempted to do I/O on that brick. As node did not respond, Gluster client waited and waited for response. To avoid that I have changed fstab mount option to: errors=panic instead of default errors= remount-ro > awesome. I look at the log files and I see lots of stuff like > this: > > > [2013-07-11 16:20:13.387798] E [posix.c:1853:posix_open] > > 0-bkupc1-posix: open on > > /export/a/glusterfs/BACKUPS/2013-07- > 10/cat/export/c/data/dir2/OLD/.dir.old.tar.gz.hdbx3z: > > File exists > > [2013-07-11 16:20:13.387819] I [server3_1-fops.c:1538:server_open_cbk] > > 0-bkupc1-server: 24283714: OPEN > > /BACKUPS/2013-07-10/cat/export/c/data/dir2/OLD/.dir.old.tar.gz.hdbx3z > > (0c079382-e88b-432c-83e3-79bd9f5b8bb9) ==> -1 (File exists) > > I believe that this particular file was still open for writing when I took away > write access to half of the cluster. Once the rsync process I'm running > finishes with that file the errors cease. Can you open the file from mount point? > The manual lists "volume heal <VOLUME> info heal-failed" as a command > that I can check out, so I run that and get a list of files for which healing has > failed. There's quite a bit of them. Did you check output of "volume heal <VOLUME> info healed", maybe those files were eventually healed. Compare time stamps. > any, that I should take to fix a failed heal. Is there anything that I should do? > Or will this eventually sort itself out? If I remember correctly, there is a bug in 3.3.1, which prevents clearing heal log so failed item list will grow until daemon is restarted. Try to restart glusterfs-server and run heal command to see results. Also if the file is in splitbrain state, I believe it will not be healed. as Gluster cannot identify which copy of replica is good. You are using replicated or distributed replicated volume, am I right? Please correct me if I am wrong about splitbrain as this question is important for me also! Thank you This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC. Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the Prudential Regulation Authority (Financial Services Register No. 122702).