First of all I apologize that the responses will not be intercalated with your questions. RAID: The RAID is running at level 6 and to start from the beginning, it performed a rebuild after a failed drive. Nothing spectacular on this, aside from the fact that during the rebuild these bad logical sectors seemed to be generated. That's when the filesystem began with all the problems. I did a rebuild, verify with fix and it found and fixed the four bad sectors that was detected. I reran a verify to be sure of it and it came back clean. Since then, the RAID device appears to be humming along like everything's just fine. Backup: I was in the process of migrating my data from this cluster to a mirror running RAID10. What this means is that hopefully if this cluster is toast I will have at least my primary data intact and can rerun the analyses as needed. My fear is that the data that came off the cluster post issue has corrupted data that is masked by the gluster striping. Unfortunately, because I'm dealing with data in the 70TB range, a replication strategy was not put into place until the mirror was able to be purchased at the beginning of the year. Since then I've been moving data in stages. Also I'm running CentOS 5.4. I've since tried to perform an umount on the filesystem that is currently hanging. See below: Pre ? umount: [root at temporal002 glusterfs]# df -h Filesystem Size Used Avail Use% Mounted on /dev/md2 143G 4.9G 131G 4% / /dev/md0 99M 11M 83M 12% /boot tmpfs 5.8G 0 5.8G 0% /dev/shm /dev/sdc1 21T 17T 3.2T 85% /mnt/data0 /dev/sdd1 21T 17T 3.3T 84% /mnt/data1 Post ? umount and while it?s hanging: [root at temporal002 log]# df -h Filesystem Size Used Avail Use% Mounted on /dev/md2 143G 4.9G 131G 4% / /dev/md0 99M 11M 83M 12% /boot tmpfs 5.8G 0 5.8G 0% /dev/shm /dev/sdc1 21T 17T 3.2T 85% /mnt/data0 /dev/sdd1 143G 4.9G 131G 4% /mnt/data1 So the volume. In the messages file I?m getting: Mar 14 11:15:00 temporal002 kernel: xfs_force_shutdown(sdd1,0x1) called from line 420 of file fs/xfs/xfs_rw.c. Return address = 0xf fffffff8840ccf5 Mar 14 11:15:00 temporal002 kernel: Filesystem "sdd1": I/O Error Detected. Shutting down filesystem: sdd1 Mar 14 11:15:00 temporal002 kernel: Please umount the filesystem, and rectify the problem(s) Mar 14 11:15:00 temporal002 kernel: Filesystem "sdd1": xfs_log_force: error 5 returned. Mar 14 11:15:54 temporal002 last message repeated 3 times Mar 14 11:17:24 temporal002 last message repeated 3 times Mar 14 11:18:54 temporal002 last message repeated 3 times Mar 14 11:20:24 temporal002 last message repeated 3 times Mar 14 11:21:54 temporal002 last message repeated 3 times Mar 14 11:23:24 temporal002 last message repeated 3 times At this point, all I can see in my future is trying to reboot without remounting and do the repair, which seems like a long shot? Suggestions? -----Original Message----- From: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at gluster.org] On Behalf Of Joe Landman Sent: Monday, March 14, 2011 12:57 PM To: gluster-users at gluster.org Subject: Re: Quick question regarding xfs_repair On 03/14/2011 12:50 PM, Terry Haley wrote: > Hello, [...] > My question has two parts. > > > > 1. If I perform an xfs_repair will gluster become out of sync with the > filesystem in lieu of repairs? If the xfs_repair modifies the extended attributes, it is possible that the gluster file system will appear to be inconsistent. Same issue would be with ext* and fsck. Gluster team would need to respond to this. > 2. Can I trust the backups I performed after these issues cropped up > as a result of the distribution of files across the nodes? Ie, a file is > only ? complete or would gluster complain and make that file unavailable as > it should? Good question. Can you verify that the data stored on the corrupted RAID hasn't been replicated to the other nodes? If so, easiest route might be a RAID rebuild, and then a file system wipe on the effected node, followed by a resync (assuming you are doing a replicated design). If the bad data's been replicated, then you probably have to trash that data (and do something similar to the above anyway). Does your RAID kit have a scan/check function? We do this with our units (both hardware/software RAID), and schedule full scans at least weekly. Strongly advise something similar if you aren't doing it already. Regards, Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.