help, glusterfs replica can't handle brick filesystem crash and shutdown

bfoster at redhat.com (Brian Foster) · Mon, 07 Jan 2013 12:34:53 -0500

On 01/04/2013 10:18 PM, ??? wrote:
> Yes the filesystem shutdown cause glusterfs confuse since glusterfsd
> is still live for that brick and it tries to retrive file extended
> attributes and fails.
> When access some of the files from client side "Input output error
> occur", the symptom is same as underlying filesystem doesn't support
> extended attribute. (For example create a volume on /dev/shm)
> However I still hope glusterfs replica to handle this kind if failure
> since this is what it supposed to do. (Fault tolerance?for single
> hardware failure)
> 

Your corruption issue aside, I was able to reproduce the EIO errors by
running an untar on a replica volume and shutting down the XFS
filesystem for one of my bricks (via the 'godown' utility in xfstests).
I've filed the following gluster bug to track:

https://bugzilla.redhat.com/show_bug.cgi?id=892730

Thanks for calling this out.

Brian

> 2013/1/4, Brian Foster <bfoster at redhat.com>:
>> On 01/04/2013 01:00 AM, ??? wrote:
>>> Dear gluster experts,
>>>
>>> Glusterfs replica is supposed to handle hardware failure of one
>>> brick.(For example power outage etc). However we recently encounter an
>>> issue related to xfs file system crash and shutdown. When it happens
>>> the whole volume dones't work. Some files are inaccessible and even
>>> worse some directories become inaccessible which make thousands of
>>> files missing.
>>> To handle it we have to force shutdown the peer. This solves the
>>> problem but our services are impacted and data loose happens.
>>> Glusterfs replica should be able to handle brick filesystem shutdown
>>> smoothly. What's your opinion to avoid this kind of failure?
>>>
>>
>> Hi,
>>
>> First, I would suggest you independently characterize your XFS crash to
>> the XFS mailing list (xfs at oss.sgi.com):
>>
>> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>>
>> Hopefully they can help assess the state and possible recovery of your
>> local filesystem. How to proceed on the gluster side of things probably
>> depends on the outcome of that analysis. My guess is that the filesystem
>> going into a shutdown state probably causes confusion for gluster, due
>> to the runtime limitations it imposes on the filesystem. I haven't
>> actually tested an active gluster mount on a brick in the shutdown
>> state, so I can't specifically characterize the state (at minimum, I'd
>> expect read-only behavior), but I'll give it a try and see what happens...
>>
>> Brian
>>
> 
>