Files present on the backend but have become invisible from clients

amar at gluster.com (Amar Tumballi) · Fri, 27 May 2011 21:53:26 +0530

James,

Replies inline.

The directories are all still visible to the users, but scanning for
> attributes of 0sAAAAAAAAAAAAAAAA still yielded matches on the set of
> GlusterFS servers.
>
> http://pastebin.com/mxvFnFj4
>
> I tried running this command, but as you can see it wasn't happy, even
> though the syntax was correct:
>
> root at jc1letgfs17:~# gluster volume rebalance pfs-ro1 fix-layout start
> Usage: volume rebalance <VOLNAME> [fix-layout|migrate-data]
> {start|stop|status}
>
> I suspect this is a bug because of the "-" in my volume name. I'll test and
> confirm and file when I get a chance.
>
>
This seems to be an bug with the 'fix-layout' CLI option itself (as i assume
the version in 3.1.3, its fixed in 3.1.4+ or 3.2.0), please use just
'rebalance <VOLNAME> start'.

So I just did the standard rebalance command:
>  gluster volume rebalance pfs-ro1 start
>
> and it trundled along for a while and then one time when checked it's
> status, it failed:
>  date; gluster volume rebalance pfs-ro1 status
>  Thu May 26 09:02:00 EDT 2011
>  rebalance failed
>
> I re-ran it FOUR times getting a little farther with each attempt, and it
> eventually completed and then started doing the actual file migration part
> of the rebalance:
>  Thu May 26 12:22:25 EDT 2011
>  rebalance step 1: layout fix in progress: fixed layout 779
>  Thu May 26 12:23:25 EDT 2011
>  rebalance step 2: data migration in progress: rebalanced 71 files of size
> 136518704 (total files scanned 57702)
>
> Now scanning for attributes of 0sAAAAAAAAAAAAAAAA yields less results, but
> some are still present:
>
>  <http://pastebin.com/x4wYq8ic>

Now, doing a 'rebalance' is surely not the way to heal the 'replicate'
related attributes. 'rebalance' is all about fixing the 'distribute' related
'layout's and rebalancing the data within the servers.

It could have helped in resolving some of the attributes of 'replicate' as
issuing a rebalance triggers a directory traversal on the volume (which is
infact same as doing a 'ls -lR' or 'find' on volume).

> http://pastebin.com/x4wYq8ic
>
> As a possible sanity check, I did this command on my Read-Write GlusterFS
> storage servers (2 boxes, Distributed-Replicate), and got no "bad"
> attributes:
>  jc1ladmin1:~/projects/gluster  loop_check ' getfattr -dm -
> /export/read-only/g*' jc1letgfs{13,16} | egrep
> "jc1letgfs|0sAAAAAAAAAAAAAAAA$|file:" | less
>  getfattr: /export/read-only/g*: No such file or directory
>  getfattr: /export/read-only/g*: No such file or directory
>  jc1letgfs13
>  jc1letgfs16
>
> One difference in these two Storage server groups - the Read-Only group of
> 4 servers have their backend file systems formatted as XFS, while the
> Read-Write group of 2 are formatted with EXT4.
>
> Suggestions, critiques, etc gratefully solicited.
>
>
Please, next time while looking at the GlusterFS attributes use '-e hex' for
'getfattr' command. Anyways, I think the issue here is mostly due to some
sort of bug which resulted in writing attributes saying 'split-brain'
happened, and if that is the attribute, 'replicate' module doesn't heal
anything and leaves the file as is (without even fixing the attribute).

We are currently working on fixing these meta-data self-heal related issues
right now and hope to fix many of them by 3.2.1 (and 3.1.5).

Regards,
Amar

> James Burnash
> Unix Engineer.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20110527/045b22be/attachment.htm>