Re: self-heal behavior

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Avati,
 Comments inline...

Anand Avati wrote:
Gerry,
your question is appropriate, but the answer to 'when to resync' is not very simple. when a brick which was brought down is brought up later, it may be a completely new (empty) brick. In that case starting to sync every file would most likely be the wrong decision. (we should rather sync the file which the user needs than some unused file). Even if we chose to sync files without user accessing them it would be very sluggish too since it would be intervening in other operations.
Self-heal should start immediately to sync files but not at full speed but rather at some throttled nice level that would not impact operations.



The current approach is to sync files on the next open() on it. This is usually a good balance since, during open() if we were to sync a file, even if it was a GB it would take 10-15 secs, and for normal files (in the order of few MBs) it is almost not noticable. But if this were to happen together for all files whether the user accessed them or not there would be a lot of traffic and be very sluggish.
Again this should be done at a throttled level if there were other operations happening, if not then throttle it up.



This approach of syncing on open() is what even other filesystems which support redundancy do.

Detecting 'idle time' and beginning sync-up and pausing the sync-up when user begins activity is a very tricky job, but that is definitely what we aim at finally. It is not enough if AFR detects the client is free, because the servers may be busy serving files to another client and syncing at that time may not be the most apprpriate time. The following versions of AFR will have more options to tune 'when' to sync. Currently it is only at open(). We plan to add options to make it sync on lookup() (happens on ls). Later versions would have pro-active syncing (detecting that both server and clients are idle etc).
That will be great.

Gerry






thanks,
avati

2007/7/4, Gerry Reno <greno@xxxxxxxxxxx <mailto:greno@xxxxxxxxxxx>>:

      I've been doing some testing of self-heal.  Basically taking
    down one
    brick and then copying some files to one of the client mounts, then
    bringing the downed brick back up.  What I see is that when I
    bring the
    downed brick back up, no activity occurs.  It's only when I start
    doing
    something in one of the client mounts that something occurs to rebuild
    the out-of-sync brick.  My concern with this is that if I have four
    applications on different client nodes (separate machines) using the
    same data set (mounted on GlusterFS).  The brick on one of these nodes
    is out-of-sync, and it is not until some user is trying to use the
    application that the brick starts to resync.  This results in
    sluggish
    performance to the user as all the data has to be brought over the
    network from other bricks since the local brick is out-of-sync.  Now
    there may have been ten minutes of idle time prior to this user trying
    to access the data but glusterfs did not make any use of this time to
    rebuild the out-of-sync brick but rather waited until a user tried to
    access data.  To me, it appears that glusterfs should be making use of
    such opportunity and this would diminish the overall impact to
    users of
    the out-of-sync condition.

    Regards,
    Gerry



    _______________________________________________
    Gluster-devel mailing list
    Gluster-devel@xxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxx>
    http://lists.nongnu.org/mailman/listinfo/gluster-devel




--
Anand V. Avati





[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux