self-heal behavior

Gerry Reno <greno@xxxxxxxxxxx> · Wed, 04 Jul 2007 09:38:20 -0400

 I've been doing some testing of self-heal.  Basically taking down one 
brick and then copying some files to one of the client mounts, then 
bringing the downed brick back up.  What I see is that when I bring the 
downed brick back up, no activity occurs.  It's only when I start doing 
something in one of the client mounts that something occurs to rebuild 
the out-of-sync brick.  My concern with this is that if I have four 
applications on different client nodes (separate machines) using the 
same data set (mounted on GlusterFS).  The brick on one of these nodes 
is out-of-sync, and it is not until some user is trying to use the 
application that the brick starts to resync.  This results in sluggish 
performance to the user as all the data has to be brought over the 
network from other bricks since the local brick is out-of-sync.  Now 
there may have been ten minutes of idle time prior to this user trying 
to access the data but glusterfs did not make any use of this time to 
rebuild the out-of-sync brick but rather waited until a user tried to 
access data.  To me, it appears that glusterfs should be making use of 
such opportunity and this would diminish the overall impact to users of 
the out-of-sync condition.

Regards,
Gerry