cluster.min-free-disk separate for each, brick

d.a.bretherton at reading.ac.uk (Dan Bretherton) · Sat, 19 Nov 2011 14:12:22 +0000

On 29/09/11 12:28, Dan Bretherton wrote:
>
> On 08/09/11 23:51, Dan Bretherton wrote:
>>
>>> On Wed, Sep 7, 2011 at 4:27 PM, Dan Bretherton 
>>> <d.a.bretherton at reading.ac.uk <mailto:d.a.bretherton at reading.ac.uk>> 
>>> wrote:
>>>
>>>
>>>     On 17/08/11 16:19, Dan Bretherton wrote:
>>>
>>>
>>>
>>>
>>>
>>>             Dan Bretherton wrote:
>>>
>>>
>>>                 On 15/08/11 20:00, gluster-users-request at gluster.org
>>>                 <mailto:gluster-users-request at gluster.org> wrote:
>>>
>>>                     Message: 1
>>>                     Date: Sun, 14 Aug 2011 23:24:46 +0300
>>>                     From: "Deyan Chepishev -
>>>                     SuperHosting.BG"<dchepishev at superhosting.bg
>>>                     <mailto:dchepishev at superhosting.bg>>
>>>                     Subject: cluster.min-free-disk
>>>                      separate for each
>>>                        brick
>>>                     To: gluster-users at gluster.org
>>>                     <mailto:gluster-users at gluster.org>
>>>                     Message-ID:<4E482F0E.3030604 at superhosting.bg
>>>                     <mailto:4E482F0E.3030604 at superhosting.bg>>
>>>                     Content-Type: text/plain; charset=UTF-8;
>>>                     format=flowed
>>>
>>>                     Hello,
>>>
>>>                     I have a gluster set up with very different
>>>                     brick sizes.
>>>
>>>                     brick1: 9T
>>>                     brick2: 9T
>>>                     brick3: 37T
>>>
>>>                     with this configuration if I set the parameter
>>>                     cluster.min-free-disk to 10% it
>>>                     applies to all bricks which is quite
>>>                     uncomfortable with these brick sizes,
>>>                     because 10% for the small bricks are ~ 1T but
>>>                     for the big brick it is ~3.7T and
>>>                     what happens at the end is that if all brick go
>>>                     to 90% usage and I continue
>>>                     writing, the small ones eventually fill up to
>>>                     100% while the big one has enough
>>>                     free space.
>>>
>>>                     My question is, is there a way to set
>>>                     cluster.min-free-disk per brick instead
>>>                     setting it for the entire volume or any other
>>>                     way to work around this problem ?
>>>
>>>                     Thank you in advance
>>>
>>>                     Regards,
>>>                     Deyan
>>>
>>>                 Hello Deyan,
>>>
>>>                 I have exactly the same problem and I have asked
>>>                 about it before - see links below.
>>>
>>>                 http://community.gluster.org/q/in-version-3-1-4-how-can-i-set-the-minimum-amount-of-free-disk-space-on-the-bricks/
>>>
>>>                 http://gluster.org/pipermail/gluster-users/2011-May/007788.html
>>>
>>>                 My understanding is that the patch referred to in
>>>                 Amar's reply in the May thread prevents a
>>>                 "migrate-data" rebalance operation failing by
>>>                 running out of space on smaller bricks, but that
>>>                 doesn't solve the problem we are having.  Being able
>>>                 to set min-free-disk for each brick separately would
>>>                 be useful, as would being able to set this value as
>>>                 a number of bytes rather than a percentage.
>>>                  However, even if these features were present we
>>>                 would still have a problem when the amount of free
>>>                 space becomes less than min-free-disk, because this
>>>                 just results in a warning message in the logs and
>>>                 doesn't actually prevent more files from being
>>>                 written.  In other words, min-free-disk is a soft
>>>                 limit rather than a hard limit.  When a volume is
>>>                 more than 90% full there may still be hundreds of
>>>                 gigabytes of free space spread over the large
>>>                 bricks, but the small bricks may each only have a
>>>                 few gigabytes left of even less.  Users do "df" and
>>>                 see lots of free space in the volume so they
>>>                 continue writing files.  However, when GlusterFS
>>>                 chooses to write a file to a small brick, the write
>>>                 fails with "device full" errors if the file grows
>>>                 too large, which is often the case here with files
>>>                 typically several gigabytes in size for some
>>>                 applications.
>>>
>>>                 I would really like to know if there is a way to
>>>                 make min-free-disk a hard limit.  Ideally, GlusterFS
>>>                 would chose a brick on which to write a file based
>>>                 on how much free space it has left rather than
>>>                 choosing a brick at random (or however it is done
>>>                 now).  That would solve the problem of non-uniform
>>>                 brick sizes without the need for a hard
>>>                 min-free-disk limit.
>>>
>>>                 Amar's comment in the May thread about QA testing
>>>                 being done only on volumes with uniform brick sizes
>>>                 prompted me to start standardising on a uniform
>>>                 brick size for each volume in my cluster.  My
>>>                 impression is that implementing the features needed
>>>                 for users with non-uniform brick sizes is not a
>>>                 priority for Gluster, and that users are all
>>>                 expected to use uniform brick sizes.  I really think
>>>                 this fact should be stated clearly in the GlusterFS
>>>                 documentation, in the sections on creating volumes
>>>                 in the Administration Guide for example.  That would
>>>                 stop other users from going down the path that I did
>>>                 initially, which has given me a real headache
>>>                 because I am now having to move tens of terabytes of
>>>                 data off bricks that are larger than the new
>>>                 standard size.
>>>
>>>                 Regards
>>>                 Dan.
>>>
>>>             Hello,
>>>
>>>             This is really bad news, because I already migrated my
>>>             data and I just realized that I am screwed because
>>>             Gluster just does not care about the brick sizes.
>>>             It is impossible to move to uniform brick sizes.
>>>
>>>             Currently we use 2TB  HDDs, but the disks are growing
>>>             and soon we will probably use 3TB hdds or whatever other
>>>             larges sizes appear on the market. So if we choose to
>>>             use raid5 and some level of redundancy (for example
>>>             6hdds in raid5, no matter what their size is) this
>>>             sooner or later will lead us to non uniform bricks which
>>>             is a problem and it is not correct to expect that we
>>>             always can or want to provide uniform size bricks.
>>>
>>>             With this way of thinking if we currently have 10T from
>>>             6x2T in hdd5, at some point when there is a 10T on a
>>>             single disk we will have to use no raid just because
>>>             gluster can not handle non uniform bricks.
>>>
>>>             Regards,
>>>             Deyan
>>>
>>>
>>>         I think Amar might have provided the answer in his posting
>>>         to the thread yesterday, which has just appeared in my
>>>         autospam folder.
>>>
>>>         http://gluster.org/pipermail/gluster-users/2011-August/008579.html
>>>
>>>             With size option, you can have a hardbound on min-free-disk
>>>
>>>         This means that you can set a hard limit on min-free-disk,
>>>         and set a value in GB that is bigger than the biggest file
>>>         that is ever likely to be written.  This looks likely to
>>>         solve our problem and make non-uniform brick sizes a
>>>         practical proposition.  I wish I had known about this back
>>>         in May when I embarked on my cluster restructuring exercise;
>>>         the issue was discussed in this thread in May as well:
>>>         http://gluster.org/pipermail/gluster-users/2011-May/007794.html
>>>
>>>         Once I have moved all the data off the large bricks and
>>>         standardised on a uniform brick size, it will be relatively
>>>         easy to stick to this because I use LVM.  I create logical
>>>         volumes for new bricks when a volume needs extending.  The
>>>         only problem with this approach is what happens when the
>>>         amount of free space left on a server is less than the size
>>>         of the brick you want to create.  The only option then would
>>>         be to use new servers, potentially wasting several TB of
>>>         free space on existing servers.  The standard brick size for
>>>         most of my volumes is 3TB, which allows me to use a mixture
>>>         of small servers and large servers in a volume and limits
>>>         the amount of free space that would be wasted if there
>>>         wasn't quite enough free space on a server to create another
>>>         brick.  Another consequence of having 3TB bricks is that a
>>>         single server typically has two more more bricks belonging
>>>         to a the same volume, although I do my best to distribute
>>>         the volumes across different servers in order to spread the
>>>         load.  I am not aware of any problems associated with
>>>         exporting multiple bricks from a single server and it has
>>>         not caused me any problems so far that I am aware of.
>>>
>>>         -Dan.
>>>
>>>     Hello Deyan,
>>>
>>>     Have you tried giving min-free-disk a value in gigabytes, and if
>>>     so does it prevent new files being written to your bricks when
>>>     they are nearly full?  I recently tried it myself and found that
>>>     min-free-disk had no effect all.  I deliberately filled my
>>>     test/backup volume and most of the bricks became 100 full.  I
>>>     set min-free-disk to "20GB", as reported in "gluster volume ...
>>>     info" below.
>>>
>>>     cluster.min-free-disk: 20GB
>>>
>>>     Unless I am doing something wrong it seems as though we can not
>>>     "have a hardbound on min-free-disk" after all, and uniform brick
>>>     size is therefore an essential requirement.  It still doesn't
>>>     say that in the documentation, at least not in the volume
>>>     creation sections.
>>>
>>>
>>>     -Dan.
>>>
>>> On 08/09/11 06:35, Raghavendra Bhat wrote:
>>> > This is how it is supposed to work.
>>> >
>>> > Suppose a distribute volume is created with 2 bricks. 1st brick is 
>>> having 25GB of free space, 2nd disk has 35 GB of free space. If one 
>>> sets a 30GB of minimum-free-disk through volume set (gluster volume 
>>> set <volname> min-free-disk 30GB), then whenever files are created, 
>>> if the file is hashed to the 1st brick (which has 25GB of free 
>>> space), then actual file will be created in the 2nd brick to which a 
>>> linkfile will be created in the 1st brick. So the linkfile points to 
>>> the actual file. A warning message indicating minimum free disk 
>>> limit has been crosses and adding more nodes will be printed in the 
>>> glusterfs log file. So any file which is hashed to the 1st brick 
>>> will be created in the 2nd brick.
>>> >
>>> > Once the free space of 2nd brick also comes below 30 GB, then the 
>>> files will be created in the respective hashed bricks only. There 
>>> will be a warning message in the log file about the 2nd brick also 
>>> crossing the minimum free disk limit.
>>> >
>>> > Regards,
>>> > Raghavendra Bhat
>>>
>> Dear Raghavendra,
>> Thanks for explaining this to me.  This mechanism should allow a 
>> volume to function correctly with non-uniform brick sizes even though 
>> min-free-disk is not a hard limit.  I can understand now why I had so 
>> many problems with the default value of 10% for min-free-disk.  10% 
>> of a large brick can be very large compared to 10% of a small brick, 
>> so when they started filling up at the same rate after all had less 
>> than 10% free space the small bricks usually filled up long before 
>> large ones, giving "device full" errors even when df still showed a 
>> lot of free space in the volume.  At least now we can minimise this 
>> effect by setting min-free-disk to a value in GB.
>>
>> -Dan.
>>
> Dear Raghavendra,
> Unfortunately I am still having problems with some bricks filling up 
> completely, despite having "cluster.min-free-disk: 20GB".  In one case 
> I am still seeing warnings about bricks being nearly full in 
> percentage terms in the client logs, so I am wondering if the volume 
> is still using cluster.min-free-disk: 10%, and ignoring the 20GB 
> setting I changed it to.  When I changed cluster.min-free-disk should 
> this have taken effect immediately is there something else I should 
> have done to activate the change?
>
> In your example above, suppose there are 9 bricks instead of 2 bricks 
> (as in my volume), and they all have less than 30GB free space except 
> for one which is nearly empty, is GlusterFS clever enough to find that 
> nearly empty brick every time when creating new files?  I expected all 
> new files to be created in my nearly empty brick but that has not 
> happened.  Some files have gone in there but most have gone to nearly 
> full bricks, one of which has now filled up completely.  I have done 
> rebalance...fix-layout a number of times.  What can I do to fix this 
> problem?  The volumes with one or more full bricks are unusable 
> because users are getting "device full" errors for some writes even 
> though both volumes are showing several TB free space.
>
> Regards
> -Dan Bretherton.

Dear All,
If anyone is interested, I managed to produce the expected behaviour by 
setting min-free-disk to 300GB rather than 30GB.  300GB is is 
approximately 10% of the size of most of the bricks in the volume.  I 
don't understand why setting min-free-disk to 30GB (about 1% of the 
brick) didn't work; maybe it is too close to the limit for some reason.  
I wonder if the default value of min-free-disk=10% is significant.  It 
seems that for non-uniform brick sizes, the correct approach is to set 
min-free-disk to a value in GB that is approximately 10% of the brick 
size in each case.

-Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20111119/54159810/attachment-0001.htm>