cluster.min-free-disk separate for each, brick

d.a.bretherton at reading.ac.uk (Dan Bretherton) · Thu, 08 Sep 2011 23:51:19 +0100



> On Wed, Sep 7, 2011 at 4:27 PM, Dan Bretherton 
> <d.a.bretherton at reading.ac.uk <mailto:d.a.bretherton at reading.ac.uk>> 
> wrote:
>
>
>     On 17/08/11 16:19, Dan Bretherton wrote:
>
>
>
>
>
>             Dan Bretherton wrote:
>
>
>                 On 15/08/11 20:00, gluster-users-request at gluster.org
>                 <mailto:gluster-users-request at gluster.org> wrote:
>
>                     Message: 1
>                     Date: Sun, 14 Aug 2011 23:24:46 +0300
>                     From: "Deyan Chepishev -
>                     SuperHosting.BG"<dchepishev at superhosting.bg
>                     <mailto:dchepishev at superhosting.bg>>
>                     Subject: cluster.min-free-disk
>                      separate for each
>                        brick
>                     To: gluster-users at gluster.org
>                     <mailto:gluster-users at gluster.org>
>                     Message-ID:<4E482F0E.3030604 at superhosting.bg
>                     <mailto:4E482F0E.3030604 at superhosting.bg>>
>                     Content-Type: text/plain; charset=UTF-8; format=flowed
>
>                     Hello,
>
>                     I have a gluster set up with very different brick
>                     sizes.
>
>                     brick1: 9T
>                     brick2: 9T
>                     brick3: 37T
>
>                     with this configuration if I set the parameter
>                     cluster.min-free-disk to 10% it
>                     applies to all bricks which is quite uncomfortable
>                     with these brick sizes,
>                     because 10% for the small bricks are ~ 1T but for
>                     the big brick it is ~3.7T and
>                     what happens at the end is that if all brick go to
>                     90% usage and I continue
>                     writing, the small ones eventually fill up to 100%
>                     while the big one has enough
>                     free space.
>
>                     My question is, is there a way to set
>                     cluster.min-free-disk per brick instead
>                     setting it for the entire volume or any other way
>                     to work around this problem ?
>
>                     Thank you in advance
>
>                     Regards,
>                     Deyan
>
>                 Hello Deyan,
>
>                 I have exactly the same problem and I have asked about
>                 it before - see links below.
>
>                 http://community.gluster.org/q/in-version-3-1-4-how-can-i-set-the-minimum-amount-of-free-disk-space-on-the-bricks/
>
>                 http://gluster.org/pipermail/gluster-users/2011-May/007788.html
>
>                 My understanding is that the patch referred to in
>                 Amar's reply in the May thread prevents a
>                 "migrate-data" rebalance operation failing by running
>                 out of space on smaller bricks, but that doesn't solve
>                 the problem we are having.  Being able to set
>                 min-free-disk for each brick separately would be
>                 useful, as would being able to set this value as a
>                 number of bytes rather than a percentage.  However,
>                 even if these features were present we would still
>                 have a problem when the amount of free space becomes
>                 less than min-free-disk, because this just results in
>                 a warning message in the logs and doesn't actually
>                 prevent more files from being written.  In other
>                 words, min-free-disk is a soft limit rather than a
>                 hard limit.  When a volume is more than 90% full there
>                 may still be hundreds of gigabytes of free space
>                 spread over the large bricks, but the small bricks may
>                 each only have a few gigabytes left of even less.
>                  Users do "df" and see lots of free space in the
>                 volume so they continue writing files.  However, when
>                 GlusterFS chooses to write a file to a small brick,
>                 the write fails with "device full" errors if the file
>                 grows too large, which is often the case here with
>                 files typically several gigabytes in size for some
>                 applications.
>
>                 I would really like to know if there is a way to make
>                 min-free-disk a hard limit.  Ideally, GlusterFS would
>                 chose a brick on which to write a file based on how
>                 much free space it has left rather than choosing a
>                 brick at random (or however it is done now).  That
>                 would solve the problem of non-uniform brick sizes
>                 without the need for a hard min-free-disk limit.
>
>                 Amar's comment in the May thread about QA testing
>                 being done only on volumes with uniform brick sizes
>                 prompted me to start standardising on a uniform brick
>                 size for each volume in my cluster.  My impression is
>                 that implementing the features needed for users with
>                 non-uniform brick sizes is not a priority for Gluster,
>                 and that users are all expected to use uniform brick
>                 sizes.  I really think this fact should be stated
>                 clearly in the GlusterFS documentation, in the
>                 sections on creating volumes in the Administration
>                 Guide for example.  That would stop other users from
>                 going down the path that I did initially, which has
>                 given me a real headache because I am now having to
>                 move tens of terabytes of data off bricks that are
>                 larger than the new standard size.
>
>                 Regards
>                 Dan.
>
>             Hello,
>
>             This is really bad news, because I already migrated my
>             data and I just realized that I am screwed because Gluster
>             just does not care about the brick sizes.
>             It is impossible to move to uniform brick sizes.
>
>             Currently we use 2TB  HDDs, but the disks are growing and
>             soon we will probably use 3TB hdds or whatever other
>             larges sizes appear on the market. So if we choose to use
>             raid5 and some level of redundancy (for example 6hdds in
>             raid5, no matter what their size is) this sooner or later
>             will lead us to non uniform bricks which is a problem and
>             it is not correct to expect that we always can or want to
>             provide uniform size bricks.
>
>             With this way of thinking if we currently have 10T from
>             6x2T in hdd5, at some point when there is a 10T on a
>             single disk we will have to use no raid just because
>             gluster can not handle non uniform bricks.
>
>             Regards,
>             Deyan
>
>
>         I think Amar might have provided the answer in his posting to
>         the thread yesterday, which has just appeared in my autospam
>         folder.
>
>         http://gluster.org/pipermail/gluster-users/2011-August/008579.html
>
>             With size option, you can have a hardbound on min-free-disk
>
>         This means that you can set a hard limit on min-free-disk, and
>         set a value in GB that is bigger than the biggest file that is
>         ever likely to be written.  This looks likely to solve our
>         problem and make non-uniform brick sizes a practical
>         proposition.  I wish I had known about this back in May when I
>         embarked on my cluster restructuring exercise; the issue was
>         discussed in this thread in May as well:
>         http://gluster.org/pipermail/gluster-users/2011-May/007794.html
>
>         Once I have moved all the data off the large bricks and
>         standardised on a uniform brick size, it will be relatively
>         easy to stick to this because I use LVM.  I create logical
>         volumes for new bricks when a volume needs extending.  The
>         only problem with this approach is what happens when the
>         amount of free space left on a server is less than the size of
>         the brick you want to create.  The only option then would be
>         to use new servers, potentially wasting several TB of free
>         space on existing servers.  The standard brick size for most
>         of my volumes is 3TB, which allows me to use a mixture of
>         small servers and large servers in a volume and limits the
>         amount of free space that would be wasted if there wasn't
>         quite enough free space on a server to create another brick.
>          Another consequence of having 3TB bricks is that a single
>         server typically has two more more bricks belonging to a the
>         same volume, although I do my best to distribute the volumes
>         across different servers in order to spread the load.  I am
>         not aware of any problems associated with exporting multiple
>         bricks from a single server and it has not caused me any
>         problems so far that I am aware of.
>
>         -Dan.
>
>     Hello Deyan,
>
>     Have you tried giving min-free-disk a value in gigabytes, and if
>     so does it prevent new files being written to your bricks when
>     they are nearly full?  I recently tried it myself and found that
>     min-free-disk had no effect all.  I deliberately filled my
>     test/backup volume and most of the bricks became 100 full.  I set
>     min-free-disk to "20GB", as reported in "gluster volume ... info"
>     below.
>
>     cluster.min-free-disk: 20GB
>
>     Unless I am doing something wrong it seems as though we can not
>     "have a hardbound on min-free-disk" after all, and uniform brick
>     size is therefore an essential requirement.  It still doesn't say
>     that in the documentation, at least not in the volume creation
>     sections.
>
>
>     -Dan.
>
> On 08/09/11 06:35, Raghavendra Bhat wrote:
> > This is how it is supposed to work.
> >
> > Suppose a distribute volume is created with 2 bricks. 1st brick is 
> having 25GB of free space, 2nd disk has 35 GB of free space. If one 
> sets a 30GB of minimum-free-disk through volume set (gluster volume 
> set <volname> min-free-disk 30GB), then whenever files are created, if 
> the file is hashed to the 1st brick (which has 25GB of free space), 
> then actual file will be created in the 2nd brick to which a linkfile 
> will be created in the 1st brick. So the linkfile points to the actual 
> file. A warning message indicating minimum free disk limit has been 
> crosses and adding more nodes will be printed in the glusterfs log 
> file. So any file which is hashed to the 1st brick will be created in 
> the 2nd brick.
> >
> > Once the free space of 2nd brick also comes below 30 GB, then the 
> files will be created in the respective hashed bricks only. There will 
> be a warning message in the log file about the 2nd brick also crossing 
> the minimum free disk limit.
> >
> > Regards,
> > Raghavendra Bhat
>
Dear Raghavendra,
Thanks for explaining this to me.  This mechanism should allow a volume 
to function correctly with non-uniform brick sizes even though 
min-free-disk is not a hard limit.  I can understand now why I had so 
many problems with the default value of 10% for min-free-disk.  10% of a 
large brick can be very large compared to 10% of a small brick, so when 
they started filling up at the same rate after all had less than 10% 
free space the small bricks usually filled up long before large ones, 
giving "device full" errors even when df still showed a lot of free 
space in the volume.  At least now we can minimise this effect by 
setting min-free-disk to a value in GB.

-Dan.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20110908/bbb7f854/attachment-0001.htm>