cluster.min-free-disk separate for each, brick

d.a.bretherton at reading.ac.uk (Dan Bretherton) · Wed, 07 Sep 2011 11:57:13 +0100

On 17/08/11 16:19, Dan Bretherton wrote:
>
>>
>>
>>
>> Dan Bretherton wrote:
>>>
>>> On 15/08/11 20:00, gluster-users-request at gluster.org wrote:
>>>> Message: 1
>>>> Date: Sun, 14 Aug 2011 23:24:46 +0300
>>>> From: "Deyan Chepishev - SuperHosting.BG"<dchepishev at superhosting.bg>
>>>> Subject: cluster.min-free-disk  separate for each
>>>>     brick
>>>> To: gluster-users at gluster.org
>>>> Message-ID:<4E482F0E.3030604 at superhosting.bg>
>>>> Content-Type: text/plain; charset=UTF-8; format=flowed
>>>>
>>>> Hello,
>>>>
>>>> I have a gluster set up with very different brick sizes.
>>>>
>>>> brick1: 9T
>>>> brick2: 9T
>>>> brick3: 37T
>>>>
>>>> with this configuration if I set the parameter 
>>>> cluster.min-free-disk to 10% it
>>>> applies to all bricks which is quite uncomfortable with these brick 
>>>> sizes,
>>>> because 10% for the small bricks are ~ 1T but for the big brick it 
>>>> is ~3.7T and
>>>> what happens at the end is that if all brick go to 90% usage and I 
>>>> continue
>>>> writing, the small ones eventually fill up to 100% while the big 
>>>> one has enough
>>>> free space.
>>>>
>>>> My question is, is there a way to set cluster.min-free-disk per 
>>>> brick instead
>>>> setting it for the entire volume or any other way to work around 
>>>> this problem ?
>>>>
>>>> Thank you in advance
>>>>
>>>> Regards,
>>>> Deyan
>>>>
>>> Hello Deyan,
>>>
>>> I have exactly the same problem and I have asked about it before - 
>>> see links below.
>>>
>>> http://community.gluster.org/q/in-version-3-1-4-how-can-i-set-the-minimum-amount-of-free-disk-space-on-the-bricks/ 
>>>
>>> http://gluster.org/pipermail/gluster-users/2011-May/007788.html
>>>
>>> My understanding is that the patch referred to in Amar's reply in 
>>> the May thread prevents a "migrate-data" rebalance operation failing 
>>> by running out of space on smaller bricks, but that doesn't solve 
>>> the problem we are having.  Being able to set min-free-disk for each 
>>> brick separately would be useful, as would being able to set this 
>>> value as a number of bytes rather than a percentage.  However, even 
>>> if these features were present we would still have a problem when 
>>> the amount of free space becomes less than min-free-disk, because 
>>> this just results in a warning message in the logs and doesn't 
>>> actually prevent more files from being written.  In other words, 
>>> min-free-disk is a soft limit rather than a hard limit.  When a 
>>> volume is more than 90% full there may still be hundreds of 
>>> gigabytes of free space spread over the large bricks, but the small 
>>> bricks may each only have a few gigabytes left of even less.  Users 
>>> do "df" and see lots of free space in the volume so they continue 
>>> writing files.  However, when GlusterFS chooses to write a file to a 
>>> small brick, the write fails with "device full" errors if the file 
>>> grows too large, which is often the case here with files typically 
>>> several gigabytes in size for some applications.
>>>
>>> I would really like to know if there is a way to make min-free-disk 
>>> a hard limit.  Ideally, GlusterFS would chose a brick on which to 
>>> write a file based on how much free space it has left rather than 
>>> choosing a brick at random (or however it is done now).  That would 
>>> solve the problem of non-uniform brick sizes without the need for a 
>>> hard min-free-disk limit.
>>>
>>> Amar's comment in the May thread about QA testing being done only on 
>>> volumes with uniform brick sizes prompted me to start standardising 
>>> on a uniform brick size for each volume in my cluster.  My 
>>> impression is that implementing the features needed for users with 
>>> non-uniform brick sizes is not a priority for Gluster, and that 
>>> users are all expected to use uniform brick sizes.  I really think 
>>> this fact should be stated clearly in the GlusterFS documentation, 
>>> in the sections on creating volumes in the Administration Guide for 
>>> example.  That would stop other users from going down the path that 
>>> I did initially, which has given me a real headache because I am now 
>>> having to move tens of terabytes of data off bricks that are larger 
>>> than the new standard size.
>>>
>>> Regards
>>> Dan.
>>>
>> Hello,
>>
>> This is really bad news, because I already migrated my data and I 
>> just realized that I am screwed because Gluster just does not care 
>> about the brick sizes.
>> It is impossible to move to uniform brick sizes.
>>
>> Currently we use 2TB  HDDs, but the disks are growing and soon we 
>> will probably use 3TB hdds or whatever other larges sizes appear on 
>> the market. So if we choose to use raid5 and some level of redundancy 
>> (for example 6hdds in raid5, no matter what their size is) this 
>> sooner or later will lead us to non uniform bricks which is a problem 
>> and it is not correct to expect that we always can or want to provide 
>> uniform size bricks.
>>
>> With this way of thinking if we currently have 10T from 6x2T in hdd5, 
>> at some point when there is a 10T on a single disk we will have to 
>> use no raid just because gluster can not handle non uniform bricks.
>>
>> Regards,
>> Deyan
>>
>
> I think Amar might have provided the answer in his posting to the 
> thread yesterday, which has just appeared in my autospam folder.
>
> http://gluster.org/pipermail/gluster-users/2011-August/008579.html
>
>> With size option, you can have a hardbound on min-free-disk
> This means that you can set a hard limit on min-free-disk, and set a 
> value in GB that is bigger than the biggest file that is ever likely 
> to be written.  This looks likely to solve our problem and make 
> non-uniform brick sizes a practical proposition.  I wish I had known 
> about this back in May when I embarked on my cluster restructuring 
> exercise; the issue was discussed in this thread in May as well:  
> http://gluster.org/pipermail/gluster-users/2011-May/007794.html
>
> Once I have moved all the data off the large bricks and standardised 
> on a uniform brick size, it will be relatively easy to stick to this 
> because I use LVM.  I create logical volumes for new bricks when a 
> volume needs extending.  The only problem with this approach is what 
> happens when the amount of free space left on a server is less than 
> the size of the brick you want to create.  The only option then would 
> be to use new servers, potentially wasting several TB of free space on 
> existing servers.  The standard brick size for most of my volumes is 
> 3TB, which allows me to use a mixture of small servers and large 
> servers in a volume and limits the amount of free space that would be 
> wasted if there wasn't quite enough free space on a server to create 
> another brick.  Another consequence of having 3TB bricks is that a 
> single server typically has two more more bricks belonging to a the 
> same volume, although I do my best to distribute the volumes across 
> different servers in order to spread the load.  I am not aware of any 
> problems associated with exporting multiple bricks from a single 
> server and it has not caused me any problems so far that I am aware of.
>
> -Dan.
>
Hello Deyan,

Have you tried giving min-free-disk a value in gigabytes, and if so does 
it prevent new files being written to your bricks when they are nearly 
full?  I recently tried it myself and found that min-free-disk had no 
effect all.  I deliberately filled my test/backup volume and most of the 
bricks became 100 full.  I set min-free-disk to "20GB", as reported in 
"gluster volume ... info" below.

cluster.min-free-disk: 20GB

Unless I am doing something wrong it seems as though we can not "have a 
hardbound on min-free-disk" after all, and uniform brick size is 
therefore an essential requirement.  It still doesn't say that in the 
documentation, at least not in the volume creation sections.

-Dan.

-- 
Mr. D.A. Bretherton
Computer System Manager
Environmental Systems Science Centre
Harry Pitt Building
3 Earley Gate
University of Reading
Reading, RG6 6AL
UK

Tel. +44 118 378 5205
Fax: +44 118 378 6413