Re: Ceph performance, empty vs part full

Shinobu Kinjo <skinjo@xxxxxxxxxx> · Fri, 4 Sep 2015 20:42:06 -0400 (EDT)

Very nice.
You're my hero!

 Shinobu

----- Original Message -----
From: "GuangYang" <yguang11@xxxxxxxxxxx>
To: "Shinobu Kinjo" <skinjo@xxxxxxxxxx>
Cc: "Ben Hines" <bhines@xxxxxxxxx>, "Nick Fisk" <nick@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Sent: Saturday, September 5, 2015 9:40:06 AM
Subject: RE:  Ceph performance, empty vs part full

----------------------------------------
> Date: Fri, 4 Sep 2015 20:31:59 -0400
> From: skinjo@xxxxxxxxxx
> To: yguang11@xxxxxxxxxxx
> CC: bhines@xxxxxxxxx; nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Ceph performance, empty vs part full
>
>> IIRC, it only triggers the move (merge or split) when that folder is hit by a request, so most likely it happens gradually.
>
> Do you know what causes this?
A requests (read/write/setxattr, etc) hitting objects in that folder.
> I would like to be more clear "gradually".
>
> Shinobu
>
> ----- Original Message -----
> From: "GuangYang" <yguang11@xxxxxxxxxxx>
> To: "Ben Hines" <bhines@xxxxxxxxx>, "Nick Fisk" <nick@xxxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Sent: Saturday, September 5, 2015 9:27:31 AM
> Subject: Re:  Ceph performance, empty vs part full
>
> IIRC, it only triggers the move (merge or split) when that folder is hit by a request, so most likely it happens gradually.
>
> Another thing might be helpful (and we have had good experience with), is that we do the folder splitting at the pool creation time, so that we avoid the performance impact with runtime splitting (which is high if you have a large cluster). In order to do that:
>
> 1. You will need to configure "filestore merge threshold" with a negative value so that it disables merging.
> 2. When creating the pool, there is a parameter named "expected_num_objects", by specifying that number, the folder will splitted to the right level with the pool creation.
>
> Hope that helps.
>
> Thanks,
> Guang
>
>
> ----------------------------------------
>> From: bhines@xxxxxxxxx
>> Date: Fri, 4 Sep 2015 12:05:26 -0700
>> To: nick@xxxxxxxxxx
>> CC: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  Ceph performance, empty vs part full
>>
>> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
>> a ticket to request a way to tell an OSD to rebalance its directory
>> structure.
>>
>> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>> I've just made the same change ( 4 and 40 for now) on my cluster which is a similar size to yours. I didn't see any merging happening, although most of the directory's I looked at had more files in than the new merge threshold, so I guess this is to be expected
>>>
>>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to bring things back into order.
>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>>>> Wang, Warren
>>>> Sent: 04 September 2015 01:21
>>>> To: Mark Nelson <mnelson@xxxxxxxxxx>; Ben Hines <bhines@xxxxxxxxx>
>>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>>>> Subject: Re:  Ceph performance, empty vs part full
>>>>
>>>> I'm about to change it on a big cluster too. It totals around 30 million, so I'm a
>>>> bit nervous on changing it. As far as I understood, it would indeed move
>>>> them around, if you can get underneath the threshold, but it may be hard to
>>>> do. Two more settings that I highly recommend changing on a big prod
>>>> cluster. I'm in favor of bumping these two up in the defaults.
>>>>
>>>> Warren
>>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>>>> Mark Nelson
>>>> Sent: Thursday, September 03, 2015 6:04 PM
>>>> To: Ben Hines <bhines@xxxxxxxxx>
>>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>>>> Subject: Re:  Ceph performance, empty vs part full
>>>>
>>>> Hrm, I think it will follow the merge/split rules if it's out of whack given the
>>>> new settings, but I don't know that I've ever tested it on an existing cluster to
>>>> see that it actually happens. I guess let it sit for a while and then check the
>>>> OSD PG directories to see if the object counts make sense given the new
>>>> settings? :D
>>>>
>>>> Mark
>>>>
>>>> On 09/03/2015 04:31 PM, Ben Hines wrote:
>>>>> Hey Mark,
>>>>>
>>>>> I've just tweaked these filestore settings for my cluster -- after
>>>>> changing this, is there a way to make ceph move existing objects
>>>>> around to new filestore locations, or will this only apply to newly
>>>>> created objects? (i would assume the latter..)
>>>>>
>>>>> thanks,
>>>>>
>>>>> -Ben
>>>>>
>>>>> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson <mnelson@xxxxxxxxxx>
>>>> wrote:
>>>>>> Basically for each PG, there's a directory tree where only a certain
>>>>>> number of objects are allowed in a given directory before it splits
>>>>>> into new branches/leaves. The problem is that this has a fair amount
>>>>>> of overhead and also there's extra associated dentry lookups to get at any
>>>> given object.
>>>>>>
>>>>>> You may want to try something like:
>>>>>>
>>>>>> "filestore merge threshold = 40"
>>>>>> "filestore split multiple = 8"
>>>>>>
>>>>>> This will dramatically increase the number of objects per directory
>>>> allowed.
>>>>>>
>>>>>> Another thing you may want to try is telling the kernel to greatly
>>>>>> favor retaining dentries and inodes in cache:
>>>>>>
>>>>>> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
>>>>>>>
>>>>>>> If I create a new pool it is generally fast for a short amount of time.
>>>>>>> Not as fast as if I had a blank cluster, but close to.
>>>>>>>
>>>>>>> Bryn
>>>>>>>>
>>>>>>>> On 8 Jul 2015, at 13:55, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> I think you're probably running into the internal PG/collection
>>>>>>>> splitting here; try searching for those terms and seeing what your
>>>>>>>> OSD folder structures look like. You could test by creating a new
>>>>>>>> pool and seeing if it's faster or slower than the one you've already filled
>>>> up.
>>>>>>>> -Greg
>>>>>>>>
>>>>>>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
>>>>>>>> <bryn.mathias@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m perf testing a cluster again,
>>>>>>>>> This time I have re-built the cluster and am filling it for testing.
>>>>>>>>>
>>>>>>>>> on a 10 min run I get the following results from 5 load
>>>>>>>>> generators, each writing though 7 iocontexts, with a queue depth of
>>>> 50 async writes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gen1
>>>>>>>>> Percentile 100 = 0.729775905609
>>>>>>>>> Max latencies = 0.729775905609, Min = 0.0320818424225, mean =
>>>>>>>>> 0.0750389684542
>>>>>>>>> Total objects writen = 113088 in time 604.259738207s gives
>>>>>>>>> 187.151307376/s (748.605229503 MB/s)
>>>>>>>>>
>>>>>>>>> Gen2
>>>>>>>>> Percentile 100 = 0.735981941223
>>>>>>>>> Max latencies = 0.735981941223, Min = 0.0340068340302, mean =
>>>>>>>>> 0.0745198070711
>>>>>>>>> Total objects writen = 113822 in time 604.437897921s gives
>>>>>>>>> 188.310495407/s (753.241981627 MB/s)
>>>>>>>>>
>>>>>>>>> Gen3
>>>>>>>>> Percentile 100 = 0.828994989395
>>>>>>>>> Max latencies = 0.828994989395, Min = 0.0349340438843, mean =
>>>>>>>>> 0.0745455575197
>>>>>>>>> Total objects writen = 113670 in time 604.352181911s gives
>>>>>>>>> 188.085694736/s (752.342778944 MB/s)
>>>>>>>>>
>>>>>>>>> Gen4
>>>>>>>>> Percentile 100 = 1.06834602356
>>>>>>>>> Max latencies = 1.06834602356, Min = 0.0333499908447, mean =
>>>>>>>>> 0.0752239764659
>>>>>>>>> Total objects writen = 112744 in time 604.408732891s gives
>>>>>>>>> 186.536020849/s (746.144083397 MB/s)
>>>>>>>>>
>>>>>>>>> Gen5
>>>>>>>>> Percentile 100 = 0.609658002853
>>>>>>>>> Max latencies = 0.609658002853, Min = 0.032968044281, mean =
>>>>>>>>> 0.0744482759499
>>>>>>>>> Total objects writen = 113918 in time 604.671534061s gives
>>>>>>>>> 188.396498897/s (753.585995589 MB/s)
>>>>>>>>>
>>>>>>>>> example ceph -w output:
>>>>>>>>> 2015-07-07 15:50:16.507084 mon.0 [INF] pgmap v1077: 2880 pgs: 2880
>>>>>>>>> active+clean; 1996 GB data, 2515 GB used, 346 TB / 348 TB avail;
>>>>>>>>> active+2185 MB/s
>>>>>>>>> wr, 572 op/s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However when the cluster gets over 20% full I see the following
>>>>>>>>> results, this gets worse as the cluster fills up:
>>>>>>>>>
>>>>>>>>> Gen1
>>>>>>>>> Percentile 100 = 6.71176099777
>>>>>>>>> Max latencies = 6.71176099777, Min = 0.0358741283417, mean =
>>>>>>>>> 0.161760483485
>>>>>>>>> Total objects writen = 52196 in time 604.488474131s gives
>>>>>>>>> 86.347386648/s
>>>>>>>>> (345.389546592 MB/s)
>>>>>>>>>
>>>>>>>>> Gen2
>>>>>>>>> Max latencies = 4.09169006348, Min = 0.0357890129089, mean =
>>>>>>>>> 0.163243938477
>>>>>>>>> Total objects writen = 51702 in time 604.036739111s gives
>>>>>>>>> 85.5941313704/s (342.376525482 MB/s)
>>>>>>>>>
>>>>>>>>> Gen3
>>>>>>>>> Percentile 100 = 7.32526683807
>>>>>>>>> Max latencies = 7.32526683807, Min = 0.0366668701172, mean =
>>>>>>>>> 0.163992217926
>>>>>>>>> Total objects writen = 51476 in time 604.684302092s gives
>>>>>>>>> 85.1287189397/s (340.514875759 MB/s)
>>>>>>>>>
>>>>>>>>> Gen4
>>>>>>>>> Percentile 100 = 7.56094503403
>>>>>>>>> Max latencies = 7.56094503403, Min = 0.0355761051178, mean =
>>>>>>>>> 0.162109421231
>>>>>>>>> Total objects writen = 52092 in time 604.769910812s gives
>>>>>>>>> 86.1352376642/s (344.540950657 MB/s)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gen5
>>>>>>>>> Percentile 100 = 6.99595499039
>>>>>>>>> Max latencies = 6.99595499039, Min = 0.0364680290222, mean =
>>>>>>>>> 0.163651215426
>>>>>>>>> Total objects writen = 51566 in time 604.061977148s gives
>>>>>>>>> 85.3654127404/s (341.461650961 MB/s)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cluster details:
>>>>>>>>> 5*HPDL380’s with 13*6Tb OSD’s
>>>>>>>>> 128Gb Ram
>>>>>>>>> 2*intel 2620v3
>>>>>>>>> 10 Gbit Ceph public network
>>>>>>>>> 10 Gbit Ceph private network
>>>>>>>>>
>>>>>>>>> Load generators connected via a 20Gbit bond to the ceph public
>>>> network.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is this likely to be something happening to the journals?
>>>>>>>>>
>>>>>>>>> Or is there something else going on.
>>>>>>>>>
>>>>>>>>> I have run FIO and iperf tests and the disk and network
>>>>>>>>> performance is very high.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kind Regards,
>>>>>>>>> Bryn Mathias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com