Re: Ceph performance, empty vs part full

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 14 Sep 2015 16:08:32 -0700

It's been a while since I looked at this, but my recollection is that
the FileStore will check if it should split on every object create,
and will check if it should merge on every delete. It's conceivable it
checks for both whenever the number of objects changes, though, which
would make things easier.

I don't think scrub or anything else will do the work, though. :/
-Greg

On Tue, Sep 8, 2015 at 2:26 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Nick Fisk
>> Sent: 06 September 2015 15:11
>> To: 'Shinobu Kinjo' <skinjo@xxxxxxxxxx>; 'GuangYang'
>> <yguang11@xxxxxxxxxxx>
>> Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>; 'Nick Fisk' <nick@xxxxxxxxxx>
>> Subject: Re:  Ceph performance, empty vs part full
>>
>> Just a quick update after up'ing the thresholds, not much happened. This is
>> probably because the merge threshold is several times less than the trigger
>> for the split. So I have now bumped the merge threshold up to 1000
>> temporarily to hopefully force some DIR's to merge.
>>
>> I believe this has started to happen, but it only seems to merge right at the
>> bottom of the tree.
>>
>> Eg
>>
>> /var/lib/ceph/osd/ceph-1/current/0.106_head/DIR_6/DIR_0/DIR_1/
>>
>> All the Directory's only 1 have directory in them, DIR_1 is the only one in the
>> path that has any objects in it. Is this the correct behaviour? Is there any
>> impact from having these deeper paths compared to when the objects are
>> just in the root directory?
>>
>> I guess the only real way to get the objects back into the root would be to
>> out->drain->in the OSD?
>>
>>
>> > -----Original Message-----
>> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
>> > Of Shinobu Kinjo
>> > Sent: 05 September 2015 01:42
>> > To: GuangYang <yguang11@xxxxxxxxxxx>
>> > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>; Nick Fisk
>> > <nick@xxxxxxxxxx>
>> > Subject: Re:  Ceph performance, empty vs part full
>> >
>> > Very nice.
>> > You're my hero!
>> >
>> >  Shinobu
>> >
>> > ----- Original Message -----
>> > From: "GuangYang" <yguang11@xxxxxxxxxxx>
>> > To: "Shinobu Kinjo" <skinjo@xxxxxxxxxx>
>> > Cc: "Ben Hines" <bhines@xxxxxxxxx>, "Nick Fisk" <nick@xxxxxxxxxx>,
>> > "ceph- users" <ceph-users@xxxxxxxxxxxxxx>
>> > Sent: Saturday, September 5, 2015 9:40:06 AM
>> > Subject: RE:  Ceph performance, empty vs part full
>> >
>> > ----------------------------------------
>> > > Date: Fri, 4 Sep 2015 20:31:59 -0400
>> > > From: skinjo@xxxxxxxxxx
>> > > To: yguang11@xxxxxxxxxxx
>> > > CC: bhines@xxxxxxxxx; nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
>> > > Subject: Re:  Ceph performance, empty vs part full
>> > >
>> > >> IIRC, it only triggers the move (merge or split) when that folder
>> > >> is hit by a
>> > request, so most likely it happens gradually.
>> > >
>> > > Do you know what causes this?
>> > A requests (read/write/setxattr, etc) hitting objects in that folder.
>> > > I would like to be more clear "gradually".
>
>
> Does anyone know if a scrub is included in this? I have kicked off a deep scrub of an OSD and yet I still don't see merging happening, even with a merge threshold of 1000.
>
> Example
> /var/lib/ceph/osd/ceph-0/current/0.108_head : 0 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8 : 0 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0 : 0 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1 : 15 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_4 : 85 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_B : 63 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_D : 88 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_8 : 73 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_0 : 77 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_6 : 79 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_3 : 67 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_E : 94 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_C : 91 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_A : 88 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_5 : 96 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_2 : 88 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_9 : 70 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_1 : 95 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_7 : 87 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_F : 88 files
>
>
>
>> > >
>> > > Shinobu
>> > >
>> > > ----- Original Message -----
>> > > From: "GuangYang" <yguang11@xxxxxxxxxxx>
>> > > To: "Ben Hines" <bhines@xxxxxxxxx>, "Nick Fisk" <nick@xxxxxxxxxx>
>> > > Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
>> > > Sent: Saturday, September 5, 2015 9:27:31 AM
>> > > Subject: Re:  Ceph performance, empty vs part full
>> > >
>> > > IIRC, it only triggers the move (merge or split) when that folder is
>> > > hit by a
>> > request, so most likely it happens gradually.
>> > >
>> > > Another thing might be helpful (and we have had good experience
>> > > with), is
>> > that we do the folder splitting at the pool creation time, so that we
>> > avoid the performance impact with runtime splitting (which is high if
>> > you have a large cluster). In order to do that:
>> > >
>> > > 1. You will need to configure "filestore merge threshold" with a
>> > > negative
>> > value so that it disables merging.
>> > > 2. When creating the pool, there is a parameter named
>> > "expected_num_objects", by specifying that number, the folder will
>> > splitted to the right level with the pool creation.
>> > >
>> > > Hope that helps.
>> > >
>> > > Thanks,
>> > > Guang
>> > >
>> > >
>> > > ----------------------------------------
>> > >> From: bhines@xxxxxxxxx
>> > >> Date: Fri, 4 Sep 2015 12:05:26 -0700
>> > >> To: nick@xxxxxxxxxx
>> > >> CC: ceph-users@xxxxxxxxxxxxxx
>> > >> Subject: Re:  Ceph performance, empty vs part full
>> > >>
>> > >> Yeah, i'm not seeing stuff being moved at all. Perhaps we should
>> > >> file a ticket to request a way to tell an OSD to rebalance its
>> > >> directory structure.
>> > >>
>> > >> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> > >>> I've just made the same change ( 4 and 40 for now) on my cluster
>> > >>> which is a similar size to yours. I didn't see any merging
>> > >>> happening, although most of the directory's I looked at had more
>> > >>> files in than the new merge threshold, so I guess this is to be
>> > >>> expected
>> > >>>
>> > >>> I'm currently splitting my PG's from 1024 to 2048 to see if that
>> > >>> helps to
>> > bring things back into order.
>> > >>>
>> > >>>> -----Original Message-----
>> > >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
>> > >>>> Behalf Of Wang, Warren
>> > >>>> Sent: 04 September 2015 01:21
>> > >>>> To: Mark Nelson <mnelson@xxxxxxxxxx>; Ben Hines
>> > <bhines@xxxxxxxxx>
>> > >>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> > >>>> Subject: Re:  Ceph performance, empty vs part full
>> > >>>>
>> > >>>> I'm about to change it on a big cluster too. It totals around 30
>> > >>>> million, so I'm a bit nervous on changing it. As far as I
>> > >>>> understood, it would indeed move them around, if you can get
>> > >>>> underneath the threshold, but it may be hard to do. Two more
>> > >>>> settings that I highly recommend changing on a big prod cluster.
>> > >>>> I'm in
>> > favor of bumping these two up in the defaults.
>> > >>>>
>> > >>>> Warren
>> > >>>>
>> > >>>> -----Original Message-----
>> > >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
>> > >>>> Behalf Of Mark Nelson
>> > >>>> Sent: Thursday, September 03, 2015 6:04 PM
>> > >>>> To: Ben Hines <bhines@xxxxxxxxx>
>> > >>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> > >>>> Subject: Re:  Ceph performance, empty vs part full
>> > >>>>
>> > >>>> Hrm, I think it will follow the merge/split rules if it's out of
>> > >>>> whack given the new settings, but I don't know that I've ever
>> > >>>> tested it on an existing cluster to see that it actually happens.
>> > >>>> I guess let it sit for a while and then check the OSD PG
>> > >>>> directories to see if the object counts make sense given the new
>> > >>>> settings? :D
>> > >>>>
>> > >>>> Mark
>> > >>>>
>> > >>>> On 09/03/2015 04:31 PM, Ben Hines wrote:
>> > >>>>> Hey Mark,
>> > >>>>>
>> > >>>>> I've just tweaked these filestore settings for my cluster --
>> > >>>>> after changing this, is there a way to make ceph move existing
>> > >>>>> objects around to new filestore locations, or will this only
>> > >>>>> apply to newly created objects? (i would assume the latter..)
>> > >>>>>
>> > >>>>> thanks,
>> > >>>>>
>> > >>>>> -Ben
>> > >>>>>
>> > >>>>> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson
>> <mnelson@xxxxxxxxxx>
>> > >>>> wrote:
>> > >>>>>> Basically for each PG, there's a directory tree where only a
>> > >>>>>> certain number of objects are allowed in a given directory
>> > >>>>>> before it splits into new branches/leaves. The problem is that
>> > >>>>>> this has a fair amount of overhead and also there's extra
>> > >>>>>> associated dentry lookups to get at any
>> > >>>> given object.
>> > >>>>>>
>> > >>>>>> You may want to try something like:
>> > >>>>>>
>> > >>>>>> "filestore merge threshold = 40"
>> > >>>>>> "filestore split multiple = 8"
>> > >>>>>>
>> > >>>>>> This will dramatically increase the number of objects per
>> > >>>>>> directory
>> > >>>> allowed.
>> > >>>>>>
>> > >>>>>> Another thing you may want to try is telling the kernel to
>> > >>>>>> greatly favor retaining dentries and inodes in cache:
>> > >>>>>>
>> > >>>>>> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
>> > >>>>>>
>> > >>>>>> Mark
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
>> > >>>>>>>
>> > >>>>>>> If I create a new pool it is generally fast for a short amount of
>> time.
>> > >>>>>>> Not as fast as if I had a blank cluster, but close to.
>> > >>>>>>>
>> > >>>>>>> Bryn
>> > >>>>>>>>
>> > >>>>>>>> On 8 Jul 2015, at 13:55, Gregory Farnum <greg@xxxxxxxxxxx>
>> > wrote:
>> > >>>>>>>>
>> > >>>>>>>> I think you're probably running into the internal
>> > >>>>>>>> PG/collection splitting here; try searching for those terms
>> > >>>>>>>> and seeing what your OSD folder structures look like. You
>> > >>>>>>>> could test by creating a new pool and seeing if it's faster
>> > >>>>>>>> or slower than the one you've already filled
>> > >>>> up.
>> > >>>>>>>> -Greg
>> > >>>>>>>>
>> > >>>>>>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
>> > >>>>>>>> <bryn.mathias@xxxxxxxxxxxxxxxxxx> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>> Hi All,
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> I’m perf testing a cluster again, This time I have re-built
>> > >>>>>>>>> the cluster and am filling it for testing.
>> > >>>>>>>>>
>> > >>>>>>>>> on a 10 min run I get the following results from 5 load
>> > >>>>>>>>> generators, each writing though 7 iocontexts, with a queue
>> > >>>>>>>>> depth of
>> > >>>> 50 async writes.
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Gen1
>> > >>>>>>>>> Percentile 100 = 0.729775905609 Max latencies =
>> > >>>>>>>>> 0.729775905609, Min = 0.0320818424225, mean =
>> > >>>>>>>>> 0.0750389684542
>> > >>>>>>>>> Total objects writen = 113088 in time 604.259738207s gives
>> > >>>>>>>>> 187.151307376/s (748.605229503 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>> Gen2
>> > >>>>>>>>> Percentile 100 = 0.735981941223 Max latencies =
>> > >>>>>>>>> 0.735981941223, Min = 0.0340068340302, mean =
>> > >>>>>>>>> 0.0745198070711
>> > >>>>>>>>> Total objects writen = 113822 in time 604.437897921s gives
>> > >>>>>>>>> 188.310495407/s (753.241981627 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>> Gen3
>> > >>>>>>>>> Percentile 100 = 0.828994989395 Max latencies =
>> > >>>>>>>>> 0.828994989395, Min = 0.0349340438843, mean =
>> > >>>>>>>>> 0.0745455575197
>> > >>>>>>>>> Total objects writen = 113670 in time 604.352181911s gives
>> > >>>>>>>>> 188.085694736/s (752.342778944 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>> Gen4
>> > >>>>>>>>> Percentile 100 = 1.06834602356 Max latencies =
>> > >>>>>>>>> 1.06834602356, Min = 0.0333499908447, mean =
>> > >>>>>>>>> 0.0752239764659
>> > >>>>>>>>> Total objects writen = 112744 in time 604.408732891s gives
>> > >>>>>>>>> 186.536020849/s (746.144083397 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>> Gen5
>> > >>>>>>>>> Percentile 100 = 0.609658002853 Max latencies =
>> > >>>>>>>>> 0.609658002853, Min = 0.032968044281, mean =
>> > >>>>>>>>> 0.0744482759499
>> > >>>>>>>>> Total objects writen = 113918 in time 604.671534061s gives
>> > >>>>>>>>> 188.396498897/s (753.585995589 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>> example ceph -w output:
>> > >>>>>>>>> 2015-07-07 15:50:16.507084 mon.0 [INF] pgmap v1077: 2880
>> pgs:
>> > >>>>>>>>> 2880
>> > >>>>>>>>> active+clean; 1996 GB data, 2515 GB used, 346 TB / 348 TB
>> > >>>>>>>>> active+avail;
>> > >>>>>>>>> active+2185 MB/s
>> > >>>>>>>>> wr, 572 op/s
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> However when the cluster gets over 20% full I see the
>> > >>>>>>>>> following results, this gets worse as the cluster fills up:
>> > >>>>>>>>>
>> > >>>>>>>>> Gen1
>> > >>>>>>>>> Percentile 100 = 6.71176099777 Max latencies =
>> > >>>>>>>>> 6.71176099777, Min = 0.0358741283417, mean =
>> > >>>>>>>>> 0.161760483485
>> > >>>>>>>>> Total objects writen = 52196 in time 604.488474131s gives
>> > >>>>>>>>> 86.347386648/s
>> > >>>>>>>>> (345.389546592 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>> Gen2
>> > >>>>>>>>> Max latencies = 4.09169006348, Min = 0.0357890129089, mean
>> =
>> > >>>>>>>>> 0.163243938477
>> > >>>>>>>>> Total objects writen = 51702 in time 604.036739111s gives
>> > >>>>>>>>> 85.5941313704/s (342.376525482 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>> Gen3
>> > >>>>>>>>> Percentile 100 = 7.32526683807 Max latencies =
>> > >>>>>>>>> 7.32526683807, Min = 0.0366668701172, mean =
>> > >>>>>>>>> 0.163992217926
>> > >>>>>>>>> Total objects writen = 51476 in time 604.684302092s gives
>> > >>>>>>>>> 85.1287189397/s (340.514875759 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>> Gen4
>> > >>>>>>>>> Percentile 100 = 7.56094503403 Max latencies =
>> > >>>>>>>>> 7.56094503403, Min = 0.0355761051178, mean =
>> > >>>>>>>>> 0.162109421231
>> > >>>>>>>>> Total objects writen = 52092 in time 604.769910812s gives
>> > >>>>>>>>> 86.1352376642/s (344.540950657 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Gen5
>> > >>>>>>>>> Percentile 100 = 6.99595499039 Max latencies =
>> > >>>>>>>>> 6.99595499039, Min = 0.0364680290222, mean =
>> > >>>>>>>>> 0.163651215426
>> > >>>>>>>>> Total objects writen = 51566 in time 604.061977148s gives
>> > >>>>>>>>> 85.3654127404/s (341.461650961 MB/s)
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Cluster details:
>> > >>>>>>>>> 5*HPDL380’s with 13*6Tb OSD’s 128Gb Ram 2*intel 2620v3
>> > >>>>>>>>> 10 Gbit Ceph public network
>> > >>>>>>>>> 10 Gbit Ceph private network
>> > >>>>>>>>>
>> > >>>>>>>>> Load generators connected via a 20Gbit bond to the ceph
>> > >>>>>>>>> public
>> > >>>> network.
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Is this likely to be something happening to the journals?
>> > >>>>>>>>>
>> > >>>>>>>>> Or is there something else going on.
>> > >>>>>>>>>
>> > >>>>>>>>> I have run FIO and iperf tests and the disk and network
>> > >>>>>>>>> performance is very high.
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Kind Regards,
>> > >>>>>>>>> Bryn Mathias
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> _______________________________________________
>> > >>>>>>>>> ceph-users mailing list
>> > >>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>> > >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> _______________________________________________
>> > >>>>>>> ceph-users mailing list
>> > >>>>>>> ceph-users@xxxxxxxxxxxxxx
>> > >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>>>>>>
>> > >>>>>> _______________________________________________
>> > >>>>>> ceph-users mailing list
>> > >>>>>> ceph-users@xxxxxxxxxxxxxx
>> > >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>>> _______________________________________________
>> > >>>> ceph-users mailing list
>> > >>>> ceph-users@xxxxxxxxxxxxxx
>> > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>>> _______________________________________________
>> > >>>> ceph-users mailing list
>> > >>>> ceph-users@xxxxxxxxxxxxxx
>> > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >> _______________________________________________
>> > >> ceph-users mailing list
>> > >> ceph-users@xxxxxxxxxxxxxx
>> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@xxxxxxxxxxxxxx
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com