Re: Ceph performance, empty vs part full

Nick Fisk <nick@xxxxxxxxxx> · Tue, 8 Sep 2015 10:26:33 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Nick Fisk
> Sent: 06 September 2015 15:11
> To: 'Shinobu Kinjo' <skinjo@xxxxxxxxxx>; 'GuangYang'
> <yguang11@xxxxxxxxxxx>
> Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>; 'Nick Fisk' <nick@xxxxxxxxxx>
> Subject: Re:  Ceph performance, empty vs part full
> 
> Just a quick update after up'ing the thresholds, not much happened. This is
> probably because the merge threshold is several times less than the trigger
> for the split. So I have now bumped the merge threshold up to 1000
> temporarily to hopefully force some DIR's to merge.
> 
> I believe this has started to happen, but it only seems to merge right at the
> bottom of the tree.
> 
> Eg
> 
> /var/lib/ceph/osd/ceph-1/current/0.106_head/DIR_6/DIR_0/DIR_1/
> 
> All the Directory's only 1 have directory in them, DIR_1 is the only one in the
> path that has any objects in it. Is this the correct behaviour? Is there any
> impact from having these deeper paths compared to when the objects are
> just in the root directory?
> 
> I guess the only real way to get the objects back into the root would be to
> out->drain->in the OSD?
> 
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Shinobu Kinjo
> > Sent: 05 September 2015 01:42
> > To: GuangYang <yguang11@xxxxxxxxxxx>
> > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>; Nick Fisk
> > <nick@xxxxxxxxxx>
> > Subject: Re:  Ceph performance, empty vs part full
> >
> > Very nice.
> > You're my hero!
> >
> >  Shinobu
> >
> > ----- Original Message -----
> > From: "GuangYang" <yguang11@xxxxxxxxxxx>
> > To: "Shinobu Kinjo" <skinjo@xxxxxxxxxx>
> > Cc: "Ben Hines" <bhines@xxxxxxxxx>, "Nick Fisk" <nick@xxxxxxxxxx>,
> > "ceph- users" <ceph-users@xxxxxxxxxxxxxx>
> > Sent: Saturday, September 5, 2015 9:40:06 AM
> > Subject: RE:  Ceph performance, empty vs part full
> >
> > ----------------------------------------
> > > Date: Fri, 4 Sep 2015 20:31:59 -0400
> > > From: skinjo@xxxxxxxxxx
> > > To: yguang11@xxxxxxxxxxx
> > > CC: bhines@xxxxxxxxx; nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
> > > Subject: Re:  Ceph performance, empty vs part full
> > >
> > >> IIRC, it only triggers the move (merge or split) when that folder
> > >> is hit by a
> > request, so most likely it happens gradually.
> > >
> > > Do you know what causes this?
> > A requests (read/write/setxattr, etc) hitting objects in that folder.
> > > I would like to be more clear "gradually".

Does anyone know if a scrub is included in this? I have kicked off a deep scrub of an OSD and yet I still don't see merging happening, even with a merge threshold of 1000.

Example
/var/lib/ceph/osd/ceph-0/current/0.108_head : 0 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8 : 0 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0 : 0 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1 : 15 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_4 : 85 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_B : 63 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_D : 88 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_8 : 73 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_0 : 77 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_6 : 79 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_3 : 67 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_E : 94 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_C : 91 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_A : 88 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_5 : 96 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_2 : 88 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_9 : 70 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_1 : 95 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_7 : 87 files
/var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_F : 88 files

> > >
> > > Shinobu
> > >
> > > ----- Original Message -----
> > > From: "GuangYang" <yguang11@xxxxxxxxxxx>
> > > To: "Ben Hines" <bhines@xxxxxxxxx>, "Nick Fisk" <nick@xxxxxxxxxx>
> > > Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> > > Sent: Saturday, September 5, 2015 9:27:31 AM
> > > Subject: Re:  Ceph performance, empty vs part full
> > >
> > > IIRC, it only triggers the move (merge or split) when that folder is
> > > hit by a
> > request, so most likely it happens gradually.
> > >
> > > Another thing might be helpful (and we have had good experience
> > > with), is
> > that we do the folder splitting at the pool creation time, so that we
> > avoid the performance impact with runtime splitting (which is high if
> > you have a large cluster). In order to do that:
> > >
> > > 1. You will need to configure "filestore merge threshold" with a
> > > negative
> > value so that it disables merging.
> > > 2. When creating the pool, there is a parameter named
> > "expected_num_objects", by specifying that number, the folder will
> > splitted to the right level with the pool creation.
> > >
> > > Hope that helps.
> > >
> > > Thanks,
> > > Guang
> > >
> > >
> > > ----------------------------------------
> > >> From: bhines@xxxxxxxxx
> > >> Date: Fri, 4 Sep 2015 12:05:26 -0700
> > >> To: nick@xxxxxxxxxx
> > >> CC: ceph-users@xxxxxxxxxxxxxx
> > >> Subject: Re:  Ceph performance, empty vs part full
> > >>
> > >> Yeah, i'm not seeing stuff being moved at all. Perhaps we should
> > >> file a ticket to request a way to tell an OSD to rebalance its
> > >> directory structure.
> > >>
> > >> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > >>> I've just made the same change ( 4 and 40 for now) on my cluster
> > >>> which is a similar size to yours. I didn't see any merging
> > >>> happening, although most of the directory's I looked at had more
> > >>> files in than the new merge threshold, so I guess this is to be
> > >>> expected
> > >>>
> > >>> I'm currently splitting my PG's from 1024 to 2048 to see if that
> > >>> helps to
> > bring things back into order.
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > >>>> Behalf Of Wang, Warren
> > >>>> Sent: 04 September 2015 01:21
> > >>>> To: Mark Nelson <mnelson@xxxxxxxxxx>; Ben Hines
> > <bhines@xxxxxxxxx>
> > >>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > >>>> Subject: Re:  Ceph performance, empty vs part full
> > >>>>
> > >>>> I'm about to change it on a big cluster too. It totals around 30
> > >>>> million, so I'm a bit nervous on changing it. As far as I
> > >>>> understood, it would indeed move them around, if you can get
> > >>>> underneath the threshold, but it may be hard to do. Two more
> > >>>> settings that I highly recommend changing on a big prod cluster.
> > >>>> I'm in
> > favor of bumping these two up in the defaults.
> > >>>>
> > >>>> Warren
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > >>>> Behalf Of Mark Nelson
> > >>>> Sent: Thursday, September 03, 2015 6:04 PM
> > >>>> To: Ben Hines <bhines@xxxxxxxxx>
> > >>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > >>>> Subject: Re:  Ceph performance, empty vs part full
> > >>>>
> > >>>> Hrm, I think it will follow the merge/split rules if it's out of
> > >>>> whack given the new settings, but I don't know that I've ever
> > >>>> tested it on an existing cluster to see that it actually happens.
> > >>>> I guess let it sit for a while and then check the OSD PG
> > >>>> directories to see if the object counts make sense given the new
> > >>>> settings? :D
> > >>>>
> > >>>> Mark
> > >>>>
> > >>>> On 09/03/2015 04:31 PM, Ben Hines wrote:
> > >>>>> Hey Mark,
> > >>>>>
> > >>>>> I've just tweaked these filestore settings for my cluster --
> > >>>>> after changing this, is there a way to make ceph move existing
> > >>>>> objects around to new filestore locations, or will this only
> > >>>>> apply to newly created objects? (i would assume the latter..)
> > >>>>>
> > >>>>> thanks,
> > >>>>>
> > >>>>> -Ben
> > >>>>>
> > >>>>> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson
> <mnelson@xxxxxxxxxx>
> > >>>> wrote:
> > >>>>>> Basically for each PG, there's a directory tree where only a
> > >>>>>> certain number of objects are allowed in a given directory
> > >>>>>> before it splits into new branches/leaves. The problem is that
> > >>>>>> this has a fair amount of overhead and also there's extra
> > >>>>>> associated dentry lookups to get at any
> > >>>> given object.
> > >>>>>>
> > >>>>>> You may want to try something like:
> > >>>>>>
> > >>>>>> "filestore merge threshold = 40"
> > >>>>>> "filestore split multiple = 8"
> > >>>>>>
> > >>>>>> This will dramatically increase the number of objects per
> > >>>>>> directory
> > >>>> allowed.
> > >>>>>>
> > >>>>>> Another thing you may want to try is telling the kernel to
> > >>>>>> greatly favor retaining dentries and inodes in cache:
> > >>>>>>
> > >>>>>> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
> > >>>>>>
> > >>>>>> Mark
> > >>>>>>
> > >>>>>>
> > >>>>>> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
> > >>>>>>>
> > >>>>>>> If I create a new pool it is generally fast for a short amount of
> time.
> > >>>>>>> Not as fast as if I had a blank cluster, but close to.
> > >>>>>>>
> > >>>>>>> Bryn
> > >>>>>>>>
> > >>>>>>>> On 8 Jul 2015, at 13:55, Gregory Farnum <greg@xxxxxxxxxxx>
> > wrote:
> > >>>>>>>>
> > >>>>>>>> I think you're probably running into the internal
> > >>>>>>>> PG/collection splitting here; try searching for those terms
> > >>>>>>>> and seeing what your OSD folder structures look like. You
> > >>>>>>>> could test by creating a new pool and seeing if it's faster
> > >>>>>>>> or slower than the one you've already filled
> > >>>> up.
> > >>>>>>>> -Greg
> > >>>>>>>>
> > >>>>>>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
> > >>>>>>>> <bryn.mathias@xxxxxxxxxxxxxxxxxx> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi All,
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I’m perf testing a cluster again, This time I have re-built
> > >>>>>>>>> the cluster and am filling it for testing.
> > >>>>>>>>>
> > >>>>>>>>> on a 10 min run I get the following results from 5 load
> > >>>>>>>>> generators, each writing though 7 iocontexts, with a queue
> > >>>>>>>>> depth of
> > >>>> 50 async writes.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Gen1
> > >>>>>>>>> Percentile 100 = 0.729775905609 Max latencies =
> > >>>>>>>>> 0.729775905609, Min = 0.0320818424225, mean =
> > >>>>>>>>> 0.0750389684542
> > >>>>>>>>> Total objects writen = 113088 in time 604.259738207s gives
> > >>>>>>>>> 187.151307376/s (748.605229503 MB/s)
> > >>>>>>>>>
> > >>>>>>>>> Gen2
> > >>>>>>>>> Percentile 100 = 0.735981941223 Max latencies =
> > >>>>>>>>> 0.735981941223, Min = 0.0340068340302, mean =
> > >>>>>>>>> 0.0745198070711
> > >>>>>>>>> Total objects writen = 113822 in time 604.437897921s gives
> > >>>>>>>>> 188.310495407/s (753.241981627 MB/s)
> > >>>>>>>>>
> > >>>>>>>>> Gen3
> > >>>>>>>>> Percentile 100 = 0.828994989395 Max latencies =
> > >>>>>>>>> 0.828994989395, Min = 0.0349340438843, mean =
> > >>>>>>>>> 0.0745455575197
> > >>>>>>>>> Total objects writen = 113670 in time 604.352181911s gives
> > >>>>>>>>> 188.085694736/s (752.342778944 MB/s)
> > >>>>>>>>>
> > >>>>>>>>> Gen4
> > >>>>>>>>> Percentile 100 = 1.06834602356 Max latencies =
> > >>>>>>>>> 1.06834602356, Min = 0.0333499908447, mean =
> > >>>>>>>>> 0.0752239764659
> > >>>>>>>>> Total objects writen = 112744 in time 604.408732891s gives
> > >>>>>>>>> 186.536020849/s (746.144083397 MB/s)
> > >>>>>>>>>
> > >>>>>>>>> Gen5
> > >>>>>>>>> Percentile 100 = 0.609658002853 Max latencies =
> > >>>>>>>>> 0.609658002853, Min = 0.032968044281, mean =
> > >>>>>>>>> 0.0744482759499
> > >>>>>>>>> Total objects writen = 113918 in time 604.671534061s gives
> > >>>>>>>>> 188.396498897/s (753.585995589 MB/s)
> > >>>>>>>>>
> > >>>>>>>>> example ceph -w output:
> > >>>>>>>>> 2015-07-07 15:50:16.507084 mon.0 [INF] pgmap v1077: 2880
> pgs:
> > >>>>>>>>> 2880
> > >>>>>>>>> active+clean; 1996 GB data, 2515 GB used, 346 TB / 348 TB
> > >>>>>>>>> active+avail;
> > >>>>>>>>> active+2185 MB/s
> > >>>>>>>>> wr, 572 op/s
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> However when the cluster gets over 20% full I see the
> > >>>>>>>>> following results, this gets worse as the cluster fills up:
> > >>>>>>>>>
> > >>>>>>>>> Gen1
> > >>>>>>>>> Percentile 100 = 6.71176099777 Max latencies =
> > >>>>>>>>> 6.71176099777, Min = 0.0358741283417, mean =
> > >>>>>>>>> 0.161760483485
> > >>>>>>>>> Total objects writen = 52196 in time 604.488474131s gives
> > >>>>>>>>> 86.347386648/s
> > >>>>>>>>> (345.389546592 MB/s)
> > >>>>>>>>>
> > >>>>>>>>> Gen2
> > >>>>>>>>> Max latencies = 4.09169006348, Min = 0.0357890129089, mean
> =
> > >>>>>>>>> 0.163243938477
> > >>>>>>>>> Total objects writen = 51702 in time 604.036739111s gives
> > >>>>>>>>> 85.5941313704/s (342.376525482 MB/s)
> > >>>>>>>>>
> > >>>>>>>>> Gen3
> > >>>>>>>>> Percentile 100 = 7.32526683807 Max latencies =
> > >>>>>>>>> 7.32526683807, Min = 0.0366668701172, mean =
> > >>>>>>>>> 0.163992217926
> > >>>>>>>>> Total objects writen = 51476 in time 604.684302092s gives
> > >>>>>>>>> 85.1287189397/s (340.514875759 MB/s)
> > >>>>>>>>>
> > >>>>>>>>> Gen4
> > >>>>>>>>> Percentile 100 = 7.56094503403 Max latencies =
> > >>>>>>>>> 7.56094503403, Min = 0.0355761051178, mean =
> > >>>>>>>>> 0.162109421231
> > >>>>>>>>> Total objects writen = 52092 in time 604.769910812s gives
> > >>>>>>>>> 86.1352376642/s (344.540950657 MB/s)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Gen5
> > >>>>>>>>> Percentile 100 = 6.99595499039 Max latencies =
> > >>>>>>>>> 6.99595499039, Min = 0.0364680290222, mean =
> > >>>>>>>>> 0.163651215426
> > >>>>>>>>> Total objects writen = 51566 in time 604.061977148s gives
> > >>>>>>>>> 85.3654127404/s (341.461650961 MB/s)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Cluster details:
> > >>>>>>>>> 5*HPDL380’s with 13*6Tb OSD’s 128Gb Ram 2*intel 2620v3
> > >>>>>>>>> 10 Gbit Ceph public network
> > >>>>>>>>> 10 Gbit Ceph private network
> > >>>>>>>>>
> > >>>>>>>>> Load generators connected via a 20Gbit bond to the ceph
> > >>>>>>>>> public
> > >>>> network.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Is this likely to be something happening to the journals?
> > >>>>>>>>>
> > >>>>>>>>> Or is there something else going on.
> > >>>>>>>>>
> > >>>>>>>>> I have run FIO and iperf tests and the disk and network
> > >>>>>>>>> performance is very high.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Kind Regards,
> > >>>>>>>>> Bryn Mathias
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> _______________________________________________
> > >>>>>>>>> ceph-users mailing list
> > >>>>>>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> _______________________________________________
> > >>>>>>> ceph-users mailing list
> > >>>>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>>>>
> > >>>>>> _______________________________________________
> > >>>>>> ceph-users mailing list
> > >>>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>> _______________________________________________
> > >>>> ceph-users mailing list
> > >>>> ceph-users@xxxxxxxxxxxxxx
> > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>> _______________________________________________
> > >>>> ceph-users mailing list
> > >>>> ceph-users@xxxxxxxxxxxxxx
> > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>
> > >>>
> > >>>
> > >>>
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com