Re: Ceph performance, empty vs part full

Nick Fisk <nick@xxxxxxxxxx> · Fri, 4 Sep 2015 13:18:08 +0100

Actually just thinking about this some more, shouldn't the PG's per OSD "golden rule" also depend on the size of the OSD? If this Directory splitting is a big deal then an 8TB OSD is going to need a lot more PG's than say a 1TB OSD.

Any thoughts Mark?

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Nick Fisk
> Sent: 04 September 2015 13:08
> To: 'Wang, Warren' <Warren_Wang@xxxxxxxxxxxxxxxxx>; 'Mark Nelson'
> <mnelson@xxxxxxxxxx>; 'Ben Hines' <bhines@xxxxxxxxx>
> Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Ceph performance, empty vs part full
> 
> I've just made the same change ( 4 and 40 for now) on my cluster which is a
> similar size to yours. I didn't see any merging happening, although most of
> the directory's I looked at had more files in than the new merge threshold, so
> I guess this is to be expected
> 
> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to bring
> things back into order.
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Wang, Warren
> > Sent: 04 September 2015 01:21
> > To: Mark Nelson <mnelson@xxxxxxxxxx>; Ben Hines <bhines@xxxxxxxxx>
> > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  Ceph performance, empty vs part full
> >
> > I'm about to change it on a big cluster too. It totals around 30
> > million, so I'm a bit nervous on changing it. As far as I understood,
> > it would indeed move them around, if you can get underneath the
> > threshold, but it may be hard to do. Two more settings that I highly
> > recommend changing on a big prod cluster. I'm in favor of bumping these
> two up in the defaults.
> >
> > Warren
> >
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Mark Nelson
> > Sent: Thursday, September 03, 2015 6:04 PM
> > To: Ben Hines <bhines@xxxxxxxxx>
> > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  Ceph performance, empty vs part full
> >
> > Hrm, I think it will follow the merge/split rules if it's out of whack
> > given the new settings, but I don't know that I've ever tested it on
> > an existing cluster to see that it actually happens.  I guess let it
> > sit for a while and then check the OSD PG directories to see if the
> > object counts make sense given the new settings? :D
> >
> > Mark
> >
> > On 09/03/2015 04:31 PM, Ben Hines wrote:
> > > Hey Mark,
> > >
> > > I've just tweaked these filestore settings for my cluster -- after
> > > changing this, is there a way to make ceph move existing objects
> > > around to new filestore locations, or will this only apply to newly
> > > created objects? (i would assume the latter..)
> > >
> > > thanks,
> > >
> > > -Ben
> > >
> > > On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson <mnelson@xxxxxxxxxx>
> > wrote:
> > >> Basically for each PG, there's a directory tree where only a
> > >> certain number of objects are allowed in a given directory before
> > >> it splits into new branches/leaves.  The problem is that this has a
> > >> fair amount of overhead and also there's extra associated dentry
> > >> lookups to get at any
> > given object.
> > >>
> > >> You may want to try something like:
> > >>
> > >> "filestore merge threshold = 40"
> > >> "filestore split multiple = 8"
> > >>
> > >> This will dramatically increase the number of objects per directory
> > allowed.
> > >>
> > >> Another thing you may want to try is telling the kernel to greatly
> > >> favor retaining dentries and inodes in cache:
> > >>
> > >> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
> > >>
> > >> Mark
> > >>
> > >>
> > >> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
> > >>>
> > >>> If I create a new pool it is generally fast for a short amount of time.
> > >>> Not as fast as if I had a blank cluster, but close to.
> > >>>
> > >>> Bryn
> > >>>>
> > >>>> On 8 Jul 2015, at 13:55, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> > >>>>
> > >>>> I think you're probably running into the internal PG/collection
> > >>>> splitting here; try searching for those terms and seeing what
> > >>>> your OSD folder structures look like. You could test by creating
> > >>>> a new pool and seeing if it's faster or slower than the one
> > >>>> you've already filled
> > up.
> > >>>> -Greg
> > >>>>
> > >>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
> > >>>> <bryn.mathias@xxxxxxxxxxxxxxxxxx> wrote:
> > >>>>>
> > >>>>> Hi All,
> > >>>>>
> > >>>>>
> > >>>>> I’m perf testing a cluster again, This time I have re-built the
> > >>>>> cluster and am filling it for testing.
> > >>>>>
> > >>>>> on a 10 min run I get the following results from 5 load
> > >>>>> generators, each writing though 7 iocontexts, with a queue depth
> > >>>>> of
> > 50 async writes.
> > >>>>>
> > >>>>>
> > >>>>> Gen1
> > >>>>> Percentile 100 = 0.729775905609
> > >>>>> Max latencies = 0.729775905609, Min = 0.0320818424225, mean =
> > >>>>> 0.0750389684542
> > >>>>> Total objects writen = 113088 in time 604.259738207s gives
> > >>>>> 187.151307376/s (748.605229503 MB/s)
> > >>>>>
> > >>>>> Gen2
> > >>>>> Percentile 100 = 0.735981941223
> > >>>>> Max latencies = 0.735981941223, Min = 0.0340068340302, mean =
> > >>>>> 0.0745198070711
> > >>>>> Total objects writen = 113822 in time 604.437897921s gives
> > >>>>> 188.310495407/s (753.241981627 MB/s)
> > >>>>>
> > >>>>> Gen3
> > >>>>> Percentile 100 = 0.828994989395
> > >>>>> Max latencies = 0.828994989395, Min = 0.0349340438843, mean =
> > >>>>> 0.0745455575197
> > >>>>> Total objects writen = 113670 in time 604.352181911s gives
> > >>>>> 188.085694736/s (752.342778944 MB/s)
> > >>>>>
> > >>>>> Gen4
> > >>>>> Percentile 100 = 1.06834602356
> > >>>>> Max latencies = 1.06834602356, Min = 0.0333499908447, mean =
> > >>>>> 0.0752239764659
> > >>>>> Total objects writen = 112744 in time 604.408732891s gives
> > >>>>> 186.536020849/s (746.144083397 MB/s)
> > >>>>>
> > >>>>> Gen5
> > >>>>> Percentile 100 = 0.609658002853
> > >>>>> Max latencies = 0.609658002853, Min = 0.032968044281, mean =
> > >>>>> 0.0744482759499
> > >>>>> Total objects writen = 113918 in time 604.671534061s gives
> > >>>>> 188.396498897/s (753.585995589 MB/s)
> > >>>>>
> > >>>>> example ceph -w output:
> > >>>>> 2015-07-07 15:50:16.507084 mon.0 [INF] pgmap v1077: 2880 pgs:
> > >>>>> 2880
> > >>>>> active+clean; 1996 GB data, 2515 GB used, 346 TB / 348 TB avail;
> > >>>>> active+2185 MB/s
> > >>>>> wr, 572 op/s
> > >>>>>
> > >>>>>
> > >>>>> However when the cluster gets over 20% full I see the following
> > >>>>> results, this gets worse as the cluster fills up:
> > >>>>>
> > >>>>> Gen1
> > >>>>> Percentile 100 = 6.71176099777
> > >>>>> Max latencies = 6.71176099777, Min = 0.0358741283417, mean =
> > >>>>> 0.161760483485
> > >>>>> Total objects writen = 52196 in time 604.488474131s gives
> > >>>>> 86.347386648/s
> > >>>>> (345.389546592 MB/s)
> > >>>>>
> > >>>>> Gen2
> > >>>>> Max latencies = 4.09169006348, Min = 0.0357890129089, mean =
> > >>>>> 0.163243938477
> > >>>>> Total objects writen = 51702 in time 604.036739111s gives
> > >>>>> 85.5941313704/s (342.376525482 MB/s)
> > >>>>>
> > >>>>> Gen3
> > >>>>> Percentile 100 = 7.32526683807
> > >>>>> Max latencies = 7.32526683807, Min = 0.0366668701172, mean =
> > >>>>> 0.163992217926
> > >>>>> Total objects writen = 51476 in time 604.684302092s gives
> > >>>>> 85.1287189397/s (340.514875759 MB/s)
> > >>>>>
> > >>>>> Gen4
> > >>>>> Percentile 100 = 7.56094503403
> > >>>>> Max latencies = 7.56094503403, Min = 0.0355761051178, mean =
> > >>>>> 0.162109421231
> > >>>>> Total objects writen = 52092 in time 604.769910812s gives
> > >>>>> 86.1352376642/s (344.540950657 MB/s)
> > >>>>>
> > >>>>>
> > >>>>> Gen5
> > >>>>> Percentile 100 = 6.99595499039
> > >>>>> Max latencies = 6.99595499039, Min = 0.0364680290222, mean =
> > >>>>> 0.163651215426
> > >>>>> Total objects writen = 51566 in time 604.061977148s gives
> > >>>>> 85.3654127404/s (341.461650961 MB/s)
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Cluster details:
> > >>>>> 5*HPDL380’s with 13*6Tb OSD’s
> > >>>>> 128Gb Ram
> > >>>>> 2*intel 2620v3
> > >>>>> 10 Gbit Ceph public network
> > >>>>> 10 Gbit Ceph private network
> > >>>>>
> > >>>>> Load generators connected via a 20Gbit bond to the ceph public
> > network.
> > >>>>>
> > >>>>>
> > >>>>> Is this likely to be something happening to the journals?
> > >>>>>
> > >>>>> Or is there something else going on.
> > >>>>>
> > >>>>> I have run FIO and iperf tests and the disk and network
> > >>>>> performance is very high.
> > >>>>>
> > >>>>>
> > >>>>> Kind Regards,
> > >>>>> Bryn Mathias
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> ceph-users mailing list
> > >>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> ceph-users mailing list
> > >>> ceph-users@xxxxxxxxxxxxxx
> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com