Mark could you please elaborate on this? "use larger directory splitting thresholds to at least balance that part of the equation out" Thanks Jan > On 04 Sep 2015, at 15:31, Mark Nelson <mnelson@xxxxxxxxxx> wrote: > > There's a lot of factors that play into all of this. The more PGs you have, the more total objects you can store before you hit the thresholds. More PGs also means slightly better random distribution across OSDs (Not really affected by the size of the OSD assuming all OSDs are uniform). You have to be careful increasing the PG count though. I've tested about a million PGs and things more or less worked but the mons were pretty laggy and I didn't test recovery. For small clusters I personally like to use more PGs than our guidelines indicate and for very large clusters I suspect you might have to under-allocate but then probably use larger directory splitting thresholds to at least balance that part of the equation out. > > Mark > > On 09/04/2015 07:18 AM, Nick Fisk wrote: >> Actually just thinking about this some more, shouldn't the PG's per OSD "golden rule" also depend on the size of the OSD? If this Directory splitting is a big deal then an 8TB OSD is going to need a lot more PG's than say a 1TB OSD. >> >> Any thoughts Mark? >> >>> -----Original Message----- >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of >>> Nick Fisk >>> Sent: 04 September 2015 13:08 >>> To: 'Wang, Warren' <Warren_Wang@xxxxxxxxxxxxxxxxx>; 'Mark Nelson' >>> <mnelson@xxxxxxxxxx>; 'Ben Hines' <bhines@xxxxxxxxx> >>> Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx> >>> Subject: Re: Ceph performance, empty vs part full >>> >>> I've just made the same change ( 4 and 40 for now) on my cluster which is a >>> similar size to yours. I didn't see any merging happening, although most of >>> the directory's I looked at had more files in than the new merge threshold, so >>> I guess this is to be expected >>> >>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to bring >>> things back into order. >>> >>>> -----Original Message----- >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf >>>> Of Wang, Warren >>>> Sent: 04 September 2015 01:21 >>>> To: Mark Nelson <mnelson@xxxxxxxxxx>; Ben Hines <bhines@xxxxxxxxx> >>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> >>>> Subject: Re: Ceph performance, empty vs part full >>>> >>>> I'm about to change it on a big cluster too. It totals around 30 >>>> million, so I'm a bit nervous on changing it. As far as I understood, >>>> it would indeed move them around, if you can get underneath the >>>> threshold, but it may be hard to do. Two more settings that I highly >>>> recommend changing on a big prod cluster. I'm in favor of bumping these >>> two up in the defaults. >>>> >>>> Warren >>>> >>>> -----Original Message----- >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf >>>> Of Mark Nelson >>>> Sent: Thursday, September 03, 2015 6:04 PM >>>> To: Ben Hines <bhines@xxxxxxxxx> >>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> >>>> Subject: Re: Ceph performance, empty vs part full >>>> >>>> Hrm, I think it will follow the merge/split rules if it's out of whack >>>> given the new settings, but I don't know that I've ever tested it on >>>> an existing cluster to see that it actually happens. I guess let it >>>> sit for a while and then check the OSD PG directories to see if the >>>> object counts make sense given the new settings? :D >>>> >>>> Mark >>>> >>>> On 09/03/2015 04:31 PM, Ben Hines wrote: >>>>> Hey Mark, >>>>> >>>>> I've just tweaked these filestore settings for my cluster -- after >>>>> changing this, is there a way to make ceph move existing objects >>>>> around to new filestore locations, or will this only apply to newly >>>>> created objects? (i would assume the latter..) >>>>> >>>>> thanks, >>>>> >>>>> -Ben >>>>> >>>>> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson <mnelson@xxxxxxxxxx> >>>> wrote: >>>>>> Basically for each PG, there's a directory tree where only a >>>>>> certain number of objects are allowed in a given directory before >>>>>> it splits into new branches/leaves. The problem is that this has a >>>>>> fair amount of overhead and also there's extra associated dentry >>>>>> lookups to get at any >>>> given object. >>>>>> >>>>>> You may want to try something like: >>>>>> >>>>>> "filestore merge threshold = 40" >>>>>> "filestore split multiple = 8" >>>>>> >>>>>> This will dramatically increase the number of objects per directory >>>> allowed. >>>>>> >>>>>> Another thing you may want to try is telling the kernel to greatly >>>>>> favor retaining dentries and inodes in cache: >>>>>> >>>>>> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure >>>>>> >>>>>> Mark >>>>>> >>>>>> >>>>>> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote: >>>>>>> >>>>>>> If I create a new pool it is generally fast for a short amount of time. >>>>>>> Not as fast as if I had a blank cluster, but close to. >>>>>>> >>>>>>> Bryn >>>>>>>> >>>>>>>> On 8 Jul 2015, at 13:55, Gregory Farnum <greg@xxxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> I think you're probably running into the internal PG/collection >>>>>>>> splitting here; try searching for those terms and seeing what >>>>>>>> your OSD folder structures look like. You could test by creating >>>>>>>> a new pool and seeing if it's faster or slower than the one >>>>>>>> you've already filled >>>> up. >>>>>>>> -Greg >>>>>>>> >>>>>>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn) >>>>>>>> <bryn.mathias@xxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> >>>>>>>>> I’m perf testing a cluster again, This time I have re-built the >>>>>>>>> cluster and am filling it for testing. >>>>>>>>> >>>>>>>>> on a 10 min run I get the following results from 5 load >>>>>>>>> generators, each writing though 7 iocontexts, with a queue depth >>>>>>>>> of >>>> 50 async writes. >>>>>>>>> >>>>>>>>> >>>>>>>>> Gen1 >>>>>>>>> Percentile 100 = 0.729775905609 >>>>>>>>> Max latencies = 0.729775905609, Min = 0.0320818424225, mean = >>>>>>>>> 0.0750389684542 >>>>>>>>> Total objects writen = 113088 in time 604.259738207s gives >>>>>>>>> 187.151307376/s (748.605229503 MB/s) >>>>>>>>> >>>>>>>>> Gen2 >>>>>>>>> Percentile 100 = 0.735981941223 >>>>>>>>> Max latencies = 0.735981941223, Min = 0.0340068340302, mean = >>>>>>>>> 0.0745198070711 >>>>>>>>> Total objects writen = 113822 in time 604.437897921s gives >>>>>>>>> 188.310495407/s (753.241981627 MB/s) >>>>>>>>> >>>>>>>>> Gen3 >>>>>>>>> Percentile 100 = 0.828994989395 >>>>>>>>> Max latencies = 0.828994989395, Min = 0.0349340438843, mean = >>>>>>>>> 0.0745455575197 >>>>>>>>> Total objects writen = 113670 in time 604.352181911s gives >>>>>>>>> 188.085694736/s (752.342778944 MB/s) >>>>>>>>> >>>>>>>>> Gen4 >>>>>>>>> Percentile 100 = 1.06834602356 >>>>>>>>> Max latencies = 1.06834602356, Min = 0.0333499908447, mean = >>>>>>>>> 0.0752239764659 >>>>>>>>> Total objects writen = 112744 in time 604.408732891s gives >>>>>>>>> 186.536020849/s (746.144083397 MB/s) >>>>>>>>> >>>>>>>>> Gen5 >>>>>>>>> Percentile 100 = 0.609658002853 >>>>>>>>> Max latencies = 0.609658002853, Min = 0.032968044281, mean = >>>>>>>>> 0.0744482759499 >>>>>>>>> Total objects writen = 113918 in time 604.671534061s gives >>>>>>>>> 188.396498897/s (753.585995589 MB/s) >>>>>>>>> >>>>>>>>> example ceph -w output: >>>>>>>>> 2015-07-07 15:50:16.507084 mon.0 [INF] pgmap v1077: 2880 pgs: >>>>>>>>> 2880 >>>>>>>>> active+clean; 1996 GB data, 2515 GB used, 346 TB / 348 TB avail; >>>>>>>>> active+2185 MB/s >>>>>>>>> wr, 572 op/s >>>>>>>>> >>>>>>>>> >>>>>>>>> However when the cluster gets over 20% full I see the following >>>>>>>>> results, this gets worse as the cluster fills up: >>>>>>>>> >>>>>>>>> Gen1 >>>>>>>>> Percentile 100 = 6.71176099777 >>>>>>>>> Max latencies = 6.71176099777, Min = 0.0358741283417, mean = >>>>>>>>> 0.161760483485 >>>>>>>>> Total objects writen = 52196 in time 604.488474131s gives >>>>>>>>> 86.347386648/s >>>>>>>>> (345.389546592 MB/s) >>>>>>>>> >>>>>>>>> Gen2 >>>>>>>>> Max latencies = 4.09169006348, Min = 0.0357890129089, mean = >>>>>>>>> 0.163243938477 >>>>>>>>> Total objects writen = 51702 in time 604.036739111s gives >>>>>>>>> 85.5941313704/s (342.376525482 MB/s) >>>>>>>>> >>>>>>>>> Gen3 >>>>>>>>> Percentile 100 = 7.32526683807 >>>>>>>>> Max latencies = 7.32526683807, Min = 0.0366668701172, mean = >>>>>>>>> 0.163992217926 >>>>>>>>> Total objects writen = 51476 in time 604.684302092s gives >>>>>>>>> 85.1287189397/s (340.514875759 MB/s) >>>>>>>>> >>>>>>>>> Gen4 >>>>>>>>> Percentile 100 = 7.56094503403 >>>>>>>>> Max latencies = 7.56094503403, Min = 0.0355761051178, mean = >>>>>>>>> 0.162109421231 >>>>>>>>> Total objects writen = 52092 in time 604.769910812s gives >>>>>>>>> 86.1352376642/s (344.540950657 MB/s) >>>>>>>>> >>>>>>>>> >>>>>>>>> Gen5 >>>>>>>>> Percentile 100 = 6.99595499039 >>>>>>>>> Max latencies = 6.99595499039, Min = 0.0364680290222, mean = >>>>>>>>> 0.163651215426 >>>>>>>>> Total objects writen = 51566 in time 604.061977148s gives >>>>>>>>> 85.3654127404/s (341.461650961 MB/s) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Cluster details: >>>>>>>>> 5*HPDL380’s with 13*6Tb OSD’s >>>>>>>>> 128Gb Ram >>>>>>>>> 2*intel 2620v3 >>>>>>>>> 10 Gbit Ceph public network >>>>>>>>> 10 Gbit Ceph private network >>>>>>>>> >>>>>>>>> Load generators connected via a 20Gbit bond to the ceph public >>>> network. >>>>>>>>> >>>>>>>>> >>>>>>>>> Is this likely to be something happening to the journals? >>>>>>>>> >>>>>>>>> Or is there something else going on. >>>>>>>>> >>>>>>>>> I have run FIO and iperf tests and the disk and network >>>>>>>>> performance is very high. >>>>>>>>> >>>>>>>>> >>>>>>>>> Kind Regards, >>>>>>>>> Bryn Mathias >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com