> -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Nick Fisk > Sent: 06 September 2015 15:11 > To: 'Shinobu Kinjo' <skinjo@xxxxxxxxxx>; 'GuangYang' > <yguang11@xxxxxxxxxxx> > Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>; 'Nick Fisk' <nick@xxxxxxxxxx> > Subject: Re: Ceph performance, empty vs part full > > Just a quick update after up'ing the thresholds, not much happened. This is > probably because the merge threshold is several times less than the trigger > for the split. So I have now bumped the merge threshold up to 1000 > temporarily to hopefully force some DIR's to merge. > > I believe this has started to happen, but it only seems to merge right at the > bottom of the tree. > > Eg > > /var/lib/ceph/osd/ceph-1/current/0.106_head/DIR_6/DIR_0/DIR_1/ > > All the Directory's only 1 have directory in them, DIR_1 is the only one in the > path that has any objects in it. Is this the correct behaviour? Is there any > impact from having these deeper paths compared to when the objects are > just in the root directory? > > I guess the only real way to get the objects back into the root would be to > out->drain->in the OSD? > > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > > Of Shinobu Kinjo > > Sent: 05 September 2015 01:42 > > To: GuangYang <yguang11@xxxxxxxxxxx> > > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>; Nick Fisk > > <nick@xxxxxxxxxx> > > Subject: Re: Ceph performance, empty vs part full > > > > Very nice. > > You're my hero! > > > > Shinobu > > > > ----- Original Message ----- > > From: "GuangYang" <yguang11@xxxxxxxxxxx> > > To: "Shinobu Kinjo" <skinjo@xxxxxxxxxx> > > Cc: "Ben Hines" <bhines@xxxxxxxxx>, "Nick Fisk" <nick@xxxxxxxxxx>, > > "ceph- users" <ceph-users@xxxxxxxxxxxxxx> > > Sent: Saturday, September 5, 2015 9:40:06 AM > > Subject: RE: Ceph performance, empty vs part full > > > > ---------------------------------------- > > > Date: Fri, 4 Sep 2015 20:31:59 -0400 > > > From: skinjo@xxxxxxxxxx > > > To: yguang11@xxxxxxxxxxx > > > CC: bhines@xxxxxxxxx; nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx > > > Subject: Re: Ceph performance, empty vs part full > > > > > >> IIRC, it only triggers the move (merge or split) when that folder > > >> is hit by a > > request, so most likely it happens gradually. > > > > > > Do you know what causes this? > > A requests (read/write/setxattr, etc) hitting objects in that folder. > > > I would like to be more clear "gradually". Does anyone know if a scrub is included in this? I have kicked off a deep scrub of an OSD and yet I still don't see merging happening, even with a merge threshold of 1000. Example /var/lib/ceph/osd/ceph-0/current/0.108_head : 0 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8 : 0 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0 : 0 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1 : 15 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_4 : 85 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_B : 63 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_D : 88 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_8 : 73 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_0 : 77 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_6 : 79 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_3 : 67 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_E : 94 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_C : 91 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_A : 88 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_5 : 96 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_2 : 88 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_9 : 70 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_1 : 95 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_7 : 87 files /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_F : 88 files > > > > > > Shinobu > > > > > > ----- Original Message ----- > > > From: "GuangYang" <yguang11@xxxxxxxxxxx> > > > To: "Ben Hines" <bhines@xxxxxxxxx>, "Nick Fisk" <nick@xxxxxxxxxx> > > > Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> > > > Sent: Saturday, September 5, 2015 9:27:31 AM > > > Subject: Re: Ceph performance, empty vs part full > > > > > > IIRC, it only triggers the move (merge or split) when that folder is > > > hit by a > > request, so most likely it happens gradually. > > > > > > Another thing might be helpful (and we have had good experience > > > with), is > > that we do the folder splitting at the pool creation time, so that we > > avoid the performance impact with runtime splitting (which is high if > > you have a large cluster). In order to do that: > > > > > > 1. You will need to configure "filestore merge threshold" with a > > > negative > > value so that it disables merging. > > > 2. When creating the pool, there is a parameter named > > "expected_num_objects", by specifying that number, the folder will > > splitted to the right level with the pool creation. > > > > > > Hope that helps. > > > > > > Thanks, > > > Guang > > > > > > > > > ---------------------------------------- > > >> From: bhines@xxxxxxxxx > > >> Date: Fri, 4 Sep 2015 12:05:26 -0700 > > >> To: nick@xxxxxxxxxx > > >> CC: ceph-users@xxxxxxxxxxxxxx > > >> Subject: Re: Ceph performance, empty vs part full > > >> > > >> Yeah, i'm not seeing stuff being moved at all. Perhaps we should > > >> file a ticket to request a way to tell an OSD to rebalance its > > >> directory structure. > > >> > > >> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > >>> I've just made the same change ( 4 and 40 for now) on my cluster > > >>> which is a similar size to yours. I didn't see any merging > > >>> happening, although most of the directory's I looked at had more > > >>> files in than the new merge threshold, so I guess this is to be > > >>> expected > > >>> > > >>> I'm currently splitting my PG's from 1024 to 2048 to see if that > > >>> helps to > > bring things back into order. > > >>> > > >>>> -----Original Message----- > > >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > >>>> Behalf Of Wang, Warren > > >>>> Sent: 04 September 2015 01:21 > > >>>> To: Mark Nelson <mnelson@xxxxxxxxxx>; Ben Hines > > <bhines@xxxxxxxxx> > > >>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> > > >>>> Subject: Re: Ceph performance, empty vs part full > > >>>> > > >>>> I'm about to change it on a big cluster too. It totals around 30 > > >>>> million, so I'm a bit nervous on changing it. As far as I > > >>>> understood, it would indeed move them around, if you can get > > >>>> underneath the threshold, but it may be hard to do. Two more > > >>>> settings that I highly recommend changing on a big prod cluster. > > >>>> I'm in > > favor of bumping these two up in the defaults. > > >>>> > > >>>> Warren > > >>>> > > >>>> -----Original Message----- > > >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > >>>> Behalf Of Mark Nelson > > >>>> Sent: Thursday, September 03, 2015 6:04 PM > > >>>> To: Ben Hines <bhines@xxxxxxxxx> > > >>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> > > >>>> Subject: Re: Ceph performance, empty vs part full > > >>>> > > >>>> Hrm, I think it will follow the merge/split rules if it's out of > > >>>> whack given the new settings, but I don't know that I've ever > > >>>> tested it on an existing cluster to see that it actually happens. > > >>>> I guess let it sit for a while and then check the OSD PG > > >>>> directories to see if the object counts make sense given the new > > >>>> settings? :D > > >>>> > > >>>> Mark > > >>>> > > >>>> On 09/03/2015 04:31 PM, Ben Hines wrote: > > >>>>> Hey Mark, > > >>>>> > > >>>>> I've just tweaked these filestore settings for my cluster -- > > >>>>> after changing this, is there a way to make ceph move existing > > >>>>> objects around to new filestore locations, or will this only > > >>>>> apply to newly created objects? (i would assume the latter..) > > >>>>> > > >>>>> thanks, > > >>>>> > > >>>>> -Ben > > >>>>> > > >>>>> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson > <mnelson@xxxxxxxxxx> > > >>>> wrote: > > >>>>>> Basically for each PG, there's a directory tree where only a > > >>>>>> certain number of objects are allowed in a given directory > > >>>>>> before it splits into new branches/leaves. The problem is that > > >>>>>> this has a fair amount of overhead and also there's extra > > >>>>>> associated dentry lookups to get at any > > >>>> given object. > > >>>>>> > > >>>>>> You may want to try something like: > > >>>>>> > > >>>>>> "filestore merge threshold = 40" > > >>>>>> "filestore split multiple = 8" > > >>>>>> > > >>>>>> This will dramatically increase the number of objects per > > >>>>>> directory > > >>>> allowed. > > >>>>>> > > >>>>>> Another thing you may want to try is telling the kernel to > > >>>>>> greatly favor retaining dentries and inodes in cache: > > >>>>>> > > >>>>>> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure > > >>>>>> > > >>>>>> Mark > > >>>>>> > > >>>>>> > > >>>>>> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote: > > >>>>>>> > > >>>>>>> If I create a new pool it is generally fast for a short amount of > time. > > >>>>>>> Not as fast as if I had a blank cluster, but close to. > > >>>>>>> > > >>>>>>> Bryn > > >>>>>>>> > > >>>>>>>> On 8 Jul 2015, at 13:55, Gregory Farnum <greg@xxxxxxxxxxx> > > wrote: > > >>>>>>>> > > >>>>>>>> I think you're probably running into the internal > > >>>>>>>> PG/collection splitting here; try searching for those terms > > >>>>>>>> and seeing what your OSD folder structures look like. You > > >>>>>>>> could test by creating a new pool and seeing if it's faster > > >>>>>>>> or slower than the one you've already filled > > >>>> up. > > >>>>>>>> -Greg > > >>>>>>>> > > >>>>>>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn) > > >>>>>>>> <bryn.mathias@xxxxxxxxxxxxxxxxxx> wrote: > > >>>>>>>>> > > >>>>>>>>> Hi All, > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> I’m perf testing a cluster again, This time I have re-built > > >>>>>>>>> the cluster and am filling it for testing. > > >>>>>>>>> > > >>>>>>>>> on a 10 min run I get the following results from 5 load > > >>>>>>>>> generators, each writing though 7 iocontexts, with a queue > > >>>>>>>>> depth of > > >>>> 50 async writes. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Gen1 > > >>>>>>>>> Percentile 100 = 0.729775905609 Max latencies = > > >>>>>>>>> 0.729775905609, Min = 0.0320818424225, mean = > > >>>>>>>>> 0.0750389684542 > > >>>>>>>>> Total objects writen = 113088 in time 604.259738207s gives > > >>>>>>>>> 187.151307376/s (748.605229503 MB/s) > > >>>>>>>>> > > >>>>>>>>> Gen2 > > >>>>>>>>> Percentile 100 = 0.735981941223 Max latencies = > > >>>>>>>>> 0.735981941223, Min = 0.0340068340302, mean = > > >>>>>>>>> 0.0745198070711 > > >>>>>>>>> Total objects writen = 113822 in time 604.437897921s gives > > >>>>>>>>> 188.310495407/s (753.241981627 MB/s) > > >>>>>>>>> > > >>>>>>>>> Gen3 > > >>>>>>>>> Percentile 100 = 0.828994989395 Max latencies = > > >>>>>>>>> 0.828994989395, Min = 0.0349340438843, mean = > > >>>>>>>>> 0.0745455575197 > > >>>>>>>>> Total objects writen = 113670 in time 604.352181911s gives > > >>>>>>>>> 188.085694736/s (752.342778944 MB/s) > > >>>>>>>>> > > >>>>>>>>> Gen4 > > >>>>>>>>> Percentile 100 = 1.06834602356 Max latencies = > > >>>>>>>>> 1.06834602356, Min = 0.0333499908447, mean = > > >>>>>>>>> 0.0752239764659 > > >>>>>>>>> Total objects writen = 112744 in time 604.408732891s gives > > >>>>>>>>> 186.536020849/s (746.144083397 MB/s) > > >>>>>>>>> > > >>>>>>>>> Gen5 > > >>>>>>>>> Percentile 100 = 0.609658002853 Max latencies = > > >>>>>>>>> 0.609658002853, Min = 0.032968044281, mean = > > >>>>>>>>> 0.0744482759499 > > >>>>>>>>> Total objects writen = 113918 in time 604.671534061s gives > > >>>>>>>>> 188.396498897/s (753.585995589 MB/s) > > >>>>>>>>> > > >>>>>>>>> example ceph -w output: > > >>>>>>>>> 2015-07-07 15:50:16.507084 mon.0 [INF] pgmap v1077: 2880 > pgs: > > >>>>>>>>> 2880 > > >>>>>>>>> active+clean; 1996 GB data, 2515 GB used, 346 TB / 348 TB > > >>>>>>>>> active+avail; > > >>>>>>>>> active+2185 MB/s > > >>>>>>>>> wr, 572 op/s > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> However when the cluster gets over 20% full I see the > > >>>>>>>>> following results, this gets worse as the cluster fills up: > > >>>>>>>>> > > >>>>>>>>> Gen1 > > >>>>>>>>> Percentile 100 = 6.71176099777 Max latencies = > > >>>>>>>>> 6.71176099777, Min = 0.0358741283417, mean = > > >>>>>>>>> 0.161760483485 > > >>>>>>>>> Total objects writen = 52196 in time 604.488474131s gives > > >>>>>>>>> 86.347386648/s > > >>>>>>>>> (345.389546592 MB/s) > > >>>>>>>>> > > >>>>>>>>> Gen2 > > >>>>>>>>> Max latencies = 4.09169006348, Min = 0.0357890129089, mean > = > > >>>>>>>>> 0.163243938477 > > >>>>>>>>> Total objects writen = 51702 in time 604.036739111s gives > > >>>>>>>>> 85.5941313704/s (342.376525482 MB/s) > > >>>>>>>>> > > >>>>>>>>> Gen3 > > >>>>>>>>> Percentile 100 = 7.32526683807 Max latencies = > > >>>>>>>>> 7.32526683807, Min = 0.0366668701172, mean = > > >>>>>>>>> 0.163992217926 > > >>>>>>>>> Total objects writen = 51476 in time 604.684302092s gives > > >>>>>>>>> 85.1287189397/s (340.514875759 MB/s) > > >>>>>>>>> > > >>>>>>>>> Gen4 > > >>>>>>>>> Percentile 100 = 7.56094503403 Max latencies = > > >>>>>>>>> 7.56094503403, Min = 0.0355761051178, mean = > > >>>>>>>>> 0.162109421231 > > >>>>>>>>> Total objects writen = 52092 in time 604.769910812s gives > > >>>>>>>>> 86.1352376642/s (344.540950657 MB/s) > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Gen5 > > >>>>>>>>> Percentile 100 = 6.99595499039 Max latencies = > > >>>>>>>>> 6.99595499039, Min = 0.0364680290222, mean = > > >>>>>>>>> 0.163651215426 > > >>>>>>>>> Total objects writen = 51566 in time 604.061977148s gives > > >>>>>>>>> 85.3654127404/s (341.461650961 MB/s) > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Cluster details: > > >>>>>>>>> 5*HPDL380’s with 13*6Tb OSD’s 128Gb Ram 2*intel 2620v3 > > >>>>>>>>> 10 Gbit Ceph public network > > >>>>>>>>> 10 Gbit Ceph private network > > >>>>>>>>> > > >>>>>>>>> Load generators connected via a 20Gbit bond to the ceph > > >>>>>>>>> public > > >>>> network. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Is this likely to be something happening to the journals? > > >>>>>>>>> > > >>>>>>>>> Or is there something else going on. > > >>>>>>>>> > > >>>>>>>>> I have run FIO and iperf tests and the disk and network > > >>>>>>>>> performance is very high. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Kind Regards, > > >>>>>>>>> Bryn Mathias > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> _______________________________________________ > > >>>>>>>>> ceph-users mailing list > > >>>>>>>>> ceph-users@xxxxxxxxxxxxxx > > >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >>>>>>> > > >>>>>>> > > >>>>>>> _______________________________________________ > > >>>>>>> ceph-users mailing list > > >>>>>>> ceph-users@xxxxxxxxxxxxxx > > >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >>>>>>> > > >>>>>> _______________________________________________ > > >>>>>> ceph-users mailing list > > >>>>>> ceph-users@xxxxxxxxxxxxxx > > >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >>>> _______________________________________________ > > >>>> ceph-users mailing list > > >>>> ceph-users@xxxxxxxxxxxxxx > > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >>>> _______________________________________________ > > >>>> ceph-users mailing list > > >>>> ceph-users@xxxxxxxxxxxxxx > > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >>> > > >>> > > >>> > > >>> > > >> _______________________________________________ > > >> ceph-users mailing list > > >> ceph-users@xxxxxxxxxxxxxx > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com