Mark, Thanks for the comments. One more question: is there bad impact if we use a higher PG# per OSD? E.g. 200x (I think a lot of people use this?) or 400x? E.g. more memory consumption or lock contention? -jiangang -----Original Message----- From: Mark Nelson [mailto:mark.nelson@xxxxxxxxxxx] Sent: Wednesday, December 11, 2013 10:38 PM To: Duan, Jiangang Cc: Sage Weil; Zhang, Jian; ceph-devel@xxxxxxxxxxxxxxx; He, Yujie Subject: Re: question on BG# and its performance impact On 12/11/2013 08:24 AM, Mark Nelson wrote: > Hi Jiangang, > > To answer your earlier question about Uniformity: > > What I saw in my testing was that the PG count increases, things did > tend to get more uniform, ie the standard deviation of the percentages > distributed over the set of OSDs slowly decreased with more PGs. > Primarily what I am interested in though is whether or not any > specific OSD has more PGs than the rest as that's all it will take to > screw up performance. As far as performance goes though, in my > testing it didn't necessarily seem to be strongly correlated with the > PG distribution, except for very small numbers of PGs. Much more > rigorous testing is probably needed to draw much of a conclusion. > > Sage and I had a conversation a while ago about how to deal with > situations where you have uneven distributions (either through not > having enough PGs to ensure even distribution, or simply bad luck at > psuedo-random roulette). I proposed that we might iterate through > multiple possible pool distributions using different seed values until > we found one we liked with good psudorandom distribution. Perhaps you > could get even fancier by looking at what happens when you lose and > OSD or two. As this is all during pool creation, a little extra time > finding a nice initial distribution doesn't really hurt. > > Sage mentioned though that it may be better to simple take whatever > distribution is generated and simply re-weight it to deal with > uniformity imperfections. I can't see any reason why this wouldn't > also work and has the benefit that it works no matter how the > distribution changes. Arguably this technique could go beyond just > looking at PG distributions and look at actual data distribution too > if the user wants extremely even data uniformity at the expense of a re-weighting tweak. > > In any event, with very large clusters with lots of pools, I think we > will likely need to at some point adopt some kind of scheme that lets > us get away with fewer PGs per pool than our current recommendations. Ha, replying to my own reply! Thinking about this a little more, these two techniques may in fact still be complementary. For very large clusters where the PG counts per OSD may be low, I suspect we will want to at least make sure the initial map guarantees that every OSD has at least 1 PG so we can do proper re-weighting down the road. In fact the better the initial distribution is, the less crazy we'll have to get with re-weighting, so it may not be a bad idea to use both techniques. > > Mark > > > On 12/11/2013 12:22 AM, Duan, Jiangang wrote: >> Cc the mail list as Sage suggested. >> >> -----Original Message----- >> From: Sage Weil [mailto:sage@xxxxxxxxxxx] >> Sent: Wednesday, December 11, 2013 2:10 PM >> To: Zhang, Jian >> Cc: Mark Nelson; Duan, Jiangang >> Subject: RE: question on BG# and its performance impact >> >> On Wed, 11 Dec 2013, Zhang, Jian wrote: >>> Thanks for the suggestions, I will take a look on the ls output. >>> No, we didn't use the optimal crush tunables. >> >> Hopefully that is part of it... try repeating the test with the >> optimal tunables (now the default in master)! >> >> s >> >> >>> >>> Jian >>> >>> -----Original Message----- >>> From: Sage Weil [mailto:sage@xxxxxxxxxxx] >>> Sent: Wednesday, December 11, 2013 1:36 PM >>> To: Zhang, Jian >>> Cc: Mark Nelson; Duan, Jiangang >>> Subject: RE: question on BG# and its performance impact >>> >>> I might be worthwhile here to get teh actual list of objects (rados >>> -p $pool ls list.txt) and calculate the pg and osd mappings for each >>> of them to verify things are uniform. >>> >>> One thing: are you using the 'optimal' crush tunables (ceph osd >>> crush tunables optimal)? >>> >>> Also, can we cc ceph-devel? >>> >>> sage >>> >>> >>> On Wed, 11 Dec 2013, Zhang, Jian wrote: >>> >>>> Mark, >>>> Thanks for the help. >>>> For the performance dip, I think it should casued by the directory >>>> splitting, just check several OSD, it does has many sub directories. >>>> For the pg # and distribution, see if I understand you correctly: >>>> When you said "a slow trend toward uniformity" do you mean the pg # >>>> for each pool is unforim? But from sheet2, the pg # on OSD10 is >>>> 103, while the pg # on OSD8 is 72, there is still a 30% gap. And I >>>> think that's the reason we saw performance drop of 10M read with >>>> 1280 pgs >>>> - pg # on the OSD is not balance. >>>> >>>> Thanks >>>> Jian >>>> >>>> -----Original Message----- >>>> From: Mark Nelson [mailto:mark.nelson@xxxxxxxxxxx] >>>> Sent: Wednesday, December 11, 2013 12:01 PM >>>> To: Duan, Jiangang >>>> Cc: Sage Weil (sage@xxxxxxxxxxx); Zhang, Jian >>>> Subject: Re: question on BG# and its performance impact >>>> >>>> Hi Jiangang, >>>> >>>> My results are rather old at this point, but I did similar testing >>>> last spring to look at PG distribution and performance (with larger >>>> writes) with varying numbers of PGs. I saw what looked like a slow >>>> trend toward uniformity. Performance however was quite variable. >>>> >>>> The performance dip you saw after many hours may have been due to >>>> directory splitting on the underlying OSDs. When this happens >>>> depends on the number of objects that are written out and the >>>> number of PGs in the pool. Eventually, when enough objects are >>>> written, the filestore will create a deeper nested directory >>>> structure to store objects to keep the maximum number of objects >>>> per directory below a certain maximum. >>>> This is governed by two settings: >>>> >>>> filestore merge threshold = 10 >>>> >>>> filestore split multiple = 2 >>>> >>>> The total number of objects per directory is by default 10 * 2 * 16 >>>> = 320. With small PG counts this can cause quite a bit of >>>> directory splitting if there are many objects. >>>> >>>> I believe that it is likely these defaults are lower than necessary >>>> and we could allow more objects per directory, potentially reducing >>>> the number of seeks for dentry lookups (though theoretically this >>>> should be cached). We definitely have seen this have a large >>>> performance impact with RGW though on clusters with small numbers >>>> of PGs. With more PGs, and more relaxed thresholds, directory >>>> splitting doesn't happen until many many millions of objects are >>>> written out, and performance degradation as the disk fills up >>>> appears to be less severe. >>>> >>>> Mark >>>> >>>> On 12/10/2013 09:05 PM, Duan, Jiangang wrote: >>>>> Sage/mark, >>>>> >>>>> We find object# unbalance condition in our Ceph setup for both RBD >>>>> and object. Refer to the attached pdf. >>>>> >>>>> Increase the PG# does increase performance however result in >>>>> unstable issues ? >>>>> >>>>> Is this a known issue and do you have any BKM to fix this? >>>>> >>>>> -jiangang >>>>> >>>> >>>> >>> >>> > ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f