[resent to list because I missed that Cc:] Hi Sage, On 05/13/2013 06:35 PM, Sage Weil wrote: > Hi Jim- > > You mentioned the other day your concerns about the uniformity of the PG > and data distribution. There are several ways to attack it (including > increasing the number of PGs), but one that we haven't tested much yet is > the 'reweight-by-utilization' function in the monitor. > > The idea is that there will always be some statistical variance in the > distribution and a non-zero probability of having outlier OSDs with too > many PG. We adjust for this by taking nodes that are substantially above > the mean down by some adjustment factor in an automated way. > > ceph osd reweight-by-utilization MIN > > where MIN is the minimum relative utilization at which we will start > adjusting down. It is always > 100 (100% of the mean), and defaults to > 120. After it adjusts the reweights, you should see the result in 'ceph > osd tree' output > > Have you played with this at all on your cluster? I'd be very interested > in how well this does/does not improve things for you. I haven't yet, but likely it's because I don't understand what the reweighting does, exactly. Maybe you can comment inline below where I go wrong? Here's my thinking: I'm only partially motivated by the actual amount of storage used per OSD, although it is a factor. My major concern is a performance issue for our parallel application codes. Their computation cycle is: compute furiously, write results, repeat. The issue is that none of our codes implement write-behind; each task must finish writing results before any can resume computing. So, when some OSDs carry more PGs, they cannot complete their portion of the write phase as quickly as other OSDs with fewer PGs. Thus, the application's ability to resume computation is delayed by the busiest OSDs. My concern is that when we rebalance, we just cause some other subset of the OSDs to be the busiest, in order to send fewer writes to the overused OSDs and more writes to underused OSDs. At least, that's what I was thinking, without actually examining the code to see what is really going on in rebalancing, and without testing. Another thing I haven't done is actually compute from the statistics of uniform distributions what the expected variance is for my specific layout, (now 256K PGs across 576 OSDs, with root/host/device hierarchy, 24 OSDs/host). That's mostly due to my lack of knowledge of statistics.... If I'm getting more variance than expected I want to understand why, in case it can be fixed. In any event, I think it's past time I tried reweighting. Suppose I use 'ceph osd reweight-by-utilization 101', on the theory that I'd cause continuous, small adjustments to utilization, and I'd learn what the maximum impact can be. Does that seem like a bad idea to you, and if so could you help me understand why? Thanks for taking the time to think about this - I know you're busy. PS - FWIW, another reason I keep pushing the number of PGs is because when we actually deploy Ceph for production, it'll be at a bigger scale than my testbed. So, I'm trying to shake out any scale-related issues now, to make sure our users' first experience with Ceph is a good one. -- Jim > > Thanks! > sage > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html