> >> Running ' ceph osd reweight-by-utilization' clears the issue up > >> temporarily, but additional data inevitably causes certain OSDs to be > >> overloaded again. > >> > > The only time I've ever seen this kind of uneven distribution is when > > using too little (and using the default formula with few OSDs might > > still be too little) PGs/PG_NUMs. > > > > Did you look into that? A bit, yeah. It was one of the first things I tried. It didn't seem to have much, if any effect. I did see a reference in an older list discussion about wide variations in OSD sizes causing unbalanced usage, so that's my current operating theory. > Yep, this is deliberate ? the sizing knobs aren't used as CRUSH inputs; it just > impacts how often the CRUSH calculation is run. > Scaling that value up or down adds or removes values to the end of the set of > OSDs hosting a PG, but doesn't change the order they appear in. > Things that do shuffle data: > 1) changing weights (obviously) > 2) changing internal CRUSH parameters (for most users, this means changing > the tunables) > 3) changing how the map looks (i.e., adding OSDs) Makes sense. Good to know. Thanks.