Re: Reboot 1 OSD server, now ceph says 60% misplaced?

Tracy Reed <treed@xxxxxxxxxxxxxxx> · Sun, 19 Nov 2017 03:41:03 -0800

On Sun, Nov 19, 2017 at 02:41:56AM PST, Gregory Farnum spake thusly:
> Okay, so the hosts look okay (although very uneven numbers of OSDs).
> 
> But the sizes are pretty wonky. Are the disks really that mismatched
> in size? I note that many of them in host10 are set to 1.0, but most
> of the others are some fraction less than that.

Yes, they are that mismatched. This is a very mix and match cluster we
built out of what we had laying around. I know that isn't ideal.
Possibly due to the large mismatch in disk sizes (although I had always
expected CRUSH to manage it batter given the default weighting
proportional to size) we used to run into situations where the small
disks would fill up even when the large disks were barely at 50%. So
back in June we ran bc-ceph-reweight-by-utilization.py fairly frequently
for a few days until things were happy and stable and it stayed that way
until tonight's incident.

I'm pretty sure you are right: The weights got reset to defaults causing
lots of movement. I had forgotten that ceph osd reweight is not a
persistent setting. So it looks like once things settle I need to adjust
crush weights appropriately and set reweights back to 1 to make this
permanent.

That explains it. Thanks!

-- 
Tracy Reed
http://tracyreed.org
Digital signature attached for your safety.
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com