Hi, It seems that my ceph cluster is in an erroneous state of which I cannot see right now how to get out of. The status is the following: health HEALTH_WARN 25 pgs degraded 1 pgs stale 26 pgs stuck unclean 25 pgs undersized recovery 23578/9450442 objects degraded (0.249%) recovery 45/9450442 objects misplaced (0.000%) crush map has legacy tunables (require bobtail, min is firefly) monmap e17: 3 mons at x election epoch 8550, quorum 0,1,2 store1,store3,store2 osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs flags require_jewel_osds pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects 36750 GB used, 61947 GB / 98697 GB avail 23578/9450442 objects degraded (0.249%) 45/9450442 objects misplaced (0.000%) 4362 active+clean 24 active+undersized+degraded 1 stale+active+undersized+degraded+remapped 1 active+remapped I tried restarting all OSDs, to no avail, it actually made things a bit worse. >From a user point of view the cluster works perfectly (apart from that stale pg, which fortunately hit the pool on which I keep swap images only). A little background: I made the mistake of creating the cluster with size=2 pools, which I'm now in the process of rectifying, but that requires some fiddling around. I also tried moving to more optimal tunables (firefly), but the documentation is a bit optimistic with the 'up to 10%' data movement - it was over 50% in my case, so I reverted to bobtail immediately after I saw that number. I then started reweighing the osds in anticipation of the size=3 bump, and I think that's when this bug hit me. Right now I have a pg (6.245) that cannot even be queried - the command times out, or gives this output: https://atw.hu/~koszik/ceph/pg6.245 I queried a few other pgs that are acting up, but cannot see anything suspicious, other than the fact they do not have a working peer: https://atw.hu/~koszik/ceph/pg4.2ca https://atw.hu/~koszik/ceph/pg4.2e4 Health details can be found here: https://atw.hu/~koszik/ceph/health OSD tree: https://atw.hu/~koszik/ceph/tree (here the weight sum of ssd/store3_ssd seems to be off, but that has been the case for quite some time - not sure if it's related to any of this) I tried setting debugging to 20/20 on some of the affected osds, but there was nothing there that gave me any ideas on solving this. How should I continue debugging this issue? BTW, I'm runnig 10.2.5 on all of my osd/mon nodes. Thanks, Matyas _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com