Situations that are stable lots of undersized PGs like this generally mean that the CRUSH map is failing to allocate enough OSDs for certain PGs. The log you have says the OSD is trying to NOTIFY the new primary that the PG exists here on this replica. I'd guess you only have 3 hosts and are trying to place all your replicas on independent boxes. Bobtail tunables have trouble with that and you're going to need to pay the cost of moving to more modern ones. -Greg On Fri, Feb 17, 2017 at 5:30 AM, Matyas Koszik <koszik at atw.hu> wrote: > > > I'm not sure what variable should I be looking at exactly, but after > reading through all of them I don't see anyting supsicious, all values are > 0. I'm attaching it anyway, in case I missed something: > https://atw.hu/~koszik/ceph/osd26-perf > > > I tried debugging the ceph pg query a bit more, and it seems that it > gets stuck communicating with the mon - it doesn't even try to connect to > the osd. This is the end of the log: > > 13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, {"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"..., 53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, {"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 98 > 13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, NULL, NULL) = 9 > 13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, {"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9 > 13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, MSG_DONTWAIT, NULL, NULL) = 9 > 13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, {"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9 > 13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, MSG_DONTWAIT, NULL, NULL) = 9 > 13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, {"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9 > 13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, NULL, NULL) = 9 > > ... and this goes on for as long as I let it. When I kill it, I get this: > RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", "pgid": "6.245"}']": exception 'int' object is not iterable > > I restarted (again) osd26 with max debugging; after grepping for 6.245, > this is the log I get: > https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245 > > Matyas > > > On Fri, 17 Feb 2017, Tomasz Kuzemko wrote: > >> If the PG cannot be queried I would bet on OSD message throttler. Check with "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD which is holding this PG if message throttler current value is not equal max. If it is, increase the max value in ceph.conf and restart OSD. >> >> -- >> Tomasz Kuzemko >> tomasz.kuzemko at corp.ovh.com >> >> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik <koszik at atw.hu> napisa?(a): >> >> > >> > Hi, >> > >> > It seems that my ceph cluster is in an erroneous state of which I cannot >> > see right now how to get out of. >> > >> > The status is the following: >> > >> > health HEALTH_WARN >> > 25 pgs degraded >> > 1 pgs stale >> > 26 pgs stuck unclean >> > 25 pgs undersized >> > recovery 23578/9450442 objects degraded (0.249%) >> > recovery 45/9450442 objects misplaced (0.000%) >> > crush map has legacy tunables (require bobtail, min is firefly) >> > monmap e17: 3 mons at x >> > election epoch 8550, quorum 0,1,2 store1,store3,store2 >> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs >> > flags require_jewel_osds >> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects >> > 36750 GB used, 61947 GB / 98697 GB avail >> > 23578/9450442 objects degraded (0.249%) >> > 45/9450442 objects misplaced (0.000%) >> > 4362 active+clean >> > 24 active+undersized+degraded >> > 1 stale+active+undersized+degraded+remapped >> > 1 active+remapped >> > >> > >> > I tried restarting all OSDs, to no avail, it actually made things a bit >> > worse. >> > From a user point of view the cluster works perfectly (apart from that >> > stale pg, which fortunately hit the pool on which I keep swap images >> > only). >> > >> > A little background: I made the mistake of creating the cluster with >> > size=2 pools, which I'm now in the process of rectifying, but that >> > requires some fiddling around. I also tried moving to more optimal >> > tunables (firefly), but the documentation is a bit optimistic >> > with the 'up to 10%' data movement - it was over 50% in my case, so I >> > reverted to bobtail immediately after I saw that number. I then started >> > reweighing the osds in anticipation of the size=3 bump, and I think that's >> > when this bug hit me. >> > >> > Right now I have a pg (6.245) that cannot even be queried - the command >> > times out, or gives this output: https://atw.hu/~koszik/ceph/pg6.245 >> > >> > I queried a few other pgs that are acting up, but cannot see anything >> > suspicious, other than the fact they do not have a working peer: >> > https://atw.hu/~koszik/ceph/pg4.2ca >> > https://atw.hu/~koszik/ceph/pg4.2e4 >> > >> > Health details can be found here: https://atw.hu/~koszik/ceph/health >> > OSD tree: https://atw.hu/~koszik/ceph/tree (here the weight sum of >> > ssd/store3_ssd seems to be off, but that has been the case for quite some >> > time - not sure if it's related to any of this) >> > >> > >> > I tried setting debugging to 20/20 on some of the affected osds, but there >> > was nothing there that gave me any ideas on solving this. How should I >> > continue debugging this issue? >> > >> > BTW, I'm runnig 10.2.5 on all of my osd/mon nodes. >> > >> > Thanks, >> > Matyas >> > >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users at lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com