Hi, I work with Tomasz and I'm investigating this situation. We still don't fully understood why there was unfound object after removing single OSD. >From logs[1] it looks like all PGs were active+clean before marking that OSD out. After that backfills started on multiple OSDs. Three minutes later our robot removed that OSD from cluster. Before removing OSD there were some PGs marked as degraded. And we ended up with some PGs degraded+undersized and with unfound objects. During whole event there were some slow requests related to this OSD -- that's why we decided to get rid of it. Then, after a while whole cluster seems to be blocked -- those unfound objects were blocking recovery and client operations -- we only got something like 10% of normal traffic. It looks like this was caused by OSD with unfound objects -- there is throttling mechanism in OSD which will by default allow only 100 messages. If there are some unfound objects and many client requests for it it will completely block that OSD after a while with all ops "waiting for missing object"[2]. In this state it's not possible to query PG, even do 'ceph tell osd.X version'. Restarting OSD will give a chance to execute 'pg query' or 'make_unfound_lost revert|delete' before throttle limit will be reached. It's also possible to increase that limit by setting 'osd_client_message_cap' in config file (option not documented in docs). I was able to reproduce blocking on throttling limit on Hammer (0.94.5 and 0.94.9) and Jewel (10.2.3). This looks like a bug - client operations are blocking recovery process/admin tools. 1: 2016-10-01 07:25:05 - mon.0 10.97.212.2:6789/0 8690339 : cluster [INF] pgmap v31174228: 10432 pgs: 10432 active+clean; 141 TB data, 423 TB used, 807 TB / 1231 TB avail; 83202 kB/s rd, 144 MB/s wr, 8717 op/s <-- cluster seems ok 2016-10-01 07:25:06 - mon.0 10.97.212.2:6789/0 8690341 : cluster [INF] osdmap e696136: 227 osds: 227 up, 226 in <-- ceph osd out 2016-10-01 07:28:47 - mon.0 10.97.212.2:6789/0 8691212 : cluster [INF] pgmap v31174433: 10432 pgs: 4 active+recovery_wait+degraded+remapped, 9 activating+degraded+remapped, 10301 active+clean, 59 active+remapped+wait_backfill, 19 activating+remapped, 38 active+remapped+backfilling, 2 active+recovering+degraded; 141 TB data, 421 TB used, 803 TB / 1225 TB avail; 4738 kB/s rd, 19025 kB/s wr, 2006 op/s; 1863/125061398 objects degraded (0.001%); 1211382/125061398 objects misplaced (0.969%); 994 MB/s, 352 objects/s recovering 2016-10-01 07:28:47 - mon.0 10.97.212.2:6789/0 8691232 : cluster [INF] osd.87 marked itself down <-- osd down 2016-10-01 07:28:41 - osd.102 10.99.208.134:6801/77524 6820 : cluster [WRN] slow request 30.664042 seconds old, received at 2016-10-01 07:28:11.248835: osd_op(client.458404438.0:13798725 rbd_data.e623422ae8944a.0000000000000d49 [write 786432~8192] 3.abb76dbb snapc 9256b=[] ack+ondisk+write+known_if_redirected e696155) currently waiting for subops from 5,54,87,142 2016-10-01 07:28:54 - mon.0 10.97.212.2:6789/0 8691266 : cluster [INF] pgmap v31174439: 10432 pgs: 3 stale+active+recovery_wait+degraded+remapped, 9 stale+activating+degraded+remapped, 11 stale+active+remapped+wait_backfill, 10301 active+clean, 49 active+remapped+wait_backfill, 1 active+recovering+undersized+degraded+remapped, 17 activating+remapped, 39 active+remapped+backfilling, 2 active+recovering+degraded; 141 TB data, 421 TB used, 803 TB / 1225 TB avail; 85707 kB/s rd, 139 MB/s wr, 6271 op/s; 6326/125057026 objects degraded (0.005%); 1224087/125057026 objects misplaced (0.979%); 1/41406770 unfound (0.000%) <-- first unfound full log: https://gist.githubusercontent.com/anonymous/0336ea26c7c165e20ae75fbe03204d19/raw/9f5053e91ec5ae10bfb614033c924fe2361a116e/ceph_log 2: # ceph --admin-daemon /var/run/ceph/*.asok dump_ops_in_flight { "description": "osd_op(client.176998.0:2805522 rbd_data.24d2b238e1f29.000000000000182f [set-alloc-hint object_size 4194304 write_size 4194304,write 2662400~4096] 2.62ea5cd0 ack+ondisk+write+known_if_redirected e1919)", "initiated_at": "2016-10-06 07:19:45.650089", "age": 647.394210, "duration": 0.003584, "type_data": [ "delayed", { "client": "client.176998", "tid": 2805522 }, [ { "time": "2016-10-06 07:19:45.650089", "event": "initiated" }, { "time": "2016-10-06 07:19:45.653667", "event": "reached_pg" }, { "time": "2016-10-06 07:19:45.653673", "event": "waiting for missing object" } ] ] } 3: # ceph --admin-daemon /var/run/ceph/*.asok perf dump | grep -A 20 "throttle-osd_client_messages" "throttle-osd_client_messages": { "val": 100, "max": 100, "get": 1739, "get_sum": 1739, "get_or_fail_fail": 0, "get_or_fail_success": 0, "take": 0, "take_sum": 0, "put": 1639, "put_sum": 1639, "wait": { "avgcount": 1511, "sum": 1139.416916927 } } } On 10/02/2016 10:22 AM, Dan van der Ster wrote: > > Hi, > > Do you understand why removing that osd led to unfound objects? Do you > have the ceph.log from yesterday? > > Cheers, Dan > > On 2 Oct 2016 10:18, "Tomasz Kuzemko" <tomasz@xxxxxxxxxxx > <mailto:tomasz@xxxxxxxxxxx>> wrote: > > > > Forgot to mention Ceph version - 0.94.5. > > > > I managed to fix this. By chance I found that when an OSD for a > blocked PG is starting, there is a few-second time window (after > load_pgs) in which it accepts commands related to the blocked PG. So > first I managed to capture "ceph pg PGID query" this way. Then I tried > to issue "ceph pg missing_lost delete" and it worked too. After > deleting all unfound objects this way cluster finally unblocked. > Before that I exported all blocked PGs so hopefully I will be able to > recover those 17 objects to a near-latest state. > > > > Hope this helps anyone who might run into the same problem. > > > > > > 2016-10-01 14:27 GMT+02:00 Tomasz Kuzemko <tomasz@xxxxxxxxxxx > <mailto:tomasz@xxxxxxxxxxx>>: > >> > >> Hi, > >> > >> I have a production cluster on which 1 OSD on a failing disk was > slowing the whole cluster down. I removed the OSD (osd.87) like usual > in such case but this time it resulted in 17 unfound objects. I no > longer have the files from osd.87. I was able to call "ceph pg PGID > mark_unfound_lost delete" on 10 of those objects. > >> > >> On the remaining objects 7 the command blocks. When I try to do > "ceph pg PGID query" on this PG it also blocks. I suspect this is > same reason why mark_unfound blocks. > >> > >> Other client IO to PGs that have unfound objects are also blocked. > When trying to query the OSDs which has the PG with unfound objects, > "ceph tell" blocks. > >> > >> I tried to mark the PG as complete using ceph-objectstore-tool but > it did not help as the PG is in fact complete but for some reason blocks. > >> > >> I tried recreating an empty osd.87 and importing the PG exported > from other replica but it did not help. > >> > >> Can someone help me please? This is really important. > >> > >> ceph pg dump: > >> > https://gist.github.com/anonymous/c0622ef0d8c0ac84e0778e73bad3c1af/raw/206a06e674ed1c870bbb09bb75fe4285a8e20ba4/pg-dump > >> > >> ceph osd dump: > >> > https://gist.github.com/anonymous/64e237d85016af6bd7879ef272ca5639/raw/d6fceb9acd206b75c3ce59c60bcd55a47dea7acd/osd-dump > >> > >> ceph health detail: > >> > https://gist.github.com/anonymous/ddb27863ecd416748ebd7ebbc036e438/raw/59ef1582960e011f10cbdbd4ccee509419b95d4e/health-detail > >> > >> > >> -- > >> Pozdrawiam, > >> Tomasz Kuzemko > >> tomasz@xxxxxxxxxxx <mailto:tomasz@xxxxxxxxxxx> > > > > > > > > > > -- > > Pozdrawiam, > > Tomasz Kuzemko > > tomasz@xxxxxxxxxxx <mailto:tomasz@xxxxxxxxxxx> > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html