Can you share your current OSD configuration? It's very curious that your scrub is getting randomly stuck on a few objects for hours at a time until an OSD is reset. On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> wrote: > Hello Jason, > > minutes ago i had another case where i restarted the osd which was shown > in objecter_requests output. > > It seems also other scrubs and deep scrubs were hanging. > > Output before: > 4095 active+clean > 1 active+clean+scrubbing > > Output after restart: > 4084 active+clean > 7 active+clean+scrubbing+deep > 5 active+clean+scrubbing > > both values are changing every few seconds again doing a lot of scrubs > and deep scubs. > > Greets, > Stefan > Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG: >> Hi, >> >> that command does not exist. >> >> But at least ceph -s permanently reports 1 pg in scrubbing with no change. >> >> Log attached as well. >> >> Greets, >> Stefan >> Am 17.05.2017 um 20:20 schrieb Jason Dillaman: >>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I >>> can quickly scan the new log if you directly send it to me. >>> >>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG >>> <s.priebe@xxxxxxxxxxxx> wrote: >>>> can send the osd log - if you want? >>>> >>>> Stefan >>>> >>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: >>>>> Hello Jason, >>>>> >>>>> the command >>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa >>>>> >>>>> hangs as well. Doing absolutely nothing... waiting forever. >>>>> >>>>> Greets, >>>>> Stefan >>>>> >>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman: >>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is >>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm >>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs >>>>>> during this)? If that succeeds while your VM remains blocked on that >>>>>> remove op, it looks like there is some problem in the OSD where ops >>>>>> queued on a scrub are not properly awoken when the scrub completes. >>>>>> >>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG >>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>> Hello Jason, >>>>>>> >>>>>>> after enabling the log and generating a gcore dump, the request was >>>>>>> successful ;-( >>>>>>> >>>>>>> So the log only contains the successfull request. So i was only able to >>>>>>> catch the successful request. I can send you the log on request. >>>>>>> >>>>>>> Luckily i had another VM on another Cluster behaving the same. >>>>>>> >>>>>>> This time osd.23: >>>>>>> # ceph --admin-daemon >>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok >>>>>>> objecter_requests >>>>>>> { >>>>>>> "ops": [ >>>>>>> { >>>>>>> "tid": 18777, >>>>>>> "pg": "2.cebed0aa", >>>>>>> "osd": 23, >>>>>>> "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>> "object_locator": "@2", >>>>>>> "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>> "target_object_locator": "@2", >>>>>>> "paused": 0, >>>>>>> "used_replica": 0, >>>>>>> "precalc_pgid": 0, >>>>>>> "last_sent": "1.83513e+06s", >>>>>>> "attempts": 1, >>>>>>> "snapid": "head", >>>>>>> "snap_context": "28a43=[]", >>>>>>> "mtime": "2017-05-17 16:51:06.0.455475s", >>>>>>> "osd_ops": [ >>>>>>> "delete" >>>>>>> ] >>>>>>> } >>>>>>> ], >>>>>>> "linger_ops": [ >>>>>>> { >>>>>>> "linger_id": 1, >>>>>>> "pg": "2.f0709c34", >>>>>>> "osd": 23, >>>>>>> "object_id": "rbd_header.21aafa6b8b4567", >>>>>>> "object_locator": "@2", >>>>>>> "target_object_id": "rbd_header.21aafa6b8b4567", >>>>>>> "target_object_locator": "@2", >>>>>>> "paused": 0, >>>>>>> "used_replica": 0, >>>>>>> "precalc_pgid": 0, >>>>>>> "snapid": "head", >>>>>>> "registered": "1" >>>>>>> } >>>>>>> ], >>>>>>> "pool_ops": [], >>>>>>> "pool_stat_ops": [], >>>>>>> "statfs_ops": [], >>>>>>> "command_ops": [] >>>>>>> } >>>>>>> >>>>>>> OSD Logfile of OSD 23 attached. >>>>>>> >>>>>>> Greets, >>>>>>> Stefan >>>>>>> >>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman: >>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG >>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>> You mean the request no matter if it is successful or not? Which log >>>>>>>>> level should be set to 20? >>>>>>>> >>>>>>>> >>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is >>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that >>>>>>>> high while attempting to capture the blocked op. >>>>>>>> >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>> >>> -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com