Hello, now it shows again: >> 4095 active+clean >> 1 active+clean+scrubbing and: # ceph pg dump | grep -i scrub dumped all in format plain pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 2.aa 4040 0 0 0 0 10128667136 3010 3010 active+clean+scrubbing 2017-05-11 09:37:37.962700 181936'11196478 181936:8688051 [23,41,9] 23 [23,41,9] 23 176730'10793226 2017-05-10 03:43:20.849784 171715'10548192 2017-05-04 14:27:39.210713 So it seems the same scrub is stuck again... even after restarting the osd. It just took some time until the scrub of this pg happened again. Greets, Stefan Am 17.05.2017 um 21:13 schrieb Jason Dillaman: > Can you share your current OSD configuration? It's very curious that > your scrub is getting randomly stuck on a few objects for hours at a > time until an OSD is reset. > > On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG > <s.priebe@xxxxxxxxxxxx> wrote: >> Hello Jason, >> >> minutes ago i had another case where i restarted the osd which was shown >> in objecter_requests output. >> >> It seems also other scrubs and deep scrubs were hanging. >> >> Output before: >> 4095 active+clean >> 1 active+clean+scrubbing >> >> Output after restart: >> 4084 active+clean >> 7 active+clean+scrubbing+deep >> 5 active+clean+scrubbing >> >> both values are changing every few seconds again doing a lot of scrubs >> and deep scubs. >> >> Greets, >> Stefan >> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG: >>> Hi, >>> >>> that command does not exist. >>> >>> But at least ceph -s permanently reports 1 pg in scrubbing with no change. >>> >>> Log attached as well. >>> >>> Greets, >>> Stefan >>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman: >>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I >>>> can quickly scan the new log if you directly send it to me. >>>> >>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG >>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>> can send the osd log - if you want? >>>>> >>>>> Stefan >>>>> >>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: >>>>>> Hello Jason, >>>>>> >>>>>> the command >>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa >>>>>> >>>>>> hangs as well. Doing absolutely nothing... waiting forever. >>>>>> >>>>>> Greets, >>>>>> Stefan >>>>>> >>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman: >>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is >>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm >>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs >>>>>>> during this)? If that succeeds while your VM remains blocked on that >>>>>>> remove op, it looks like there is some problem in the OSD where ops >>>>>>> queued on a scrub are not properly awoken when the scrub completes. >>>>>>> >>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG >>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>> Hello Jason, >>>>>>>> >>>>>>>> after enabling the log and generating a gcore dump, the request was >>>>>>>> successful ;-( >>>>>>>> >>>>>>>> So the log only contains the successfull request. So i was only able to >>>>>>>> catch the successful request. I can send you the log on request. >>>>>>>> >>>>>>>> Luckily i had another VM on another Cluster behaving the same. >>>>>>>> >>>>>>>> This time osd.23: >>>>>>>> # ceph --admin-daemon >>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok >>>>>>>> objecter_requests >>>>>>>> { >>>>>>>> "ops": [ >>>>>>>> { >>>>>>>> "tid": 18777, >>>>>>>> "pg": "2.cebed0aa", >>>>>>>> "osd": 23, >>>>>>>> "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>> "object_locator": "@2", >>>>>>>> "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>> "target_object_locator": "@2", >>>>>>>> "paused": 0, >>>>>>>> "used_replica": 0, >>>>>>>> "precalc_pgid": 0, >>>>>>>> "last_sent": "1.83513e+06s", >>>>>>>> "attempts": 1, >>>>>>>> "snapid": "head", >>>>>>>> "snap_context": "28a43=[]", >>>>>>>> "mtime": "2017-05-17 16:51:06.0.455475s", >>>>>>>> "osd_ops": [ >>>>>>>> "delete" >>>>>>>> ] >>>>>>>> } >>>>>>>> ], >>>>>>>> "linger_ops": [ >>>>>>>> { >>>>>>>> "linger_id": 1, >>>>>>>> "pg": "2.f0709c34", >>>>>>>> "osd": 23, >>>>>>>> "object_id": "rbd_header.21aafa6b8b4567", >>>>>>>> "object_locator": "@2", >>>>>>>> "target_object_id": "rbd_header.21aafa6b8b4567", >>>>>>>> "target_object_locator": "@2", >>>>>>>> "paused": 0, >>>>>>>> "used_replica": 0, >>>>>>>> "precalc_pgid": 0, >>>>>>>> "snapid": "head", >>>>>>>> "registered": "1" >>>>>>>> } >>>>>>>> ], >>>>>>>> "pool_ops": [], >>>>>>>> "pool_stat_ops": [], >>>>>>>> "statfs_ops": [], >>>>>>>> "command_ops": [] >>>>>>>> } >>>>>>>> >>>>>>>> OSD Logfile of OSD 23 attached. >>>>>>>> >>>>>>>> Greets, >>>>>>>> Stefan >>>>>>>> >>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman: >>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG >>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>> You mean the request no matter if it is successful or not? Which log >>>>>>>>>> level should be set to 20? >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is >>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that >>>>>>>>> high while attempting to capture the blocked op. >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>> >>>> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com