I'm unfortunately out of ideas at the moment. I think the best chance of figuring out what is wrong is to repeat it while logs are enabled. On Wed, May 17, 2017 at 4:51 PM, Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> wrote: > No i can't reproduce it with active logs. Any furthr ideas? > > Greets, > Stefan > > Am 17.05.2017 um 21:26 schrieb Stefan Priebe - Profihost AG: >> Am 17.05.2017 um 21:21 schrieb Jason Dillaman: >>> Any chance you still have debug logs enabled on OSD 23 after you >>> restarted it and the scrub froze again? >> >> No but i can do that ;-) Hopefully it freezes again. >> >> Stefan >> >>> >>> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG >>> <s.priebe@xxxxxxxxxxxx> wrote: >>>> Hello, >>>> >>>> now it shows again: >>>>>> 4095 active+clean >>>>>> 1 active+clean+scrubbing >>>> >>>> and: >>>> # ceph pg dump | grep -i scrub >>>> dumped all in format plain >>>> pg_stat objects mip degr misp unf bytes log disklog >>>> state state_stamp v reported up up_primary >>>> acting acting_primary last_scrub scrub_stamp last_deep_scrub >>>> deep_scrub_stamp >>>> 2.aa 4040 0 0 0 0 10128667136 3010 >>>> 3010 active+clean+scrubbing 2017-05-11 09:37:37.962700 >>>> 181936'11196478 181936:8688051 [23,41,9] 23 [23,41,9] >>>> 23 176730'10793226 2017-05-10 03:43:20.849784 171715'10548192 >>>> 2017-05-04 14:27:39.210713 >>>> >>>> So it seems the same scrub is stuck again... even after restarting the >>>> osd. It just took some time until the scrub of this pg happened again. >>>> >>>> Greets, >>>> Stefan >>>> Am 17.05.2017 um 21:13 schrieb Jason Dillaman: >>>>> Can you share your current OSD configuration? It's very curious that >>>>> your scrub is getting randomly stuck on a few objects for hours at a >>>>> time until an OSD is reset. >>>>> >>>>> On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG >>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>> Hello Jason, >>>>>> >>>>>> minutes ago i had another case where i restarted the osd which was shown >>>>>> in objecter_requests output. >>>>>> >>>>>> It seems also other scrubs and deep scrubs were hanging. >>>>>> >>>>>> Output before: >>>>>> 4095 active+clean >>>>>> 1 active+clean+scrubbing >>>>>> >>>>>> Output after restart: >>>>>> 4084 active+clean >>>>>> 7 active+clean+scrubbing+deep >>>>>> 5 active+clean+scrubbing >>>>>> >>>>>> both values are changing every few seconds again doing a lot of scrubs >>>>>> and deep scubs. >>>>>> >>>>>> Greets, >>>>>> Stefan >>>>>> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG: >>>>>>> Hi, >>>>>>> >>>>>>> that command does not exist. >>>>>>> >>>>>>> But at least ceph -s permanently reports 1 pg in scrubbing with no change. >>>>>>> >>>>>>> Log attached as well. >>>>>>> >>>>>>> Greets, >>>>>>> Stefan >>>>>>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman: >>>>>>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I >>>>>>>> can quickly scan the new log if you directly send it to me. >>>>>>>> >>>>>>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG >>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>> can send the osd log - if you want? >>>>>>>>> >>>>>>>>> Stefan >>>>>>>>> >>>>>>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>> Hello Jason, >>>>>>>>>> >>>>>>>>>> the command >>>>>>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa >>>>>>>>>> >>>>>>>>>> hangs as well. Doing absolutely nothing... waiting forever. >>>>>>>>>> >>>>>>>>>> Greets, >>>>>>>>>> Stefan >>>>>>>>>> >>>>>>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman: >>>>>>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is >>>>>>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm >>>>>>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs >>>>>>>>>>> during this)? If that succeeds while your VM remains blocked on that >>>>>>>>>>> remove op, it looks like there is some problem in the OSD where ops >>>>>>>>>>> queued on a scrub are not properly awoken when the scrub completes. >>>>>>>>>>> >>>>>>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG >>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>> Hello Jason, >>>>>>>>>>>> >>>>>>>>>>>> after enabling the log and generating a gcore dump, the request was >>>>>>>>>>>> successful ;-( >>>>>>>>>>>> >>>>>>>>>>>> So the log only contains the successfull request. So i was only able to >>>>>>>>>>>> catch the successful request. I can send you the log on request. >>>>>>>>>>>> >>>>>>>>>>>> Luckily i had another VM on another Cluster behaving the same. >>>>>>>>>>>> >>>>>>>>>>>> This time osd.23: >>>>>>>>>>>> # ceph --admin-daemon >>>>>>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok >>>>>>>>>>>> objecter_requests >>>>>>>>>>>> { >>>>>>>>>>>> "ops": [ >>>>>>>>>>>> { >>>>>>>>>>>> "tid": 18777, >>>>>>>>>>>> "pg": "2.cebed0aa", >>>>>>>>>>>> "osd": 23, >>>>>>>>>>>> "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>> "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>> "paused": 0, >>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>> "last_sent": "1.83513e+06s", >>>>>>>>>>>> "attempts": 1, >>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>> "snap_context": "28a43=[]", >>>>>>>>>>>> "mtime": "2017-05-17 16:51:06.0.455475s", >>>>>>>>>>>> "osd_ops": [ >>>>>>>>>>>> "delete" >>>>>>>>>>>> ] >>>>>>>>>>>> } >>>>>>>>>>>> ], >>>>>>>>>>>> "linger_ops": [ >>>>>>>>>>>> { >>>>>>>>>>>> "linger_id": 1, >>>>>>>>>>>> "pg": "2.f0709c34", >>>>>>>>>>>> "osd": 23, >>>>>>>>>>>> "object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>> "target_object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>> "paused": 0, >>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>> "registered": "1" >>>>>>>>>>>> } >>>>>>>>>>>> ], >>>>>>>>>>>> "pool_ops": [], >>>>>>>>>>>> "pool_stat_ops": [], >>>>>>>>>>>> "statfs_ops": [], >>>>>>>>>>>> "command_ops": [] >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> OSD Logfile of OSD 23 attached. >>>>>>>>>>>> >>>>>>>>>>>> Greets, >>>>>>>>>>>> Stefan >>>>>>>>>>>> >>>>>>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman: >>>>>>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>> You mean the request no matter if it is successful or not? Which log >>>>>>>>>>>>>> level should be set to 20? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is >>>>>>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that >>>>>>>>>>>>> high while attempting to capture the blocked op. >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> ceph-users mailing list >>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> >>>>> >>> >>> >>> -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com