If you cannot recreate with debug logging enabled, that might be the next best option. On Mon, May 22, 2017 at 2:30 AM, Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> wrote: > Hello Jason, > > i had another 8 cases where scrub was running for hours. Sadly i > couldn't get it to hang again after an osd restart. Any further ideas? > > Coredump of the OSD with hanging scrub? > > Greets, > Stefan > > Am 18.05.2017 um 17:26 schrieb Jason Dillaman: >> I'm unfortunately out of ideas at the moment. I think the best chance >> of figuring out what is wrong is to repeat it while logs are enabled. >> >> On Wed, May 17, 2017 at 4:51 PM, Stefan Priebe - Profihost AG >> <s.priebe@xxxxxxxxxxxx> wrote: >>> No i can't reproduce it with active logs. Any furthr ideas? >>> >>> Greets, >>> Stefan >>> >>> Am 17.05.2017 um 21:26 schrieb Stefan Priebe - Profihost AG: >>>> Am 17.05.2017 um 21:21 schrieb Jason Dillaman: >>>>> Any chance you still have debug logs enabled on OSD 23 after you >>>>> restarted it and the scrub froze again? >>>> >>>> No but i can do that ;-) Hopefully it freezes again. >>>> >>>> Stefan >>>> >>>>> >>>>> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG >>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>> Hello, >>>>>> >>>>>> now it shows again: >>>>>>>> 4095 active+clean >>>>>>>> 1 active+clean+scrubbing >>>>>> >>>>>> and: >>>>>> # ceph pg dump | grep -i scrub >>>>>> dumped all in format plain >>>>>> pg_stat objects mip degr misp unf bytes log disklog >>>>>> state state_stamp v reported up up_primary >>>>>> acting acting_primary last_scrub scrub_stamp last_deep_scrub >>>>>> deep_scrub_stamp >>>>>> 2.aa 4040 0 0 0 0 10128667136 3010 >>>>>> 3010 active+clean+scrubbing 2017-05-11 09:37:37.962700 >>>>>> 181936'11196478 181936:8688051 [23,41,9] 23 [23,41,9] >>>>>> 23 176730'10793226 2017-05-10 03:43:20.849784 171715'10548192 >>>>>> 2017-05-04 14:27:39.210713 >>>>>> >>>>>> So it seems the same scrub is stuck again... even after restarting the >>>>>> osd. It just took some time until the scrub of this pg happened again. >>>>>> >>>>>> Greets, >>>>>> Stefan >>>>>> Am 17.05.2017 um 21:13 schrieb Jason Dillaman: >>>>>>> Can you share your current OSD configuration? It's very curious that >>>>>>> your scrub is getting randomly stuck on a few objects for hours at a >>>>>>> time until an OSD is reset. >>>>>>> >>>>>>> On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG >>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>> Hello Jason, >>>>>>>> >>>>>>>> minutes ago i had another case where i restarted the osd which was shown >>>>>>>> in objecter_requests output. >>>>>>>> >>>>>>>> It seems also other scrubs and deep scrubs were hanging. >>>>>>>> >>>>>>>> Output before: >>>>>>>> 4095 active+clean >>>>>>>> 1 active+clean+scrubbing >>>>>>>> >>>>>>>> Output after restart: >>>>>>>> 4084 active+clean >>>>>>>> 7 active+clean+scrubbing+deep >>>>>>>> 5 active+clean+scrubbing >>>>>>>> >>>>>>>> both values are changing every few seconds again doing a lot of scrubs >>>>>>>> and deep scubs. >>>>>>>> >>>>>>>> Greets, >>>>>>>> Stefan >>>>>>>> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> that command does not exist. >>>>>>>>> >>>>>>>>> But at least ceph -s permanently reports 1 pg in scrubbing with no change. >>>>>>>>> >>>>>>>>> Log attached as well. >>>>>>>>> >>>>>>>>> Greets, >>>>>>>>> Stefan >>>>>>>>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman: >>>>>>>>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I >>>>>>>>>> can quickly scan the new log if you directly send it to me. >>>>>>>>>> >>>>>>>>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG >>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>> can send the osd log - if you want? >>>>>>>>>>> >>>>>>>>>>> Stefan >>>>>>>>>>> >>>>>>>>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>> Hello Jason, >>>>>>>>>>>> >>>>>>>>>>>> the command >>>>>>>>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa >>>>>>>>>>>> >>>>>>>>>>>> hangs as well. Doing absolutely nothing... waiting forever. >>>>>>>>>>>> >>>>>>>>>>>> Greets, >>>>>>>>>>>> Stefan >>>>>>>>>>>> >>>>>>>>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman: >>>>>>>>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is >>>>>>>>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm >>>>>>>>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs >>>>>>>>>>>>> during this)? If that succeeds while your VM remains blocked on that >>>>>>>>>>>>> remove op, it looks like there is some problem in the OSD where ops >>>>>>>>>>>>> queued on a scrub are not properly awoken when the scrub completes. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>> Hello Jason, >>>>>>>>>>>>>> >>>>>>>>>>>>>> after enabling the log and generating a gcore dump, the request was >>>>>>>>>>>>>> successful ;-( >>>>>>>>>>>>>> >>>>>>>>>>>>>> So the log only contains the successfull request. So i was only able to >>>>>>>>>>>>>> catch the successful request. I can send you the log on request. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Luckily i had another VM on another Cluster behaving the same. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This time osd.23: >>>>>>>>>>>>>> # ceph --admin-daemon >>>>>>>>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok >>>>>>>>>>>>>> objecter_requests >>>>>>>>>>>>>> { >>>>>>>>>>>>>> "ops": [ >>>>>>>>>>>>>> { >>>>>>>>>>>>>> "tid": 18777, >>>>>>>>>>>>>> "pg": "2.cebed0aa", >>>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>>> "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>>> "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>>> "last_sent": "1.83513e+06s", >>>>>>>>>>>>>> "attempts": 1, >>>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>>> "snap_context": "28a43=[]", >>>>>>>>>>>>>> "mtime": "2017-05-17 16:51:06.0.455475s", >>>>>>>>>>>>>> "osd_ops": [ >>>>>>>>>>>>>> "delete" >>>>>>>>>>>>>> ] >>>>>>>>>>>>>> } >>>>>>>>>>>>>> ], >>>>>>>>>>>>>> "linger_ops": [ >>>>>>>>>>>>>> { >>>>>>>>>>>>>> "linger_id": 1, >>>>>>>>>>>>>> "pg": "2.f0709c34", >>>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>>> "object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>>> "target_object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>>> "registered": "1" >>>>>>>>>>>>>> } >>>>>>>>>>>>>> ], >>>>>>>>>>>>>> "pool_ops": [], >>>>>>>>>>>>>> "pool_stat_ops": [], >>>>>>>>>>>>>> "statfs_ops": [], >>>>>>>>>>>>>> "command_ops": [] >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> OSD Logfile of OSD 23 attached. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman: >>>>>>>>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>>> You mean the request no matter if it is successful or not? Which log >>>>>>>>>>>>>>>> level should be set to 20? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is >>>>>>>>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that >>>>>>>>>>>>>>> high while attempting to capture the blocked op. >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>>> >> >> >> -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com