If you have the debug symbols installed, I'd say "thread apply all bt" in addition to a "generate-core-file". The backtrace would only be useful if it showed a thread deadlocked on something. On Mon, May 22, 2017 at 9:29 AM, Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> wrote: > Hello Jason, > > should i do a coredump or a thread apply all bt? > > Don't know what is better. > > Greets, > Stefan > > Am 22.05.2017 um 15:19 schrieb Jason Dillaman: >> If you cannot recreate with debug logging enabled, that might be the >> next best option. >> >> On Mon, May 22, 2017 at 2:30 AM, Stefan Priebe - Profihost AG >> <s.priebe@xxxxxxxxxxxx> wrote: >>> Hello Jason, >>> >>> i had another 8 cases where scrub was running for hours. Sadly i >>> couldn't get it to hang again after an osd restart. Any further ideas? >>> >>> Coredump of the OSD with hanging scrub? >>> >>> Greets, >>> Stefan >>> >>> Am 18.05.2017 um 17:26 schrieb Jason Dillaman: >>>> I'm unfortunately out of ideas at the moment. I think the best chance >>>> of figuring out what is wrong is to repeat it while logs are enabled. >>>> >>>> On Wed, May 17, 2017 at 4:51 PM, Stefan Priebe - Profihost AG >>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>> No i can't reproduce it with active logs. Any furthr ideas? >>>>> >>>>> Greets, >>>>> Stefan >>>>> >>>>> Am 17.05.2017 um 21:26 schrieb Stefan Priebe - Profihost AG: >>>>>> Am 17.05.2017 um 21:21 schrieb Jason Dillaman: >>>>>>> Any chance you still have debug logs enabled on OSD 23 after you >>>>>>> restarted it and the scrub froze again? >>>>>> >>>>>> No but i can do that ;-) Hopefully it freezes again. >>>>>> >>>>>> Stefan >>>>>> >>>>>>> >>>>>>> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG >>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> now it shows again: >>>>>>>>>> 4095 active+clean >>>>>>>>>> 1 active+clean+scrubbing >>>>>>>> >>>>>>>> and: >>>>>>>> # ceph pg dump | grep -i scrub >>>>>>>> dumped all in format plain >>>>>>>> pg_stat objects mip degr misp unf bytes log disklog >>>>>>>> state state_stamp v reported up up_primary >>>>>>>> acting acting_primary last_scrub scrub_stamp last_deep_scrub >>>>>>>> deep_scrub_stamp >>>>>>>> 2.aa 4040 0 0 0 0 10128667136 3010 >>>>>>>> 3010 active+clean+scrubbing 2017-05-11 09:37:37.962700 >>>>>>>> 181936'11196478 181936:8688051 [23,41,9] 23 [23,41,9] >>>>>>>> 23 176730'10793226 2017-05-10 03:43:20.849784 171715'10548192 >>>>>>>> 2017-05-04 14:27:39.210713 >>>>>>>> >>>>>>>> So it seems the same scrub is stuck again... even after restarting the >>>>>>>> osd. It just took some time until the scrub of this pg happened again. >>>>>>>> >>>>>>>> Greets, >>>>>>>> Stefan >>>>>>>> Am 17.05.2017 um 21:13 schrieb Jason Dillaman: >>>>>>>>> Can you share your current OSD configuration? It's very curious that >>>>>>>>> your scrub is getting randomly stuck on a few objects for hours at a >>>>>>>>> time until an OSD is reset. >>>>>>>>> >>>>>>>>> On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG >>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>> Hello Jason, >>>>>>>>>> >>>>>>>>>> minutes ago i had another case where i restarted the osd which was shown >>>>>>>>>> in objecter_requests output. >>>>>>>>>> >>>>>>>>>> It seems also other scrubs and deep scrubs were hanging. >>>>>>>>>> >>>>>>>>>> Output before: >>>>>>>>>> 4095 active+clean >>>>>>>>>> 1 active+clean+scrubbing >>>>>>>>>> >>>>>>>>>> Output after restart: >>>>>>>>>> 4084 active+clean >>>>>>>>>> 7 active+clean+scrubbing+deep >>>>>>>>>> 5 active+clean+scrubbing >>>>>>>>>> >>>>>>>>>> both values are changing every few seconds again doing a lot of scrubs >>>>>>>>>> and deep scubs. >>>>>>>>>> >>>>>>>>>> Greets, >>>>>>>>>> Stefan >>>>>>>>>> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> that command does not exist. >>>>>>>>>>> >>>>>>>>>>> But at least ceph -s permanently reports 1 pg in scrubbing with no change. >>>>>>>>>>> >>>>>>>>>>> Log attached as well. >>>>>>>>>>> >>>>>>>>>>> Greets, >>>>>>>>>>> Stefan >>>>>>>>>>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman: >>>>>>>>>>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I >>>>>>>>>>>> can quickly scan the new log if you directly send it to me. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG >>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>> can send the osd log - if you want? >>>>>>>>>>>>> >>>>>>>>>>>>> Stefan >>>>>>>>>>>>> >>>>>>>>>>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>>> Hello Jason, >>>>>>>>>>>>>> >>>>>>>>>>>>>> the command >>>>>>>>>>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa >>>>>>>>>>>>>> >>>>>>>>>>>>>> hangs as well. Doing absolutely nothing... waiting forever. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman: >>>>>>>>>>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is >>>>>>>>>>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm >>>>>>>>>>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs >>>>>>>>>>>>>>> during this)? If that succeeds while your VM remains blocked on that >>>>>>>>>>>>>>> remove op, it looks like there is some problem in the OSD where ops >>>>>>>>>>>>>>> queued on a scrub are not properly awoken when the scrub completes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>>> Hello Jason, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> after enabling the log and generating a gcore dump, the request was >>>>>>>>>>>>>>>> successful ;-( >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So the log only contains the successfull request. So i was only able to >>>>>>>>>>>>>>>> catch the successful request. I can send you the log on request. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Luckily i had another VM on another Cluster behaving the same. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This time osd.23: >>>>>>>>>>>>>>>> # ceph --admin-daemon >>>>>>>>>>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok >>>>>>>>>>>>>>>> objecter_requests >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> "ops": [ >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> "tid": 18777, >>>>>>>>>>>>>>>> "pg": "2.cebed0aa", >>>>>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>>>>> "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>>>>> "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>>>>> "last_sent": "1.83513e+06s", >>>>>>>>>>>>>>>> "attempts": 1, >>>>>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>>>>> "snap_context": "28a43=[]", >>>>>>>>>>>>>>>> "mtime": "2017-05-17 16:51:06.0.455475s", >>>>>>>>>>>>>>>> "osd_ops": [ >>>>>>>>>>>>>>>> "delete" >>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> ], >>>>>>>>>>>>>>>> "linger_ops": [ >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> "linger_id": 1, >>>>>>>>>>>>>>>> "pg": "2.f0709c34", >>>>>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>>>>> "object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>>>>> "target_object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>>>>> "registered": "1" >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> ], >>>>>>>>>>>>>>>> "pool_ops": [], >>>>>>>>>>>>>>>> "pool_stat_ops": [], >>>>>>>>>>>>>>>> "statfs_ops": [], >>>>>>>>>>>>>>>> "command_ops": [] >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> OSD Logfile of OSD 23 attached. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman: >>>>>>>>>>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>>>>> You mean the request no matter if it is successful or not? Which log >>>>>>>>>>>>>>>>>> level should be set to 20? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is >>>>>>>>>>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that >>>>>>>>>>>>>>>>> high while attempting to capture the blocked op. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>> >>>> >>>> >> >> >> -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com