Hello Jason, should i do a coredump or a thread apply all bt? Don't know what is better. Greets, Stefan Am 22.05.2017 um 15:19 schrieb Jason Dillaman: > If you cannot recreate with debug logging enabled, that might be the > next best option. > > On Mon, May 22, 2017 at 2:30 AM, Stefan Priebe - Profihost AG > <s.priebe@xxxxxxxxxxxx> wrote: >> Hello Jason, >> >> i had another 8 cases where scrub was running for hours. Sadly i >> couldn't get it to hang again after an osd restart. Any further ideas? >> >> Coredump of the OSD with hanging scrub? >> >> Greets, >> Stefan >> >> Am 18.05.2017 um 17:26 schrieb Jason Dillaman: >>> I'm unfortunately out of ideas at the moment. I think the best chance >>> of figuring out what is wrong is to repeat it while logs are enabled. >>> >>> On Wed, May 17, 2017 at 4:51 PM, Stefan Priebe - Profihost AG >>> <s.priebe@xxxxxxxxxxxx> wrote: >>>> No i can't reproduce it with active logs. Any furthr ideas? >>>> >>>> Greets, >>>> Stefan >>>> >>>> Am 17.05.2017 um 21:26 schrieb Stefan Priebe - Profihost AG: >>>>> Am 17.05.2017 um 21:21 schrieb Jason Dillaman: >>>>>> Any chance you still have debug logs enabled on OSD 23 after you >>>>>> restarted it and the scrub froze again? >>>>> >>>>> No but i can do that ;-) Hopefully it freezes again. >>>>> >>>>> Stefan >>>>> >>>>>> >>>>>> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG >>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> now it shows again: >>>>>>>>> 4095 active+clean >>>>>>>>> 1 active+clean+scrubbing >>>>>>> >>>>>>> and: >>>>>>> # ceph pg dump | grep -i scrub >>>>>>> dumped all in format plain >>>>>>> pg_stat objects mip degr misp unf bytes log disklog >>>>>>> state state_stamp v reported up up_primary >>>>>>> acting acting_primary last_scrub scrub_stamp last_deep_scrub >>>>>>> deep_scrub_stamp >>>>>>> 2.aa 4040 0 0 0 0 10128667136 3010 >>>>>>> 3010 active+clean+scrubbing 2017-05-11 09:37:37.962700 >>>>>>> 181936'11196478 181936:8688051 [23,41,9] 23 [23,41,9] >>>>>>> 23 176730'10793226 2017-05-10 03:43:20.849784 171715'10548192 >>>>>>> 2017-05-04 14:27:39.210713 >>>>>>> >>>>>>> So it seems the same scrub is stuck again... even after restarting the >>>>>>> osd. It just took some time until the scrub of this pg happened again. >>>>>>> >>>>>>> Greets, >>>>>>> Stefan >>>>>>> Am 17.05.2017 um 21:13 schrieb Jason Dillaman: >>>>>>>> Can you share your current OSD configuration? It's very curious that >>>>>>>> your scrub is getting randomly stuck on a few objects for hours at a >>>>>>>> time until an OSD is reset. >>>>>>>> >>>>>>>> On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG >>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>> Hello Jason, >>>>>>>>> >>>>>>>>> minutes ago i had another case where i restarted the osd which was shown >>>>>>>>> in objecter_requests output. >>>>>>>>> >>>>>>>>> It seems also other scrubs and deep scrubs were hanging. >>>>>>>>> >>>>>>>>> Output before: >>>>>>>>> 4095 active+clean >>>>>>>>> 1 active+clean+scrubbing >>>>>>>>> >>>>>>>>> Output after restart: >>>>>>>>> 4084 active+clean >>>>>>>>> 7 active+clean+scrubbing+deep >>>>>>>>> 5 active+clean+scrubbing >>>>>>>>> >>>>>>>>> both values are changing every few seconds again doing a lot of scrubs >>>>>>>>> and deep scubs. >>>>>>>>> >>>>>>>>> Greets, >>>>>>>>> Stefan >>>>>>>>> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> that command does not exist. >>>>>>>>>> >>>>>>>>>> But at least ceph -s permanently reports 1 pg in scrubbing with no change. >>>>>>>>>> >>>>>>>>>> Log attached as well. >>>>>>>>>> >>>>>>>>>> Greets, >>>>>>>>>> Stefan >>>>>>>>>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman: >>>>>>>>>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I >>>>>>>>>>> can quickly scan the new log if you directly send it to me. >>>>>>>>>>> >>>>>>>>>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG >>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>> can send the osd log - if you want? >>>>>>>>>>>> >>>>>>>>>>>> Stefan >>>>>>>>>>>> >>>>>>>>>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>> Hello Jason, >>>>>>>>>>>>> >>>>>>>>>>>>> the command >>>>>>>>>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa >>>>>>>>>>>>> >>>>>>>>>>>>> hangs as well. Doing absolutely nothing... waiting forever. >>>>>>>>>>>>> >>>>>>>>>>>>> Greets, >>>>>>>>>>>>> Stefan >>>>>>>>>>>>> >>>>>>>>>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman: >>>>>>>>>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is >>>>>>>>>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm >>>>>>>>>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs >>>>>>>>>>>>>> during this)? If that succeeds while your VM remains blocked on that >>>>>>>>>>>>>> remove op, it looks like there is some problem in the OSD where ops >>>>>>>>>>>>>> queued on a scrub are not properly awoken when the scrub completes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>> Hello Jason, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> after enabling the log and generating a gcore dump, the request was >>>>>>>>>>>>>>> successful ;-( >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So the log only contains the successfull request. So i was only able to >>>>>>>>>>>>>>> catch the successful request. I can send you the log on request. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Luckily i had another VM on another Cluster behaving the same. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This time osd.23: >>>>>>>>>>>>>>> # ceph --admin-daemon >>>>>>>>>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok >>>>>>>>>>>>>>> objecter_requests >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> "ops": [ >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> "tid": 18777, >>>>>>>>>>>>>>> "pg": "2.cebed0aa", >>>>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>>>> "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>>>> "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>>>> "last_sent": "1.83513e+06s", >>>>>>>>>>>>>>> "attempts": 1, >>>>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>>>> "snap_context": "28a43=[]", >>>>>>>>>>>>>>> "mtime": "2017-05-17 16:51:06.0.455475s", >>>>>>>>>>>>>>> "osd_ops": [ >>>>>>>>>>>>>>> "delete" >>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> ], >>>>>>>>>>>>>>> "linger_ops": [ >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> "linger_id": 1, >>>>>>>>>>>>>>> "pg": "2.f0709c34", >>>>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>>>> "object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>>>> "target_object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>>>> "registered": "1" >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> ], >>>>>>>>>>>>>>> "pool_ops": [], >>>>>>>>>>>>>>> "pool_stat_ops": [], >>>>>>>>>>>>>>> "statfs_ops": [], >>>>>>>>>>>>>>> "command_ops": [] >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> OSD Logfile of OSD 23 attached. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman: >>>>>>>>>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>>>> You mean the request no matter if it is successful or not? Which log >>>>>>>>>>>>>>>>> level should be set to 20? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is >>>>>>>>>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that >>>>>>>>>>>>>>>> high while attempting to capture the blocked op. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> >>> >>> >>> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com