Hello Jason, i had another 8 cases where scrub was running for hours. Sadly i couldn't get it to hang again after an osd restart. Any further ideas? Coredump of the OSD with hanging scrub? Greets, Stefan Am 18.05.2017 um 17:26 schrieb Jason Dillaman: > I'm unfortunately out of ideas at the moment. I think the best chance > of figuring out what is wrong is to repeat it while logs are enabled. > > On Wed, May 17, 2017 at 4:51 PM, Stefan Priebe - Profihost AG > <s.priebe@xxxxxxxxxxxx> wrote: >> No i can't reproduce it with active logs. Any furthr ideas? >> >> Greets, >> Stefan >> >> Am 17.05.2017 um 21:26 schrieb Stefan Priebe - Profihost AG: >>> Am 17.05.2017 um 21:21 schrieb Jason Dillaman: >>>> Any chance you still have debug logs enabled on OSD 23 after you >>>> restarted it and the scrub froze again? >>> >>> No but i can do that ;-) Hopefully it freezes again. >>> >>> Stefan >>> >>>> >>>> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG >>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>> Hello, >>>>> >>>>> now it shows again: >>>>>>> 4095 active+clean >>>>>>> 1 active+clean+scrubbing >>>>> >>>>> and: >>>>> # ceph pg dump | grep -i scrub >>>>> dumped all in format plain >>>>> pg_stat objects mip degr misp unf bytes log disklog >>>>> state state_stamp v reported up up_primary >>>>> acting acting_primary last_scrub scrub_stamp last_deep_scrub >>>>> deep_scrub_stamp >>>>> 2.aa 4040 0 0 0 0 10128667136 3010 >>>>> 3010 active+clean+scrubbing 2017-05-11 09:37:37.962700 >>>>> 181936'11196478 181936:8688051 [23,41,9] 23 [23,41,9] >>>>> 23 176730'10793226 2017-05-10 03:43:20.849784 171715'10548192 >>>>> 2017-05-04 14:27:39.210713 >>>>> >>>>> So it seems the same scrub is stuck again... even after restarting the >>>>> osd. It just took some time until the scrub of this pg happened again. >>>>> >>>>> Greets, >>>>> Stefan >>>>> Am 17.05.2017 um 21:13 schrieb Jason Dillaman: >>>>>> Can you share your current OSD configuration? It's very curious that >>>>>> your scrub is getting randomly stuck on a few objects for hours at a >>>>>> time until an OSD is reset. >>>>>> >>>>>> On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG >>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>> Hello Jason, >>>>>>> >>>>>>> minutes ago i had another case where i restarted the osd which was shown >>>>>>> in objecter_requests output. >>>>>>> >>>>>>> It seems also other scrubs and deep scrubs were hanging. >>>>>>> >>>>>>> Output before: >>>>>>> 4095 active+clean >>>>>>> 1 active+clean+scrubbing >>>>>>> >>>>>>> Output after restart: >>>>>>> 4084 active+clean >>>>>>> 7 active+clean+scrubbing+deep >>>>>>> 5 active+clean+scrubbing >>>>>>> >>>>>>> both values are changing every few seconds again doing a lot of scrubs >>>>>>> and deep scubs. >>>>>>> >>>>>>> Greets, >>>>>>> Stefan >>>>>>> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG: >>>>>>>> Hi, >>>>>>>> >>>>>>>> that command does not exist. >>>>>>>> >>>>>>>> But at least ceph -s permanently reports 1 pg in scrubbing with no change. >>>>>>>> >>>>>>>> Log attached as well. >>>>>>>> >>>>>>>> Greets, >>>>>>>> Stefan >>>>>>>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman: >>>>>>>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I >>>>>>>>> can quickly scan the new log if you directly send it to me. >>>>>>>>> >>>>>>>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG >>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>> can send the osd log - if you want? >>>>>>>>>> >>>>>>>>>> Stefan >>>>>>>>>> >>>>>>>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>> Hello Jason, >>>>>>>>>>> >>>>>>>>>>> the command >>>>>>>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa >>>>>>>>>>> >>>>>>>>>>> hangs as well. Doing absolutely nothing... waiting forever. >>>>>>>>>>> >>>>>>>>>>> Greets, >>>>>>>>>>> Stefan >>>>>>>>>>> >>>>>>>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman: >>>>>>>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is >>>>>>>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm >>>>>>>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs >>>>>>>>>>>> during this)? If that succeeds while your VM remains blocked on that >>>>>>>>>>>> remove op, it looks like there is some problem in the OSD where ops >>>>>>>>>>>> queued on a scrub are not properly awoken when the scrub completes. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>> Hello Jason, >>>>>>>>>>>>> >>>>>>>>>>>>> after enabling the log and generating a gcore dump, the request was >>>>>>>>>>>>> successful ;-( >>>>>>>>>>>>> >>>>>>>>>>>>> So the log only contains the successfull request. So i was only able to >>>>>>>>>>>>> catch the successful request. I can send you the log on request. >>>>>>>>>>>>> >>>>>>>>>>>>> Luckily i had another VM on another Cluster behaving the same. >>>>>>>>>>>>> >>>>>>>>>>>>> This time osd.23: >>>>>>>>>>>>> # ceph --admin-daemon >>>>>>>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok >>>>>>>>>>>>> objecter_requests >>>>>>>>>>>>> { >>>>>>>>>>>>> "ops": [ >>>>>>>>>>>>> { >>>>>>>>>>>>> "tid": 18777, >>>>>>>>>>>>> "pg": "2.cebed0aa", >>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>> "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>> "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>> "last_sent": "1.83513e+06s", >>>>>>>>>>>>> "attempts": 1, >>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>> "snap_context": "28a43=[]", >>>>>>>>>>>>> "mtime": "2017-05-17 16:51:06.0.455475s", >>>>>>>>>>>>> "osd_ops": [ >>>>>>>>>>>>> "delete" >>>>>>>>>>>>> ] >>>>>>>>>>>>> } >>>>>>>>>>>>> ], >>>>>>>>>>>>> "linger_ops": [ >>>>>>>>>>>>> { >>>>>>>>>>>>> "linger_id": 1, >>>>>>>>>>>>> "pg": "2.f0709c34", >>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>> "object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>> "target_object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>> "registered": "1" >>>>>>>>>>>>> } >>>>>>>>>>>>> ], >>>>>>>>>>>>> "pool_ops": [], >>>>>>>>>>>>> "pool_stat_ops": [], >>>>>>>>>>>>> "statfs_ops": [], >>>>>>>>>>>>> "command_ops": [] >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> OSD Logfile of OSD 23 attached. >>>>>>>>>>>>> >>>>>>>>>>>>> Greets, >>>>>>>>>>>>> Stefan >>>>>>>>>>>>> >>>>>>>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman: >>>>>>>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>> You mean the request no matter if it is successful or not? Which log >>>>>>>>>>>>>>> level should be set to 20? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is >>>>>>>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that >>>>>>>>>>>>>> high while attempting to capture the blocked op. >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> ceph-users mailing list >>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> >>>> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com