No i can't reproduce it with active logs. Any furthr ideas? Greets, Stefan Am 17.05.2017 um 21:26 schrieb Stefan Priebe - Profihost AG: > Am 17.05.2017 um 21:21 schrieb Jason Dillaman: >> Any chance you still have debug logs enabled on OSD 23 after you >> restarted it and the scrub froze again? > > No but i can do that ;-) Hopefully it freezes again. > > Stefan > >> >> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG >> <s.priebe@xxxxxxxxxxxx> wrote: >>> Hello, >>> >>> now it shows again: >>>>> 4095 active+clean >>>>> 1 active+clean+scrubbing >>> >>> and: >>> # ceph pg dump | grep -i scrub >>> dumped all in format plain >>> pg_stat objects mip degr misp unf bytes log disklog >>> state state_stamp v reported up up_primary >>> acting acting_primary last_scrub scrub_stamp last_deep_scrub >>> deep_scrub_stamp >>> 2.aa 4040 0 0 0 0 10128667136 3010 >>> 3010 active+clean+scrubbing 2017-05-11 09:37:37.962700 >>> 181936'11196478 181936:8688051 [23,41,9] 23 [23,41,9] >>> 23 176730'10793226 2017-05-10 03:43:20.849784 171715'10548192 >>> 2017-05-04 14:27:39.210713 >>> >>> So it seems the same scrub is stuck again... even after restarting the >>> osd. It just took some time until the scrub of this pg happened again. >>> >>> Greets, >>> Stefan >>> Am 17.05.2017 um 21:13 schrieb Jason Dillaman: >>>> Can you share your current OSD configuration? It's very curious that >>>> your scrub is getting randomly stuck on a few objects for hours at a >>>> time until an OSD is reset. >>>> >>>> On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG >>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>> Hello Jason, >>>>> >>>>> minutes ago i had another case where i restarted the osd which was shown >>>>> in objecter_requests output. >>>>> >>>>> It seems also other scrubs and deep scrubs were hanging. >>>>> >>>>> Output before: >>>>> 4095 active+clean >>>>> 1 active+clean+scrubbing >>>>> >>>>> Output after restart: >>>>> 4084 active+clean >>>>> 7 active+clean+scrubbing+deep >>>>> 5 active+clean+scrubbing >>>>> >>>>> both values are changing every few seconds again doing a lot of scrubs >>>>> and deep scubs. >>>>> >>>>> Greets, >>>>> Stefan >>>>> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG: >>>>>> Hi, >>>>>> >>>>>> that command does not exist. >>>>>> >>>>>> But at least ceph -s permanently reports 1 pg in scrubbing with no change. >>>>>> >>>>>> Log attached as well. >>>>>> >>>>>> Greets, >>>>>> Stefan >>>>>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman: >>>>>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I >>>>>>> can quickly scan the new log if you directly send it to me. >>>>>>> >>>>>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG >>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>> can send the osd log - if you want? >>>>>>>> >>>>>>>> Stefan >>>>>>>> >>>>>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: >>>>>>>>> Hello Jason, >>>>>>>>> >>>>>>>>> the command >>>>>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa >>>>>>>>> >>>>>>>>> hangs as well. Doing absolutely nothing... waiting forever. >>>>>>>>> >>>>>>>>> Greets, >>>>>>>>> Stefan >>>>>>>>> >>>>>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman: >>>>>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is >>>>>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm >>>>>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs >>>>>>>>>> during this)? If that succeeds while your VM remains blocked on that >>>>>>>>>> remove op, it looks like there is some problem in the OSD where ops >>>>>>>>>> queued on a scrub are not properly awoken when the scrub completes. >>>>>>>>>> >>>>>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG >>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>> Hello Jason, >>>>>>>>>>> >>>>>>>>>>> after enabling the log and generating a gcore dump, the request was >>>>>>>>>>> successful ;-( >>>>>>>>>>> >>>>>>>>>>> So the log only contains the successfull request. So i was only able to >>>>>>>>>>> catch the successful request. I can send you the log on request. >>>>>>>>>>> >>>>>>>>>>> Luckily i had another VM on another Cluster behaving the same. >>>>>>>>>>> >>>>>>>>>>> This time osd.23: >>>>>>>>>>> # ceph --admin-daemon >>>>>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok >>>>>>>>>>> objecter_requests >>>>>>>>>>> { >>>>>>>>>>> "ops": [ >>>>>>>>>>> { >>>>>>>>>>> "tid": 18777, >>>>>>>>>>> "pg": "2.cebed0aa", >>>>>>>>>>> "osd": 23, >>>>>>>>>>> "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>> "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>> "paused": 0, >>>>>>>>>>> "used_replica": 0, >>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>> "last_sent": "1.83513e+06s", >>>>>>>>>>> "attempts": 1, >>>>>>>>>>> "snapid": "head", >>>>>>>>>>> "snap_context": "28a43=[]", >>>>>>>>>>> "mtime": "2017-05-17 16:51:06.0.455475s", >>>>>>>>>>> "osd_ops": [ >>>>>>>>>>> "delete" >>>>>>>>>>> ] >>>>>>>>>>> } >>>>>>>>>>> ], >>>>>>>>>>> "linger_ops": [ >>>>>>>>>>> { >>>>>>>>>>> "linger_id": 1, >>>>>>>>>>> "pg": "2.f0709c34", >>>>>>>>>>> "osd": 23, >>>>>>>>>>> "object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>> "target_object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>> "paused": 0, >>>>>>>>>>> "used_replica": 0, >>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>> "snapid": "head", >>>>>>>>>>> "registered": "1" >>>>>>>>>>> } >>>>>>>>>>> ], >>>>>>>>>>> "pool_ops": [], >>>>>>>>>>> "pool_stat_ops": [], >>>>>>>>>>> "statfs_ops": [], >>>>>>>>>>> "command_ops": [] >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> OSD Logfile of OSD 23 attached. >>>>>>>>>>> >>>>>>>>>>> Greets, >>>>>>>>>>> Stefan >>>>>>>>>>> >>>>>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman: >>>>>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>> You mean the request no matter if it is successful or not? Which log >>>>>>>>>>>>> level should be set to 20? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is >>>>>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that >>>>>>>>>>>> high while attempting to capture the blocked op. >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>> >>>>>>> >>>>>>> >>>> >>>> >>>> >> >> >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com