Hello Jason, attached is the thread apply all bt outout while a pg is deadlocked in scrubbing. The core dump is 5.5GB / 500MB compressed. pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.137 2028 0 0 0 0 7236313088 3007 3007 active+clean+scrubbing 2017-05-21 04:45:30.321710 453856'20824101 453856:25450494 [116,37,85] 116 [116,37,85] 116 452296'20769703 2017-05-19 23:40:14.150887 451238'20721095 2017-05-17 21:25:09.174598 Greets, Stefan Am 22.05.2017 um 15:32 schrieb Jason Dillaman: > If you have the debug symbols installed, I'd say "thread apply all bt" > in addition to a "generate-core-file". The backtrace would only be > useful if it showed a thread deadlocked on something. > > On Mon, May 22, 2017 at 9:29 AM, Stefan Priebe - Profihost AG > <s.priebe@xxxxxxxxxxxx> wrote: >> Hello Jason, >> >> should i do a coredump or a thread apply all bt? >> >> Don't know what is better. >> >> Greets, >> Stefan >> >> Am 22.05.2017 um 15:19 schrieb Jason Dillaman: >>> If you cannot recreate with debug logging enabled, that might be the >>> next best option. >>> >>> On Mon, May 22, 2017 at 2:30 AM, Stefan Priebe - Profihost AG >>> <s.priebe@xxxxxxxxxxxx> wrote: >>>> Hello Jason, >>>> >>>> i had another 8 cases where scrub was running for hours. Sadly i >>>> couldn't get it to hang again after an osd restart. Any further ideas? >>>> >>>> Coredump of the OSD with hanging scrub? >>>> >>>> Greets, >>>> Stefan >>>> >>>> Am 18.05.2017 um 17:26 schrieb Jason Dillaman: >>>>> I'm unfortunately out of ideas at the moment. I think the best chance >>>>> of figuring out what is wrong is to repeat it while logs are enabled. >>>>> >>>>> On Wed, May 17, 2017 at 4:51 PM, Stefan Priebe - Profihost AG >>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>> No i can't reproduce it with active logs. Any furthr ideas? >>>>>> >>>>>> Greets, >>>>>> Stefan >>>>>> >>>>>> Am 17.05.2017 um 21:26 schrieb Stefan Priebe - Profihost AG: >>>>>>> Am 17.05.2017 um 21:21 schrieb Jason Dillaman: >>>>>>>> Any chance you still have debug logs enabled on OSD 23 after you >>>>>>>> restarted it and the scrub froze again? >>>>>>> >>>>>>> No but i can do that ;-) Hopefully it freezes again. >>>>>>> >>>>>>> Stefan >>>>>>> >>>>>>>> >>>>>>>> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG >>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> now it shows again: >>>>>>>>>>> 4095 active+clean >>>>>>>>>>> 1 active+clean+scrubbing >>>>>>>>> >>>>>>>>> and: >>>>>>>>> # ceph pg dump | grep -i scrub >>>>>>>>> dumped all in format plain >>>>>>>>> pg_stat objects mip degr misp unf bytes log disklog >>>>>>>>> state state_stamp v reported up up_primary >>>>>>>>> acting acting_primary last_scrub scrub_stamp last_deep_scrub >>>>>>>>> deep_scrub_stamp >>>>>>>>> 2.aa 4040 0 0 0 0 10128667136 3010 >>>>>>>>> 3010 active+clean+scrubbing 2017-05-11 09:37:37.962700 >>>>>>>>> 181936'11196478 181936:8688051 [23,41,9] 23 [23,41,9] >>>>>>>>> 23 176730'10793226 2017-05-10 03:43:20.849784 171715'10548192 >>>>>>>>> 2017-05-04 14:27:39.210713 >>>>>>>>> >>>>>>>>> So it seems the same scrub is stuck again... even after restarting the >>>>>>>>> osd. It just took some time until the scrub of this pg happened again. >>>>>>>>> >>>>>>>>> Greets, >>>>>>>>> Stefan >>>>>>>>> Am 17.05.2017 um 21:13 schrieb Jason Dillaman: >>>>>>>>>> Can you share your current OSD configuration? It's very curious that >>>>>>>>>> your scrub is getting randomly stuck on a few objects for hours at a >>>>>>>>>> time until an OSD is reset. >>>>>>>>>> >>>>>>>>>> On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG >>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>> Hello Jason, >>>>>>>>>>> >>>>>>>>>>> minutes ago i had another case where i restarted the osd which was shown >>>>>>>>>>> in objecter_requests output. >>>>>>>>>>> >>>>>>>>>>> It seems also other scrubs and deep scrubs were hanging. >>>>>>>>>>> >>>>>>>>>>> Output before: >>>>>>>>>>> 4095 active+clean >>>>>>>>>>> 1 active+clean+scrubbing >>>>>>>>>>> >>>>>>>>>>> Output after restart: >>>>>>>>>>> 4084 active+clean >>>>>>>>>>> 7 active+clean+scrubbing+deep >>>>>>>>>>> 5 active+clean+scrubbing >>>>>>>>>>> >>>>>>>>>>> both values are changing every few seconds again doing a lot of scrubs >>>>>>>>>>> and deep scubs. >>>>>>>>>>> >>>>>>>>>>> Greets, >>>>>>>>>>> Stefan >>>>>>>>>>> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> that command does not exist. >>>>>>>>>>>> >>>>>>>>>>>> But at least ceph -s permanently reports 1 pg in scrubbing with no change. >>>>>>>>>>>> >>>>>>>>>>>> Log attached as well. >>>>>>>>>>>> >>>>>>>>>>>> Greets, >>>>>>>>>>>> Stefan >>>>>>>>>>>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman: >>>>>>>>>>>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I >>>>>>>>>>>>> can quickly scan the new log if you directly send it to me. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG >>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>> can send the osd log - if you want? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>>>> Hello Jason, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> the command >>>>>>>>>>>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> hangs as well. Doing absolutely nothing... waiting forever. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman: >>>>>>>>>>>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is >>>>>>>>>>>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm >>>>>>>>>>>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs >>>>>>>>>>>>>>>> during this)? If that succeeds while your VM remains blocked on that >>>>>>>>>>>>>>>> remove op, it looks like there is some problem in the OSD where ops >>>>>>>>>>>>>>>> queued on a scrub are not properly awoken when the scrub completes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>>>> Hello Jason, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> after enabling the log and generating a gcore dump, the request was >>>>>>>>>>>>>>>>> successful ;-( >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So the log only contains the successfull request. So i was only able to >>>>>>>>>>>>>>>>> catch the successful request. I can send you the log on request. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Luckily i had another VM on another Cluster behaving the same. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This time osd.23: >>>>>>>>>>>>>>>>> # ceph --admin-daemon >>>>>>>>>>>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok >>>>>>>>>>>>>>>>> objecter_requests >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> "ops": [ >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> "tid": 18777, >>>>>>>>>>>>>>>>> "pg": "2.cebed0aa", >>>>>>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>>>>>> "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>>>>>> "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa", >>>>>>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>>>>>> "last_sent": "1.83513e+06s", >>>>>>>>>>>>>>>>> "attempts": 1, >>>>>>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>>>>>> "snap_context": "28a43=[]", >>>>>>>>>>>>>>>>> "mtime": "2017-05-17 16:51:06.0.455475s", >>>>>>>>>>>>>>>>> "osd_ops": [ >>>>>>>>>>>>>>>>> "delete" >>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> ], >>>>>>>>>>>>>>>>> "linger_ops": [ >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> "linger_id": 1, >>>>>>>>>>>>>>>>> "pg": "2.f0709c34", >>>>>>>>>>>>>>>>> "osd": 23, >>>>>>>>>>>>>>>>> "object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>>>>>> "object_locator": "@2", >>>>>>>>>>>>>>>>> "target_object_id": "rbd_header.21aafa6b8b4567", >>>>>>>>>>>>>>>>> "target_object_locator": "@2", >>>>>>>>>>>>>>>>> "paused": 0, >>>>>>>>>>>>>>>>> "used_replica": 0, >>>>>>>>>>>>>>>>> "precalc_pgid": 0, >>>>>>>>>>>>>>>>> "snapid": "head", >>>>>>>>>>>>>>>>> "registered": "1" >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> ], >>>>>>>>>>>>>>>>> "pool_ops": [], >>>>>>>>>>>>>>>>> "pool_stat_ops": [], >>>>>>>>>>>>>>>>> "statfs_ops": [], >>>>>>>>>>>>>>>>> "command_ops": [] >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> OSD Logfile of OSD 23 attached. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman: >>>>>>>>>>>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>>>>>> You mean the request no matter if it is successful or not? Which log >>>>>>>>>>>>>>>>>>> level should be set to 20? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is >>>>>>>>>>>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that >>>>>>>>>>>>>>>>>> high while attempting to capture the blocked op. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> >>>>> >>> >>> >>> > > >
Attachment:
threadallbt_osd_116.txt.gz
Description: application/gzip
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com