The cache tiering has nothing to do with the PG of the underlying pool being incomplete. You are just seeing these requests as stuck because it's the only thing trying to write to the underlying pool. What you need to fix is the PG showing incomplete. I assume you already tried reducing the min_size to 4 as suggested? Or did you by chance always run with min_size 4 on the ec pool, which is a common cause for problems like this. Can you share the output of "ceph osd pool ls detail"? Also, which version of Ceph are you running? Paul Am Fr., 21. Sep. 2018 um 19:28 Uhr schrieb Olivier Bonvalet <ceph.list@xxxxxxxxx>: > > So I've totally disable cache-tiering and overlay. Now OSD 68 & 69 are > fine, no more blocked. > > But OSD 32 is still blocked, and PG 37.9c still marked incomplete with > : > > "recovery_state": [ > { > "name": "Started/Primary/Peering/Incomplete", > "enter_time": "2018-09-21 18:56:01.222970", > "comment": "not enough complete instances of this PG" > }, > > But I don't see blocked requests in OSD.32 logs, should I increase one > of the "debug_xx" flag ? > > > Le vendredi 21 septembre 2018 à 16:51 +0200, Maks Kowalik a écrit : > > According to the query output you pasted shards 1 and 2 are broken. > > But, on the other hand EC profile (4+2) should make it possible to > > recover from 2 shards lost simultanously... > > > > pt., 21 wrz 2018 o 16:29 Olivier Bonvalet <ceph.list@xxxxxxxxx> > > napisał(a): > > > Well on drive, I can find thoses parts : > > > > > > - cs0 on OSD 29 and 30 > > > - cs1 on OSD 18 and 19 > > > - cs2 on OSD 13 > > > - cs3 on OSD 66 > > > - cs4 on OSD 0 > > > - cs5 on OSD 75 > > > > > > And I can read thoses files too. > > > > > > And all thoses OSD are UP and IN. > > > > > > > > > Le vendredi 21 septembre 2018 à 13:10 +0000, Eugen Block a écrit : > > > > > > I tried to flush the cache with "rados -p cache-bkp-foo > > > cache- > > > > > > flush- > > > > > > evict-all", but it blocks on the object > > > > > > "rbd_data.f66c92ae8944a.00000000000f2596". > > > > > > > > This is the object that's stuck in the cache tier (according to > > > > your > > > > output in https://pastebin.com/zrwu5X0w). Can you verify if that > > > > block > > > > device is in use and healthy or is it corrupt? > > > > > > > > > > > > Zitat von Maks Kowalik <maks_kowalik@xxxxxxxxx>: > > > > > > > > > Could you, please paste the output of pg 37.9c query > > > > > > > > > > pt., 21 wrz 2018 o 14:39 Olivier Bonvalet <ceph.list@xxxxxxxxx> > > > > > napisał(a): > > > > > > > > > > > In fact, one object (only one) seem to be blocked on the > > > cache > > > > > > tier > > > > > > (writeback). > > > > > > > > > > > > I tried to flush the cache with "rados -p cache-bkp-foo > > > cache- > > > > > > flush- > > > > > > evict-all", but it blocks on the object > > > > > > "rbd_data.f66c92ae8944a.00000000000f2596". > > > > > > > > > > > > So I reduced (a lot) the cache tier to 200MB, "rados -p > > > cache- > > > > > > bkp-foo > > > > > > ls" now show only 3 objects : > > > > > > > > > > > > rbd_directory > > > > > > rbd_data.f66c92ae8944a.00000000000f2596 > > > > > > rbd_header.f66c92ae8944a > > > > > > > > > > > > And "cache-flush-evict-all" still hangs. > > > > > > > > > > > > I also switched the cache tier to "readproxy", to avoid using > > > > > > this > > > > > > cache. But, it's still blocked. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Le vendredi 21 septembre 2018 à 02:14 +0200, Olivier Bonvalet > > > a > > > > > > écrit : > > > > > > > Hello, > > > > > > > > > > > > > > on a Luminous cluster, I have a PG incomplete and I can't > > > find > > > > > > > how to > > > > > > > fix that. > > > > > > > > > > > > > > It's an EC pool (4+2) : > > > > > > > > > > > > > > pg 37.9c is incomplete, acting [32,50,59,1,0,75] > > > (reducing > > > > > > > pool > > > > > > > bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs > > > for > > > > > > > 'incomplete') > > > > > > > > > > > > > > Of course, we can't reduce min_size from 4. > > > > > > > > > > > > > > And the full state : https://pastebin.com/zrwu5X0w > > > > > > > > > > > > > > So, IO are blocked, we can't access thoses damaged data. > > > > > > > OSD blocks too : > > > > > > > osds 32,68,69 have stuck requests > 4194.3 sec > > > > > > > > > > > > > > OSD 32 is the primary of this PG. > > > > > > > And OSD 68 and 69 are for cache tiering. > > > > > > > > > > > > > > Any idea how can I fix that ? > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Olivier > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > ceph-users mailing list > > > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > ceph-users mailing list > > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com