Re: PG stuck incomplete

Maks Kowalik <maks_kowalik@xxxxxxxxx> · Fri, 21 Sep 2018 16:51:48 +0200

According to the query output you pasted shards 1 and 2 are broken.
But, on the other hand EC profile (4+2) should make it possible to recover from 2 shards lost simultanously... 

pt., 21 wrz 2018 o 16:29 Olivier Bonvalet <ceph.list@xxxxxxxxx> napisał(a):
Well on drive, I can find thoses parts :

- cs0 on OSD 29 and 30

- cs1 on OSD 18 and 19

- cs2 on OSD 13

- cs3 on OSD 66

- cs4 on OSD 0

- cs5 on OSD 75

And I can read thoses files too.

And all thoses OSD are UP and IN.

Le vendredi 21 septembre 2018 à 13:10 +0000, Eugen Block a écrit :

> > > I tried to flush the cache with "rados -p cache-bkp-foo cache-

> > > flush-

> > > evict-all", but it blocks on the object

> > > "rbd_data.f66c92ae8944a.00000000000f2596".

> 

> This is the object that's stuck in the cache tier (according to

> your  

> output in https://pastebin.com/zrwu5X0w). Can you verify if that

> block  

> device is in use and healthy or is it corrupt?

> 

> 

> Zitat von Maks Kowalik <maks_kowalik@xxxxxxxxx>:

> 

> > Could you, please paste the output of pg 37.9c query

> > 

> > pt., 21 wrz 2018 o 14:39 Olivier Bonvalet <ceph.list@xxxxxxxxx>

> > napisał(a):

> > 

> > > In fact, one object (only one) seem to be blocked on the cache

> > > tier

> > > (writeback).

> > > 

> > > I tried to flush the cache with "rados -p cache-bkp-foo cache-

> > > flush-

> > > evict-all", but it blocks on the object

> > > "rbd_data.f66c92ae8944a.00000000000f2596".

> > > 

> > > So I reduced (a lot) the cache tier to 200MB, "rados -p cache-

> > > bkp-foo

> > > ls" now show only 3 objects :

> > > 

> > >     rbd_directory

> > >     rbd_data.f66c92ae8944a.00000000000f2596

> > >     rbd_header.f66c92ae8944a

> > > 

> > > And "cache-flush-evict-all" still hangs.

> > > 

> > > I also switched the cache tier to "readproxy", to avoid using

> > > this

> > > cache. But, it's still blocked.

> > > 

> > > 

> > > 

> > > 

> > > Le vendredi 21 septembre 2018 à 02:14 +0200, Olivier Bonvalet a

> > > écrit :

> > > > Hello,

> > > > 

> > > > on a Luminous cluster, I have a PG incomplete and I can't find

> > > > how to

> > > > fix that.

> > > > 

> > > > It's an EC pool (4+2) :

> > > > 

> > > >     pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing

> > > > pool

> > > > bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for

> > > > 'incomplete')

> > > > 

> > > > Of course, we can't reduce min_size from 4.

> > > > 

> > > > And the full state : https://pastebin.com/zrwu5X0w

> > > > 

> > > > So, IO are blocked, we can't access thoses damaged data.

> > > > OSD blocks too :

> > > >     osds 32,68,69 have stuck requests > 4194.3 sec

> > > > 

> > > > OSD 32 is the primary of this PG.

> > > > And OSD 68 and 69 are for cache tiering.

> > > > 

> > > > Any idea how can I fix that ?

> > > > 

> > > > Thanks,

> > > > 

> > > > Olivier

> > > > 

> > > > 

> > > > _______________________________________________

> > > > ceph-users mailing list

> > > > ceph-users@xxxxxxxxxxxxxx

> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> > > > 

> > > 

> > > _______________________________________________

> > > ceph-users mailing list

> > > ceph-users@xxxxxxxxxxxxxx

> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> > > 

> 

> 

> 

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com