I think those osds (1, 11, 21, 32, ...) need a little kick to re-peer their degraded PGs. Open a window with `watch ceph -s`, then in another window slowly do ceph osd down 1 # then wait a minute or so for that osd.1 to re-peer fully. ceph osd down 11 ... Continue that for each of the osds with stuck requests, or until there are no more recovery_wait/degraded PGs. After each `ceph osd down...`, you should expect to see several PGs re-peer, and then ideally the slow requests will disappear and the degraded PGs will become active+clean. If anything else happens, you should stop and let us know. -- dan On Thu, May 23, 2019 at 10:59 AM Kevin Flöh <kevin.floeh@xxxxxxx> wrote: > > This is the current status of ceph: > > > cluster: > id: 23e72372-0d44-4cad-b24f-3641b14b86f4 > health: HEALTH_ERR > 9/125481144 objects unfound (0.000%) > Degraded data redundancy: 9/497011417 objects degraded > (0.000%), 7 pgs degraded > 9 stuck requests are blocked > 4096 sec. Implicated osds > 1,11,21,32,43,50,65 > > services: > mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 > mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu > mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu=up:active}, 3 > up:standby > osd: 96 osds: 96 up, 96 in > > data: > pools: 2 pools, 4096 pgs > objects: 125.48M objects, 259TiB > usage: 370TiB used, 154TiB / 524TiB avail > pgs: 9/497011417 objects degraded (0.000%) > 9/125481144 objects unfound (0.000%) > 4078 active+clean > 11 active+clean+scrubbing+deep > 7 active+recovery_wait+degraded > > io: > client: 211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr > > On 23.05.19 10:54 vorm., Dan van der Ster wrote: > > What's the full ceph status? > > Normally recovery_wait just means that the relevant osd's are busy > > recovering/backfilling another PG. > > > > On Thu, May 23, 2019 at 10:53 AM Kevin Flöh <kevin.floeh@xxxxxxx> wrote: > >> Hi, > >> > >> we have set the PGs to recover and now they are stuck in active+recovery_wait+degraded and instructing them to deep-scrub does not change anything. Hence, the rados report is empty. Is there a way to stop the recovery wait to start the deep-scrub and get the output? I guess the recovery_wait might be caused by missing objects. Do we need to delete them first to get the recovery going? > >> > >> Kevin > >> > >> On 22.05.19 6:03 nachm., Robert LeBlanc wrote: > >> > >> On Wed, May 22, 2019 at 4:31 AM Kevin Flöh <kevin.floeh@xxxxxxx> wrote: > >>> Hi, > >>> > >>> thank you, it worked. The PGs are not incomplete anymore. Still we have > >>> another problem, there are 7 PGs inconsistent and a cpeh pg repair is > >>> not doing anything. I just get "instructing pg 1.5dd on osd.24 to > >>> repair" and nothing happens. Does somebody know how we can get the PGs > >>> to repair? > >>> > >>> Regards, > >>> > >>> Kevin > >> > >> Kevin, > >> > >> I just fixed an inconsistent PG yesterday. You will need to figure out why they are inconsistent. Do these steps and then we can figure out how to proceed. > >> 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them) > >> 2. Print out the inconsistent report for each inconsistent PG. `rados list-inconsistent-obj <PG_NUM> --format=json-pretty` > >> 3. You will want to look at the error messages and see if all the shards have the same data. > >> > >> Robert LeBlanc > >> > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com