Serious cluster issue - Incomplete PGs

Deep Dish <deeepdish@xxxxxxxxx> · Sun, 8 Jan 2023 19:32:20 -0500

Hello.   I really screwed up my ceph cluster.   Hoping to get data off it
so I can rebuild it.

In summary, too many changes too quickly caused the cluster to develop
incomplete pgs.  Some PGS were reporting that OSDs were to be probes.
I've created those OSD IDs (empty), however this wouldn't clear
incompletes.   Incompletes are part of EC pools.  Running 17.2.5.

This is the overall state:

  cluster:

    id:     49057622-69fc-11ed-b46e-d5acdedaae33

    health: HEALTH_WARN

            Failed to apply 1 service(s): osd.dashboard-admin-1669078094056

            1 hosts fail cephadm check

            cephadm background work is paused

            Reduced data availability: 28 pgs inactive, 28 pgs incomplete

            Degraded data redundancy: 55 pgs undersized

            2 slow ops, oldest one blocked for 4449 sec, daemons
[osd.25,osd.50,osd.51] have slow ops.

These are PGs that are incomplete that HAVE DATA (Objects > 0) [ via ceph
pg ls incomplete ]:

2.35     23199         0          0        0  95980273664            0
      0  2477           incomplete    10s  2104'46277   28260:686871
 [44,4,37,3,40,32]p44    [44,4,37,3,40,32]p44
 2023-01-03T03:54:47.821280+0000  2022-12-29T18:53:09.287203+0000
        14  queued for deep scrub
2.53     22821         0          0        0  94401175552            0
      0  2745  remapped+incomplete    10s  2104'45845   28260:565267
[60,48,52,65,67,7]p60                 [60]p60
 2023-01-03T10:18:13.388383+0000  2023-01-03T10:18:13.388383+0000
       408  queued for scrub
2.9f     22858         0          0        0  94555983872            0
      0  2736  remapped+incomplete    10s  2104'45636   28260:759872
 [56,59,3,57,5,32]p56                 [56]p56
 2023-01-03T10:55:49.848693+0000  2023-01-03T10:55:49.848693+0000
       376  queued for scrub
2.be     22870         0          0        0  94429110272            0
      0  2661  remapped+incomplete    10s  2104'45561   28260:813759
 [41,31,37,9,7,69]p41                 [41]p41
 2023-01-03T14:02:15.790077+0000  2023-01-03T14:02:15.790077+0000
       360  queued for scrub
2.e4     22953         0          0        0  94912278528            0
      0  2648  remapped+incomplete    20m  2104'46048   28259:732896
[37,46,33,4,48,49]p37                 [37]p37
 2023-01-02T18:38:46.268723+0000  2022-12-29T18:05:47.431468+0000
        18  queued for deep scrub
17.78    20169         0          0        0  84517834400            0
      0  2198  remapped+incomplete    10s  3735'53405  28260:1243673
 [4,37,2,36,66,0]p4                 [41]p41
 2023-01-03T14:21:41.563424+0000  2023-01-03T14:21:41.563424+0000
       348  queued for scrub
17.d8    20328         0          0        0  85196053130            0
      0  1852  remapped+incomplete    10s  3735'54458  28260:1309564
 [38,65,61,37,58,39]p38                 [53]p53
 2023-01-02T18:32:35.371071+0000  2022-12-28T19:08:29.492244+0000
        21  queued for deep scrub

At present I'm unable to reliably access my data due to incomplete pages
above.  I'll post whatever outputs requested (won't post now as it can be
rather verbose).  Is there hope?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx