Hi,
Backups will be challenging. I honestly didn't anticipate this kind of failure with ceph to be possible, we've been using it for several years now and were encouraged by orchestrator and performance improvements in the 17 code branch.
that's exactly what a backup is for, to be prepared for the unexpected. Besides the fact that ceph didn't actually fail (you removed too many/too early OSDs) you can't expect a bug free software, no matter how long it has been running successfully.
- Identifying the pools / images / files that are affected by incomplete pages;
The PGs start with a number which reflects the pools in your cluster, check the output of 'ceph osd pool ls detail'. There's no easy way to tell which images or files are affected, you can query each OSD and list the PG's objects, but that doesn't work for missing OSDs/PGs, of course. I'm not sure how promising it is, but maybe try a for loop over all rbd images and just execute a 'rbd info <pool>/<image>' for each image, maybe it will tell you which image is incomplete.
- Extracting and reconstructing data for RBD images (these images are XFS formatted filesystems); - Extracting and reconstructing data for CephFS Files not affected by incomplete PGs.
If you kept the disks you removed too early (and didn't wipe them) there may be a chance to export the PG chunks with ceph-objectstore-tool [2]. I haven't used that myself in a production cluster so be careful and get familiar with the commands in a test environment first. If you already wiped the temporary OSDs I don't see a chance to recover from this.
Regards, Eugen [2] https://docs.ceph.com/en/pacific/man/8/ceph-objectstore-tool/ Zitat von Deep Dish <deeepdish@xxxxxxxxx>:
Thanks for the insight Eugen. Here's what basically happened: - Upgrade from Nautilus to Quincy via migration to new cluster on temp hardware; - Data from Nautilus migrated successfully to older / lab-type equipment running Quincy; - Nautilus Hardware rebuilt for Quincy, data migrated back; - As data was migrating we set the older notes to maintenance mode and started to drain them; - After several days many OSDs were showing as spinning in "deleting" status on portal and we were marked OUT; - This point we made the incorrect assumption those OSDs were no longer required and proceeded to remove those nodes / OSDs. I understand Incomplete pages are basically lost. And it's likely a lengthy task to attempt to salvage data. Backups will be challenging. I honestly didn't anticipate this kind of failure with ceph to be possible, we've been using it for several years now and were encouraged by orchestrator and performance improvements in the 17 code branch. The fact is of the Incomplete pages that have object counts > 0, there's about 644 GB of data that's tied up in this mess. There are other incomplete PGs with object = 0 which I understand can be manually marked as complete. The cluster has a data usage of 61 TiB. Of this I can categorize about 14TB of critical data, 40 TB of data that is of medium / high importance. There's 14TB in RBD images that would be critical on an EC pool there are other images, however of lower importance at this point; There's also about a 20TB CephFS file system of lower data importance as well. Question - Can you kindly point me to procedures for: - Identifying the pools / images / files that are affected by incomplete pages; - Extracting and reconstructing data for RBD images (these images are XFS formatted filesystems); - Extracting and reconstructing data for CephFS Files not affected by incomplete PGs. Much appreciated. ------------------------------ Date: Mon, 09 Jan 2023 10:12:49 +0000 From: Eugen Block <eblock@xxxxxx> Subject: Re: Serious cluster issue - Incomplete PGs To: ceph-users@xxxxxxx Message-ID: <20230109101249.Horde.hAHCWQijFMYLNdX8a2YQDVV@xxxxxxxxxxxxxx> Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes Hi, can you clarify what exactly you did to get into this situation? What about the undersized PGs, any chance to bring those OSDs back online? Regarding the incomplete PGs I'm not sure there's much you can do if the OSDs are lost. To me it reads like you may have destroyed/recreated more OSDs than you should have, just recreating OSDs with the same IDs is not sufficient if you destroyed too many chunks. Each OSD only contains a chunk of the PG due to the erasure coding. I'm afraid those objects are lost and you would have to restore from backup. To get the cluster into a healthy state again there a couple of threads, e. g. [1], but recovering the lost chunks from ceph will probably not work. Regards, Eugen [1] https://www.mail-archive.com/ceph-users@xxxxxxx/msg14757.html Zitat von Deep Dish <deeepdish@xxxxxxxxx>:Hello. I really screwed up my ceph cluster. Hoping to get data off it so I can rebuild it. In summary, too many changes too quickly caused the cluster to develop incomplete pgs. Some PGS were reporting that OSDs were to be probes. I've created those OSD IDs (empty), however this wouldn't clear incompletes. Incompletes are part of EC pools. Running 17.2.5. This is the overall state: cluster: id: 49057622-69fc-11ed-b46e-d5acdedaae33 health: HEALTH_WARN Failed to apply 1 service(s):osd.dashboard-admin-16690780940561 hosts fail cephadm check cephadm background work is paused Reduced data availability: 28 pgs inactive, 28 pgs incomplete Degraded data redundancy: 55 pgs undersized 2 slow ops, oldest one blocked for 4449 sec, daemons [osd.25,osd.50,osd.51] have slow ops. These are PGs that are incomplete that HAVE DATA (Objects > 0) [ via ceph pg ls incomplete ]: 2.35 23199 0 0 0 95980273664 0 0 2477 incomplete 10s 2104'46277 28260:686871 [44,4,37,3,40,32]p44 [44,4,37,3,40,32]p44 2023-01-03T03:54:47.821280+0000 2022-12-29T18:53:09.287203+0000 14 queued for deep scrub 2.53 22821 0 0 0 94401175552 0 0 2745 remapped+incomplete 10s 2104'45845 28260:565267 [60,48,52,65,67,7]p60 [60]p60 2023-01-03T10:18:13.388383+0000 2023-01-03T10:18:13.388383+0000 408 queued for scrub 2.9f 22858 0 0 0 94555983872 0 0 2736 remapped+incomplete 10s 2104'45636 28260:759872 [56,59,3,57,5,32]p56 [56]p56 2023-01-03T10:55:49.848693+0000 2023-01-03T10:55:49.848693+0000 376 queued for scrub 2.be 22870 0 0 0 94429110272 0 0 2661 remapped+incomplete 10s 2104'45561 28260:813759 [41,31,37,9,7,69]p41 [41]p41 2023-01-03T14:02:15.790077+0000 2023-01-03T14:02:15.790077+0000 360 queued for scrub 2.e4 22953 0 0 0 94912278528 0 0 2648 remapped+incomplete 20m 2104'46048 28259:732896 [37,46,33,4,48,49]p37 [37]p37 2023-01-02T18:38:46.268723+0000 2022-12-29T18:05:47.431468+0000 18 queued for deep scrub 17.78 20169 0 0 0 84517834400 0 0 2198 remapped+incomplete 10s 3735'53405 28260:1243673 [4,37,2,36,66,0]p4 [41]p41 2023-01-03T14:21:41.563424+0000 2023-01-03T14:21:41.563424+0000 348 queued for scrub 17.d8 20328 0 0 0 85196053130 0 0 1852 remapped+incomplete 10s 3735'54458 28260:1309564 [38,65,61,37,58,39]p38 [53]p53 2023-01-02T18:32:35.371071+0000 2022-12-28T19:08:29.492244+0000 21 queued for deep scrub At present I'm unable to reliably access my data due to incomplete pages above. I'll post whatever outputs requested (won't post now as it can be rather verbose). Is there hope?
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx