Re: Serious cluster issue - Incomplete PGs

Deep Dish <deeepdish@xxxxxxxxx> · Tue, 10 Jan 2023 13:01:30 -0500

Eugen,

I never insinuated my circumstance is resultant from buggy software, and
acknowledged operational missteps.   Let's please leave that there.  Ceph
remains a technology I like and will continue to use.   Our operational
understanding has evolved greatly as a result of current circumstances.

Removed OSDs are gone and not recoverable.  (ie. lockbox keys gone, VG
groups removed)..

My objective of this post is to validate understanding of an alternate
recovery (of available, not complete) data scenario:

1. Cluster has blocked IO due to Incomplete pages.   Therefore any online
operations on affected pools / images / filesystems are blocked.

# ceph -s

  cluster:

    id:

    health: HEALTH_WARN

            1 hosts fail cephadm check

            cephadm background work is paused

            Reduced data availability: 28 pgs inactive, 28 pgs incomplete

            5 pgs not deep-scrubbed in time

            3 slow ops, oldest one blocked for 347227 sec, daemons
[osd.25,osd.50,osd.51] have slow ops.

  services:

    mon: 5 daemons, quorum  (age 8h)

    mgr: (active, since 27m)

    mds: 2/2 daemons up, 3 standby

    osd: 70 osds: 70 up (since 3d), 45 in (since 3d); 24 remapped pgs

  data:

    volumes: 2/2 healthy

    pools:   9 pools, 1056 pgs

    objects: 10.64M objects, 40 TiB

    usage:   61 TiB used, 266 TiB / 327 TiB avail

    pgs:     2.652% pgs not active

             1027 active+clean

             24   remapped+incomplete

             4    incomplete

             1    active+clean+scrubbing+deep

2. Since pages are incomplete and supporting data lost, I found a
documented process that will mark pages are complete and unblock IO for the
cluster.   I fully understand that pgs that have 0 objects will have no
impact on data integrity, however those pgs containing objects will result
in complete data loss for only those affected pgs.
Link:
https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1

Based on above referenced link, commands to this effect would mark
incomplete PGs as complete (examples):
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op
mark-complete --pgid 2.50
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op
mark-complete --pgid 2.57

3.  My cluster, at present, has a total 28 incomplete pgs.   Of these, 7
reference approximately 644 GB of now lost / irrecoverable data, the rest
reference 0 objects and 0 bytes (empty).   The cluster holds a total of
61.3T of data, leaving ~60.8T available for recovery.
4.  If I were to mark ALL incomplete pgs as complete, the cluster would be
operable - meaning I can interact with pool images and surviving files on
cephfs pools.
5.  Although data loss may affect the contents RBD images, these images
would be able to be mapped ( rbd map ) and available for alternate recovery
methods (e.g. DD contents to a seperate volume for use at a recovery
facility, or attempt to read via available recovery tools that interact
with the filesystem on those block devices (XFS in this case).  Lost data
would be equivalent of blocks of 0's in the overall image data stream where
data was lost.
6.  The above could be successful in extracting available / recoverable.
7.  Upon marking the 2 incomplete pages affecting CephFS volume as
complete, CephFS would be accessible minus affected files.   How would
these files be represented?  (Corrupted or simply 0 bytes)?

Thank you.

Date: Tue, 10 Jan 2023 08:15:31 +0000
From: Eugen Block <eblock@xxxxxx>
Subject:  Re: Serious cluster issue - Incomplete PGs
To: ceph-users@xxxxxxx
Message-ID:
        <20230110081531.Horde.NfeIXEvXkBYy6JFyMgYbpX2@xxxxxxxxxxxxxx>
Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes

Hi,

> Backups will be challenging.   I honestly didn't anticipate this kind of
> failure with ceph to be possible, we've been using it for several years
now
> and were encouraged by orchestrator and performance improvements in the 17
> code branch.

that's exactly what a backup is for, to be prepared for the
unexpected. Besides the fact that ceph didn't actually fail (you
removed too many/too early OSDs) you can't expect a bug free software,
no matter how long it has been running successfully.

> - Identifying the pools / images / files that are affected by incomplete
> pages;

The PGs start with a number which reflects the pools in your cluster,
check the output of 'ceph osd pool ls detail'. There's no easy way to
tell which images or files are affected, you can query each OSD and
list the PG's objects, but that doesn't work for missing OSDs/PGs, of
course. I'm not sure how promising it is, but maybe try a for loop
over all rbd images and just execute a 'rbd info <pool>/<image>' for
each image, maybe it will tell you which image is incomplete.

> - Extracting and reconstructing data for RBD images (these images are XFS
> formatted filesystems);
> - Extracting and reconstructing data for CephFS Files not affected by
> incomplete PGs.

If you kept the disks you removed too early (and didn't wipe them)
there may be a chance to export the PG chunks with
ceph-objectstore-tool [2]. I haven't used that myself in a production
cluster so be careful and get familiar with the commands in a test
environment first. If you already wiped the temporary OSDs I don't see
a chance to recover from this.

Regards,
Eugen

[2] https://docs.ceph.com/en/pacific/man/8/ceph-objectstore-tool/

Zitat von Deep Dish <deeepdish@xxxxxxxxx>:

> Thanks for the insight Eugen.
>
> Here's what basically happened:
>
> - Upgrade from Nautilus to Quincy via migration to new cluster on temp
> hardware;
> - Data from Nautilus migrated successfully to older / lab-type equipment
> running Quincy;
> - Nautilus Hardware rebuilt for Quincy, data migrated back;
> - As data was migrating we set the older notes to maintenance mode and
> started to drain them;
> - After several days many OSDs were showing as spinning in "deleting"
> status on portal and we were marked OUT;
> - This point we made the incorrect assumption those OSDs were no longer
> required and proceeded to remove those nodes / OSDs.
>
> I understand Incomplete pages are basically lost.   And it's likely a
> lengthy task to attempt to salvage data.
>
> Backups will be challenging.   I honestly didn't anticipate this kind of
> failure with ceph to be possible, we've been using it for several years
now
> and were encouraged by orchestrator and performance improvements in the 17
> code branch.
>
> The fact is of the Incomplete pages that have object counts > 0, there's
> about 644 GB of data that's tied up in this mess.   There are other
> incomplete PGs with object = 0 which I understand can be manually marked
as
> complete.   The cluster has a data usage of 61 TiB.   Of this I can
> categorize about 14TB of critical data, 40 TB of data that is of medium /
> high importance.
>
> There's 14TB in RBD images that would be critical on an EC pool there are
> other images, however of lower importance at this point;
>
> There's also about a 20TB CephFS file system of lower data importance as
> well.
>
> Question - Can you kindly point me to procedures for:
>
> - Identifying the pools / images / files that are affected by incomplete
> pages;
> - Extracting and reconstructing data for RBD images (these images are XFS
> formatted filesystems);
> - Extracting and reconstructing data for CephFS Files not affected by
> incomplete PGs.
>
> Much appreciated.
>
>
> ------------------------------
>
> Date: Mon, 09 Jan 2023 10:12:49 +0000
> From: Eugen Block <eblock@xxxxxx>
> Subject:  Re: Serious cluster issue - Incomplete PGs
> To: ceph-users@xxxxxxx
> Message-ID:
>         <20230109101249.Horde.hAHCWQijFMYLNdX8a2YQDVV@xxxxxxxxxxxxxx>
> Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes
>
> Hi,
>
> can you clarify what exactly you did to get into this situation? What
> about the undersized PGs, any chance to bring those OSDs back online?
> Regarding the incomplete PGs I'm not sure there's much you can do if
> the OSDs are lost. To me it reads like you may have
> destroyed/recreated more OSDs than you should have, just recreating
> OSDs with the same IDs is not sufficient if you destroyed too many
> chunks. Each OSD only contains a chunk of the PG due to the erasure
> coding. I'm afraid those objects are lost and you would have to
> restore from backup. To get the cluster into a healthy state again
> there a couple of threads, e. g. [1], but recovering the lost chunks
> from ceph will probably not work.
>
> Regards,
> Eugen
>
> [1] https://www.mail-archive.com/ceph-users@xxxxxxx/msg14757.html
>
> Zitat von Deep Dish <deeepdish@xxxxxxxxx>:
>
>> Hello.   I really screwed up my ceph cluster.   Hoping to get data off it
>> so I can rebuild it.
>>
>> In summary, too many changes too quickly caused the cluster to develop
>> incomplete pgs.  Some PGS were reporting that OSDs were to be probes.
>> I've created those OSD IDs (empty), however this wouldn't clear
>> incompletes.   Incompletes are part of EC pools.  Running 17.2.5.
>>
>> This is the overall state:
>>
>>   cluster:
>>
>>     id:     49057622-69fc-11ed-b46e-d5acdedaae33
>>
>>     health: HEALTH_WARN
>>
>>             Failed to apply 1 service(s):
> osd.dashboard-admin-1669078094056
>>
>>             1 hosts fail cephadm check
>>
>>             cephadm background work is paused
>>
>>             Reduced data availability: 28 pgs inactive, 28 pgs incomplete
>>
>>             Degraded data redundancy: 55 pgs undersized
>>
>>             2 slow ops, oldest one blocked for 4449 sec, daemons
>> [osd.25,osd.50,osd.51] have slow ops.
>>
>>
>>
>> These are PGs that are incomplete that HAVE DATA (Objects > 0) [ via ceph
>> pg ls incomplete ]:
>>
>> 2.35     23199         0          0        0  95980273664            0
>>       0  2477           incomplete    10s  2104'46277   28260:686871
>>  [44,4,37,3,40,32]p44    [44,4,37,3,40,32]p44
>>  2023-01-03T03:54:47.821280+0000  2022-12-29T18:53:09.287203+0000
>>         14  queued for deep scrub
>> 2.53     22821         0          0        0  94401175552            0
>>       0  2745  remapped+incomplete    10s  2104'45845   28260:565267
>> [60,48,52,65,67,7]p60                 [60]p60
>>  2023-01-03T10:18:13.388383+0000  2023-01-03T10:18:13.388383+0000
>>        408  queued for scrub
>> 2.9f     22858         0          0        0  94555983872            0
>>       0  2736  remapped+incomplete    10s  2104'45636   28260:759872
>>  [56,59,3,57,5,32]p56                 [56]p56
>>  2023-01-03T10:55:49.848693+0000  2023-01-03T10:55:49.848693+0000
>>        376  queued for scrub
>> 2.be     22870         0          0        0  94429110272            0
>>       0  2661  remapped+incomplete    10s  2104'45561   28260:813759
>>  [41,31,37,9,7,69]p41                 [41]p41
>>  2023-01-03T14:02:15.790077+0000  2023-01-03T14:02:15.790077+0000
>>        360  queued for scrub
>> 2.e4     22953         0          0        0  94912278528            0
>>       0  2648  remapped+incomplete    20m  2104'46048   28259:732896
>> [37,46,33,4,48,49]p37                 [37]p37
>>  2023-01-02T18:38:46.268723+0000  2022-12-29T18:05:47.431468+0000
>>         18  queued for deep scrub
>> 17.78    20169         0          0        0  84517834400            0
>>       0  2198  remapped+incomplete    10s  3735'53405  28260:1243673
>>  [4,37,2,36,66,0]p4                 [41]p41
>>  2023-01-03T14:21:41.563424+0000  2023-01-03T14:21:41.563424+0000
>>        348  queued for scrub
>> 17.d8    20328         0          0        0  85196053130            0
>>       0  1852  remapped+incomplete    10s  3735'54458  28260:1309564
>>  [38,65,61,37,58,39]p38                 [53]p53
>>  2023-01-02T18:32:35.371071+0000  2022-12-28T19:08:29.492244+0000
>>         21  queued for deep scrub
>>
>> At present I'm unable to reliably access my data due to incomplete pages
>> above.  I'll post whatever outputs requested (won't post now as it can be
>> rather verbose).  Is there hope?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx