Re: ceph PGs issues

Reed Dier <reed.dier@xxxxxxxxxxx> · Tue, 15 Jun 2021 11:24:23 -0500

Note: I am not entirely sure here, and would love other input from the ML about this, so take this with a grain of salt.

You don't show any unfound objects, which I think is excellent news as far as data loss.
>>            96   active+clean+scrubbing+deep+repair
The deep scrub + repair seems auspicious, and also seems like a really heavy operation on those PGs.

I can't tell fully, but it looks like your EC profile is K+M=12. Which could be 10+2, 9+3, or hopefully not 11+1.
That said, being on Mimic, I am thinking that you are more than likely running into this: https://docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery <https://docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery>
> Prior to Octopus, erasure coded pools required at least min_size shards to be available, even if min_size is greater than K. (We generally recommend min_size be K+2 or more to prevent loss of writes and data.) This conservative decision was made out of an abundance of caution when designing the new pool mode but also meant pools with lost OSDs but no data loss were unable to recover and go active without manual intervention to change the min_size.

I can't definitively say whether reducing the min_size will unlock the offline data, but I think it could.
As for what that value will be, I'm guessing just drop it by one, and see if PGs come out of their incomplete state.
After (hopeful) recovery, I would revert the min_size back to the original value for safety.

Something odd I did notice from the pastebin of ceph health detail,
> pg 3.e5 is remapped+incomplete, acting [2147483647,2147483647,2147483647,2147483647,2147483647,278,2147483647,2147483647,273,2147483647,2147483647,2147483647]
> pg 3.14e is remapped+incomplete, acting [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,271,2147483647,222,416,2147483647]
> pg 3.45e is remapped+incomplete, acting [2147483647,2147483647,2147483647,2147483647,2147483647,377,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
> pg 3.4bc is remapped+incomplete, acting [2147483647,280,2147483647,2147483647,2147483647,407,445,268,2147483647,2147483647,418,273]
> pg 3.7c6 is remapped+incomplete, acting [2147483647,338,2147483647,2147483647,261,2147483647,2147483647,2147483647,416,415,337,2147483647]
> pg 3.8e8 is remapped+incomplete, acting [2147483647,2147483647,2147483647,2147483647,360,418,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647] 
> pg 3.b5e is remapped+incomplete, acting [2147483647,242,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,399,2147483647,2147483647] 

These 7 PGs are reporting a really large percentage of chunks with no OSDs found.
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean <https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean>
I think this could possibly relate to the below bit about osd.73 throwing off the crush map.
I'm sure someone with more experience may have a better understanding of what this implies.

As for osd.73, I would remove it from the crush map.
It existing in the crush map, while not being a valid OSD may be throwing off the crush mappings.
I think the first step I would take would be to
$ ceph osd crush remove osd.73
$ ceph osd rm osd.73

This should reweight the ceph003 host, and cause some data movement.

So, in summation,
I would kill off osd.73 first.
Then, after some assumed rebalancing, I would then reduce the min_size to try and bring PGs out of an incomplete state.

As I said, I'm not entirely sure, and would love a second opinion from someone, but if it were me in a vacuum, I think these would be my steps.

Reed

> On Jun 15, 2021, at 10:14 AM, Aly, Adel <adel.aly@xxxxxxxx> wrote:
> 
> Hi Reed,
> 
> Thank you for getting back to us.
> 
> We had indeed several disk failures at the same time.
> 
> Regarding the OSD map, we have an OSD that failed and we needed to remove but we didn't update the crushmap.
> 
> The question here, is it safe to update the OSD crushmap without affecting the data available?
> 
> We can free up more space on the monitors if that will help indeed.
> 
> More information which can be helpful:
> 
> # ceph -v
> ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)
> 
> # ceph health detail
> https://pastebin.pl/view/2b8b337d
> 
> # ceph osd pool ls detail
> pool 3 'cephfs-data' erasure size 12 min_size 11 crush_rule 1 object_hash rjenkins pg_num 3072 pgp_num 3072 last_change 370219 lfor 0/367599 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 40960 fast_read 1 compression_algorithm snappy compression_mode force application cephfs
>        removed_snaps [2~7c]
> pool 4 'cephfs-meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 370219 lfor 0/367414 flags hashpspool stripe_width 0 compression_algorithm none compression_mode none application cephfs
> 
> # ceph osd tree
> https://pastebin.pl/view/eac56017
> 
> Our main struggle is when we try to rsync data, the rsync process hangs because it encounters an inaccessible object.
> 
> Is there a way we can take out the incomplete PGs to be able to copy data smoothly without having to reset the rsync process?
> 
> Kind regards,
> adel
> 
> -----Original Message-----
> From: Reed Dier <reed.dier@xxxxxxxxxxx>
> Sent: Tuesday, June 15, 2021 4:21 PM
> To: Aly, Adel <adel.aly@xxxxxxxx>
> Cc: ceph-users@xxxxxxx
> Subject: Re:  ceph PGs issues
> 
> Caution! External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
> 
> You have incomplete PGs, which means you have inactive data, because the data isn't there.
> 
> This will typically only happen when you have multiple concurrent disk failures, or something like that, so I think there is some missing info.
> 
>>           1 osds exist in the crush map but not in the osdmap
> 
> This seems like a red flag to have an OSD in the crush map but not the osdmap.
> 
>>           mons xyz01,xyz02 are low on available space
> 
> Your mons are probably filling up data running in the warn state.
> This can be problematic for recovery.
> 
> I think you will be more likely to receive some useful suggestions by providing things like which version of ceph you are using ($ ceph -v), major events that caused this, poo ($ ceph osd pool ls detail) and osd  ($ ceph osd tree) topology, as well as maybe detailed health output ($ ceph health detail).
> 
> Given how much data some things may be, like the osd tree, you may want to paste to pastebin and link here.
> 
> Reed
> 
>> On Jun 15, 2021, at 2:48 AM, Aly, Adel <adel.aly@xxxxxxxx> wrote:
>> 
>> Dears,
>> 
>> We have a ceph cluster with 4096 PGs out of with +100 PGs are not active+clean.
>> 
>> On top of the ceph cluster, we have a ceph FS, with 3 active MDS servers.
>> 
>> It seems that we can’t get all the files out of it because of the affected PGs.
>> 
>> The object store has more than 400 million objects.
>> 
>> When we do “rados -p cephfs-data ls”, the listing stops (hangs) after listing +11 million objects.
>> 
>> When we try to access an object which we can’t copy, the rados command hangs forever:
>> 
>> ls -I <filename>
>> 2199140525188
>> 
>> printf "%x\n" 2199140525188
>> 20006fd6484
>> 
>> rados -p cephfs-data stat 20006fd6484.00000000 (hangs here)
>> 
>> This is the current status of the ceph cluster:
>>   health: HEALTH_WARN
>>           1 MDSs report slow metadata IOs
>>           1 MDSs report slow requests
>>           1 MDSs behind on trimming
>>           1 osds exist in the crush map but not in the osdmap
>>           *Reduced data availability: 22 pgs inactive, 22 pgs incomplete*
>>           240324 slow ops, oldest one blocked for 391503 sec, daemons
>> [osd.144,osd.159,osd.180,osd.184,osd.242,osd.271,osd.275,osd.278,osd.280,osd.332]... h ave slow ops.
>>           mons xyz01,xyz02 are low on available space
>> 
>> services:
>>   mon: 4 daemons, quorum abc001,abc002,xyz02,xyz01
>>   mgr: abc002(active), standbys: xyz01, xyz02, abc001
>>   mds: cephfs-3/3/3 up  {0=xyz02=up:active,1=abc001=up:active,2=abc002=up:active}, 1 up:standby
>>   osd: 421 osds: 421 up, 421 in; 7 remapped pgs
>> 
>> data:
>>   pools:   2 pools, 4096 pgs
>>   objects: 403.4 M objects, 846 TiB
>>   usage:   1.2 PiB used, 1.4 PiB / 2.6 PiB avail
>>   pgs:     0.537% pgs not active
>>            3968 active+clean
>>            96   active+clean+scrubbing+deep+repair
>>            15   incomplete
>>            10   active+clean+scrubbing
>>            7    remapped+incomplete
>> 
>> io:
>>   client:   89 KiB/s rd, 13 KiB/s wr, 34 op/s rd, 1 op/s wr
>> 
>> The 100+ PGs have been in this state for a long time already.
>> 
>> Sometimes when we try to copy some files the rsync process hangs and we can’t kill it and from the process stack, it seems to be hanging on ceph i/o operation.
>> 
>> # cat /proc/51795/stack
>> [<ffffffffc184206d>] ceph_mdsc_do_request+0xfd/0x280 [ceph]
>> [<ffffffffc181e92e>] __ceph_do_getattr+0x9e/0x200 [ceph]
>> [<ffffffffc181eb08>] ceph_getattr+0x28/0x100 [ceph]
>> [<ffffffffab853689>] vfs_getattr+0x49/0x80 [<ffffffffab8537b5>]
>> vfs_fstatat+0x75/0xc0 [<ffffffffab853bc1>] SYSC_newlstat+0x31/0x60
>> [<ffffffffab85402e>] SyS_newlstat+0xe/0x10 [<ffffffffabd93f92>]
>> system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff
>> 
>> # cat /proc/51795/mem
>> cat: /proc/51795/mem: Input/output error
>> 
>> Any idea on how to move forward with debugging and fixing this issue so we can get the data out of the ceph FS?
>> 
>> Thank you in advance.
>> 
>> Kind regards,
>> adel
>> 
>> This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, Atos’ liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted. On all offers and agreements under which Atos Nederland B.V. supplies goods and/or services of whatever nature, the Terms of Delivery from Atos Nederland B.V. exclusively apply. The Terms of Delivery shall be promptly submitted to you on your request.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>> email to ceph-users-leave@xxxxxxx
> 
> This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, Atos’ liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted. On all offers and agreements under which Atos Nederland B.V. supplies goods and/or services of whatever nature, the Terms of Delivery from Atos Nederland B.V. exclusively apply. The Terms of Delivery shall be promptly submitted to you on your request.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx