Re: Changing crush map result in > 100% objects degraded

Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx> · Tue, 21 Jan 2025 17:38:22 +0000

Oh, but of cause everything smooths out after while.

My main concern is just, if I do this on a large cluster, it will send it spinning...

________________________________
From: Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx>
Sent: Tuesday, January 21, 2025 18:35
To: Dan van der Ster <dan.vanderster@xxxxxxxxx>; Anthony D'Atri <aad@xxxxxxxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxx>
Subject: Re:  Re: Changing crush map result in > 100% objects degraded

Hi Dan

"Also, in the process of moving the hosts one by one, each step creates
a new topology which can change the ordering of hosts, incrementally
putting things out of whack."

RESPONSE: Will it be better to edit the crushmap as a file, and load the new with ceph osd setcrushmap -i <file> ?

Kaspar: I assume the cluster was idle during your tests?
RESPONSE: Yes, the cluster was indeed idle

Also -- can you reproduce it without norecover/nobackfill set ?
RESPONSE: Yes, I reproduced with no flags set - Starting point HEALTH_OK -

Result is as seen here below - (The output from the ceph pg "$pgid" query is 8548 lines - I don't know how to send that -  Just as an attachment?):

ubuntu@ksr-ceph-deploy:~$ ceph osd crush move ksr-ceph-osd1 rack=rack1;
moved item id -7 name 'ksr-ceph-osd1' to location {rack=rack1} in crush map
ubuntu@ksr-ceph-deploy:~$ sleep 10
ubuntu@ksr-ceph-deploy:~$ ceph pg ls undersized;
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP LAST_SCRUB_DURATION SCRUB_SCHEDULING
1.0 2 2 0 0 459280 0 0 117 0 active+recovery_wait+undersized+degraded+remapped 7s 732'117 963:994 [11,2,8]p11 [11,2]p11 2025-01-21T05:25:36.075052+0000 2025-01-17T14:16:37.883834+0000 1 periodic scrub scheduled @ 2025-01-22T15:03:39.660268+0000
6.0 22 60 0 0 2252800 0 0 124 0 active+recovery_wait+undersized+degraded+remapped 7s 732'124 962:2333 [7,6,1]p7 [7,1]p7 2025-01-21T09:06:38.302061+0000 2025-01-20T03:35:48.722520+0000 1 periodic scrub scheduled @ 2025-01-22T18:15:49.710411+0000
6.1 12 50 0 0 1228800 0 0 110 0 active+recovery_wait+undersized+degraded+remapped 6s 732'110 963:2298 [6,3,1]p6 [6,3]p6 2025-01-21T09:04:29.912825+0000 2025-01-20T03:11:56.962281+0000 1 periodic scrub scheduled @ 2025-01-22T14:09:59.472013+0000
6.2 13 52 52 0 5423104 0 0 107 0 active+recovery_wait+undersized+degraded+remapped 6s 732'107 963:2273 [10,11,4]p10 [0,4]p0 2025-01-21T06:46:11.657543+0000 2025-01-17T14:16:41.932263+0000 0 periodic scrub scheduled @ 2025-01-22T13:52:17.796513+0000
6.5 18 100 0 0 10027008 0 0 113 0 active+recovery_wait+undersized+degraded+remapped 6s 732'113 963:726 [0,9,3]p0 [0,3]p0 2025-01-21T05:24:25.006369+0000 2025-01-21T05:24:25.006369+0000 0 periodic scrub scheduled @ 2025-01-22T12:29:24.576297+0000
6.9 16 55 0 0 5730304 0 0 104 0 active+recovery_wait+undersized+degraded+remapped 6s 732'104 963:996 [10,9,1]p10 [10,1]p10 2025-01-21T09:02:25.957504+0000 2025-01-17T14:16:50.479422+0000 0 periodic scrub scheduled @ 2025-01-22T19:49:15.092705+0000
6.b 18 60 0 0 1843200 0 0 114 0 active+recovery_wait+undersized+degraded+remapped 6s 732'114 962:3860 [1,2,6]p1 [1,2]p1 2025-01-21T06:57:22.832565+0000 2025-01-17T14:16:57.820141+0000 0 periodic scrub scheduled @ 2025-01-22T09:03:26.636583+0000
6.f 17 60 0 0 5832704 0 0 117 0 active+recovery_wait+undersized+degraded+remapped 6s 732'117 963:1437 [4,3,2]p4 [4,3]p4 2025-01-21T07:59:55.049488+0000 2025-01-17T14:17:05.581667+0000 0 periodic scrub scheduled @ 2025-01-22T14:09:31.906176+0000
6.11 20 59 0 0 7888896 0 0 123 0 active+recovery_wait+undersized+degraded+remapped 7s 732'123 963:3329 [11,0,8]p11 [11,8]p11 2025-01-21T01:32:23.458956+0000 2025-01-17T14:17:09.774195+0000 0 periodic scrub scheduled @ 2025-01-22T06:25:32.462414+0000
6.12 21 4 42 0 10334208 0 0 141 0 active+recovering+undersized+degraded+remapped 1.11259s 732'141 964:613 [9,8,11]p9 [2,11]p2 2025-01-21T05:01:52.629884+0000 2025-01-21T05:01:52.629884+0000 0 periodic scrub scheduled @ 2025-01-22T07:57:03.899309+0000
6.13 22 138 0 0 8093696 0 0 156 0 active+recovery_wait+undersized+degraded+remapped 6s 732'156 963:1312 [3,6,9]p3 [3,6]p3 2025-01-21T04:47:23.091543+0000 2025-01-18T19:41:37.702881+0000 0 periodic scrub scheduled @ 2025-01-22T13:44:27.566620+0000
6.14 14 57 0 0 5525504 0 0 116 0 active+recovery_wait+undersized+degraded+remapped 6s 732'116 963:804 [11,9,8]p11 [11,9]p11 2025-01-21T04:58:22.800659+0000 2025-01-18T18:51:30.797784+0000 0 periodic scrub scheduled @ 2025-01-22T14:10:05.285157+0000
6.17 15 58 0 0 3284992 0 0 117 0 active+recovery_wait+undersized+degraded+remapped 7s 732'117 963:1758 [11,8,5]p11 [11,8]p11 2025-01-21T17:02:51.283098+0000 2025-01-17T14:17:24.300985+0000 3 periodic scrub scheduled @ 2025-01-23T02:23:35.432833+0000
6.1a 16 2 0 0 3387392 0 0 118 0 active+recovering+undersized+remapped 0.802018s 732'118 964:1263 [6,7,1]p6 [1,6]p1 2025-01-21T09:04:47.728667+0000 2025-01-18T21:05:47.129277+0000 1 periodic scrub scheduled @ 2025-01-22T16:41:28.828878+0000
6.1d 18 118 0 0 11776000 0 0 111 0 active+recovery_wait+undersized+degraded+remapped 6s 732'111 963:1239 [3,5,10]p3 [3,5]p3 2025-01-21T06:57:54.358465+0000 2025-01-17T14:17:34.523095+0000 0 periodic scrub scheduled @ 2025-01-22T10:05:29.250471+0000

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilization. See http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for further details.
ubuntu@ksr-ceph-deploy:~$ sleep 5;
ubuntu@ksr-ceph-deploy:~$ ceph pg ls degraded;
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP LAST_SCRUB_DURATION SCRUB_SCHEDULING
1.0 2 2 0 0 459280 0 0 117 0 active+recovery_wait+undersized+degraded+remapped 12s 732'117 965:996 [11,2,8]p11 [11,2]p11 2025-01-21T05:25:36.075052+0000 2025-01-17T14:16:37.883834+0000 1 periodic scrub scheduled @ 2025-01-22T15:03:39.660268+0000
6.2 13 52 52 0 5423104 0 0 107 0 active+recovery_wait+undersized+degraded+remapped 12s 732'107 965:2275 [10,11,4]p10 [0,4]p0 2025-01-21T06:46:11.657543+0000 2025-01-17T14:16:41.932263+0000 0 periodic scrub scheduled @ 2025-01-22T13:52:17.796513+0000
6.5 18 100 0 0 10027008 0 0 113 0 active+recovery_wait+undersized+degraded+remapped 11s 732'113 965:728 [0,9,3]p0 [0,3]p0 2025-01-21T05:24:25.006369+0000 2025-01-21T05:24:25.006369+0000 0 periodic scrub scheduled @ 2025-01-22T12:29:24.576297+0000
6.7 19 2 0 0 9535488 0 0 100 0 active+recovering+degraded 2s 732'100 966:1371 [11,10,1]p11 [11,10,1]p11 2025-01-21T11:56:25.453326+0000 2025-01-17T14:16:46.382792+0000 0 periodic scrub scheduled @ 2025-01-22T21:24:43.245561+0000
6.9 16 55 0 0 5730304 0 0 104 0 active+recovery_wait+undersized+degraded+remapped 11s 732'104 965:998 [10,9,1]p10 [10,1]p10 2025-01-21T09:02:25.957504+0000 2025-01-17T14:16:50.479422+0000 0 periodic scrub scheduled @ 2025-01-22T19:49:15.092705+0000
6.b 18 60 0 0 1843200 0 0 114 0 active+recovery_wait+undersized+degraded+remapped 12s 732'114 965:3862 [1,2,6]p1 [1,2]p1 2025-01-21T06:57:22.832565+0000 2025-01-17T14:16:57.820141+0000 0 periodic scrub scheduled @ 2025-01-22T09:03:26.636583+0000
6.f 17 60 0 0 5832704 0 0 117 0 active+recovery_wait+undersized+degraded+remapped 12s 732'117 965:1439 [4,3,2]p4 [4,3]p4 2025-01-21T07:59:55.049488+0000 2025-01-17T14:17:05.581667+0000 0 periodic scrub scheduled @ 2025-01-22T14:09:31.906176+0000
6.11 20 59 0 0 7888896 0 0 123 0 active+recovery_wait+undersized+degraded+remapped 12s 732'123 965:3331 [11,0,8]p11 [11,8]p11 2025-01-21T01:32:23.458956+0000 2025-01-17T14:17:09.774195+0000 0 periodic scrub scheduled @ 2025-01-22T06:25:32.462414+0000
6.13 22 138 0 0 8093696 0 0 156 0 active+recovery_wait+undersized+degraded+remapped 11s 732'156 963:1312 [3,6,9]p3 [3,6]p3 2025-01-21T04:47:23.091543+0000 2025-01-18T19:41:37.702881+0000 0 periodic scrub scheduled @ 2025-01-22T13:44:27.566620+0000
6.14 14 57 0 0 5525504 0 0 116 0 active+recovery_wait+undersized+degraded+remapped 12s 732'116 965:806 [11,9,8]p11 [11,9]p11 2025-01-21T04:58:22.800659+0000 2025-01-18T18:51:30.797784+0000 0 periodic scrub scheduled @ 2025-01-22T14:10:05.285157+0000
6.17 15 58 0 0 3284992 0 0 117 0 active+recovery_wait+undersized+degraded+remapped 12s 732'117 965:1760 [11,8,5]p11 [11,8]p11 2025-01-21T17:02:51.283098+0000 2025-01-17T14:17:24.300985+0000 3 periodic scrub scheduled @ 2025-01-23T02:23:35.432833+0000
6.18 10 41 0 0 5115904 0 0 95 0 active+recovery_wait+degraded 12s 732'95 965:888 [4,10,5]p4 [4,10,5]p4 2025-01-21T07:22:43.040812+0000 2025-01-17T14:17:26.898595+0000 0 periodic scrub scheduled @ 2025-01-22T17:24:55.003668+0000
6.1c 9 86 0 0 5013504 0 0 88 0 active+recovery_wait+degraded 12s 732'88 963:24 [3,0,8]p3 [3,0,8]p3 2025-01-21T09:08:33.536101+0000 2025-01-17T14:17:37.489220+0000 1 periodic scrub scheduled @ 2025-01-22T11:35:58.254737+0000
6.1d 18 118 0 0 11776000 0 0 111 0 active+recovery_wait+undersized+degraded+remapped 11s 732'111 963:1239 [3,5,10]p3 [3,5]p3 2025-01-21T06:57:54.358465+0000 2025-01-17T14:17:34.523095+0000 0 periodic scrub scheduled @ 2025-01-22T10:05:29.250471+0000

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilization. See http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for further details.

________________________________
From: Dan van der Ster <dan.vanderster@xxxxxxxxx>
Sent: Tuesday, January 21, 2025 16:51
To: Anthony D'Atri <aad@xxxxxxxxxxxxxx>; Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxx>
Subject: Re:  Re: Changing crush map result in > 100% objects degraded

On Tue, Jan 21, 2025 at 7:12 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> > On Jan 21, 2025, at 7:59 AM, Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx> wrote:
> >
> > 1 - Why do this result in such a high - objects degraded - percentage?
>
> I suspect that’s a function of the new topology having changed the mappings of multiple OSDs for given PGs.  It’s subtle, but when you move hosts into rack CRUSH buckets, that’s a different set of inputs into the CRUSH hash function, so the mappings that come out are different, even though you haven’t changed the rules and would think that hosts are hosts.

Also, in the process of moving the hosts one by one, each step creates
a new topology which can change the ordering of hosts, incrementally
putting things out of whack.

> > 2 - Why do PGs get undersized?
>
> That often means that CRUSH can’t find a complete set of placements.  In your situation maybe those would resolve themselves when you unleash the recovery hounds.

We started noticing this kind of issue around pacific, but haven't
fully tracked down what broke yet.
See https://tracker.ceph.com/issues/56046 for similar.

Undersized or degraded should only happen -- by design -- if objects
were modified while the PG did not have 3 OSDs up and acting.
Kaspar: I assume the cluster was idle during your tests?
Also -- can you reproduce it without norecover/nobackfill set ?

Could you simplify your reproducer down to:

> HEALTH_OK
> ceph osd crush move ksr-ceph-osd1 rack=rack1
> ceph pg ls undersized / degraded # get a pgid of a degraded PG
> ceph pg $pgid query

Cheers, dan

--
Dan van der Ster
CTO @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/
https://clyso.com | dan.vanderster@xxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx