Re: Placement Groups undersized after adding OSDs

Wido den Hollander <wido@xxxxxxxx> · Thu, 15 Nov 2018 10:55:12 +0100

On 11/15/18 4:37 AM, Gregory Farnum wrote:
> This is weird. Can you capture the pg query for one of them and narrow
> down in which epoch it “lost” the previous replica and see if there’s
> any evidence of why?

So I checked it further and dug deeper into the logs and found this on
osd.1982:

2018-11-14 15:03:04.261689 7fde7b525700  0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.1982 down, but it is still running
2018-11-14 15:03:04.261713 7fde7b525700  0 log_channel(cluster) log
[DBG] : map e647120 wrongly marked me down at e647120

After searching further (Zabbix graphs) it seems that this machine had a
spike in CPU load around that time which probably caused it to be marked
as down.

As OSD 1982 was involved which these PGs it's now in undersized+degraded
state.

Recovery didn't start, but Ceph choose to wait for the backfill to
happen as the PG needed to be vacated from this OSD.

The side-effect is that it took 14 hours before these PGs started to
backfill.

I would say that a PG which is in undersized+degraded should get the
highest possible priority to be repaired asap.

Wido

> On Wed, Nov 14, 2018 at 8:09 PM Wido den Hollander <wido@xxxxxxxx
> <mailto:wido@xxxxxxxx>> wrote:
> 
>     Hi,
> 
>     I'm in the middle of expanding a Ceph cluster and while having 'ceph -s'
>     open I suddenly saw a bunch of Placement Groups go undersized.
> 
>     My first hint was that one or more OSDs have failed, but none did.
> 
>     So I checked and I saw these Placement Groups undersized:
> 
>     11.3b54 active+undersized+degraded+remapped+backfill_wait
>     [1795,639,1422]       1795       [1795,639]           1795
>     11.362f active+undersized+degraded+remapped+backfill_wait
>     [1431,1134,2217]       1431      [1134,1468]           1134
>     11.3e31 active+undersized+degraded+remapped+backfill_wait
>     [1451,1391,1906]       1451      [1906,2053]           1906
>     11.50c  active+undersized+degraded+remapped+backfill_wait
>     [1867,1455,1348]       1867      [1867,2036]           1867
>     11.421e   active+undersized+degraded+remapped+backfilling
>     [280,117,1421]        280        [280,117]            280
>     11.700  active+undersized+degraded+remapped+backfill_wait
>     [2212,1422,2087]       2212      [2055,2087]           2055
>     11.735    active+undersized+degraded+remapped+backfilling
>     [772,1832,1433]        772       [772,1832]            772
>     11.d5a  active+undersized+degraded+remapped+backfill_wait
>     [423,1709,1441]        423       [423,1709]            423
>     11.a95  active+undersized+degraded+remapped+backfill_wait
>     [1433,1180,978]       1433       [978,1180]            978
>     11.a67  active+undersized+degraded+remapped+backfill_wait
>     [1154,1463,2151]       1154      [1154,2151]           1154
>     11.10ca active+undersized+degraded+remapped+backfill_wait
>     [2012,486,1457]       2012       [2012,486]           2012
>     11.2439 active+undersized+degraded+remapped+backfill_wait
>     [910,1457,1193]        910       [910,1193]            910
>     11.2f7e active+undersized+degraded+remapped+backfill_wait
>     [1423,1356,2098]       1423      [1356,2098]           1356
> 
>     After searching I found that OSDs
>     1422,1431,1451,1455,1421,1422,1433,1441,1433,1463,1457,1457 and 1423 are
>     all running on the same (newly) added host.
> 
>     I checked:
>     - The host did not reboot
>     - The OSDs did not restart
> 
>     The OSDs are up_thru since map 646724 which is from 11:05 this morning
>     (4,5 hours ago), which is about the same time when these were added.
> 
>     So these PGs are currently running on *2* replicas while they should be
>     running on *3*.
> 
>     We just added 8 nodes with 24 disks each to the cluster, but none of the
>     existing OSDs were touched.
> 
>     When looking at PG 11.3b54 I see that 1422 is a backfill target:
> 
>     $ ceph pg 11.3b54 query|jq '.recovery_state'
> 
>     The 'enter time' for this is about 30 minutes ago and that's about the
>     same time this has happened.
> 
>     'might_have_unfound' tells me OSD 1982 which is in the same rack as 1422
>     (CRUSH replicates over racks), but that OSD is also online.
> 
>     It's up_thru = 647122 and that's from about 30 minutes ago. That
>     ceph-osd process is however running since September and seems to be
>     functioning fine.
> 
>     This confuses me as during such an expansion I know that normally a PG
>     would map to size+1 until the backfill finishes.
> 
>     The cluster is running Luminous 12.2.8 on CentOS 7.5.
> 
>     Any ideas on what this could be?
> 
>     Wido
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com