Re: Remapped PGs

David Orman <ormandj@xxxxxxxxxxxx> · Tue, 2 Mar 2021 12:37:23 -0600

I wanted to revisit this - we're not on 15.2.9 and still have this one
cluster with 5 PGs "stuck" in pg_temp. Any idea how to clean this up,
or how it might have occurred? I'm fairly certain it showed up after
an autoscale up and autoscale down happened that overlapped each
other.

On Mon, Aug 10, 2020 at 10:28 AM David Orman <ormandj@xxxxxxxxxxxx> wrote:
>
> We've gotten a bit further, after evaluating how this remapped count was determine (pg_temp), we've found the PGs counted as being remapped:
>
> root@ceph01:~# ceph osd dump |grep pg_temp
> pg_temp 3.7af [93,1,29]
> pg_temp 3.7bc [137,97,5]
> pg_temp 3.7d9 [72,120,18]
> pg_temp 3.7e8 [80,21,71]
> pg_temp 3.7fd [74,51,8]
>
> Looking at 3.7af:
> root@ceph01:~# ceph pg map 3.7af
> osdmap e15406 pg 3.7af (3.f) -> up [87,156,29] acting [87,156,29]
>
> I'm unclear why this is staying in pg_temp. Is there a way to clean this up? I would have expected it to be cleaned up as per docs but I might be missing something here.
>
> On Thu, Aug 6, 2020 at 2:40 PM David Orman <ormandj@xxxxxxxxxxxx> wrote:
>>
>> Still haven't figured this out. We went ahead and upgraded the entire cluster to Podman 2.0.4 and in the process did OS/Kernel upgrades and rebooted every node, one at a time. We've still got 5 PGs stuck in 'remapped' state, according to 'ceph -s' but 0 in the pg dump output in that state. Does anybody have any suggestions on what to do about this?
>>
>> On Wed, Aug 5, 2020 at 10:54 AM David Orman <ormandj@xxxxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> We see that we have 5 'remapped' PGs, but are unclear why/what to do about it. We shifted some target ratios for the autobalancer and it resulted in this state. When adjusting ratio, we noticed two OSDs go down, but we just restarted the container for those OSDs with podman, and they came back up. Here's status output:
>>>
>>> ###################
>>> root@ceph01:~# ceph status
>>> INFO:cephadm:Inferring fsid x
>>> INFO:cephadm:Inferring config x
>>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
>>>   cluster:
>>>     id:     41bb9256-c3bf-11ea-85b9-9e07b0435492
>>>     health: HEALTH_OK
>>>
>>>   services:
>>>     mon: 5 daemons, quorum ceph01,ceph04,ceph02,ceph03,ceph05 (age 2w)
>>>     mgr: ceph03.ytkuyr(active, since 2w), standbys: ceph01.aqkgbl, ceph02.gcglcg, ceph04.smbdew, ceph05.yropto
>>>     osd: 168 osds: 168 up (since 2d), 168 in (since 2d); 5 remapped pgs
>>>
>>>   data:
>>>     pools:   3 pools, 1057 pgs
>>>     objects: 18.00M objects, 69 TiB
>>>     usage:   119 TiB used, 2.0 PiB / 2.1 PiB avail
>>>     pgs:     1056 active+clean
>>>              1    active+clean+scrubbing+deep
>>>
>>>   io:
>>>     client:   859 KiB/s rd, 212 MiB/s wr, 644 op/s rd, 391 op/s wr
>>>
>>> root@ceph01:~#
>>>
>>> ###################
>>>
>>> When I look at ceph pg dump, I don't see any marked as remapped:
>>>
>>> ###################
>>> root@ceph01:~# ceph pg dump |grep remapped
>>> INFO:cephadm:Inferring fsid x
>>> INFO:cephadm:Inferring config x
>>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
>>> dumped all
>>> root@ceph01:~#
>>> ###################
>>>
>>> Any idea what might be going on/how to recover? All OSDs are up. Health is 'OK'. This is Ceph 15.2.4 deployed using Cephadm in containers, on Podman 2.0.3.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx