Re: Octopus 15.2.8 slow ops causing inactive PGs upon disk replacement

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Stuck activating could be an old known issue: if the cluster has many
(>100) PGs per OSD, they may temporarily need to hold more than the
max (300) and therefore PGs get stuck activating.

We always use this option as a workaround:
    osd max pg per osd hard ratio = 10.0

I suggest giving this a try -- it can't hurt much.

Cheers, Dan



On Wed, Jun 23, 2021 at 4:29 PM Justin Goetz <jgoetz@xxxxxxxxxxxxxx> wrote:
>
> Hello!
>
> We are in the process of expanding our CEPH cluster (by both adding OSD
> hosts and replacing smaller-sized HDDs on our existing hosts). So far we
> have gone host by host, removing the old OSDs, swapping the physical
> HDDs, and re-adding them. This process has gone smooth, aside from one
> issue: upon any action taken on the cluster (adding new OSDs, replacing
> old, etc), we have PGs get stuck "activating"which causes around 3.5% of
> PGs go inactive, causing IO to stop.
>
> Here is a current look at our ceph -s command:
>
> cluster:
>      id:     e8ffe2eb-f8fc-4110-a4bc-1715e878fb7b
>      health: HEALTH_WARN
>              Reduced data availability: 166 pgs inactive
>              Degraded data redundancy: 137153907/3658405707 objects
> degraded (3.749%), 930 pgs degraded, 928 pgs undersized
>              10 pgs not deep-scrubbed in time
>              33709 slow ops, oldest one blocked for 35956 sec, daemons
> [osd.103,osd.104,osd.105,osd.106,osd.107,osd.109,osd.111,osd.112,osd.113,osd.114]...
> have slow ops.
>
>    services:
>      mon: 3 daemons, quorum lb3,lb2,lb1 (age 8w)
>      mgr: lb1(active, since 6w), standbys: lb3, lb2
>      osd: 117 osds: 117 up (since 15m), 117 in (since 10h); 2033
> remapped pgs
>      rgw: 3 daemons active (lb1.rgw0, lb2.rgw0, lb3.rgw0)
>
>    task status:
>
>    data:
>      pools:   8 pools, 5793 pgs
>      objects: 609.74M objects, 169 TiB
>      usage:   308 TiB used, 430 TiB / 738 TiB avail
>      pgs:     2.866% pgs not active
>               137153907/3658405707 objects degraded (3.749%)
>               262215404/3658405707 objects misplaced (7.167%)
>               3754 active+clean
>               963  active+remapped+backfill_wait
>               892 active+undersized+degraded+remapped+backfill_wait
>               136  activating+remapped
>               27   activating+undersized+degraded+remapped
>               8 active+undersized+degraded+remapped+backfilling
>               6    active+clean+scrubbing+deep
>               3    activating+degraded+remapped
>               3    active+remapped+backfilling
>               1    active+undersized+remapped+backfill_wait
>
>    io:
>      client:   94 KiB/s rd, 94 op/s rd, 0 op/s wr
>      recovery: 112 MiB/s, 372 objects/s
>
>    progress:
>      Rebalancing after osd.20 marked in (10h)
>        [............................] (remaining: 11d)
>      Rebalancing after osd.41 marked in (10h)
>        [=...........................] (remaining: 8d)
>      Rebalancing after osd.30 marked in (10h)
>        [=...........................] (remaining: 9d)
>      Rebalancing after osd.1 marked in (10h)
>        [=======.....................] (remaining: 2h)
>      Rebalancing after osd.10 marked in (10h)
>        [............................] (remaining: 12d)
>      Rebalancing after osd.50 marked in (10h)
>        [............................] (remaining: 2w)
>      Rebalancing after osd.71 marked out (10h)
>        [==..........................] (remaining: 5d)
>
> What you may find interesting is the "slow ops" warnings. This is where
> our inactive PGs become stuck. Once the cluster gets into this state,
> I'm able to recover IO usually by restarting the OSDs with slow ops.
> However, what's extremely strange, is this workaround only works after
> about 12 hours since the last OSD addition. Restarting the slow ops OSDs
> before roughly 12 hours results in the slow ops returning immediately.
>
> Our first thought was hardware issues, however we ruled this out after
> the slow ops warnings appeared on brand new HDDs and OSD hosts.
> Monitoring the IO saturation of the OSDs reporting slow ops shows actual
> usage nowhere near saturation, and no hardware issues are present on the
> drives themselves.
>
> Looking at the journalctl logs of one of the affected OSDs above, we see
> the following repeated multiple times:
>
> osd.103 56934 get_health_metrics reporting 2 slow ops, oldest is
> osd_op(client.467952.0:1520304537 8.6fbs0 8.1e6826fb (undecoded)
> ondisk+retry+write+known_if_redirected e56923
>
> So far my procedure for the disk swaps have been as follows:
>
> 1. Set noout,norebalance, and norecover on the cluster.
> 2. Use ceph-ansible to remove the old disk OSD IDs
> 3. Swap physical HDDs, re-add with ceph-ansible
> 4. Unset noout,norebalance,norecover
>
> I should note this issue appears even with simple OSD additions (not
> removals), as we added 2 brand new hosts to the cluster and saw the same
> issue.
>
> I've been trying to think of any possible cause of this issue, I should
> mention our cluster is messy at the moment hardware-wise (we have a mix
> of 7T HDDs, 4T HDDs, and 10T HDDs - moving to all 10T HDDs but the
> process to swap has been taking a while). One warning I've noticed
> during the old disk removals is a warning about too many PGs per OSD,
> however this warning clears once the new OSDs are added, which is to be
> expected I assume.
>
> If anyone would be willing to provide any hints of where to look, it
> would be much appreciated!
>
> Thanks for your time.
> --
>
> Justin Goetz
> Systems Engineer, TeraSwitch Inc.
> jgoetz@xxxxxxxxxxxxxx
> 412-945-7045 (NOC) | 412-459-7945 (Direct)
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux