Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

Joshua Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Fri, 22 Mar 2024 09:47:36 -0600

Personally, I don't think the compaction is actually required. Reef
has compact-on-iteration enabled, which should take care of this
automatically. We see this sort of delay pretty often during PG
cleaning, at the end of a PG being cleaned, when the PG has a high
count of objects, whether or not OSD compaction has been keeping up
with tombstones. It's unfortunately just something to ride through
these days until backfill completes.

https://github.com/ceph/ceph/pull/49438 is a recent attempt to improve
things in this area, but I'm not sure whether it would eliminate this
issue. We've considered going to higher PG counts (and thus fewer
objects per PG) as a possible mitigation as well.

Josh

On Fri, Mar 22, 2024 at 2:59 AM Alexander E. Patrakov
<patrakov@xxxxxxxxx> wrote:
>
> Hello Torkil,
>
> The easiest way (in my opinion) to perform offline compaction is a bit
> different than what Igor suggested. We had a prior off-list
> conversation indicating that the results would be equivalent.
>
> 1. ceph config set osd osd_compact_on_start true
> 2. Restart the OSD that you want to compact (or the whole host at
> once, if you want to compact the whole host and your failure domain
> allows for that)
> 3. ceph config set osd osd_compact_on_start false
>
> The OSD will restart, but will not show as "up" until the compaction
> process completes. In your case, I would expect it to take up to 40
> minutes.
>
> On Fri, Mar 22, 2024 at 3:46 PM Torkil Svensgaard <torkil@xxxxxxxx> wrote:
> >
> >
> > On 22-03-2024 08:38, Igor Fedotov wrote:
> > > Hi Torkil,
> >
> > Hi Igor
> >
> > > highly likely you're facing a well known issue with RocksDB performance
> > > drop after bulk data removal. The latter might occur at source OSDs
> > > after PG migration completion.
> >
> > Aha, thanks.
> >
> > > You might want to use DB compaction (preferably offline one using ceph-
> > > kvstore-tool) to get OSD out of this "degraded" state or as a preventive
> > > measure. I'd recommend to do that for all the OSDs right now. And once
> > > again after rebalancing is completed.  This should improve things but
> > > unfortunately no 100% guarantee.
> >
> > Why is offline preferred? With offline the easiest way would be
> > something like stop all OSDs one host at a time and run a loop over
> > /var/lib/ceph/$id/osd.*?
> >
> > > Also curious if you have DB/WAL on fast (SSD or NVMe) drives? This might
> > > be crucial..
> >
> > We do, 22 HDDs and 2 DB/WAL NVMes pr host.
> >
> > Thanks.
> >
> > Mvh.
> >
> > Torkil
> >
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > > On 3/22/2024 9:59 AM, Torkil Svensgaard wrote:
> > >> Good morning,
> > >>
> > >> Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure
> > >> domain from host to datacenter which is the reason for the large
> > >> misplaced percentage.
> > >>
> > >> We were seeing some pretty crazy spikes in "OSD Read Latencies" and
> > >> "OSD Write Latencies" on the dashboard. Most of the time everything is
> > >> well but then for periods of time, 1-4 hours, latencies will go to 10+
> > >> seconds for one or more OSDs. This also happens outside scrub hours
> > >> and it is not the same OSDs every time. The OSDs affected are HDD with
> > >> DB/WAL on NVMe.
> > >>
> > >> Log snippet:
> > >>
> > >> "
> > >> ...
> > >> 2024-03-22T06:48:22.859+0000 7fb184b52700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.000000954s
> > >> 2024-03-22T06:48:22.859+0000 7fb185b54700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.000000954s
> > >> 2024-03-22T06:48:22.864+0000 7fb169898700  1 heartbeat_map
> > >> clear_timeout 'OSD::osd_op_tp thread 0x7fb169898700' had timed out
> > >> after 15.000000954s
> > >> 2024-03-22T06:48:22.864+0000 7fb169898700  0 bluestore(/var/lib/ceph/
> > >> osd/ceph-112) log_latency slow operation observed for submit_transact,
> > >> latency = 17.716707230s
> > >> 2024-03-22T06:48:22.880+0000 7fb1748ae700  0 bluestore(/var/lib/ceph/
> > >> osd/ceph-112) log_latency_fn slow operation observed for
> > >> _txc_committed_kv, latency = 17.732601166s, txc = 0x55a5bcda0f00
> > >> 2024-03-22T06:48:38.077+0000 7fb184b52700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.000000954s
> > >> 2024-03-22T06:48:38.077+0000 7fb184b52700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.000000954s
> > >> ...
> > >> "
> > >>
> > >> "
> > >> [root@dopey ~]# ceph -s
> > >>   cluster:
> > >>     id:     8ee2d228-ed21-4580-8bbf-0649f229e21d
> > >>     health: HEALTH_WARN
> > >>             1 failed cephadm daemon(s)
> > >>             Low space hindering backfill (add storage if this doesn't
> > >> resolve itself): 1 pg backfill_toofull
> > >>
> > >>   services:
> > >>     mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 3d)
> > >>     mgr: jolly.tpgixt(active, since 10d), standbys: dopey.lxajvk,
> > >> lazy.xuhetq
> > >>     mds: 1/1 daemons up, 2 standby
> > >>     osd: 540 osds: 539 up (since 6m), 539 in (since 15h); 6250
> > >> remapped pgs
> > >>
> > >>   data:
> > >>     volumes: 1/1 healthy
> > >>     pools:   15 pools, 10849 pgs
> > >>     objects: 546.35M objects, 1.1 PiB
> > >>     usage:   1.9 PiB used, 2.3 PiB / 4.2 PiB avail
> > >>     pgs:     1425479651/3163081036 objects misplaced (45.066%)
> > >>              6224 active+remapped+backfill_wait
> > >>              4516 active+clean
> > >>              67   active+clean+scrubbing
> > >>              25   active+remapped+backfilling
> > >>              16   active+clean+scrubbing+deep
> > >>              1    active+remapped+backfill_wait+backfill_toofull
> > >>
> > >>   io:
> > >>     client:   117 MiB/s rd, 68 MiB/s wr, 274 op/s rd, 183 op/s wr
> > >>     recovery: 438 MiB/s, 192 objects/s
> > >> "
> > >>
> > >> Anyone know what the issue might be? Given that is happens on and off
> > >> with large periods of time in between with normal low latencies I
> > >> think it unlikely that it is just because the cluster is busy.
> > >>
> > >> Also, how come there's only a small amount of PGs doing backfill when
> > >> we have such a large misplaced percentage? Can this be just from
> > >> backfill reservation logjam?
> > >>
> > >> Mvh.
> > >>
> > >> Torkil
> > >>
> >
> > --
> > Torkil Svensgaard
> > Systems Administrator
> > Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> > Copenhagen University Hospital Amager and Hvidovre
> > Kettegaard Allé 30, 2650 Hvidovre, Denmark
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> --
> Alexander E. Patrakov
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx