Re: Slow OSD startup and slow ops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Stefan,

Thank you for your answers.

On Thu, Sep 22, 2022 at 5:54 PM Stefan Kooman <stefan@xxxxxx> wrote:

> Hi,
>
> On 9/21/22 18:00, Gauvain Pocentek wrote:
> > Hello all,
> >
> > We are running several Ceph clusters and are facing an issue on one of
> > them, we would appreciate some input on the problems we're seeing.
> >
> > We run Ceph in containers on Centos Stream 8, and we deploy using
> > ceph-ansible. While upgrading ceph from 16.2.7 to 16.2.10, we noticed
> that
> > OSDs were taking a very long time to restart on one of the clusters.
> (Other
> > clusters were not impacted at all.)
>
> Are the other clusters of similar size?
>

We have at least one cluster that is roughly the same size. It has not been
upgraded yet but restarting the OSDs doesn't create any issues.



> The OSD startup was so slow sometimes
> > that we ended up having slow ops, with 1 or 2 pg stuck in a peering
> state.
> > We've interrupted the upgrade and the cluster runs fine now, although we
> > have seen 1 OSD flapping recently, having trouble coming back to life.
> >
> > We've checked a lot of things and read a lot of mails from this list, and
> > here are some info:
> >
> > * this cluster has RBD pools for OpenStack and RGW pools; everything is
> > replicated x 3, except the RGW data pool which is EC 4+2
> > * we haven't found any hardware related issues; we run fully on SSDs and
> > they are all in good shape, no network issue, RAM and CPU are available
> on
> > all OSD hosts
> > * bluestore with an LVM collocated setup
> > * we have seen the slow restart with almost all the OSDs we've upgraded
> > (100 out of 350)
> > * on restart the ceph-osd process runs at 100% CPU but we haven't seen
> > anything weird on the host
>
> Are the containers restricted to use a certain amount of CPU? Do the
> OSDs, after ~ 10-20 seconds increase their CPU usage to 200% (if so this
> is proably because of rocksdb option max_background_compactions = 2).
>

This is actually a good point. We run the containers with --cpus=2. We also
had a couple incidents were OSDs started to act up on nodes were VMs were
running CPU intensive workloads (we have a hyperconverged setup with
OpenStack). So there's definitely something going on there.

I haven't had the opportunity to do a new restart to check more about the
CPU usage, but I hope to do that this week.


>
> > * no DB spillover
> > * we have other clusters with the same hardware, and we don't see
> problems
> > there
> >
> > The only thing that we found that looks suspicious is the number of op
> logs
> > for the PGs of the RGW index pool. `osd_max_pg_log_entries` is set to 10k
> > but `ceph pg dump` show PGs with more than 100k logs (the largest one
> has >
> > 400k logs).
> >
> > Could this be the reason for the slow startup of OSDs? If so is there a
> way
> > to trim these logs without too much impact on the cluster?
>
> Not sure. We have ~ 2K logs per PG.
>
> >
> > Let me know if additional info or logs are needed.
>
> Do you have a log of slow ops and osd logs?
>

I will get more logs when I restart an OSD this week. What log levels for
bluestore/rocksdb would you recommend?


>
> Do you have any non-standard configuration for the daemons? I.e. ceph
> daemon osd.$id config diff
>

Nothing non-standard.


>
> We are running a Ceph Octopus (15.2.16) cluster with similar
> configuration. We have *a lot* of slow ops when starting OSDs. Also
> during peering. When the OSDs start they consume 100% CPU for up to ~ 10
> seconds, and after that consume 200% for a minute or more. During that
> time the OSDs perform a compaction. You should be able to find this in
> the OSD logs if it's the same in your case. After some the OSDs are done
> initializing and starting the boot process. As soon as they boot up and
> start peering the slow ops start to kick in. Lot's of "transitioning to
> Primary" and "transitioning to Stray" logging. Some time later the OSD
> becomes "active". While the OSD is busy with peering it's also busy
> compacting. As I also see RocksDB compaction logging. So it might be due
> to RocksDB compactions impacting OSD performance while it's already busy
> becoming primary (and or secondary / tertiary) for it's PGs.
>
> We had norecover, nobackfill, norebalance active when booting the OSDs.
>
> So, it might just take a long time to do RocksDB compaction. In this
> case it might be better to do all needed RocksDB compactions, and then
> start booting. So, what might help is to set "ceph osd set noup". This
> prevents the OSD from becoming active, then wait for the RocksDB
> compactions, and after that unset the flag.
>
> If you try this, please let me know how it goes.
>

That sounds like a good thing to try, I'll keep you posted.

Thanks again,
Gauvain
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux