Re: osd is immidietly down and uses CPU full.

Janne Johansson <icepic.dz@xxxxxxxxx> · Mon, 3 Feb 2020 09:28:37 +0100

Den mån 3 feb. 2020 kl 08:25 skrev Wido den Hollander <wido@xxxxxxxx>:

> > The crash happens, when the osd wants to read from pipe when processing
> > heartbeat. To me it sounds like a networking issue.
>
> It could also be that this OSD is so busy internally with other stuff
> that it doesn't respond to heartbeats and then commits suicide.
>
> Combined with the comment that VMs can't read their data it could very
> well be that the OSD is super busy.
>
> Maybe try a compact of the LevelDB database.
>

I think I am with Wido on this one, if you get one or a few PGs so full of
metadata or weird stuff that it takes longer than suicide_timeout to handle
it, then it will be like this.
At start it tries to complete whatever operation was in queue (like scrubs,
recovery, something) and it just gets stuck doing that instead of answering
heartbeats or finishing
operations requested by other OSDs or clients, and gets ejected from the
cluster. If it is anything like what we see on our jewel cluster, you can
move these PGs around
with impact to clients but you can't "fix" them without doing deep changes
like moving from leveldb to rocksdb (if filestore), splitting PGs and
sharding buckets if it is RGW
metadata that causes these huge indexes to end up on a single OSD.

You sort of need to figure out what the root cause is and aim to fix that
part.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx