Den mån 3 feb. 2020 kl 08:25 skrev Wido den Hollander <wido@xxxxxxxx>: > > The crash happens, when the osd wants to read from pipe when processing > > heartbeat. To me it sounds like a networking issue. > > It could also be that this OSD is so busy internally with other stuff > that it doesn't respond to heartbeats and then commits suicide. > > Combined with the comment that VMs can't read their data it could very > well be that the OSD is super busy. > > Maybe try a compact of the LevelDB database. > I think I am with Wido on this one, if you get one or a few PGs so full of metadata or weird stuff that it takes longer than suicide_timeout to handle it, then it will be like this. At start it tries to complete whatever operation was in queue (like scrubs, recovery, something) and it just gets stuck doing that instead of answering heartbeats or finishing operations requested by other OSDs or clients, and gets ejected from the cluster. If it is anything like what we see on our jewel cluster, you can move these PGs around with impact to clients but you can't "fix" them without doing deep changes like moving from leveldb to rocksdb (if filestore), splitting PGs and sharding buckets if it is RGW metadata that causes these huge indexes to end up on a single OSD. You sort of need to figure out what the root cause is and aim to fix that part. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx