Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

Igor Fedotov <ifedotov@xxxxxxx> · Tue, 19 May 2020 16:23:45 +0300

Thoralf,

from your perf counter's dump:

        "db_total_bytes": 15032377344,
        "db_used_bytes": 411033600,
        "wal_total_bytes": 0,
        "wal_used_bytes": 0,
        "slow_total_bytes": 94737203200,
        "slow_used_bytes": 10714480640,

slow_used_bytes is non-zero hence you have a spillover.

Additionally your DB volume size selection isn't perfect. For optimal 
space usage RocksDB/BlueFS require DB volume sizes to aligned with the 
following sequence (a bit simplified view):

3-6GB, 30-60GB, 300+GB. This has been discussed in this mailing list 
multiple times.

Using DB volume size (15 GB in you case) out of these ranges cause 
wasting of space from one side and early spillovers from another.

Hence this worths adjusting in long term too.

Thanks,

Igor

On 5/19/2020 4:11 PM, thoralf schulze wrote:
hi igor, hi paul -

thank you for your answers.

On 5/19/20 2:05 PM, Igor Fedotov wrote:
I presume that your OSDs suffer from slow RocksDB access,
collection_listing operation is a culprit in this case - 30 items
listing takes 96seconds to complete.
 From my experience such issues tend to happen after massive DB data
removals (e.g. pool removal(s)) often backed by RGW usage which is "DB
access greedy".
DB data fragmentation is presumably the root cause for the resulting
slowdown. BlueFS spillover to main HDD device if any to be eliminated too.
To temporary workaround the issue you might want to do manual RocksDB
compaction - it's known to be helpful in such cases. But the positive
effect doesn't last forever - DB might go into degraded state again.
so i'll try to compact the rockdbs and report back … we didn't see any
spillovers yet, but indeed created a few large test pools with many pgs
and removed these afterwards. also, the affected pools had a significant
number of osds added to them recently. apart from this, the pools are
mainly being used for cephfs, with some rather small rgw pools for
openstack on top.

On 5/19/20 2:13 PM, Paul Emmerich wrote:
3) if necessary add more OSDs; common problem is having very
few dedicated OSDs for the index pool; running the index on
all OSDs (and having a fast DB device for every disk) is
better. But sounds like you already have that
nope, unfortunately not. default.rgw.buckets.index is an replicated pool
on hdds with only 4 pgs, i'll see if i can change that.

back to igors questions:

Some questions about your cases:
- What kind of payload do you have - RGW or something else?
mostly cephfs. the most active pools in terms of i/o are the openstack
rgw ones, though.

- Have you done massive removals recently?
yes, see above

- How large are main and DB disks for suffering OSDs? How much is their
current utilization?
for osd.293, for which i've sent the log:
main: 2tb hdd (5% used), db: 14gb partition on a 180gb nvme (~400mb used)
… i'll attach a perf dump for this osd.

- Do you see multiple "slow operation observed" patterns in OSD logs?
yes, although they do not necessarily correlate with osd down events.

Are they all about _collection_list function?
no, there are also submit_transact and _txc_committed_kv, with about the
same frequency as collection_list.

thank you very much for your analysis & with kind regards,
thoralf.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx