Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

Paul Emmerich <paul.emmerich@xxxxxxxx> · Tue, 19 May 2020 15:18:55 +0200

On Tue, May 19, 2020 at 3:11 PM thoralf schulze <t.schulze@xxxxxxxxxxxx>
wrote:

>
> On 5/19/20 2:13 PM, Paul Emmerich wrote:
> > 3) if necessary add more OSDs; common problem is having very
> > few dedicated OSDs for the index pool; running the index on
> > all OSDs (and having a fast DB device for every disk) is
> > better. But sounds like you already have that
>
> nope, unfortunately not. default.rgw.buckets.index is an replicated pool
> on hdds with only 4 pgs, i'll see if i can change that.
>
>
these PGs should be distributed across all OSDs; in general it's a good
idea to have at least as many PGs as you have OSDs of the target type for
that pool
(technically a third would be enough to target one PG per OSD, because of
x3 replication)

Paul

> back to igors questions:
>
> > Some questions about your cases:
> > - What kind of payload do you have - RGW or something else?
> mostly cephfs. the most active pools in terms of i/o are the openstack
> rgw ones, though.
>
> > - Have you done massive removals recently?
> yes, see above
>
> > - How large are main and DB disks for suffering OSDs? How much is their
> > current utilization?
> for osd.293, for which i've sent the log:
> main: 2tb hdd (5% used), db: 14gb partition on a 180gb nvme (~400mb used)
> … i'll attach a perf dump for this osd.
>
> > - Do you see multiple "slow operation observed" patterns in OSD logs?
> yes, although they do not necessarily correlate with osd down events.
>
> > Are they all about _collection_list function?
> no, there are also submit_transact and _txc_committed_kv, with about the
> same frequency as collection_list.
>
> thank you very much for your analysis & with kind regards,
> thoralf.
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx