Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 19, 2020 at 2:06 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:

> Hi Thoralf,
>
> given the following indication from your logs:
>
> May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211
> 7fb25cc80700  0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn
> slow operation observed for _collection_list, latency = 96.337s, lat =
> 96s cid =2.0s2_head start GHMAX end GHMAX max 30
>
>
> I presume that your OSDs suffer from slow RocksDB access,
> collection_listing operation is a culprit in this case - 30 items
> listing takes 96seconds to complete.
>  From my experience such issues tend to happen after massive DB data
> removals (e.g. pool removal(s)) often backed by RGW usage which is "DB
> access greedy".
>

+1, also my experience: this happens when running large RGW setups.

Usually the solution is making sure that

1) all metadata really goes to SSD
2) there is no spillover
3) if necessary add more OSDs; common problem is having very few dedicated
OSDs for the index pool; running the index on all OSDs (and having a fast
DB device for every disk) is better. But sounds like you already have that


Paul



> DB data fragmentation is presumably the root cause for the resulting
> slowdown. BlueFS spillover to main HDD device if any to be eliminated too.
> To temporary workaround the issue you might want to do manual RocksDB
> compaction - it's known to be helpful in such cases. But the positive
> effect doesn't last forever - DB might go into degraded state again.
>
> Some questions about your cases:
> - What kind of payload do you have - RGW or something else?
> - Have you done massive removals recently?
> - How large are main and DB disks for suffering OSDs? How much is their
> current utilization?
> - Do you see multiple "slow operation observed" patterns in OSD logs?
> Are they all about _collection_list function?
>
> Thanks,
> Igor
> On 5/19/2020 12:07 PM, thoralf schulze wrote:
> > hi there,
> >
> > we are seeing osd occasionally getting kicked out of our cluster, after
> > having been marked down by other osds. most of the time, the affected
> > osd rejoins the cluster after about ~5 minutes, but sometimes this takes
> > much longer. during that time, the osd seems to run just fine.
> >
> > this happens more often that we'd like it to … is "OSD::osd_op_tp thread
> > … had timed out" a real error condition or just a warning about certain
> > operations on the osd taking a long time? i already set
> > osd_op_thread_timeout to 120 (was 60 before, default should be 15
> > according to the docs), but apparently that doesn't make any difference.
> >
> > are there any other settings that prevent this kind of behaviour?
> > mon_osd_report_timeout maybe, as in frank schilder's case?
> >
> > the cluster runs nautilus 14.2.7, osds are backed by spinning platters
> > with their rocksdb and wals on nvmes. in general, there seems to be the
> > following pattern:
> >
> > - it happens under moderate to heavy load, eg. while creating pools with
> > a lot of pgs
> > - the affected osd logs a lot of:
> > "heartbeat_map is_healthy 'OSD::osd_op_tp thread ${thread-id}' had timed
> > out after 60"
> >   … and finally something along the lines of:
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211
> > 7fb25cc80700  0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn
> > slow operation observed for _collection_list, latency = 96.337s, lat =
> > 96s cid =2.0s2_head start GHMAX end GHMAX max 30
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.219
> > 7fb25cc80700  1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread
> > 0x7fb25cc80700' had timed out after 60
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: osd.293 osd.293 2 :
> > Monitor daemon marked osd.293 down, but it is still running
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
> > 7fb267c96700  0 log_channel(cluster) log [WRN] : Monitor daemon marked
> > osd.293 down, but it is still running
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
> > 7fb267c96700  0 log_channel(cluster) do_log log to syslog
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
> > 7fb267c96700  0 log_channel(cluster) log [DBG] : map e646639 wrongly
> > marked me down at e646638
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
> > 7fb267c96700  0 log_channel(cluster) do_log log to syslog
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371
> > 7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify
> > public interface 'br-bond0' numa node: (2) No such file or directory
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371
> > 7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify
> > public interface 'br-bond0' numa node: (2) No such file or directory
> >
> > - meanwhile on the mon:
> > 2020-05-18 21:12:16.440 7f08f7933700  0 mon.ceph-mon-01@0(leader) e4
> > handle_command mon_command({"prefix": "status"} v 0) v1
> > entity='client.admin' cmd=[{"prefix": "status"}]: dispatch
> > 2020-05-18 21:12:18.436 7f08f7933700  0 log_channel(cluster) log [DBG] :
> > osd.293 reported failed by osd.101
> > 2020-05-18 21:12:18.848 7f08f7933700  0 log_channel(cluster) log [DBG] :
> > osd.293 reported failed by osd.533
> > [… lots of these from various osds]
> > 2020-05-18 21:12:24.992 7f08f7933700  0 log_channel(cluster) log [DBG] :
> > osd.293 reported failed by osd.421
> > 2020-05-18 21:12:26.124 7f08f7933700  0 log_channel(cluster) log [DBG] :
> > osd.293 reported failed by osd.504
> > 2020-05-18 21:12:26.132 7f08f7933700  0 log_channel(cluster) log [INF] :
> > osd.293 failed (root=tuberlin,datacenter=barz,host=ceph-osd-05) (16
> > reporters from different host after 27.137527 >= grace 26.361774)
> > 2020-05-18 21:12:26.236 7f08fa138700  0 log_channel(cluster) log [WRN] :
> > Health check failed: 1 osds down (OSD_DOWN)
> > 2020-05-18 21:12:26.280 7f08f6130700  0 log_channel(cluster) log [DBG] :
> > osdmap e646638: 604 total, 603 up, 604 in
> > 2020-05-18 21:12:27.336 7f08f6130700  0 log_channel(cluster) log [DBG] :
> > osdmap e646639: 604 total, 603 up, 604 in
> > 2020-05-18 21:12:28.248 7f08fa138700  0 log_channel(cluster) log [WRN] :
> > Health check failed: Reduced data availability: 17 pgs peering
> > (PG_AVAILABILITY)
> > 2020-05-18 21:12:29.392 7f08fa138700  0 log_channel(cluster) log [WRN] :
> > Health check failed: Degraded data redundancy: 80091/181232010 objects
> > degraded (0.044%), 18 pgs degraded (PG_DEGRADED)
> > 2020-05-18 21:12:33.927 7f08fa138700  0 log_channel(cluster) log [INF] :
> > Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1
> > pg inactive, 22 pgs peering)
> > 2020-05-18 21:12:35.095 7f08fa138700  0 log_channel(cluster) log [INF] :
> > Health check cleared: OSD_DOWN (was: 1 osds down)
> > 2020-05-18 21:12:35.119 7f08f6130700  0 log_channel(cluster) log [INF] :
> > osd.293 [v2:172.28.9.26:6936/2356578,v1:172.28.9.26:6937/2356578] boot
> > 2020-05-18 21:12:35.119 7f08f6130700  0 log_channel(cluster) log [DBG] :
> > osdmap e646640: 604 total, 604 up, 604 in
> > 2020-05-18 21:12:36.175 7f08f6130700  0 log_channel(cluster) log [DBG] :
> > osdmap e646641: 604 total, 604 up, 604 in
> >
> > i can happily provide more detailed logs, if that helps.
> >
> > thank you very much & with kind regards,
> > thoralf.
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux