Hi, so i extended IO capability by adding spinning disks ( +10% ) and i stopped scrubbing completely. But the problem keep on coming: 2017-01-12 21:19:18.275826 7f5d93e58700 0 log_channel(cluster) log [WRN] : 19 slow requests, 5 included below; oldest blocked for > 202.408648 secs 2017-01-12 21:19:18.275839 7f5d93e58700 0 log_channel(cluster) log [WRN] : slow request 60.008335 seconds old, received at 2017-01-12 21:18:18.267397: osd_op(client.245117.1:639159942 13.21d2b510 rbd_data.320282ae8944a.00000000000a0058 [set-alloc-hint object_size 4194304 write_size 4194304,write 765952~4096] snapc 0=[] ondisk+write e5148) currently waiting for subops from 15 2017-01-12 21:19:18.275847 7f5d93e58700 0 log_channel(cluster) log [WRN] : slow request 60.143672 seconds old, received at 2017-01-12 21:18:18.132060: osd_op(client.245117.1:639158909 13.caf24910 rbd_data.320282ae8944a.0000000000067db7 [set-alloc-hint object_size 4194304 write_size 4194304,write 741376~4096] snapc 0=[] ondisk+write e5148) currently waiting for subops from 15 2017-01-12 21:19:18.275858 7f5d93e58700 0 log_channel(cluster) log [WRN] : slow request 60.164862 seconds old, received at 2017-01-12 21:18:18.110870: osd_op(client.245117.1:639158730 13.c9d74f90 rbd_data.320282ae8944a.000000000008f18e [set-alloc-hint object_size 4194304 write_size 4194304,write 897024~4096] snapc 0=[] ondisk+write e5148) currently waiting for subops from 15 2017-01-12 21:19:18.275863 7f5d93e58700 0 log_channel(cluster) log [WRN] : slow request 60.127854 seconds old, received at 2017-01-12 21:18:18.147878: osd_op(client.245117.1:639159079 13.a2efa410 rbd_data.320282ae8944a.000000000008e5cf [set-alloc-hint object_size 4194304 write_size 4194304,write 1703936~4096] snapc 0=[] ondisk+write e5148) currently waiting for subops from 15 2017-01-12 21:19:18.275867 7f5d93e58700 0 log_channel(cluster) log [WRN] : slow request 60.183234 seconds old, received at 2017-01-12 21:18:18.092498: osd_op(client.245117.1:639158607 13.b56e4190 rbd_data.320282ae8944a.00000000000f45eb [set-alloc-hint object_size 4194304 write_size 4194304,write 2850816~8192] snapc 0=[] ondisk+write e5148) currently waiting for subops from 15 At this time, the spinning disks were around 10-20% busy. While the SSD Caching disks ( writeback config ) were around 2% busy. So to me it does not look like i have here a problem, based on missing IO power. So any idea how to find out more ? Thank you ! -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:info@xxxxxxxxxxxxxxxxx Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 06.01.2017 um 01:56 schrieb Christian Balzer: > > Hello, > > On Thu, 5 Jan 2017 23:02:51 +0100 Oliver Dzombic wrote: > > > I've never seen hung qemu tasks, slow/hung I/O tasks inside VMs with a > broken/slow cluster I've seen. > That's because mine are all RBD librbd backed. > > I think your approach with cephfs probably isn't the way forward. > Also with cephfs you probably want to run the latest and greatest kernel > there is (4.8?). > > Is your cluster logging slow request warnings during that time? > >> >> In the night, thats when this issues occure primary/(only?), we run the >> scrubs and deep scrubs. >> >> In this time the HDD Utilization of the cold storage peaks to 80-95%. >> > Never a good thing, if they are also expected to do something useful. > HDD OSDs have their journals inline? > >> But we have a SSD hot storage in front of this, which is buffering >> writes and reads. >> > With that you mean cache-tier in writeback mode? > >> In our ceph.conf we already have this settings active: >> >> osd max scrubs = 1 >> osd scrub begin hour = 20 >> osd scrub end hour = 7 >> osd op threads = 16 >> osd client op priority = 63 >> osd recovery op priority = 1 >> osd op thread timeout = 5 >> >> osd disk thread ioprio class = idle >> osd disk thread ioprio priority = 7 >> > You're missing the most powerful scrub dampener there is: > osd_scrub_sleep = 0.1 > >> >> >> All in all i do not think that there is not enough IO for the clients on >> the cold storage ( even it looks like that on the first view ). >> > I find that one of the best ways to understand and thus manage your > cluster is to run something like collectd with graphite (or grafana or > whatever cranks your tractor). > > This should in combination with detailed spot analysis by atop or similar > give a very good idea of what is going on. > > So in this case, watch cache-tier promotions and flushes, see if your > clients I/Os really are covered by the cache or if during the night your > VMs may do log rotates or access other cold data and thus have to go to > the HDD based OSDs... > >> And if its really as simple as too view IO for the clients, my question >> would be, how to avoid it ? >> >> Turning off scrub/deep scrub completely ? That should not be needed and >> is also not too much advisable. >> > From where I'm standing deep-scrub is a luxury bling thing of limited > value when compared to something with integrated live checksums as in > Bluestore (so we hope) and BTRFS/ZFS. > > That said, your cluster NEEDs to be able to survive scrubs or it will be > in even bigger trouble when OSDs/nodes fail. > > Christian > >> We simply can not run less than >> >> osd max scrubs = 1 >> >> >> So if scrub is eating away all IO, the scrub algorythem is simply too >> aggressiv. >> >> Or, and thats most probable i guess, i have some kind of config mistake. >> >> > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com