Re: cephfs ata1.00: status: { DRDY }

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Fri, 13 Jan 2017 00:40:37 +0100

Hi,

so i extended IO capability by adding spinning disks ( +10% ) and i
stopped scrubbing completely.

But the problem keep on coming:

2017-01-12 21:19:18.275826 7f5d93e58700  0 log_channel(cluster) log
[WRN] : 19 slow requests, 5 included below; oldest blocked for >
202.408648 secs
2017-01-12 21:19:18.275839 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.008335 seconds old, received at 2017-01-12
21:18:18.267397: osd_op(client.245117.1:639159942 13.21d2b510
rbd_data.320282ae8944a.00000000000a0058 [set-alloc-hint object_size
4194304 write_size 4194304,write 765952~4096] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15
2017-01-12 21:19:18.275847 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.143672 seconds old, received at 2017-01-12
21:18:18.132060: osd_op(client.245117.1:639158909 13.caf24910
rbd_data.320282ae8944a.0000000000067db7 [set-alloc-hint object_size
4194304 write_size 4194304,write 741376~4096] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15
2017-01-12 21:19:18.275858 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.164862 seconds old, received at 2017-01-12
21:18:18.110870: osd_op(client.245117.1:639158730 13.c9d74f90
rbd_data.320282ae8944a.000000000008f18e [set-alloc-hint object_size
4194304 write_size 4194304,write 897024~4096] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15
2017-01-12 21:19:18.275863 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.127854 seconds old, received at 2017-01-12
21:18:18.147878: osd_op(client.245117.1:639159079 13.a2efa410
rbd_data.320282ae8944a.000000000008e5cf [set-alloc-hint object_size
4194304 write_size 4194304,write 1703936~4096] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15
2017-01-12 21:19:18.275867 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.183234 seconds old, received at 2017-01-12
21:18:18.092498: osd_op(client.245117.1:639158607 13.b56e4190
rbd_data.320282ae8944a.00000000000f45eb [set-alloc-hint object_size
4194304 write_size 4194304,write 2850816~8192] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15

At this time, the spinning disks were around 10-20% busy.
While the SSD Caching disks ( writeback config ) were around 2% busy.

So to me it does not look like i have here a problem, based on missing
IO power.

So any idea how to find out more ?

Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 06.01.2017 um 01:56 schrieb Christian Balzer:
> 
> Hello,
> 
> On Thu, 5 Jan 2017 23:02:51 +0100 Oliver Dzombic wrote:
> 
> 
> I've never seen hung qemu tasks, slow/hung I/O tasks inside VMs with a
> broken/slow cluster I've seen.
> That's because mine are all RBD librbd backed.
> 
> I think your approach with cephfs probably isn't the way forward.
> Also with cephfs you probably want to run the latest and greatest kernel
> there is (4.8?).
> 
> Is your cluster logging slow request warnings during that time?
> 
>>
>> In the night, thats when this issues occure primary/(only?), we run the
>> scrubs and deep scrubs.
>>
>> In this time the HDD Utilization of the cold storage peaks to 80-95%.
>>
> Never a good thing, if they are also expected to do something useful.
> HDD OSDs have their journals inline?
> 
>> But we have a SSD hot storage in front of this, which is buffering
>> writes and reads.
>>
> With that you mean cache-tier in writeback mode?
>  
>> In our ceph.conf we already have this settings active:
>>
>> osd max scrubs = 1
>> osd scrub begin hour = 20
>> osd scrub end hour = 7
>> osd op threads = 16
>> osd client op priority = 63
>> osd recovery op priority = 1
>> osd op thread timeout = 5
>>
>> osd disk thread ioprio class = idle
>> osd disk thread ioprio priority = 7
>>
> You're missing the most powerful scrub dampener there is:
> osd_scrub_sleep = 0.1
> 
>>
>>
>> All in all i do not think that there is not enough IO for the clients on
>> the cold storage ( even it looks like that on the first view ).
>>
> I find that one of the best ways to understand and thus manage your
> cluster is to run something like collectd with graphite (or grafana or
> whatever cranks your tractor).
> 
> This should in combination with detailed spot analysis by atop or similar
> give a very good idea of what is going on.
> 
> So in this case, watch cache-tier promotions and flushes, see if your
> clients I/Os really are covered by the cache or if during the night your
> VMs may do log rotates or access other cold data and thus have to go to
> the HDD based OSDs...
>  
>> And if its really as simple as too view IO for the clients, my question
>> would be, how to avoid it ?
>>
>> Turning off scrub/deep scrub completely ? That should not be needed and
>> is also not too much advisable.
>>
> From where I'm standing deep-scrub is a luxury bling thing of limited
> value when compared to something with integrated live checksums as in
> Bluestore (so we hope) and BTRFS/ZFS. 
> 
> That said, your cluster NEEDs to be able to survive scrubs or it will be
> in even bigger trouble when OSDs/nodes fail.
> 
> Christian
> 
>> We simply can not run less than
>>
>> osd max scrubs = 1
>>
>>
>> So if scrub is eating away all IO, the scrub algorythem is simply too
>> aggressiv.
>>
>> Or, and thats most probable i guess, i have some kind of config mistake.
>>
>>
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com