Re: in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

Brad Hubbard <bhubbard@xxxxxxxxxx> · Sat, 19 May 2018 18:09:03 +1000

On Sat, May 19, 2018 at 5:01 PM, Uwe Sauter <uwe.sauter.de@xxxxxxxxx> wrote:
>>>>> The mistery is that these blocked requests occur numerously when at
>>>>> least
>>>>> one of the 6 servers is booted with kernel 4.15.17, if all are running
>>>>> 4.13.16 the number of blocked requests is infrequent and low.
>>>>
>>>>
>>>> Sounds like you need to profile your two kernel versions and work out
>>>> why one is under-performing.
>>>>
>>>
>>> Well, the problem is that I see this behavior only in our production
>>> system (6 hosts and 22 OSDs total). The test system I have is
>>> a bit smaller (only 3 hosts with 12 OSDs on older hardware) and shows no
>>> sign of this possible regression…
>>
>>
>> Are you saying you can't gather performance data from your production
>> system?
>
>
> As far as I can tell the issue only occurs on the production cluster.
> Without a way to reproduce
> on the test cluster I can't bisect the kernels as on the production cluster
> runs our central
> infrastructure and each time the active LDAP is stuck, most of the other
> services are stuck as well…
> My colleagues won't appreciate that.
>
> What other kind of performance data would you have collected?
>

On systems where this can be reproduced I would use tools like 'perf
top', pvp, collectd and maybe something like the following to capture
data that can be analysed to define the nature of the issue.

// for rhel6 and rhel7 so may need modification

# { top -n 5 -b > /tmp/top.out; \
vmstat 1 50 > /tmp/vm.out; \
iostat -tkx -p ALL 1 10 > /tmp/io.out; \
mpstat -A 1 10 > /tmp/mp.out; \
ps auwwx > /tmp/ps1.out; \
ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan > /tmp/ps2.out; \
sar -A 1 50 > /tmp/sar.out; \
free > /tmp/free.out; } ; tar -cjvf outputs_$(hostname)_$(date
+"%d-%b-%Y_%H%M").tar.bz2 /tmp/*.out

As you've already pointed out this currently seems to be a kernel
performance issue but analysis of this sort of data should help you
narrow it down.

Of course, all of this relies on you being able to reproduce the
issue, but maybe you can gather a baseline to begin with so you have
something to compare to when you are in a position to gather perf data
during an issue.

At the same time I'd suggest pursuing this with Proxmox and/or Ubuntu
to see if they have anything to offer.

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com