Hello,
On Wed, 01 Feb 2017 11:43:02 +0100 Eugen Block wrote:
Hi,
I haven't tracked the slow requests yet, but I ran some performance
tests. Although I'm not an expert, I believe the results are quite
unsatisfying.
You've told us absolutely nothing about your cluster that would be
relevant to determine if your results are actually to be expected or not.
Anything, and I mean anything is relevant, from OS, kernel, Ceph versions
to a detailed description of ALL HW in there, CPU, RAM, SSD, HDD, network,
topology, etc.
I ran a couple of rados bench tests in a pool, with different
replication sizes (1 to 3). The tests were executed on one of the ceph
nodes, which also is a monitor.
Benchmarks, including rados bench run from a cluster node that also hosts
OSDs will give you skewed results.
Results for size 1:
Total time run: 20.428947
Total writes made: 639
Write size: 4194304
Bandwidth (MB/sec): 125.117
This and the following results would be on par with a smallish, 1Gb/s
network HDD based cluster.
But we don't know if that's what you're running or not...
Results for size 2:
Total time run: 20.741990
Total writes made: 454
Write size: 4194304
Bandwidth (MB/sec): 87.552
Still looks nice, at least we would be totally fine with this throughput.
As mentioned countless times here, throughput is most likely your least
worry with Ceph, IOPS are more likely to be your bottleneck.
Results for size 3:
Total time run: 22.934567
Total writes made: 252
Write size: 4194304
Bandwidth (MB/sec): 43.951
I guess this is still ok for a replication size of 3. But the real
problem seems to be the client throughput.
Running fio on a Xen-VM inside rbd:
Since when does Xen support Ceph?
As in, how are you running a Xen VM that uses Ceph?
Run status group 0 (all jobs):
WRITE: io=1024.0MB, aggrb=21776KB/s, minb=21776KB/s,
maxb=21776KB/s, mint=48151msec, maxt=48151msec
And this is one of the better results, we often have values of 12MB/s
or even lower, e.g. on a KVM-VM:
Run status group 0 (all jobs):
WRITE: io=512000KB, aggrb=6425KB/s, minb=6425KB/s, maxb=6425KB/s,
mint=79683msec, maxt=79683msec
Fio results w/o the full command line parameters are basically worthless,
I'm afraid.
This unacceptable!
I use almost all defaults in our ceph.conf,
What did you change that could be relevant then?
I've been reading
everything I could find on Google, but I don't understand why the
clients are so slow. Is it a valid assumption, that the slow requests
could be caused by our slow clients?
Slow requests are caused typically when your cluster is not capable of
keeping up.
One of the things I tried was
vm1:~ # echo 4096 > /sys/block/xvda/queue/read_ahead_kb
Only affects reads.
but this had no effect. I tried it with different numbers of PGs, but
that also didn't have any effect.
Again, w/o knowing your cluster and these numbers, that information is of
little value.
What else can I do to increase the performance of my instances or to
analyse the bottleneck?
Understand your HW/cluster.
Monitor the nodes with atop etc, use collectd and graphite, etc, to keep
records of what happened when and to visualize things.
Christian
Regards,
Eugen
Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:
> Check the latency figures in a "perf dump". High numbers in a
> particular area may help you nail it.
>
> I suspect though, that it may come down to enabling debug logging and
> tracking a slow request through the logs.
>
> On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block <eblock@xxxxxx> wrote:
>> Hi,
>>
>>> Looking at the output of dump_historic_ops and dump_ops_in_flight
>>
>>
>> I waited for new slow request messages and dumped the historic_ops into a
>> file. The reporting OSD shows lots of "waiting for rw locks"
messages and a
>> duration of more than 30 secs:
>>
>> "age": 366.044746,
>> "duration": 32.491506,
>> "type_data": [
>> "commit sent; apply or cleanup",
>> {
>> "client": "client.9664429",
>> "tid": 130439910
>> },
>> [
>> {
>> "time": "2017-01-12 10:38:15.227649",
>> "event": "initiated"
>> },
>> {
>> "time": "2017-01-12 10:38:15.232310",
>> "event": "reached_pg"
>> },
>> {
>> "time": "2017-01-12 10:38:15.232341",
>> "event": "waiting for rw locks"
>> },
>> {
>> "time": "2017-01-12 10:38:15.268819",
>> "event": "reached_pg"
>> },
>> [
>> .
>> .
>> .
>> ]
>> {
>> "time": "2017-01-12 10:38:45.515055",
>> "event": "waiting for rw locks"
>> },
>> {
>> "time": "2017-01-12 10:38:46.921095",
>> "event": "reached_pg"
>> },
>> {
>> "time": "2017-01-12 10:38:46.921157",
>> "event": "started"
>> },
>> {
>> "time": "2017-01-12 10:38:46.921342",
>> "event": "waiting for subops from 9,15"
>> },
>> {
>> "time": "2017-01-12 10:38:46.921724",
>> "event": "commit_queued_for_journal_write"
>> },
>> {
>> "time": "2017-01-12 10:38:46.922186",
>> "event": "write_thread_in_journal_buffer"
>> },
>> {
>> "time": "2017-01-12 10:38:46.931103",
>> "event": "sub_op_commit_rec"
>> },
>> {
>> "time": "2017-01-12 10:38:46.968730",
>> "event": "sub_op_commit_rec"
>> },
>> {
>> "time": "2017-01-12 10:38:47.717770",
>> "event": "journaled_completion_queued"
>> },
>> {
>> "time": "2017-01-12 10:38:47.718280",
>> "event": "op_commit"
>> },
>> {
>> "time": "2017-01-12 10:38:47.718359",
>> "event": "commit_sent"
>> },
>> {
>> "time": "2017-01-12 10:38:47.718890",
>> "event": "op_applied"
>> },
>> {
>> "time": "2017-01-12 10:38:47.719154",
>> "event": "done"
>> }
>>
>>
>> There were about 70 events "waiting for rw locks", I truncated
the output.
>> Based on the message "waiting for subops from 9,15" I also dumped the
>> historic_ops for these two OSDs.
>>
>> Duration on OSD.9
>>
>> "initiated_at": "2017-01-12 10:38:29.258221",
>> "age": 54.069919,
>> "duration": 20.831568,
>>
>> Duration on OSD.15
>>
>> "initiated_at": "2017-01-12 10:38:23.695098",
>> "age": 112.118210,
>> "duration": 26.452526,
>>
>> They also contain many "waiting for rw locks" messages, but not
as much as
>> the dump from the reporting OSD.
>> To me it seems that because two OSDs take a lot of time to process their
>> requests (only slightly less than 30 secs), it sums up to more
than 30 secs
>> on the reporting (primary?) OSD. Is the reporting OSD always the primary?
>>
>> How can I debug this further? I searched the web for "waiting
for rw locks",
>> I also found Wido's blog [1] about my exact problem, but I'm not
sure how to
>> continue. Our admin says our network should be fine, but what can I do to
>> rule that out?
>>
>> I don't think I have provided information about our cluster yet:
>>
>> 4 nodes, 3 mons, 20 OSDs on
>> ceph version 0.94.7-84-g8e6f430
(8e6f430683e4d8293e31fd4eb6cb09be96960cfa)
>>
>> [1]
>>
https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/
>>
>> Thanks!
>> Eugen
>>
>>
>> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:
>>
>>
>>> On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block <eblock@xxxxxx> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I simply grepped for "slow request" in ceph.log. What exactly
do you mean
>>>> by
>>>> "effective OSD"?
>>>>
>>>> If I have this log line:
>>>> 2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141
>>>> seconds
>>>> old, received at 2017-01-11 [...] ack+ondisk+write+known_if_redirected
>>>> e12440) currently waiting for subops from 0,12
>>>>
>>>> I assumed that osd.16 is the one causing problems.
>>>
>>>
>>> 0 and/or 12 in this case. 16 is just waiting. Looking at the output of
>>> dump_historic_ops and dump_ops_in_flight on the implicated osds may
>>> help to narrow down what the issue is.
>>>
>>>> But now that you mention
>>>> the subops, I only noticed them yesterday, but didn't have the time yet
>>>> to
>>>> investigate further. I'll have a look into the subops messages
and report
>>>> back.
>>>>
>>>> Thanks!
>>>> Eugen
>>>>
>>>> Zitat von Burkhard Linke
>>>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> just for clarity:
>>>>>
>>>>>
>>>>> Did you parse the slow request messages and use the effective
OSD in the
>>>>> statistics? Some message may refer to other OSDs, e.g.
"waiting for sub
>>>>> op
>>>>> on OSD X,Y". The reporting OSD is not the root cause in that case, but
>>>>> one
>>>>> of the mentioned OSDs (and I'm currently not aware of a method to
>>>>> determine
>>>>> which of the both OSD is the cause in case of 3 replicates.....).
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Burkhard
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Eugen Block voice : +49-40-559 51 75
>>>> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
>>>> Postfach 61 03 15
>>>> D-22423 Hamburg e-mail : eblock@xxxxxx
>>>>
>>>> Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>>>> Sitz und Registergericht: Hamburg, HRB 90934
>>>> Vorstand: Jens-U. Mozdzen
>>>> USt-IdNr. DE 814 013 983
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>> --
>>> Cheers,
>>> Brad
>>
>>
>>
>>
>> --
>> Eugen Block voice : +49-40-559 51 75
>> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
>> Postfach 61 03 15
>> D-22423 Hamburg e-mail : eblock@xxxxxx
>>
>> Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>> Sitz und Registergericht: Hamburg, HRB 90934
>> Vorstand: Jens-U. Mozdzen
>> USt-IdNr. DE 814 013 983
>>
>
>
>
> --
> Cheers,
> Brad
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
http://www.gol.com/