Re: Object Write Latency

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 23 Sep 2013 09:39:34 +0200

On Fri, Sep 20, 2013 at 5:34 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> On Fri, 20 Sep 2013, Andreas Joachim Peters wrote:
>> Hi,
>>
>>
>> we made some benchmarks about object read/write latencies on the CERN ceph installation.
>>
>> The cluster has 44 nodes and ~1k disks, all on 10GE and the pool configuration has 3 copies.
>> Client & Server is 0.67.
>>
>> The latencies we observe (using tiny objects ... 5 bytes) on the idle pool:
>>
>> write full object(sync) ~65-80ms
>> append to object ~60-75ms
>> set xattr object ~65-80ms
>> lock object ~65-80ms
>> stat object ~1ms
>
> How are the individual OSDs configured?  Are they purely HDD's with a
> journal partition, or is there an SSD journal?  If it's pure HDD, this
> will be a signnificant source of latency.

No SSD journal, same HDD for journal and partition. It still seems
large to have >45ms for 1 copy though, no?

>
> That said, Dieter just recently pointed out to me that he's observing
> significant time in a request turnaround (single request, 1 request in
> flight) that is not be related to the storage backend.  I've been
> traveling this week and haven't had time to look into it yet.  Tracking
> this down should just be a matter of turning up the logs and looking
> carefully at the timestamps to see where things are being delayed.

Which logs.. debug_ms or debug_osd or both or something else?

>
>> We seem to saturate the pools writing ~ 20k objects/s (= internally 60k/s).
>>
>> Is there an easy explanation for 80 ms (quasi without payload) and a possible tuning to reduce that?
>> I measured (append few bytes +fsync) on such a disk around 33ms which explains probably part of the latency.
>>
>> Then I tried with the async API to see if there is a difference in the
>> measurement between wait_for_complete or wait_for_safe ... shouldn't
>> wait_for_complete be much shorter, but I get always comparable results
>> ...
>
> If you are using XFS or ext4 on teh backend, the OSD is doing write-ahead
> journaling, which means that in reality the commit happens before the op
> is applied to the fs and is readable.  (The 'commit/ondisk' reply implies
> an ack so the OSD has some internal locking to maintain this illusion from
> the client's perspective.)

It's XFS.

Cheers, Dan

>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html