Re: Replication lag in block storage

Gregory Farnum <greg@xxxxxxxxxxx> · Fri, 14 Mar 2014 09:52:45 -0700



On Fri, Mar 14, 2014 at 9:37 AM, Greg Poirier <greg.poirier@xxxxxxxxxx> wrote:
> So, on the cluster that I _expect_ to be slow, it appears that we are
> waiting on journal commits. I want to make sure that I am reading this
> correctly:
>
>           "received_at": "2014-03-14 12:14:22.659170",
>
>                     { "time": "2014-03-14 12:14:22.660191",
>                       "event": "write_thread_in_journal_buffer"},
>
> At this point we have received the write and are attempting to write the
> transaction to the OSD's journal, yes?
>
> Then:
>
>                     { "time": "2014-03-14 12:14:22.900779",
>                       "event": "journaled_completion_queued"},
>
> 240ms later we have successfully written to the journal?

Correct. That seems an awfully long time for a 16K write, although I
don't know how much data I have on co-located journals. (At least, I'm
assuming it's in the 16K range based on the others, although I'm just
now realizing that subops aren't providing that information...I've
created a ticket to include that diagnostic info in future.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


> I expect this particular slowness due to colocation of journal and data on
> the same disk (and it's a spinning disk, not an SSD). I expect some of this
> could be alleviated by migrating journals to SSDs, but I am looking to
> rebuild in the near future--so am willing to hobble in the meantime.
>
> I am surprised that our all SSD cluster is also underperforming. I am trying
> colocating the journal on the same disk with all SSDs at the moment and will
> see if the performance degradation is of the same nature.
>
>
>
> On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>
>> Right. So which is the interval that's taking all the time? Probably
>> it's waiting for the journal commit, but maybe there's something else
>> blocking progress. If it is the journal commit, check out how busy the
>> disk is (is it just saturated?) and what its normal performance
>> characteristics are (is it breaking?).
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier <greg.poirier@xxxxxxxxxx>
>> wrote:
>> > Many of the sub ops look like this, with significant lag between
>> > received_at
>> > and commit_sent:
>> >
>> >         { "description": "osd_op(client.6869831.0:1192491
>> > rbd_data.67b14a2ae8944a.0000000000009105 [write 507904~3686400]
>> > 6.556a4db0
>> > e660)",
>> >           "received_at": "2014-03-13 20:42:05.811936",
>> >           "age": "46.088198",
>> >           "duration": "0.038328",
>> > <snip>
>> >                     { "time": "2014-03-13 20:42:05.850215",
>> >                       "event": "commit_sent"},
>> >                     { "time": "2014-03-13 20:42:05.850264",
>> >                       "event": "done"}]]},
>> >
>> > In this case almost 39ms between received_at and commit_sent.
>> >
>> > A particularly egregious example of 80+ms lag between received_at and
>> > commit_sent:
>> >
>> >        { "description": "osd_op(client.6869831.0:1190526
>> > rbd_data.67b14a2ae8944a.0000000000008fac [write 3325952~868352]
>> > 6.5255f5fd
>> > e660)",
>> >           "received_at": "2014-03-13 20:41:40.227813",
>> >           "age": "320.017087",
>> >           "duration": "0.086852",
>> > <snip>
>> >                     { "time": "2014-03-13 20:41:40.314633",
>> >                       "event": "commit_sent"},
>> >                     { "time": "2014-03-13 20:41:40.314665",
>> >                       "event": "done"}]]},
>> >
>> >
>> >
>> > On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum <greg@xxxxxxxxxxx>
>> > wrote:
>> >>
>> >> On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier <greg.poirier@xxxxxxxxxx>
>> >> wrote:
>> >> > We've been seeing this issue on all of our dumpling clusters, and I'm
>> >> > wondering what might be the cause of it.
>> >> >
>> >> > In dump_historic_ops, the time between op_applied and
>> >> > sub_op_commit_rec
>> >> > or
>> >> > the time between commit_sent and sub_op_applied is extremely high.
>> >> > Some
>> >> > of
>> >> > the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
>> >> > included at the bottom.
>> >>
>> >> It's important to understand what each of those timestamps are
>> >> reporting.
>> >>
>> >> op_applied: the point at which an OSD has applied an operation to its
>> >> readable backing filesystem in-memory (which for xfs or ext4 will be
>> >> after it's committed to the journal)
>> >> sub_op_commit_rec: the point at which an OSD has gotten commits from
>> >> the replica OSDs
>> >> commit_sent: the point at which a replica OSD has sent a commit back
>> >> to its primary
>> >> sub_op_applied: the point at which a replica OSD has applied a
>> >> particular operation to its backing filesystem in-memory (again, after
>> >> the journal if using xfs)
>> >>
>> >> Reads are never served from replicas, so a long time between
>> >> commit_sent and sub_op_applied should not in itself be an issue. A lag
>> >> time between op_applied and sub_op_commit_rec means that the OSD is
>> >> waiting on its replicas. A long time there indicates either that the
>> >> replica is processing slowly, or that there's some issue in the
>> >> communications stack (all the way from the raw ethernet up to the
>> >> message handling in the OSD itself).
>> >> So the first thing to look for are sub ops which have a lag time
>> >> between the received_at and commit_sent timestamps. If none of those
>> >> ever turn up, but unusually long waits for sub_op_commit_rec are still
>> >> present, then it'll take more effort to correlate particular subops on
>> >> replicas with the op on the primary they correspond to, and see where
>> >> the time lag is coming into it.
>> >> -Greg
>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> >
>> >
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com