Re: Filestore throttling

GuangYang <yguang11@xxxxxxxxxxx> · Fri, 24 Oct 2014 04:07:47 +0000

---------------------------------------
> Date: Thu, 23 Oct 2014 06:58:58 -0700
> From: sage@xxxxxxxxxxxx
> To: yguang11@xxxxxxxxxxx
> CC: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
> Subject: RE: Filestore throttling
>
> On Thu, 23 Oct 2014, GuangYang wrote:
>> Thanks Sage for the quick response!
>>
>> We are using firefly (v0.80.4 with a couple of back-ports). One
>> observation we have is that during peering stage (especially if the OSD
>> got down/in for several hours with high load), the peering OPs are in
>> contention with normal OPs and thus bring extremely long latency (up to
>> minutes) for client OPs, the contention happened in filestore for
>> throttling budget, it also happened at dispatcher/op threads, I will
>> send another email with more details after more investigation.
>
> It sounds like the problem here is that when the pg logs are long (1000's
> of entries) the MOSDPGLog messages are bit and generate a big
> ObjectStore::Transaction. This can be mitigated by shortening the logs,
> but that means shortening the duration that an OSD can be down without
> triggering a backfill. Part of the answer is probably to break the PGLog
> messages into smaller pieces.
Making the transaction small should help, let me test that and get back with more information.
>
>> As for this one, I created a pull request #2779 to change the default
>> value of filesotre_queue_max_ops to 500 (which is specified in the
>> document but code is inconsistent), do you think we should make others
>> as default as well?
>
> We reduced it to 50 almost 2 years ago, in this commit:
>
> commit 44dca5c8c5058acf9bc391303dc77893793ce0be
> Author: Sage Weil <sage@xxxxxxxxxxx>
> Date: Sat Jan 19 17:33:25 2013 -0800
>
> filestore: disable extra committing queue allowance
>
> The motivation here is if there is a problem draining the op queue
> during a sync. For XFS and ext4, this isn't generally a problem: you
> can continue to make writes while a syncfs(2) is in progress. There
> are currently some possible implementation issues with btrfs, but we
> have not demonstrated them recently.
>
> Meanwhile, this can cause queue length spikes that screw up latency.
> During a commit, we allow too much into the queue (say, recovery
> operations). After the sync finishes, we have to drain it out before
> we can queue new work (say, a higher priority client request). Having
> a deep queue below the point where priorities order work limits the
> value of the priority queue.
>
> Signed-off-by: Sage Weil <sage@xxxxxxxxxxx>
>
> I'm not sure it makes sense to increase it in the general case. It might
> make sense for your workload, or we may want to make peering transactions
> some sort of special case...?
It is actually another commit:

commit 40654d6d53436c210b2f80911217b044f4d7643a
filestore: filestore_queue_max_ops 500 -> 50
Having a deep queue limits the effectiveness of the priority queues
above by adding additional latency.
I don't quite understand the use case that it might add additional latency by increasing this value, would you mind elaborating?

>
> sage
>
>
>>
>> Thanks,
>> Guang
>>
>> ----------------------------------------
>>> Date: Wed, 22 Oct 2014 21:06:21 -0700
>>> From: sage@xxxxxxxxxxxx
>>> To: yguang11@xxxxxxxxxxx
>>> CC: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
>>> Subject: Re: Filestore throttling
>>>
>>> On Thu, 23 Oct 2014, GuangYang wrote:
>>>> Hello Cephers,
>>>> During our testing, I found that the filestore throttling became a limiting factor for performance, the four settings (with default value) are:
>>>> filestore queue max ops = 50
>>>> filestore queue max bytes = 100 << 20
>>>> filestore queue committing max ops = 500
>>>> filestore queue committing max bytes = 100 << 20
>>>>
>>>> My understanding is, if we lift the threshold, the response for op (end to end) could be improved a lot during high load, and that is one reason to have journal. The downside is that if there is a read following a successful write, the read might stuck longer as the object is not flushed.
>>>>
>>>> Is my understanding correct here?
>>>>
>>>> If that is the tradeoff and read after write is not a concern in our use case, can I lift the parameters to below values?
>>>> filestore queue max ops = 500
>>>> filestore queue max bytes = 200 << 20
>>>> filestore queue committing max ops = 500
>>>> filestore queue committing max bytes = 200 << 20
>>>>
>>>> It turns out very helpful during PG peering stage (e.g. OSD down and up).
>>>
>>> That looks reasonable to me.
>>>
>>> For peering, I think there isn't really any reason to block sooner rather
>>> than later. I wonder if we should try to mark those transactions such
>>> that they don't run up against the usual limits...
>>>
>>> Is this firefly or something later? Sometime after firefly Sam made some
>>> changes so that the OSD is more careful about waiting for PG metadata to
>>> be persisted before sharing state. I wonder if you will still see the
>>> same improvement now...
>>>
>>> sage
>> N????y????b?????v?????{.n??????z??ay????????j???f????????????????:+v??????????zZ+??????"?!?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com