Re: Fix OP dequeuing order

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 30 Oct 2015 14:57:50 -0700



On Fri, Oct 30, 2015 at 12:53 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I'll take a look at the two values mentioned to see if adjusting them help any.
>
> In my reading of the code seems to indicate that the token
> implementation is not like any I'm used to. My understanding of token
> bucket algorithms is that queues that stay under their token rate get
> priority. This could be due to a steady rate of token depletion or
> bursts with idle time between them. In my reading of the code, it
> doesn't seem that this is being done because as soon as a subqueue is
> empty it is removed along with any excess tokens. So when a "new" op
> for an empty priority is queued it is immediately penalized to not run
> because it hasn't accrued any tokens yet. So it has to sit and wait
> for tokens to build up even if there may not have been any activity in
> that queue for some time.
>
> The real result of this really depends on how many tokens are still
> able to be run verses the priority of the op. If all tokens have been
> exhausted, then it starts the strict priority queue which dequeues
> differently from the token queue. All of this seems to lead to very
> unpredictable behavior in how OPs are dequeued.
>
> What I'd like to know is what is the consensus of how OPs should be
> dequeued. Since priorities are being assigned I'm going to assume that
> it should be taken into account:
> 1. Strict priority (some priorities can be starved)
> 2. Priority token bucket (sort of what is there now, dequeues in
> priority order but favors idle priorities)
> 3. Fair share (each priority gets an opportunity to run in proportion
> to their priority)
>
> I'm leaning to implement #3 to test against what is in the code now. I
> don't feel like #1 is the right way to go. I think #2 can add a lot of
> administrative overhead to each OP especially if there have been a lot
> of different priorities that we should remember even if the subqueue
> is empty (I think MDS assigns many different priorities) which will
> really be detrimental for high IOP workloads (SSD). I think I can
> implement #3 with less overhead than what is in the current code and
> prevent any queue from being starved.

I honestly can't talk about the specifics of this as I don't think
I've even read through the whole thing. I thought it was supposed to
be #3 already, but perhaps it's broken.
When looking at it in response to your initial PR I was worried about
the eager deletion and queues losing their tokens, so if you say
that's not working I think you may be right. I think it could explain
the observed behavior too, and fixing that seems like an
easier/quicker solution than a new queue type?
-Greg

>
> I'll rough in some code and hopefully I can get it put up to the group
> next week.
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWM8qYCRDmVDuy+mK58QAAs2UP/jbMo9jSD6+LaeqN0Ldu
> El+/MVILzAYNeD25djLi2CF1vT0cuNAhMdUefAYUG0QOTaenLyk7EzUdhBul
> z7z3GAZuGMVGfh8NKWUNMdf6j4OKgQvPQmA+bnFE23ou3Fxvm3+OOMjv9nRm
> f8O308nkdZx3YlpaBq/jEuvhdPg4PtLNLgWg5EWDP26zl3NBO3xWBMtF/hh2
> u+qxHVfev432xPcNZmOAadEiC9mQkfAbG2ms0OJ8nGaeEAA/fWPxfKmN5axL
> sG2YaJ5nuh+ygIhYNiGSZrbGBBB93WBV9tGyNHAJx15/3Dz9ZZU3OHg84ufr
> L6WssbnjkX48ExOc2GfWH/sxI/UqpyzCHKY2G/iWWpmZO27dCjYQOTUO3H7Q
> /en1JDyl8hAl9BqKBPFUthRH3gv/RYkkQTejE2iVfdvSn8l9+EcfzCtsdGou
> LXDYb+k5jyxZelvR3qY1QdRxcuBxqLnmYVzS/iPph6nU3TINZGpyi/mFZiN5
> mxIED4BQGNLAG6hBr4OD7WusH9I8U2CEXFs5nGjlMxBsAQpM8L0xTwhmgthC
> 4aHZqp0hH2DlNcBC8L1gNbDV15Q7fg0T8x2jXnh7F81Oq3AF+S4xYm6OzisC
> jUc+Pmb1XwlWoL9wkcwqZ+GwKRcw2W4a/0ryi4KDriU+zTUo7J0P6qQHm6ab
> nImA
> =e7Ar
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Oct 30, 2015 at 12:08 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> As we've discussed on the PR, this isn't right. We deliberately
>> iterate through the lower-priority queues first so that the
>> higher-priority ones don't starve them out entirely: they might be
>> generating tokens more quickly than tokens can be used, which would
>> prevent low queues from ever getting any ops at all, and that's
>> contrary to the intention. ...I guess this could also happen where the
>> lower-priority queues are generating tokens so quickly they never run
>> out, but that doesn't seem likely unless you have *very* small
>> objects.
>>
>> You might also try tuning the osd_op_pq_max_tokens_per_priority and
>> osd_op_pq_min_cost values, which specify the maximum number of tokens
>> and the minimum token cost of an op for each. (Defaults are 4194304
>> and 65536, ie 4MB and 64KB.) I wouldn't expect this to be an issue
>> (that builds up to a maximum of 64 recovery ops getting processed in a
>> single go before it hits its long-term token average limit) but maybe
>> something isn't behaving as expected.
>> -Greg
>>
>>
>>
>> On Wed, Oct 28, 2015 at 9:52 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>> On Wed, 28 Oct 2015, Robert LeBlanc wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> I created a pull request to fix an op dequeuing order problem. I'm not
>>>> sure if I need to mention it here.
>>>>
>>>> https://github.com/ceph/ceph/pull/6417
>>>
>>> Wow, good catch.  Have you found that this materially impacts the behavior
>>> in your cluster?
>>>
>>> sage
>>>
>>>
>>>>
>>>> - ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.2.3
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWMVh6CRDmVDuy+mK58QAAztQP/385BOI8AH2uEJhN8pQ4
>>>> QnAJxRy4HceWzjfAUulqNbbiD1scHZMU7LDW1GtsXfOZzmndTnJSBrR4+aHq
>>>> F7py9zgXcxXH4uTAoILbRzkCF3rWdmkeh1/m5aY4LqmhE2N/O/LLOmDUe2BT
>>>> XkQgZ9sROzY9pSj6pjA2vuv7k2u1SWtF3Ky14Hll3LHjqJibXoXYy+ik7lOP
>>>> lRUoAY08Yf+c/Ag/Yy7CLGgIk/y6mdaJZPd2PCaVsKFa55NJAlYv0PHJKX0j
>>>> XkSAY10MednMX6N+QL8XAq+yiAd//UADfCNhxHkP84YsPPCpNeS1OcoF6WGG
>>>> g5H8uMK84kZCk37ummW/ANg9WNnO3hN2j22r9ezA+4GfxqKibT4lEMba6h88
>>>> i5L3rQwWmM0cdpjS9plH1yUiPP2DexJV8PaiAIVVMAkw+AC0Xb/nUXKX6u5+
>>>> YU744kSjtscN95Caf72V6HirB/uEU4sm+4lUuUBHzTcvau/r9WUHezwvmUiH
>>>> HHL9bSU5TJ4jXvQhDEBYKbflTzLNKjXPcp1PagN2P9ZWQvNaxrQm32iB84DW
>>>> 6jLEArFX10kE3eZ8IqoBikw5d+y3YtnuJ1oAIkfzj1ANofm37VKcQY/Wfrjw
>>>> eke0nR4QBuN6SibbPXqIsjjIWZdo/jCgOCylNONXCFn9Qp08/7UJMQtzHk/1
>>>> xRRp
>>>> =g+NJ
>>>> -----END PGP SIGNATURE-----
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html