Re: Fix OP dequeuing order

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Fri, 30 Oct 2015 13:53:00 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I'll take a look at the two values mentioned to see if adjusting them help any.

In my reading of the code seems to indicate that the token
implementation is not like any I'm used to. My understanding of token
bucket algorithms is that queues that stay under their token rate get
priority. This could be due to a steady rate of token depletion or
bursts with idle time between them. In my reading of the code, it
doesn't seem that this is being done because as soon as a subqueue is
empty it is removed along with any excess tokens. So when a "new" op
for an empty priority is queued it is immediately penalized to not run
because it hasn't accrued any tokens yet. So it has to sit and wait
for tokens to build up even if there may not have been any activity in
that queue for some time.

The real result of this really depends on how many tokens are still
able to be run verses the priority of the op. If all tokens have been
exhausted, then it starts the strict priority queue which dequeues
differently from the token queue. All of this seems to lead to very
unpredictable behavior in how OPs are dequeued.

What I'd like to know is what is the consensus of how OPs should be
dequeued. Since priorities are being assigned I'm going to assume that
it should be taken into account:
1. Strict priority (some priorities can be starved)
2. Priority token bucket (sort of what is there now, dequeues in
priority order but favors idle priorities)
3. Fair share (each priority gets an opportunity to run in proportion
to their priority)

I'm leaning to implement #3 to test against what is in the code now. I
don't feel like #1 is the right way to go. I think #2 can add a lot of
administrative overhead to each OP especially if there have been a lot
of different priorities that we should remember even if the subqueue
is empty (I think MDS assigns many different priorities) which will
really be detrimental for high IOP workloads (SSD). I think I can
implement #3 with less overhead than what is in the current code and
prevent any queue from being starved.

I'll rough in some code and hopefully I can get it put up to the group
next week.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWM8qYCRDmVDuy+mK58QAAs2UP/jbMo9jSD6+LaeqN0Ldu
El+/MVILzAYNeD25djLi2CF1vT0cuNAhMdUefAYUG0QOTaenLyk7EzUdhBul
z7z3GAZuGMVGfh8NKWUNMdf6j4OKgQvPQmA+bnFE23ou3Fxvm3+OOMjv9nRm
f8O308nkdZx3YlpaBq/jEuvhdPg4PtLNLgWg5EWDP26zl3NBO3xWBMtF/hh2
u+qxHVfev432xPcNZmOAadEiC9mQkfAbG2ms0OJ8nGaeEAA/fWPxfKmN5axL
sG2YaJ5nuh+ygIhYNiGSZrbGBBB93WBV9tGyNHAJx15/3Dz9ZZU3OHg84ufr
L6WssbnjkX48ExOc2GfWH/sxI/UqpyzCHKY2G/iWWpmZO27dCjYQOTUO3H7Q
/en1JDyl8hAl9BqKBPFUthRH3gv/RYkkQTejE2iVfdvSn8l9+EcfzCtsdGou
LXDYb+k5jyxZelvR3qY1QdRxcuBxqLnmYVzS/iPph6nU3TINZGpyi/mFZiN5
mxIED4BQGNLAG6hBr4OD7WusH9I8U2CEXFs5nGjlMxBsAQpM8L0xTwhmgthC
4aHZqp0hH2DlNcBC8L1gNbDV15Q7fg0T8x2jXnh7F81Oq3AF+S4xYm6OzisC
jUc+Pmb1XwlWoL9wkcwqZ+GwKRcw2W4a/0ryi4KDriU+zTUo7J0P6qQHm6ab
nImA
=e7Ar
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Fri, Oct 30, 2015 at 12:08 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> As we've discussed on the PR, this isn't right. We deliberately
> iterate through the lower-priority queues first so that the
> higher-priority ones don't starve them out entirely: they might be
> generating tokens more quickly than tokens can be used, which would
> prevent low queues from ever getting any ops at all, and that's
> contrary to the intention. ...I guess this could also happen where the
> lower-priority queues are generating tokens so quickly they never run
> out, but that doesn't seem likely unless you have *very* small
> objects.
>
> You might also try tuning the osd_op_pq_max_tokens_per_priority and
> osd_op_pq_min_cost values, which specify the maximum number of tokens
> and the minimum token cost of an op for each. (Defaults are 4194304
> and 65536, ie 4MB and 64KB.) I wouldn't expect this to be an issue
> (that builds up to a maximum of 64 recovery ops getting processed in a
> single go before it hits its long-term token average limit) but maybe
> something isn't behaving as expected.
> -Greg
>
>
>
> On Wed, Oct 28, 2015 at 9:52 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> On Wed, 28 Oct 2015, Robert LeBlanc wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> I created a pull request to fix an op dequeuing order problem. I'm not
>>> sure if I need to mention it here.
>>>
>>> https://github.com/ceph/ceph/pull/6417
>>
>> Wow, good catch.  Have you found that this materially impacts the behavior
>> in your cluster?
>>
>> sage
>>
>>
>>>
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.3
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWMVh6CRDmVDuy+mK58QAAztQP/385BOI8AH2uEJhN8pQ4
>>> QnAJxRy4HceWzjfAUulqNbbiD1scHZMU7LDW1GtsXfOZzmndTnJSBrR4+aHq
>>> F7py9zgXcxXH4uTAoILbRzkCF3rWdmkeh1/m5aY4LqmhE2N/O/LLOmDUe2BT
>>> XkQgZ9sROzY9pSj6pjA2vuv7k2u1SWtF3Ky14Hll3LHjqJibXoXYy+ik7lOP
>>> lRUoAY08Yf+c/Ag/Yy7CLGgIk/y6mdaJZPd2PCaVsKFa55NJAlYv0PHJKX0j
>>> XkSAY10MednMX6N+QL8XAq+yiAd//UADfCNhxHkP84YsPPCpNeS1OcoF6WGG
>>> g5H8uMK84kZCk37ummW/ANg9WNnO3hN2j22r9ezA+4GfxqKibT4lEMba6h88
>>> i5L3rQwWmM0cdpjS9plH1yUiPP2DexJV8PaiAIVVMAkw+AC0Xb/nUXKX6u5+
>>> YU744kSjtscN95Caf72V6HirB/uEU4sm+4lUuUBHzTcvau/r9WUHezwvmUiH
>>> HHL9bSU5TJ4jXvQhDEBYKbflTzLNKjXPcp1PagN2P9ZWQvNaxrQm32iB84DW
>>> 6jLEArFX10kE3eZ8IqoBikw5d+y3YtnuJ1oAIkfzj1ANofm37VKcQY/Wfrjw
>>> eke0nR4QBuN6SibbPXqIsjjIWZdo/jCgOCylNONXCFn9Qp08/7UJMQtzHk/1
>>> xRRp
>>> =g+NJ
>>> -----END PGP SIGNATURE-----
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html