On Fri, Oct 30, 2015 at 12:53 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > I'll take a look at the two values mentioned to see if adjusting them help any. > > In my reading of the code seems to indicate that the token > implementation is not like any I'm used to. My understanding of token > bucket algorithms is that queues that stay under their token rate get > priority. This could be due to a steady rate of token depletion or > bursts with idle time between them. In my reading of the code, it > doesn't seem that this is being done because as soon as a subqueue is > empty it is removed along with any excess tokens. So when a "new" op > for an empty priority is queued it is immediately penalized to not run > because it hasn't accrued any tokens yet. So it has to sit and wait > for tokens to build up even if there may not have been any activity in > that queue for some time. > > The real result of this really depends on how many tokens are still > able to be run verses the priority of the op. If all tokens have been > exhausted, then it starts the strict priority queue which dequeues > differently from the token queue. All of this seems to lead to very > unpredictable behavior in how OPs are dequeued. > > What I'd like to know is what is the consensus of how OPs should be > dequeued. Since priorities are being assigned I'm going to assume that > it should be taken into account: > 1. Strict priority (some priorities can be starved) > 2. Priority token bucket (sort of what is there now, dequeues in > priority order but favors idle priorities) > 3. Fair share (each priority gets an opportunity to run in proportion > to their priority) > > I'm leaning to implement #3 to test against what is in the code now. I > don't feel like #1 is the right way to go. I think #2 can add a lot of > administrative overhead to each OP especially if there have been a lot > of different priorities that we should remember even if the subqueue > is empty (I think MDS assigns many different priorities) which will > really be detrimental for high IOP workloads (SSD). I think I can > implement #3 with less overhead than what is in the current code and > prevent any queue from being starved. I honestly can't talk about the specifics of this as I don't think I've even read through the whole thing. I thought it was supposed to be #3 already, but perhaps it's broken. When looking at it in response to your initial PR I was worried about the eager deletion and queues losing their tokens, so if you say that's not working I think you may be right. I think it could explain the observed behavior too, and fixing that seems like an easier/quicker solution than a new queue type? -Greg > > I'll rough in some code and hopefully I can get it put up to the group > next week. > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v1.2.3 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWM8qYCRDmVDuy+mK58QAAs2UP/jbMo9jSD6+LaeqN0Ldu > El+/MVILzAYNeD25djLi2CF1vT0cuNAhMdUefAYUG0QOTaenLyk7EzUdhBul > z7z3GAZuGMVGfh8NKWUNMdf6j4OKgQvPQmA+bnFE23ou3Fxvm3+OOMjv9nRm > f8O308nkdZx3YlpaBq/jEuvhdPg4PtLNLgWg5EWDP26zl3NBO3xWBMtF/hh2 > u+qxHVfev432xPcNZmOAadEiC9mQkfAbG2ms0OJ8nGaeEAA/fWPxfKmN5axL > sG2YaJ5nuh+ygIhYNiGSZrbGBBB93WBV9tGyNHAJx15/3Dz9ZZU3OHg84ufr > L6WssbnjkX48ExOc2GfWH/sxI/UqpyzCHKY2G/iWWpmZO27dCjYQOTUO3H7Q > /en1JDyl8hAl9BqKBPFUthRH3gv/RYkkQTejE2iVfdvSn8l9+EcfzCtsdGou > LXDYb+k5jyxZelvR3qY1QdRxcuBxqLnmYVzS/iPph6nU3TINZGpyi/mFZiN5 > mxIED4BQGNLAG6hBr4OD7WusH9I8U2CEXFs5nGjlMxBsAQpM8L0xTwhmgthC > 4aHZqp0hH2DlNcBC8L1gNbDV15Q7fg0T8x2jXnh7F81Oq3AF+S4xYm6OzisC > jUc+Pmb1XwlWoL9wkcwqZ+GwKRcw2W4a/0ryi4KDriU+zTUo7J0P6qQHm6ab > nImA > =e7Ar > -----END PGP SIGNATURE----- > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Fri, Oct 30, 2015 at 12:08 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> As we've discussed on the PR, this isn't right. We deliberately >> iterate through the lower-priority queues first so that the >> higher-priority ones don't starve them out entirely: they might be >> generating tokens more quickly than tokens can be used, which would >> prevent low queues from ever getting any ops at all, and that's >> contrary to the intention. ...I guess this could also happen where the >> lower-priority queues are generating tokens so quickly they never run >> out, but that doesn't seem likely unless you have *very* small >> objects. >> >> You might also try tuning the osd_op_pq_max_tokens_per_priority and >> osd_op_pq_min_cost values, which specify the maximum number of tokens >> and the minimum token cost of an op for each. (Defaults are 4194304 >> and 65536, ie 4MB and 64KB.) I wouldn't expect this to be an issue >> (that builds up to a maximum of 64 recovery ops getting processed in a >> single go before it hits its long-term token average limit) but maybe >> something isn't behaving as expected. >> -Greg >> >> >> >> On Wed, Oct 28, 2015 at 9:52 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >>> On Wed, 28 Oct 2015, Robert LeBlanc wrote: >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA256 >>>> >>>> I created a pull request to fix an op dequeuing order problem. I'm not >>>> sure if I need to mention it here. >>>> >>>> https://github.com/ceph/ceph/pull/6417 >>> >>> Wow, good catch. Have you found that this materially impacts the behavior >>> in your cluster? >>> >>> sage >>> >>> >>>> >>>> - ---------------- >>>> Robert LeBlanc >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>>> -----BEGIN PGP SIGNATURE----- >>>> Version: Mailvelope v1.2.3 >>>> Comment: https://www.mailvelope.com >>>> >>>> wsFcBAEBCAAQBQJWMVh6CRDmVDuy+mK58QAAztQP/385BOI8AH2uEJhN8pQ4 >>>> QnAJxRy4HceWzjfAUulqNbbiD1scHZMU7LDW1GtsXfOZzmndTnJSBrR4+aHq >>>> F7py9zgXcxXH4uTAoILbRzkCF3rWdmkeh1/m5aY4LqmhE2N/O/LLOmDUe2BT >>>> XkQgZ9sROzY9pSj6pjA2vuv7k2u1SWtF3Ky14Hll3LHjqJibXoXYy+ik7lOP >>>> lRUoAY08Yf+c/Ag/Yy7CLGgIk/y6mdaJZPd2PCaVsKFa55NJAlYv0PHJKX0j >>>> XkSAY10MednMX6N+QL8XAq+yiAd//UADfCNhxHkP84YsPPCpNeS1OcoF6WGG >>>> g5H8uMK84kZCk37ummW/ANg9WNnO3hN2j22r9ezA+4GfxqKibT4lEMba6h88 >>>> i5L3rQwWmM0cdpjS9plH1yUiPP2DexJV8PaiAIVVMAkw+AC0Xb/nUXKX6u5+ >>>> YU744kSjtscN95Caf72V6HirB/uEU4sm+4lUuUBHzTcvau/r9WUHezwvmUiH >>>> HHL9bSU5TJ4jXvQhDEBYKbflTzLNKjXPcp1PagN2P9ZWQvNaxrQm32iB84DW >>>> 6jLEArFX10kE3eZ8IqoBikw5d+y3YtnuJ1oAIkfzj1ANofm37VKcQY/Wfrjw >>>> eke0nR4QBuN6SibbPXqIsjjIWZdo/jCgOCylNONXCFn9Qp08/7UJMQtzHk/1 >>>> xRRp >>>> =g+NJ >>>> -----END PGP SIGNATURE----- >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html