-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 After trying to look through the recovery code, I'm getting the feeling that recovery OPs are not scheduled in the OP queue that I've been working on. Does that sound right? In the OSD logs I'm only seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply). If the recovery is in another separate queue, then there is no reliable way to prioritize OPs between them. If I'm going off in to the weeds, please help me get back on the trail. Thanks, - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil wrote: >> On Thu, 5 Nov 2015, Robert LeBlanc wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA256 >>> >>> Thanks Gregory, >>> >>> People are most likely busy and haven't had time to digest this and I >>> may be expecting more excitement from it (I'm excited due to the >>> results and probably also that such a large change still works). I'll >>> keep working towards a PR, this was mostly proof of concept, now that >>> there is some data I'll clean up the code. >> >> I'm *very* excited about this. This is something that almost every >> operator has problems with so it's very encouraging to see that switching >> up the queue has a big impact in your environment. >> >> I'm just following up on this after a week of travel, so apologies if this >> is covered already, but did you compare this implementation to the >> original one with the same tunables? I see somewhere that you had >> max_backfills=20 at some point, which is going to be bad regardless of the >> queue. >> >> I also see that you chnaged the strict priority threshold from LOW to HIGH >> in OSD.cc; I'm curious how much of an impact was from this vs the queue >> implementation. > > Yes max_backfills=20 is problematic for both queues and from what I > can tell is because the OPs are waiting for PGs to get healthy. In a > busy cluster it can take a while due to the recovery ops having low > priority. In the current queue, it is possible to be blocked for a > long time. The new queue seems to prevent that, but they do still back > up. After this, I think I'd like to look into promoting recovery OPs > that are blocking client OPs to higher priorities so that client I/O > doesn't suffer as much during recovery. I think that will be a very > different problem to tackle because I don't think I can do the proper > introspection at the queue level. I'll have to do that logic in OSD.cc > or PG.cc. > > The strict priority threshold didn't make much of a difference with > the original queue. I initially eliminated it all together in the WRR, > but there were times that peering would never complete. I want to get > as many OPs in the WRR queue to provide fairness as much as possible. > I haven't tweaked the setting much in the WRR queue yet. > >> >>> I was thinking that a config option to choose the scheduler would be a >>> good idea. In terms of the project what is the better approach: create >>> a new template and each place the template class is instantiated >>> select the queue, or perform the queue selection in the same template >>> class, or something else I haven't thought of. >> >> A config option would be nice, but I'd start by just cleaning up the code >> and putting it in a new class (WeightedRoundRobinPriorityQueue or >> whatever). If we find that it's behaving better I'm not sure how much >> value we get from a tunable. Note that there is one other user >> (msgr/simple/DispatchQueue) that we might also was to switch over at some >> point.. especially if this implementation is faster. >> >> Once it's cleaned up (remove commented out code, new class) put it up as a >> PR and we can review and get it through testing. > > In talking with Samuel in IRC, we think creating an abstract class for > the queue is the best option. C++11 allows you to still optimize > abstract template classes if you use final in the derived class (I > verified the assembly). I'm planning to refactor the code so that > similar code can be reused between queues and allows more flexibility > in the future (components can chose the queue that works the best for > them, etc). The test for which queue to use should be a very simple > comparison and it would allow us to let it bake much longer. I hope to > have a PR mid next week. > > - ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v1.2.3 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK > xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh > 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X > AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno > v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy > aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG > Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz > ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip > +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t > pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ > w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv > QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm > FKe/ > =yodk > -----END PGP SIGNATURE----- -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWPVZPCRDmVDuy+mK58QAAyK4QAL4ZdF0bRxSVSQAZGgDN pEfGEO1+heaj5Uj1sUitoXct5f//TbXcnuJDStlMe0rbplZDPUU0ZsXs8hNE sro6GiFuSP6ZQgHshW50d8iCGjmF/DKhYPs6jWJUIwCMelY45YLfpadAmkZT GePGEu5UzhYhlfQeiaQOFd7jWH2uVOnPLASK6f68cNRUv8rywJ8q5/6h0p8I TPg277NglGP1VntZ0z4/9CsSl49YOowVQooRZ9JQr3BpFYsbSEBBY5vLak8q X9Rb0rngG52vKT5VE58wUY/Pfbdwn7nbnV/BOUBnhBr+f14QKhNsWKpVM9EV R/cjlqJV3vesrwrXWay+4AaVoOn1TPMgBc/YV9LOlSdectNC0Ig7iBqC0Mjo kgeSQ0NJZSN99o4GKUnfwnd/fjDLzyi03XX5JkUMmEDLKPjT0LTmcnVSP5gu GGdEDNNEfIyt8PZalB4HN1Ik0c4/YdQKpb6XjbejoN37NvYom+dwZsKk2g/J Qa1bFDzvUZoTfax1yyMh2xu4b0rI6+a3bBhVBbY6Wz417aPRAhz09DecJoxt 28jqn3Aj7ARETg5BTCn1gGjEWP4IytLKOvctukCFSnxJWKPumTMRqfTUnsKu FxNjhSk5Kc+kVV7wQ7cU6NzxoBYHXMoEeamFXBmLooUG4lDKEeg0t+R9hPbT ABCA =yXJO -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html