On Mon, Nov 9, 2015 at 12:31 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just wrote: >> What I really want from PrioritizedQueue (and from the dmclock/mclock >> approaches that are also being worked on) is a solution to the problem >> of efficiently deciding which op to do next taking into account >> fairness across io classes and ops with different costs. > >> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA256 >>> >>> Thanks, I think some of the fog is clearing. I was wondering how >>> operations between threads were keeping the order of operations in >>> PGs, that explains it. >>> >>> My original thoughts were to have a queue in front and behind the >>> Prio/WRR queue. Threads scheduling work would queue to the pre-queue. >>> The queue thread would pull ops off that queue and place them into the >>> specialized queue, do house keeping, etc and would dequeue ops in that >>> queue to a post-queue that worker threads would monitor. The thread >>> queue could keep a certain amount of items in the post-queue to >>> prevent starvation and worker threads from being blocked. >> >> I'm not sure what the advantage of this would be -- it adds another thread >> to the processing pipeline at best. > > There are a few reasons I thought about it. 1. It is hard to > prioritize/mange the work load if you can't see/manage all the > operations. One queue allows the algorithm to make decisions based on > all available information. (This point seems to be handled in a > different way in the future) 2. Reduce latency in the Op path. When an > OP is queued, there is overhead in getting it in the right place. When > an OP is dequeued there is more overhead in spreading tokens, etc. > Right now that is all serial, if an OP is stuck in the queue waiting > to be dispatched some of this overhead can't be performed while in > this waiting period. The idea is pushing that overhead to a separate > thread and allowing a worker thread to queue/dequeue in the most > efficient manner. It also allows for more complex trending, > scheduling, etc because it can sit outside of the OP path. As the > workload changes, it can dynamically change how it manages the queue > like simple fifo for low periods where latency is dominated by compute > time, to Token/WRR when latency is dominated by disk access, etc. > We basically don't want a single thread to see all of the operations -- it would cause a tremendous bottleneck and complicate the design immensely. It's shouldn't be necessary anyway since PGs are a form of course grained locking, so it's probably fine to schedule work for different groups of PGs independently if we assume that all kinds of work are well distributed over those groups. >>> It would require the worker thread to be able to handle any kind of >>> op, or having separate post-queues for the different kinds of work. >>> I'm getting the feeling that this may be a far too simplistic approach >>> to the problem (or at least in terms of the organization of Ceph at >>> this point). I'm also starting to feel that I'm getting out of my >>> league trying to understand all the intricacies of the OSD work flow >>> (trying to start with one of the most complicated parts of the system >>> doesn't help). >>> >>> Maybe what I should do is just code up the queue to drop in as a >>> replacement for the Prio queue for the moment. Then as your async work >>> is completing we can shake out the potential issues with recovery and >>> costs that we talked about earlier. One thing that I'd like to look >>> into is elevating the priority of recovery ops that have client OPs >>> blocked. I don't think the WRR queue gives the recovery thread a lot >>> of time to get its work done. >>> >> >> If an op comes in that requires recovery to happen before it can be >> processed, we send the recovery messages with client priority rather >> than recovery priority. > > But the recovery is still happening the recovery thread and not the > client thread, right? The recovery thread has a lower priority than > the op thread? That's how I understand it. > No, in hammer we removed the snap trim and scrub workqueues. With wip-recovery-wq, I remove the recovery wqs as well. Ideally, the only meaningful set of threads remaining will be the op_tp and associated queues. >>> Based on some testing on Friday, the number of recovery ops on an osd >>> did not really change if there were 20 backfilling or 1 backfilling. >>> The difference came in with how many client I/Os were blocked waiting >>> for objects to recover. When 20 backfills were going, there were a lot >>> more blocked I/O waiting for objects to show up or recover. With one >>> backfill, there were far less blocked I/O, but there were still times >>> I/O would block. >> >> The number of recovery ops is actually a separate configurable >> (osd_recovery_max_active -- default to 15). It's odd that with more >> backfilling on a single osd, there is more blocked IO. Looking into >> that would be helpful and would probably give you some insight >> into recovery and the op processing pipeline. > > I'll see what I can find here. > > - ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v1.2.3 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWQQJ0CRDmVDuy+mK58QAAeeUP/1uN/9EdqQDJdxW7fgeJ > /E0X49LmnnCigMPL5QJ3fpGjf44C0xcc9LN5IGJwwumHd5ozznpocy8Oj30N > +rNPJQ4dxcRao+bXUL/+DCQuY0wN/i7CqfMTW5PFmkdH4K9Lgce+bN6Q5Ora > q8JZvAxaZLCLZ10N+uiD5ghs+3X68hu4Da8SYQj0vjLs5gV4oATebF3JuYXW > GZ9qNfm2ygbeuT5Q0fhOKrvwJ9taKagMNrZLU10Wz5lHpGNitP3f17sVQznF > 7ZCkZ+2oS+P4Lerchc3xB2qBJUoPJGSuGAUTSl/uUeyMoZT1+2LvLdNbJaio > UonoKJv47p4mpjo75x6FTWbJg0Ix+8/3/6oo3CkxC+6vOeWcv90B3TJGJPRz > tAayNB/1YpsVZ3QlHiuyC7+TdKofLRlMR21iAnAJkZ6FdgMz9SFk1Rp4vuyR > 1qeZ+B4qA0m9ZWjx/G80j3fkUDY48EHR5gnI1k+WHFAh8KqT3eTRr37n9HH4 > 7wVakfPv89+HRjqrlA7WK5F89UVp1I+2kEmtPADCiwgh2wf0zn7Y5tA4FMXH > DIloZIRfvPwFtwpqgF7GR5vb/1dEOzD9Da0Zb7gBfsEfGaI2pJ+yvD1ad3BB > eqHQ05rl7s8meeX0H+6gWn9/f0JA65k2P2Y4N3YHk6OvKqIqnhreS9Tl4grH > MrBN > =Ju+O > -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html