Re: Request for Comments: Weighted Round Robin OP Queue

Milosz Tanski <milosz@xxxxxxxxx> · Mon, 9 Nov 2015 19:39:31 -0500

On Mon, Nov 9, 2015 at 3:49 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> On Mon, Nov 9, 2015 at 12:31 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just  wrote:
>>> What I really want from PrioritizedQueue (and from the dmclock/mclock
>>> approaches that are also being worked on) is a solution to the problem
>>> of efficiently deciding which op to do next taking into account
>>> fairness across io classes and ops with different costs.
>>
>>> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc  wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> Thanks, I think some of the fog is clearing. I was wondering how
>>>> operations between threads were keeping the order of operations in
>>>> PGs, that explains it.
>>>>
>>>> My original thoughts were to have a queue in front and behind the
>>>> Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
>>>> The queue thread would pull ops off that queue and place them into the
>>>> specialized queue, do house keeping, etc and would dequeue ops in that
>>>> queue to a post-queue that worker threads would monitor. The thread
>>>> queue could keep a certain amount of items in the post-queue to
>>>> prevent starvation and worker threads from being blocked.
>>>
>>> I'm not sure what the advantage of this would be -- it adds another thread
>>> to the processing pipeline at best.
>>
>> There are a few reasons I thought about it. 1. It is hard to
>> prioritize/mange the work load if you can't see/manage all the
>> operations. One queue allows the algorithm to make decisions based on
>> all available information. (This point seems to be handled in a
>> different way in the future) 2. Reduce latency in the Op path. When an
>> OP is queued, there is overhead in getting it in the right place. When
>> an OP is dequeued there is more overhead in spreading tokens, etc.
>> Right now that is all serial, if an OP is stuck in the queue waiting
>> to be dispatched some of this overhead can't be performed while in
>> this waiting period. The idea is pushing that overhead to a separate
>> thread and allowing a worker thread to queue/dequeue in the most
>> efficient manner. It also allows for more complex trending,
>> scheduling, etc because it can sit outside of the OP path. As the
>> workload changes, it can dynamically change how it manages the queue
>> like simple fifo for low periods where latency is dominated by compute
>> time, to Token/WRR when latency is dominated by disk access, etc.
>>
>
> We basically don't want a single thread to see all of the operations -- it
> would cause a tremendous bottleneck and complicate the design
> immensely.  It's shouldn't be necessary anyway since PGs are a form
> of course grained locking, so it's probably fine to schedule work for
> different groups of PGs independently if we assume that all kinds of
> work are well distributed over those groups.

There are are some queue implementations that rely on a single thread
essentially playing traffic cop in between queues and it's pretty
fast. FastFlow, the C++ lib, does that. It constructs other kinds of
queues from fast lock-free / wait-free SPSC queues. In the case of
something like MPMC there's a mediator thread there that manages N
SPSC in-queus to MSPC out-queues.

I'm only bringing this up since if you have a problem that might need
a mediator to arrange order, it's possible to do it fast.

>
>>>> It would require the worker thread to be able to handle any kind of
>>>> op, or having separate post-queues for the different kinds of work.
>>>> I'm getting the feeling that this may be a far too simplistic approach
>>>> to the problem (or at least in terms of the organization of Ceph at
>>>> this point). I'm also starting to feel that I'm getting out of my
>>>> league trying to understand all the intricacies of the OSD work flow
>>>> (trying to start with one of the most complicated parts of the system
>>>> doesn't help).
>>>>
>>>> Maybe what I should do is just code up the queue to drop in as a
>>>> replacement for the Prio queue for the moment. Then as your async work
>>>> is completing we can shake out the potential issues with recovery and
>>>> costs that we talked about earlier. One thing that I'd like to look
>>>> into is elevating the priority of recovery ops that have client OPs
>>>> blocked. I don't think the WRR queue gives the recovery thread a lot
>>>> of time to get its work done.
>>>>
>>>
>>> If an op comes in that requires recovery to happen before it can be
>>> processed, we send the recovery messages with client priority rather
>>> than recovery priority.
>>
>> But the recovery is still happening the recovery thread and not the
>> client thread, right? The recovery thread has a lower priority than
>> the op thread? That's how I understand it.
>>
>
> No, in hammer we removed the snap trim and scrub workqueues.  With
> wip-recovery-wq, I remove the recovery wqs as well.  Ideally, the only
> meaningful set of threads remaining will be the op_tp and associated
> queues.
>
>>>> Based on some testing on Friday, the number of recovery ops on an osd
>>>> did not really change if there were 20 backfilling or 1 backfilling.
>>>> The difference came in with how many client I/Os were blocked waiting
>>>> for objects to recover. When 20 backfills were going, there were a lot
>>>> more blocked I/O waiting for objects to show up or recover. With one
>>>> backfill, there were far less blocked I/O, but there were still times
>>>> I/O would block.
>>>
>>> The number of recovery ops is actually a separate configurable
>>> (osd_recovery_max_active -- default to 15).  It's odd that with more
>>> backfilling on a single osd, there is more blocked IO.  Looking into
>>> that would be helpful and would probably give you some insight
>>> into recovery and the op processing pipeline.
>>
>> I'll see what I can find here.
>>
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWQQJ0CRDmVDuy+mK58QAAeeUP/1uN/9EdqQDJdxW7fgeJ
>> /E0X49LmnnCigMPL5QJ3fpGjf44C0xcc9LN5IGJwwumHd5ozznpocy8Oj30N
>> +rNPJQ4dxcRao+bXUL/+DCQuY0wN/i7CqfMTW5PFmkdH4K9Lgce+bN6Q5Ora
>> q8JZvAxaZLCLZ10N+uiD5ghs+3X68hu4Da8SYQj0vjLs5gV4oATebF3JuYXW
>> GZ9qNfm2ygbeuT5Q0fhOKrvwJ9taKagMNrZLU10Wz5lHpGNitP3f17sVQznF
>> 7ZCkZ+2oS+P4Lerchc3xB2qBJUoPJGSuGAUTSl/uUeyMoZT1+2LvLdNbJaio
>> UonoKJv47p4mpjo75x6FTWbJg0Ix+8/3/6oo3CkxC+6vOeWcv90B3TJGJPRz
>> tAayNB/1YpsVZ3QlHiuyC7+TdKofLRlMR21iAnAJkZ6FdgMz9SFk1Rp4vuyR
>> 1qeZ+B4qA0m9ZWjx/G80j3fkUDY48EHR5gnI1k+WHFAh8KqT3eTRr37n9HH4
>> 7wVakfPv89+HRjqrlA7WK5F89UVp1I+2kEmtPADCiwgh2wf0zn7Y5tA4FMXH
>> DIloZIRfvPwFtwpqgF7GR5vb/1dEOzD9Da0Zb7gBfsEfGaI2pJ+yvD1ad3BB
>> eqHQ05rl7s8meeX0H+6gWn9/f0JA65k2P2Y4N3YHk6OvKqIqnhreS9Tl4grH
>> MrBN
>> =Ju+O
>> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html