Re: Request for Comments: Weighted Round Robin OP Queue

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Mon, 9 Nov 2015 13:31:24 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just  wrote:
> What I really want from PrioritizedQueue (and from the dmclock/mclock
> approaches that are also being worked on) is a solution to the problem
> of efficiently deciding which op to do next taking into account
> fairness across io classes and ops with different costs.

> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc  wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Thanks, I think some of the fog is clearing. I was wondering how
>> operations between threads were keeping the order of operations in
>> PGs, that explains it.
>>
>> My original thoughts were to have a queue in front and behind the
>> Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
>> The queue thread would pull ops off that queue and place them into the
>> specialized queue, do house keeping, etc and would dequeue ops in that
>> queue to a post-queue that worker threads would monitor. The thread
>> queue could keep a certain amount of items in the post-queue to
>> prevent starvation and worker threads from being blocked.
>
> I'm not sure what the advantage of this would be -- it adds another thread
> to the processing pipeline at best.

There are a few reasons I thought about it. 1. It is hard to
prioritize/mange the work load if you can't see/manage all the
operations. One queue allows the algorithm to make decisions based on
all available information. (This point seems to be handled in a
different way in the future) 2. Reduce latency in the Op path. When an
OP is queued, there is overhead in getting it in the right place. When
an OP is dequeued there is more overhead in spreading tokens, etc.
Right now that is all serial, if an OP is stuck in the queue waiting
to be dispatched some of this overhead can't be performed while in
this waiting period. The idea is pushing that overhead to a separate
thread and allowing a worker thread to queue/dequeue in the most
efficient manner. It also allows for more complex trending,
scheduling, etc because it can sit outside of the OP path. As the
workload changes, it can dynamically change how it manages the queue
like simple fifo for low periods where latency is dominated by compute
time, to Token/WRR when latency is dominated by disk access, etc.

>> It would require the worker thread to be able to handle any kind of
>> op, or having separate post-queues for the different kinds of work.
>> I'm getting the feeling that this may be a far too simplistic approach
>> to the problem (or at least in terms of the organization of Ceph at
>> this point). I'm also starting to feel that I'm getting out of my
>> league trying to understand all the intricacies of the OSD work flow
>> (trying to start with one of the most complicated parts of the system
>> doesn't help).
>>
>> Maybe what I should do is just code up the queue to drop in as a
>> replacement for the Prio queue for the moment. Then as your async work
>> is completing we can shake out the potential issues with recovery and
>> costs that we talked about earlier. One thing that I'd like to look
>> into is elevating the priority of recovery ops that have client OPs
>> blocked. I don't think the WRR queue gives the recovery thread a lot
>> of time to get its work done.
>>
>
> If an op comes in that requires recovery to happen before it can be
> processed, we send the recovery messages with client priority rather
> than recovery priority.

But the recovery is still happening the recovery thread and not the
client thread, right? The recovery thread has a lower priority than
the op thread? That's how I understand it.

>> Based on some testing on Friday, the number of recovery ops on an osd
>> did not really change if there were 20 backfilling or 1 backfilling.
>> The difference came in with how many client I/Os were blocked waiting
>> for objects to recover. When 20 backfills were going, there were a lot
>> more blocked I/O waiting for objects to show up or recover. With one
>> backfill, there were far less blocked I/O, but there were still times
>> I/O would block.
>
> The number of recovery ops is actually a separate configurable
> (osd_recovery_max_active -- default to 15).  It's odd that with more
> backfilling on a single osd, there is more blocked IO.  Looking into
> that would be helpful and would probably give you some insight
> into recovery and the op processing pipeline.

I'll see what I can find here.

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQQJ0CRDmVDuy+mK58QAAeeUP/1uN/9EdqQDJdxW7fgeJ
/E0X49LmnnCigMPL5QJ3fpGjf44C0xcc9LN5IGJwwumHd5ozznpocy8Oj30N
+rNPJQ4dxcRao+bXUL/+DCQuY0wN/i7CqfMTW5PFmkdH4K9Lgce+bN6Q5Ora
q8JZvAxaZLCLZ10N+uiD5ghs+3X68hu4Da8SYQj0vjLs5gV4oATebF3JuYXW
GZ9qNfm2ygbeuT5Q0fhOKrvwJ9taKagMNrZLU10Wz5lHpGNitP3f17sVQznF
7ZCkZ+2oS+P4Lerchc3xB2qBJUoPJGSuGAUTSl/uUeyMoZT1+2LvLdNbJaio
UonoKJv47p4mpjo75x6FTWbJg0Ix+8/3/6oo3CkxC+6vOeWcv90B3TJGJPRz
tAayNB/1YpsVZ3QlHiuyC7+TdKofLRlMR21iAnAJkZ6FdgMz9SFk1Rp4vuyR
1qeZ+B4qA0m9ZWjx/G80j3fkUDY48EHR5gnI1k+WHFAh8KqT3eTRr37n9HH4
7wVakfPv89+HRjqrlA7WK5F89UVp1I+2kEmtPADCiwgh2wf0zn7Y5tA4FMXH
DIloZIRfvPwFtwpqgF7GR5vb/1dEOzD9Da0Zb7gBfsEfGaI2pJ+yvD1ad3BB
eqHQ05rl7s8meeX0H+6gWn9/f0JA65k2P2Y4N3YHk6OvKqIqnhreS9Tl4grH
MrBN
=Ju+O
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html