Re: Filestore throttling

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 23 Oct 2014 21:26:07 -0700 (PDT)

On Fri, 24 Oct 2014, GuangYang wrote:
> > commit 44dca5c8c5058acf9bc391303dc77893793ce0be
> > Author: Sage Weil <sage@xxxxxxxxxxx>
> > Date: Sat Jan 19 17:33:25 2013 -0800
> >
> > filestore: disable extra committing queue allowance
> >
> > The motivation here is if there is a problem draining the op queue
> > during a sync. For XFS and ext4, this isn't generally a problem: you
> > can continue to make writes while a syncfs(2) is in progress. There
> > are currently some possible implementation issues with btrfs, but we
> > have not demonstrated them recently.
> >
> > Meanwhile, this can cause queue length spikes that screw up latency.
> > During a commit, we allow too much into the queue (say, recovery
> > operations). After the sync finishes, we have to drain it out before
> > we can queue new work (say, a higher priority client request). Having
> > a deep queue below the point where priorities order work limits the
> > value of the priority queue.
> >
> > Signed-off-by: Sage Weil <sage@xxxxxxxxxxx>
> >
> > I'm not sure it makes sense to increase it in the general case. It might
> > make sense for your workload, or we may want to make peering transactions
> > some sort of special case...?
> It is actually another commit:
> 
> commit 40654d6d53436c210b2f80911217b044f4d7643a
> filestore: filestore_queue_max_ops 500 -> 50
> Having a deep queue limits the effectiveness of the priority queues
> above by adding additional latency.

Ah, you're right.

> I don't quite understand the use case that it might add additional 
> latency by increasing this value, would you mind elaborating?

There is a priority queue a bit further up the stack OpWQ, in which high 
priority items (e.g., client IO) can move ahead of low priority items 
(e.g., recovery).  If the queue beneath that (the filestore one) is very 
deep, the client IO will only have a marginal advantage over the recovery 
IO since it will still sit in the second queue for a long time.  Ideally, 
we want the priority queue to be the deepest one (so that we maximize the 
amount of stuff we can reorder) and the queues above and below to be as 
shallow as possible.

I think the peering operations are different because they can't be 
reordered with respect to anything else in the same PG (unlike, say, 
client vs recovery io for that pg).  On the other hand, there may be 
client IO on other PGs that we want to reorder and finish more quickly.  
Allowing all of the right reordering and also getting the priority 
inheritence right here is probably a hugely complex undertaking, so we 
probably just want to go for a reasonably simple strategy that avoids the 
worst instances of priority inversion (where an important thing is stuck 
behind a slow thing).  :/

In any case, though, I'm skeptical that making the lowest-level queue 
deeper is going to help in general, even if it addresses the peering 
case specifically...

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com