On Tue, Apr 25, 2017 at 03:39:42PM -0400, Gregory Farnum wrote: > > I'd like to understand if "prio" in Jewel is as explained, i.e. > > something similar to the following pseudo code: > > > > if len(subqueue) > 0: > > dequeue(subqueue) > > if tokens(global) > some_cost: > > for queue in queues_high_to_low: > > if len(queue) > 0: > > dequeue(queue) > > tokens = tokens - some_other_cost > > else: > > for queue in queues_low_to_high: > > if len(queue) > 0: > > dequeue(queue) > > tokens = tokens - some_other_cost > > tokens = min(tokens + some_refill_rate, max_tokens) > > That looks about right. OK, thanks for validation. That has indeed impact on the entire priority queue under stress, then. (WPQ motivation seems clear :) ) > > The objective is to increase servicing time of client IO, especially > > read, while deep scrub is occuring. It doesn't matter for us if a > > deep-scrub takes x or 3x time, essentially. More consistent latency > > to clients is more important. > > I don't have any experience with SMR drives so it wouldn't surprise me > if there are some exciting emergent effects with them. Basically a very large chunk of disk area needs to be rewritten on each write. So write amplification factor of an inode update is just silly. They have a PMR buffer area on approx 500 GB, but that area can run out pretty fast during consistent IO over time (exact buffer management logic not known). > But it sounds > to me like you want to start by adjusting the osd_scrub_priority > (default 5) and osd_scrub_cost (default 50 << 20, ie 50MB). That will > directly impact how they move through the queue in relation to client > ops. (There are also the family of scrub scheduling options, which > might make sense if you are more tolerant of slow IO at certain times > of the day/week, but I'm not familiar with them). > -Greg Thanks for those pointers! It seems from a distance that it's necessary to use WPQ if it can be suspected that the IO scheduler is running without available tokens (not sure how to verify *that*). #ceph also helped point out that indeed I'm missing noatime,nodiratime on the mount options. So every read is causing an inode update which is extremely expensive on SMR, compared with regular HDD (e.g. PMR). (Not sure how I missed this when I set it up, because I've been aware of noatime earlier :) ) I think that's the first fix we'll want to do, and the biggest source of trouble, and then look back in a week or so to see how it's doing then. Then after that look into the various scrub-vs-client op scheduling artefacts. Thanks! /M
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com