Re: Deepscrub IO impact on Jewel: What is osd_op_queue prio implementation?

Martin Millnert <martin@xxxxxxxxxxx> · Tue, 25 Apr 2017 22:25:30 +0200

On Tue, Apr 25, 2017 at 03:39:42PM -0400, Gregory Farnum wrote:
> > I'd like to understand if "prio" in Jewel is as explained, i.e.
> > something similar to the following pseudo code:
> >
> >   if len(subqueue) > 0:
> >     dequeue(subqueue)
> >   if tokens(global) > some_cost:
> >     for queue in queues_high_to_low:
> >       if len(queue) > 0:
> >         dequeue(queue)
> >         tokens = tokens - some_other_cost
> >   else:
> >     for queue in queues_low_to_high:
> >       if len(queue) > 0:
> >         dequeue(queue)
> >         tokens = tokens - some_other_cost
> >   tokens = min(tokens + some_refill_rate, max_tokens)
> 
> That looks about right.

OK, thanks for validation. That has indeed impact on the entire priority
queue under stress, then. (WPQ motivation seems clear :) )

> > The objective is to increase servicing time of client IO, especially
> > read, while deep scrub is occuring. It doesn't matter for us if a
> > deep-scrub takes x or 3x time, essentially. More consistent latency
> > to clients is more important.
> 
> I don't have any experience with SMR drives so it wouldn't surprise me
> if there are some exciting emergent effects with them.

Basically a very large chunk of disk area needs to be rewritten on each
write. So write amplification factor of an inode update is just silly.
They have a PMR buffer area on approx 500 GB, but that area can run out
pretty fast during consistent IO over time (exact buffer management
logic not known).

> But it sounds
> to me like you want to start by adjusting the osd_scrub_priority
> (default 5) and osd_scrub_cost (default 50 << 20, ie 50MB). That will
> directly impact how they move through the queue in relation to client
> ops. (There are also the family of scrub scheduling options, which
> might make sense if you are more tolerant of slow IO at certain times
> of the day/week, but I'm not familiar with them).
> -Greg

Thanks for those pointers!  It seems from a distance that it's necessary
to use WPQ if it can be suspected that the IO scheduler is running
without available tokens (not sure how to verify *that*).

#ceph also helped point out that indeed I'm missing noatime,nodiratime
on the mount options. So every read is causing an inode update which is
extremely expensive on SMR, compared with regular HDD (e.g. PMR).
(Not sure how I missed this when I set it up, because I've been aware of
noatime earlier :) )

I think that's the first fix we'll want to do, and the biggest source of
trouble, and then look back in a week or so to see how it's doing then.
Then after that look into the various scrub-vs-client op scheduling
artefacts.

Thanks!

/M
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com