Re: osd_snap_trim_sleep keeps locks PG during sleep?

David Turner <david.turner@xxxxxxxxxxxxxxxx> · Thu, 19 Jan 2017 22:25:19 +0000

We are a couple of weeks away from upgrading to Jewel in our production clusters (after months of testing in our QA environments), but this might
 prevent us from making the migration from Hammer.   We delete ~8,000 snapshots/day between 3 clusters and our snap_trim_q gets up to about 60 Million in each of those clusters.  We have to use an osd_snap_trim_sleep of 0.25 to prevent our clusters from falling
 on their faces during our big load and 0.1 the rest of the day to catch up on the snap trim q.

Is our setup possible to use on Jewel?

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

________________________________________

From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Samuel Just [sjust@xxxxxxxxxx]

Sent: Thursday, January 19, 2017 2:45 PM

To: Nick Fisk

Cc: ceph-users

Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?

Yeah, I think you're probably right.  The answer is probably to add an

explicit rate-limiting element to the way the snaptrim events are

scheduled.

-Sam

On Thu, Jan 19, 2017 at 1:34 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

> I will give those both a go and report back, but the more I thinking about this the less I'm convinced that it's going to help.

>

> I think the problem is a general IO imbalance, there is probably something like 100+ times more trimming IO than client IO and so even if client IO gets promoted to the front of the queue by Ceph, once it hits the Linux IO layer its fighting for itself. I
 guess this approach works with scrubbing as each read IO has to wait to be read before the next one is submitted, so the queue can be managed on the OSD. With trimming, writes can buffer up below what the OSD controls.

>

> I don't know if the snap trimming goes nuts because the journals are acking each request and the spinning disks can't keep up, or if it's something else. Does WBThrottle get involved with snap trimming?

>

> But from an underlying disk perspective, there is definitely more than 2 snaps per OSD at a time going on, even if the OSD itself is not processing more than 2 at a time. I think there either needs to be another knob so that Ceph can throttle back snaps,
 not just de-prioritise them. Or, there needs a whole new kernel interface where an application can priority tag individual IO's for CFQ to handle, instead of the current limitation of priority per thread, I realise this is probably very very hard or impossible.
 But it would allow Ceph to control IO queue's right down to the disk.

>

>> -----Original Message-----

>> From: Samuel Just [mailto:sjust@xxxxxxxxxx]

>> Sent: 19 January 2017 18:58

>> To: Nick Fisk <nick@xxxxxxxxxx>

>> Cc: Dan van der Ster <dan@xxxxxxxxxxxxxx>; ceph-users <ceph-users@xxxxxxxxxxxxxx>

>> Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?

>>

>> Have you also tried setting osd_snap_trim_cost to be 16777216 (16x the default value, equal to a 16MB IO) and

>> osd_pg_max_concurrent_snap_trims to 1 (from 2)?

>> -Sam

>>

>> On Thu, Jan 19, 2017 at 7:57 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:

>> > Hi Sam,

>> >

>> > Thanks for the confirmation on both which thread the trimming happens in and for confirming my suspicion that sleeping is now a

>> bad idea.

>> >

>> > The problem I see is that even with setting the priority for trimming down low, it still seems to completely swamp the cluster. The

>> trims seem to get submitted in an async nature which seems to leave all my disks sitting at queue depths of 50+ for several minutes

>> until the snapshot is removed, often also causing several OSD's to get marked out and start flapping. I'm using WPQ but haven't

>> changed the cutoff variable yet as I know you are working on fixing a bug with that.

>> >

>> > Nick

>> >

>> >> -----Original Message-----

>> >> From: Samuel Just [mailto:sjust@xxxxxxxxxx]

>> >> Sent: 19 January 2017 15:47

>> >> To: Dan van der Ster <dan@xxxxxxxxxxxxxx>

>> >> Cc: Nick Fisk <nick@xxxxxxxxxx>; ceph-users

>> >> <ceph-users@xxxxxxxxxxxxxx>

>> >> Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?

>> >>

>> >> Snaptrimming is now in the main op threadpool along with scrub,

>> >> recovery, and client IO.  I don't think it's a good idea to use any of the _sleep configs anymore -- the intention is that by setting the

>> priority low, they won't actually be scheduled much.

>> >> -Sam

>> >>

>> >> On Thu, Jan 19, 2017 at 5:40 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

>> >> > On Thu, Jan 19, 2017 at 1:28 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

>> >> >> Hi Dan,

>> >> >>

>> >> >> I carried out some more testing after doubling the op threads, it

>> >> >> may have had a small benefit as potentially some threads are

>> >> >> available, but latency still sits more or less around the

>> >> >> configured snap sleep time. Even more threads might help, but I

>> >> >> suspect you are just

>> >> lowering the chance of IO's that are stuck behind the sleep, rather than actually solving the problem.

>> >> >>

>> >> >> I'm guessing when the snap trimming was in disk thread, you

>> >> >> wouldn't have noticed these sleeps, but now it's in the op thread

>> >> >> it will just sit there holding up all IO and be a lot more

>> >> >> noticable. It might be

>> >> that this option shouldn't be used with Jewel+?

>> >> >

>> >> > That's a good thought -- so we need confirmation which thread is

>> >> > doing the snap trimming. I honestly can't figure it out from the

>> >> > code -- hopefully a dev could explain how it works.

>> >> >

>> >> > Otherwise, I don't have much practical experience with snap

>> >> > trimming in jewel yet -- our RBD cluster is still running 0.94.9.

>> >> >

>> >> > Cheers, Dan

>> >> >

>> >> >

>> >> >>

>> >> >>> -----Original Message-----

>> >> >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On

>> >> >>> Behalf Of Nick Fisk

>> >> >>> Sent: 13 January 2017 20:38

>> >> >>> To: 'Dan van der Ster' <dan@xxxxxxxxxxxxxx>

>> >> >>> Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>

>> >> >>> Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?

>> >> >>>

>> >> >>> We're on Jewel and your right, I'm pretty sure the snap stuff is also now handled in the op thread.

>> >> >>>

>> >> >>> The dump historic ops socket command showed a 10s delay at the

>> >> >>> "Reached PG" stage, from Greg's response [1], it would suggest

>> >> >>> that the OSD itself isn't blocking but the PG it's currently

>> >> >>> sleeping whilst trimming. I think in the former case, it would

>> >> >>> have a

>> >> >> high time

>> >> >>> on the "Started" part of the op? Anyway I will carry out some

>> >> >>> more testing with higher osd op threads and see if that makes any difference. Thanks for the suggestion.

>> >> >>>

>> >> >>> Nick

>> >> >>>

>> >> >>>

>> >> >>> [1]

>> >> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/00

>> >> >>> 865

>> >> >>> 2.html

>> >> >>>

>> >> >>> > -----Original Message-----

>> >> >>> > From: Dan van der Ster [mailto:dan@xxxxxxxxxxxxxx]

>> >> >>> > Sent: 13 January 2017 10:28

>> >> >>> > To: Nick Fisk <nick@xxxxxxxxxx>

>> >> >>> > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>

>> >> >>> > Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?

>> >> >>> >

>> >> >>> > Hammer or jewel? I've forgotten which thread pool is handling

>> >> >>> > the snap trim nowadays -- is it the op thread yet? If so,

>> >> >>> > perhaps all the op threads are stuck sleeping? Just a wild

>> >> >>> > guess. (Maybe

>> >> >> increasing #

>> >> >>> op threads would help?).

>> >> >>> >

>> >> >>> > -- Dan

>> >> >>> >

>> >> >>> >

>> >> >>> > On Thu, Jan 12, 2017 at 3:11 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

>> >> >>> > > Hi,

>> >> >>> > >

>> >> >>> > > I had been testing some higher values with the

>> >> >>> > > osd_snap_trim_sleep variable to try and reduce the impact of

>> >> >>> > > removing RBD snapshots on our cluster and I have come across

>> >> >>> > > what I believe to be a possible unintended consequence. The

>> >> >>> > > value of the sleep seems to keep the

>> >> >>> > lock on the PG open so that no other IO can use the PG whilst the snap removal operation is sleeping.

>> >> >>> > >

>> >> >>> > > I had set the variable to 10s to completely minimise the

>> >> >>> > > impact as I had some multi TB snapshots to remove and noticed

>> >> >>> > > that suddenly all IO to the cluster had a latency of roughly

>> >> >>> > > 10s as well, all the

>> >> >>> > dumped ops show waiting on PG for 10s as well.

>> >> >>> > >

>> >> >>> > > Is the osd_snap_trim_sleep variable only ever meant to be

>> >> >>> > > used up to say a max of 0.1s and this is a known side effect,

>> >> >>> > > or should the lock on the PG be removed so that normal IO can

>> >> >>> > > continue during the

>> >> >>> > sleeps?

>> >> >>> > >

>> >> >>> > > Nick

>> >> >>> > >

>> >> >>> > > _______________________________________________

>> >> >>> > > ceph-users mailing list

>> >> >>> > > ceph-users@xxxxxxxxxxxxxx

>> >> >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >> >>>

>> >> >>> _______________________________________________

>> >> >>> ceph-users mailing list

>> >> >>> ceph-users@xxxxxxxxxxxxxx

>> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >> >>

>> >> > _______________________________________________

>> >> > ceph-users mailing list

>> >> > ceph-users@xxxxxxxxxxxxxx

>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com