Re: osd_snap_trim_sleep keeps locks PG during sleep?

David Turner <david.turner@xxxxxxxxxxxxxxxx> · Fri, 3 Feb 2017 20:05:59 +0000

We still had blocked requests with osd_snap_trim_cost set to 1GB and osd_snap_trim_priority set to 1 in our test cluster.  The test has 20 threads writing to RBD's and 1 thread
 deleting snapshots on RBD's with an osd_map.

The snap_trim_q on the PGs holds at empty on the PGs unless we use osd_snap_trim_sleep no matter how strictly we set the osd_snap_trim cost and priority settings.

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

From: David Turner

Sent: Friday, February 03, 2017 11:54 AM

To: Samuel Just

Cc: Nick Fisk; ceph-users

Subject: RE: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

We found where it is in 10.2.5.  It is implemented in the OSD.h file in Jewel, but it is implemented in OSD.cc in Master.  We assumed it would be in the same place.

We delete over 100TB of snapshots spread across thousands of snapshots every day.  We haven't yet found any combination of settings that allow us to delete snapshots in Jewel without blocking requests in a test cluster with a fraction of that workload.  We
 went as far as setting osd_snap_trim_cost to 512MB with default osd_snap_trim_priority (before we noticed the priority setting) and set osd_snap_trim_cost to 4MB (the size of our objects) with default_osd_snap_trim_priority set to 1.  We stopped testing there
 as we thought we found that these weren't implemented in Jewel.  We will continue our testing and provide an update when we have it.

Our current solution in Hammer involves a daemon monitoring the cluster load and setting the osd_snap_trim_sleep accordingly between 0 and 0.35 which does a good job of preventing IO blocking and help us to clear out the snap_trim_q each day.  These settings
 not being injectable in Jewel would negate an option of using variable settings throughout the day.

From: Samuel Just [sjust@xxxxxxxxxx]

Sent: Friday, February 03, 2017 11:24 AM

To: David Turner

Cc: Nick Fisk; ceph-users

Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

They do seem to exist in Jewel.
-Sam

On Fri, Feb 3, 2017 at 10:12 AM, David Turner 
<david.turner@xxxxxxxxxxxxxxxx> wrote:

After searching the code,
osd_snap_trim_cost and osd_snap_trim_priority exist in Master but not in Jewel or Kraken.  If osd_snap_trim_sleep was made useless in Jewel by moving snap trimming to the main op thread and no new feature was added
 to Jewel to allow clusters to throttle snap trimming... What recourse do people that use a lot of snapshots to use Jewel?  Luckily this thread came around right before we were ready to push to production and we tested snap trimming heavily in QA and found
 that we can't even deal with half of our snap trimming on Jewel that we would need to.  All of these settings are also not injectable into the osd daemon so it would take a full restart of the all of the osds to change their settings...

Does anyone have any success stories for snap trimming on Jewel?

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

From: Samuel Just [sjust@xxxxxxxxxx]

Sent: Thursday, January 26, 2017 1:14 PM

To: Nick Fisk

Cc: David Turner; ceph-users

Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Just an update.  I think the real goal with the sleep configs in general was to reduce the number of concurrent snap trims happening.  To that end, I've put together a branch which adds an AsyncReserver (as with backfill) for snap trims to each
 OSD.  Before actually starting to do trim work, the primary will wait in line to get one of the slots and will hold that slot until the repops are complete.  https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep
 is the branch (based on master), but I've got a bit more work to do (and testing to do) before it's ready to be tested.
-Sam

On Fri, Jan 20, 2017 at 2:05 PM, Nick Fisk 
<nick@xxxxxxxxxx> wrote:

Hi Sam,

I have a test cluster, albeit small. I’m happy to run tests + graph results with a wip branch and work out reasonable settings…etc

From: Samuel Just [mailto:sjust@xxxxxxxxxx]

Sent: 19 January 2017 23:23

To: David Turner <david.turner@xxxxxxxxxxxxxxxx>

Cc: Nick Fisk <nick@xxxxxxxxxx>; ceph-users <ceph-users@xxxxxxxxxxxxxx>

Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

I could probably put together a wip branch if you have a test cluster you could try it out on.

-Sam

On Thu, Jan 19, 2017 at 2:27 PM, David Turner <david.turner@xxxxxxxxxxxxxxxx> wrote:

To be clear, we are willing to change to a snap_trim_sleep of 0 and try to manage it with the other available settings... but it is sounding like that won't really
 work for us since our main op thread(s) will just be saturated with snap trimming almost all day.  We currently only have ~6 hours/day where our snap trim q's are empty.

David Turner | Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 | Draper | Utah | 84020

Office:
801.871.2760 |
Mobile: 385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and
 be advised that any dissemination or copying of this message is prohibited.

From: ceph-users [ceph-users-bounces@xxxxxxxxxx.com]
 on behalf of David Turner [david.turner@xxxxxxxxxxxxxxxx]

Sent: Thursday, January 19, 2017 3:25 PM

To: Samuel Just; Nick Fisk

Cc: ceph-users

Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

We are a couple of weeks away from upgrading to Jewel in our production clusters (after months of testing in our QA environments), but this might prevent us from
 making the migration from Hammer.   We delete ~8,000 snapshots/day between 3 clusters and our snap_trim_q gets up to about 60 Million in each of those clusters.  We have to use an osd_snap_trim_sleep of 0.25 to prevent our clusters from falling on their faces
 during our big load and 0.1 the rest of the day to catch up on the snap trim q.

Is our setup possible to use on Jewel?

David Turner | Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 | Draper | Utah | 84020

Office:
801.871.2760 |
Mobile: 385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and
 be advised that any dissemination or copying of this message is prohibited.

________________________________________

From: ceph-users [ceph-users-bounces@xxxxxxxxxx.com] on behalf of Samuel Just [sjust@xxxxxxxxxx]

Sent: Thursday, January 19, 2017 2:45 PM

To: Nick Fisk

Cc: ceph-users

Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Yeah, I think you're probably right.  The answer is probably to add an

explicit rate-limiting element to the way the snaptrim events are

scheduled.

-Sam

On Thu, Jan 19, 2017 at 1:34 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

> I will give those both a go and report back, but the more I thinking about this the less I'm convinced that it's going to help.

>

> I think the problem is a general IO imbalance, there is probably something like 100+ times more trimming IO than client IO and so even if client IO gets promoted to the front of the queue by Ceph, once it hits the Linux IO layer its fighting for itself. I
 guess this approach works with scrubbing as each read IO has to wait to be read before the next one is submitted, so the queue can be managed on the OSD. With trimming, writes can buffer up below what the OSD controls.

>

> I don't know if the snap trimming goes nuts because the journals are acking each request and the spinning disks can't keep up, or if it's something else. Does WBThrottle get involved with snap trimming?

>

> But from an underlying disk perspective, there is definitely more than 2 snaps per OSD at a time going on, even if the OSD itself is not processing more than 2 at a time. I think there either needs to be another knob so that Ceph can throttle back snaps,
 not just de-prioritise them. Or, there needs a whole new kernel interface where an application can priority tag individual IO's for CFQ to handle, instead of the current limitation of priority per thread, I realise this is probably very very hard or impossible.
 But it would allow Ceph to control IO queue's right down to the disk.

>

>> -----Original Message-----

>> From: Samuel Just [mailto:sjust@xxxxxxxxxx]

>> Sent: 19 January 2017 18:58

>> To: Nick Fisk <nick@xxxxxxxxxx>

>> Cc: Dan van der Ster <dan@xxxxxxxxxxxxxx>; ceph-users <ceph-users@xxxxxxxxxxxxxx>

>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

>>

>> Have you also tried setting osd_snap_trim_cost to be 16777216 (16x the default value, equal to a 16MB IO) and

>> osd_pg_max_concurrent_snap_trims to 1 (from 2)?

>> -Sam

>>

>> On Thu, Jan 19, 2017 at 7:57 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:

>> > Hi Sam,

>> >

>> > Thanks for the confirmation on both which thread the trimming happens in and for confirming my suspicion that sleeping is now a

>> bad idea.

>> >

>> > The problem I see is that even with setting the priority for trimming down low, it still seems to completely swamp the cluster. The

>> trims seem to get submitted in an async nature which seems to leave all my disks sitting at queue depths of 50+ for several minutes

>> until the snapshot is removed, often also causing several OSD's to get marked out and start flapping. I'm using WPQ but haven't

>> changed the cutoff variable yet as I know you are working on fixing a bug with that.

>> >

>> > Nick

>> >

>> >> -----Original Message-----

>> >> From: Samuel Just [mailto:sjust@xxxxxxxxxx]

>> >> Sent: 19 January 2017 15:47

>> >> To: Dan van der Ster <dan@xxxxxxxxxxxxxx>

>> >> Cc: Nick Fisk <nick@xxxxxxxxxx>; ceph-users

>> >> <ceph-users@xxxxxxxxxxxxxx>

>> >> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

>> >>

>> >> Snaptrimming is now in the main op threadpool along with scrub,

>> >> recovery, and client IO.  I don't think it's a good idea to use any of the _sleep configs anymore -- the intention is that by setting the

>> priority low, they won't actually be scheduled much.

>> >> -Sam

>> >>

>> >> On Thu, Jan 19, 2017 at 5:40 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

>> >> > On Thu, Jan 19, 2017 at 1:28 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

>> >> >> Hi Dan,

>> >> >>

>> >> >> I carried out some more testing after doubling the op threads, it

>> >> >> may have had a small benefit as potentially some threads are

>> >> >> available, but latency still sits more or less around the

>> >> >> configured snap sleep time. Even more threads might help, but I

>> >> >> suspect you are just

>> >> lowering the chance of IO's that are stuck behind the sleep, rather than actually solving the problem.

>> >> >>

>> >> >> I'm guessing when the snap trimming was in disk thread, you

>> >> >> wouldn't have noticed these sleeps, but now it's in the op thread

>> >> >> it will just sit there holding up all IO and be a lot more

>> >> >> noticable. It might be

>> >> that this option shouldn't be used with Jewel+?

>> >> >

>> >> > That's a good thought -- so we need confirmation which thread is

>> >> > doing the snap trimming. I honestly can't figure it out from the

>> >> > code -- hopefully a dev could explain how it works.

>> >> >

>> >> > Otherwise, I don't have much practical experience with snap

>> >> > trimming in jewel yet -- our RBD cluster is still running 0.94.9.

>> >> >

>> >> > Cheers, Dan

>> >> >

>> >> >

>> >> >>

>> >> >>> -----Original Message-----

>> >> >>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On

>> >> >>> Behalf Of Nick Fisk

>> >> >>> Sent: 13 January 2017 20:38

>> >> >>> To: 'Dan van der Ster' <dan@xxxxxxxxxxxxxx>

>> >> >>> Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>

>> >> >>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

>> >> >>>

>> >> >>> We're on Jewel and your right, I'm pretty sure the snap stuff is also now handled in the op thread.

>> >> >>>

>> >> >>> The dump historic ops socket command showed a 10s delay at the

>> >> >>> "Reached PG" stage, from Greg's response [1], it would suggest

>> >> >>> that the OSD itself isn't blocking but the PG it's currently

>> >> >>> sleeping whilst trimming. I think in the former case, it would

>> >> >>> have a

>> >> >> high time

>> >> >>> on the "Started" part of the op? Anyway I will carry out some

>> >> >>> more testing with higher osd op threads and see if that makes any difference. Thanks for the suggestion.

>> >> >>>

>> >> >>> Nick

>> >> >>>

>> >> >>>

>> >> >>> [1]

>> >> >>> 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/00

>> >> >>> 865

>> >> >>> 2.html

>> >> >>>

>> >> >>> > -----Original Message-----

>> >> >>> > From: Dan van der Ster [mailto:dan@xxxxxxxxxxxxxx]

>> >> >>> > Sent: 13 January 2017 10:28

>> >> >>> > To: Nick Fisk <nick@xxxxxxxxxx>

>> >> >>> > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>

>> >> >>> > Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

>> >> >>> >

>> >> >>> > Hammer or jewel? I've forgotten which thread pool is handling

>> >> >>> > the snap trim nowadays -- is it the op thread yet? If so,

>> >> >>> > perhaps all the op threads are stuck sleeping? Just a wild

>> >> >>> > guess. (Maybe

>> >> >> increasing #

>> >> >>> op threads would help?).

>> >> >>> >

>> >> >>> > -- Dan

>> >> >>> >

>> >> >>> >

>> >> >>> > On Thu, Jan 12, 2017 at 3:11 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

>> >> >>> > > Hi,

>> >> >>> > >

>> >> >>> > > I had been testing some higher values with the

>> >> >>> > > osd_snap_trim_sleep variable to try and reduce the impact of

>> >> >>> > > removing RBD snapshots on our cluster and I have come across

>> >> >>> > > what I believe to be a possible unintended consequence. The

>> >> >>> > > value of the sleep seems to keep the

>> >> >>> > lock on the PG open so that no other IO can use the PG whilst the snap removal operation is sleeping.

>> >> >>> > >

>> >> >>> > > I had set the variable to 10s to completely minimise the

>> >> >>> > > impact as I had some multi TB snapshots to remove and noticed

>> >> >>> > > that suddenly all IO to the cluster had a latency of roughly

>> >> >>> > > 10s as well, all the

>> >> >>> > dumped ops show waiting on PG for 10s as well.

>> >> >>> > >

>> >> >>> > > Is the osd_snap_trim_sleep variable only ever meant to be

>> >> >>> > > used up to say a max of 0.1s and this is a known side effect,

>> >> >>> > > or should the lock on the PG be removed so that normal IO can

>> >> >>> > > continue during the

>> >> >>> > sleeps?

>> >> >>> > >

>> >> >>> > > Nick

>> >> >>> > >

>> >> >>> > > _______________________________________________

>> >> >>> > > ceph-users mailing list

>> >> >>> > > ceph-users@xxxxxxxxxxxxxx

>> >> >>> > > 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >> >>>

>> >> >>> _______________________________________________

>> >> >>> ceph-users mailing list

>> >> >>> ceph-users@xxxxxxxxxxxxxx

>> >> >>> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >> >>

>> >> > _______________________________________________

>> >> > ceph-users mailing list

>> >> > ceph-users@xxxxxxxxxxxxxx

>> >> > 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com