Christian,
Yep, that describes what I see too. Good news is that I made a lot of progress on optimizing the queue today 10-50% performance increase in my microbenchmarks (that is only the improvement in enqueueing and dequeueing ops which is a small part of the whole IO path, but every little bit helps). I have some more knobs to turn in my microbenchmarks then run it through some real tests and document the results and then submit the pull request hopefully mid next week. Then the next thing to look into is what I've affectionately called "The 10 Minutes of Terror" which can be anywhere from 5 minutes to 20 minutes in our cluster. It is our biggest point after the starved IO situation which the new queue I wrote goes a long way in mitigating that and the cache promotion/demotion issues which Nick and Sage have been working on (thanks for all the work on that).
I hope in a few weeks time I can have a report on what I find. Hopefully we can have it fixed for Jewel and Hammer. Fingers crossed.
Robert LeBlanc
Sent from a mobile device please excuse any typos.
Hello,
for the record what Robert is writing below matches my experience the best.
On Fri, 12 Feb 2016 22:17:01 +0000 Steve Taylor wrote:
> I could be wrong, but I didn't think a PG would have to peer when an OSD
> is restarted with noout set. If I'm wrong, then this peering would
> definitely block I/O. I just did a quick test on a non-busy cluster and
> didn't see any peering when my OSD went down or up, but I'm not sure how
> good a test that is. The OSD should also stay "in" throughout the
> restart with noout set, so it wouldn't have been "out" before to cause
> peering when it came "in."
>
It stays in as far as the ceph -s output is concerned, but clearly for
things to work as desired/imagined/expected some redirection state has to
be enabled for the duration.
> I do know that OSDs don’t mark themselves "up" until they're caught up
> on OSD maps. They won't accept any op requests until they're "up," so
> they shouldn't have any catching up to do by the time they start taking
> op requests. In theory they're ready to handle I/O by the time they
> start handling I/O. At least that's my understanding.
>
Well, here is the 3 minute ordeal of a restart, things went downhill from
the moment the shutdown was initiated to the time it was fully back up,
with some additional fun at the recovery after that, but compared to the 3
minutes of near silence, that was a minor hiccup.
---
2016-02-12 01:33:45.408348 7f7f0c786700 -1 osd.2 1788 *** Got signal Terminated ***
2016-02-12 01:33:45.408414 7f7f0c786700 0 osd.2 1788 prepare_to_stop telling mon we are shutting down
2016-02-12 01:33:45.408807 7f7f1cfa7700 0 osd.2 1788 got_stop_ack starting shutdown
2016-02-12 01:33:45.408841 7f7f0c786700 0 osd.2 1788 prepare_to_stop starting shutdown
2016-02-12 01:33:45.408852 7f7f0c786700 -1 osd.2 1788 shutdown
2016-02-12 01:33:45.409541 7f7f0c786700 20 osd.2 1788 kicking pg 1.10
2016-02-12 01:33:45.409547 7f7f0c786700 30 osd.2 pg_epoch: 1788 pg[1.10( empty local-les=1788 n=0 ec=1 les/c 1788/1788
1787/1787/1787) [1,2] r=1 lpr=1787 pi=1656-1786/12 crt=0'0 active] lock
2016-02-12 01:33:45.409562 7f7f0c786700 10 osd.2 pg_epoch: 1788 pg[1.10( empty local-les=1788 n=0 ec=1 les/c 1788/1788
1787/1787/1787) [1,2] r=1 lpr=1787 pi=1656-1786/12 crt=0'0 active] on_shutdown
---
This goes on for quite a while with other PGs and more shutdown fun, until
we get here:
---
2016-02-12 01:33:46.413966 7f7f0c786700 20 osd.2 1788 kicking pg 2.17c
2016-02-12 01:33:46.413974 7f7f0c786700 30 osd.2 pg_epoch: 1788 pg[2.17c( v 1788'10083124 (1782'10080123,1788'10083124
] local-les=1785 n=1522 ec=1 les/c 1785/1785 1784/1784/1784) [0,2] r=1 lpr=1784 pi=1652-1783/16 luod=0'0 crt=1692'3885
111 lcod 1788'10083123 active] lock
2016-02-12 01:33:48.690760 7f75be4d57c0 0 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ce
ph-osd, pid 24967
---
So from shutdown to startup about 2 seconds, not that bad.
However here is where the cookie crumbles massively:
---
2016-02-12 01:33:50.263152 7f75be4d57c0 0 filestore(/var/lib/ceph/osd/ceph-2) limited size xattrs
2016-02-12 01:35:31.809897 7f75be4d57c0 0 filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode
: checkpoint is not enabled
---
Nearly 2 minutes to mount things, it probably had to go to disk quite a
bit, as not everything was in the various slab caches. And yes, there is
32GB of RAM, most of it pagecache and vfs_cache_pressure is set to 1.
During that time, silence of the lambs when it came to ops.
Next this :
---
2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 1788 load_pgs
2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 load_pgs opened 564 pgs
---
Another minute to load the PGs.
After this the boot completed and things normalized more or less.
> It would be interesting to see what this cluster looks like as far as
> OSD count, journal configuration, network, CPU, RAM, etc. Something is
> obviously amiss. Even in a semi-decent configuration one should be able
> to restart a single OSD with noout under little load without causing
> blocked op requests.
>
Well, I specifically stated that this cluster is very, VERY busy.
CPU, RAM, network (IPoIB) are all not the culprit, neither are the 2 DC
S3700s 200GB holding just one journal each.
Which brings us to the OSDs, just 4 of them, 2 per the 2 storage nodes,
RAID with HW cache backed HDDs.
Size and min_size are of course 2 and 1 respectively.
I've described this problem child before, in this state because my
requests for additional nodes/OSDs over a year ago were ignored for 8
months at which time the load was so high that backfilling in new OSDs
wasn't an option either.
So now I'm putting a cache tier on top of this (and the cost vs. an upgrade
in time is of course much higher, la-di-da), but due to another bug I found
in Firefly with cache tiers I have to go to Hammer first to do that.
For the oldtimers here: You're in a twisty maze of passages, all alike.
Again, I was watching this with atop running on all nodes, only the backing
storage of the restarting OSD was busy to the point of melting, the
cluster should have had enough IOPS resources to still hold up on just 3
cylinders, but the other OSDs were never getting to see any of that
traffic.
IOPS dropped 80-90 percent until the OSD was fully back "in".
I could more or less replicate this with my cruddy test cluster, 4 nodes,
4 HDDs (no SSDs), 1Gb links, totally anemic in the CPU and RAM department.
This one is already running Hammer.
When running "rados -p rbd bench 30 write -b 4096" and stopping one OSD
in the middle of things and starting it 10 seconds later I got this
cheerful log entry:
---
2016-02-13 12:22:42.114247 osd.20 203.216.0.83:6807/26301 239 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.986682 secs
2016-02-13 12:22:42.114270 osd.20 203.216.0.83:6807/26301 240 : cluster [WRN] slow request 30.986682 seconds old, received at 2016-02-13 12:22:11.127490: osd_op(client.28601805.0:114 benchmark_data_engtest03_23598_object113 [write 0~4194304] 8.43856a6f ack+ondisk+write+known_if_redirected e11563) currently reached_pg
2016-02-13 12:22:44.116518 osd.20 203.216.0.83:6807/26301 241 : cluster [WRN] 2 slow requests, 1 included below; oldest blocked for > 32.988940 secs
2016-02-13 12:22:44.116533 osd.20 203.216.0.83:6807/26301 242 : cluster [WRN] slow request 30.254598 seconds old, received at 2016-02-13 12:22:13.861832: osd_op(client.28601805.0:128 benchmark_data_engtest03_23598_object127 [write 0~4194304] 8.ddc143c ack+ondisk+write+known_if_redirected e11563) currently reached_pg
2016-02-13 12:23:01.768430 osd.22 203.216.0.83:6812/27118 203 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.158289 secs
2016-02-13 12:23:01.768458 osd.22 203.216.0.83:6812/27118 204 : cluster [WRN] slow request 30.158289 seconds old, received at 2016-02-13 12:22:31.610001: osd_op(client.28601805.0:191 benchmark_data_engtest03_23598_object190 [write 0~4194304] 8.cafe2d76 ack+ondisk+write+known_if_redirected e11563) currently reached_pg
---
And when doing it with the default 4M blocks the bench output looks like
this, spot when I stopped/started the OSD:
---
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 16 311 295 58.9881 44 0.542233 1.03916
21 16 324 308 58.6546 52 0.563641 1.04909
22 16 338 322 58.5335 56 1.35033 1.05188
23 16 346 330 57.3796 32 1.01166 1.04428
24 16 355 339 56.4886 36 0.66026 1.04332
25 16 361 345 55.1888 24 0.503717 1.03625
26 16 364 348 53.5277 12 0.297751 1.02988
27 16 365 349 51.6934 4 0.456629 1.02823
28 16 365 349 49.8472 0 - 1.02823
29 16 365 349 48.1285 0 - 1.02823
30 16 365 349 46.5243 0 - 1.02823
31 16 365 349 45.0236 0 - 1.02823
32 16 365 349 43.6167 0 - 1.02823
33 16 365 349 42.2951 0 - 1.02823
34 16 365 349 41.0512 0 - 1.02823
35 16 365 349 39.8784 0 - 1.02823
36 16 365 349 38.7707 0 - 1.02823
37 16 366 350 37.8309 0.4 17.1657 1.07434
38 16 386 370 38.9395 80 0.363365 1.62187
---
Regards,
Christian
> Steve Taylor | Senior Software Engineer | StorageCraft Technology
> Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 | Fax: 801.545.4705
>
> If you are not the intended recipient of this message, be advised that
> any dissemination or copying of this message is prohibited. If you
> received this message erroneously, please notify the sender and delete
> it, together with any attachments.
>
>
> -----Original Message-----
> From: Robert LeBlanc [mailto:robert@xxxxxxxxxxxxx]
> Sent: Friday, February 12, 2016 1:30 PM
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>; Christian Balzer
> <chibi@xxxxxxx>; ceph-users@xxxxxxxxxxxxxx Subject: Re:
> Reducing the impact of OSD restarts (noout ain't uptosnuff)
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> What I've seen is that when an OSD starts up in a busy cluster, as soon
> as it is "in" (could be "out" before) it starts getting client traffic.
> However, it has be "in" to start catching up and peering to the other
> OSDs in the cluster. The OSD is not ready to service requests for that
> PG yet, but it has the OP queued until it is ready. On a busy cluster it
> can take an OSD a long time to become ready especially if it is
> servicing client requests at the same time.
>
> If someone isn't able to look into the code to resolve this by the time
> I'm finished with the queue optimizations I'm doing (hopefully in a week
> or two), I plan on looking into this to see if there is something that
> can be done to prevent the OPs from being accepted until the OSD is
> ready for them.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk wrote:
> > I wonder if Christian is hitting some performance issue when the OSD
> > or number of OSD's all start up at once? Or maybe the OSD is still
> > doing some internal startup procedure and when the IO hits it on a
> > very busy cluster, it causes it to become overloaded for a few seconds?
> >
> > I've seen similar things in the past where if I did not have enough
> > min free KB's configured, PG's would take a long time to peer/activate
> > and cause slow ops.
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Steve Taylor
> >> Sent: 12 February 2016 16:32
> >> To: Nick Fisk ; 'Christian Balzer' ; ceph- users@xxxxxxxxxxxxxx
> >> Subject: Re: Reducing the impact of OSD restarts (noout
> >> ain't
> >> uptosnuff)
> >>
> >> Nick is right. Setting noout is the right move in this scenario.
> > Restarting an
> >> OSD shouldn't block I/O unless nodown is also set, however. The
> >> exception to this would be a case where min_size can't be achieved
> >> because of the down OSD, i.e. min_size=3 and 1 of 3 OSDs is
> >> restarting. That would
> > certainly
> >> block writes. Otherwise the cluster will recognize down OSDs as down
> >> (without nodown set), redirect I/O requests to OSDs that are up, and
> > backfill
> >> as necessary when things are back to normal.
> >>
> >> You can set min_size to something lower if you don't have enough OSDs
> >> to allow you to restart one without blocking writes. If this isn't
> >> the case, something deeper is going on with your cluster. You
> >> shouldn't get slow requests due to restarting a single OSD with only
> >> noout set and idle disks
> > on
> >> the remaining OSDs. I've done this many, many times.
> >>
> >> Steve Taylor | Senior Software Engineer | StorageCraft Technology
> >> Corporation
> >> 380 Data Drive Suite 300 | Draper | Utah | 84020
> >> Office: 801.871.2799 | Fax: 801.545.4705
> >>
> >> If you are not the intended recipient of this message, be advised
> >> that any dissemination or copying of this message is prohibited.
> >> If you received this message erroneously, please notify the sender
> >> and delete it, together with any attachments.
> >>
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Nick Fisk
> >> Sent: Friday, February 12, 2016 9:07 AM
> >> To: 'Christian Balzer' ; ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re: Reducing the impact of OSD restarts (noout
> >> ain't
> >> uptosnuff)
> >>
> >>
> >>
> >> > -----Original Message-----
> >> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >> > Behalf Of Christian Balzer
> >> > Sent: 12 February 2016 15:38
> >> > To: ceph-users@xxxxxxxxxxxxxx
> >> > Subject: Re: Reducing the impact of OSD restarts
> >> > (noout ain't
> >> > uptosnuff)
> >> >
> >> > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > On 02/12/2016 03:47 PM, Christian Balzer wrote:
> >> > > > Hello,
> >> > > >
> >> > > > yesterday I upgraded our most busy (in other words lethally
> >> > > > overloaded) production cluster to the latest Firefly in
> >> > > > preparation for a Hammer upgrade and then phasing in of a cache
> > tier.
> >> > > >
> >> > > > When restarting the ODSs it took 3 minutes (1 minute in a
> >> > > > consecutive repeat to test the impact of primed caches) during
> >> > > > which the cluster crawled to a near stand-still and the dreaded
> >> > > > slow requests piled up, causing applications in the VMs to fail.
> >> > > >
> >> > > > I had of course set things to "noout" beforehand, in hopes of
> >> > > > staving off this kind of scenario.
> >> > > >
> >> > > > Note that the other OSDs and their backing storage were NOT
> >> > > > overloaded during that time, only the backing storage of the
> >> > > > OSD being restarted was under duress.
> >> > > >
> >> > > > I was under the (wishful thinking?) impression that with noout
> >> > > > set and a controlled OSD shutdown/restart, operations would be
> >> > > > redirect to the new primary for the duration.
> >> > > > The strain on the restarted OSDs when recovering those
> >> > > > operations (which I also saw) I was prepared for, the near
> >> > > > screeching halt not so much.
> >> > > >
> >> > > > Any thoughts on how to mitigate this further or is this the
> >> > > > expected behavior?
> >> > >
> >> > > I wouldn't use noout in this scenario. It keeps the cluster from
> >> > > recognizing that a OSD is not available; other OSD will still try
> >> > > to write to that OSD. This is probably the cause of the blocked
> >> > > requests. Redirecting only works if the cluster is able to detect
> >> > > a PG as being degraded.
> >> > >
> >> > Oh well, that makes of course sense, but I found some article
> >> > stating that
> >> it
> >> > also would redirect things and the recovery activity I saw
> >> > afterwards
> >> suggests
> >> > it did so at some point.
> >>
> >> Doesn't noout just stop the crushmap from being modified and hence
> >> data shuffling. Nodown controls whether or not the OSD is available
> >> for IO?
> >>
> >> Maybe try the reverse. Set noup so that OSD's don't participate in IO
> >> and then bring them in manually?
> >>
> >> >
> >> > > If the cluster is aware of the OSD being missing, it could handle
> >> > > the write requests more gracefully. To prevent it from
> >> > > backfilling etc, I prefer to use nobackfill and norecover. It
> >> > > blocks backfill on the cluster level, but allows requests to be
> >> > > carried out (at least in my understanding of these flags).
> >> > >
> >> > Yes, I concur and was thinking of that as well. Will give it a spin
> >> > with
> >> the
> >> > upgrade to Hammer.
> >> >
> >> > > 'noout' is fine for large scale cluster maintenance, since it
> >> > > keeps the cluster from backfilling. I've used when I had to power
> >> > > down our complete cluster.
> >> > >
> >> > Guess with my other, less busy clusters, this never showed up on my
> > radar.
> >> >
> >> > Regards,
> >> >
> >> > Christian
> >> > > Regards,
> >> > > Burkhard
> >> > > _______________________________________________
> >> > > ceph-users mailing list
> >> > > ceph-users@xxxxxxxxxxxxxx
> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > >
> >> >
> >> >
> >> > --
> >> > Christian Balzer Network/Systems Engineer
> >> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> >> > http://www.gol.com/
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@xxxxxxxxxxxxxx
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.3.4
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWvkCiCRDmVDuy+mK58QAAo2cP/1T8Mswjr+k25L9mjTU+
> prOnXMZqLgJviGWXKECcbWq0ApGfFB2MSz5X5d88xT8B0TOlq0ozMy1E7RN1
> 0BUXxWWW3yDTib+u7DD9hLr9OgqEiEF4ASJopHkLJyu0ej9qv6G3RfUDSrm4
> aA6s961Y3GdqIE+0iser6XPCDEfZ3yd5gwGcyNCDbQsy8mIB7hfYO5qdlzGY
> v+7YQoXWhO8B1vcLD2goVLlExlVmYXT8yjVOum/a6lZHBW7OKD8v7+KByFs8
> ih1dnqb3gXUKdXnA4ScOiYc1ZnNFY2hltPAgrNtBvKKlWmBbp5HXyMfQbIeE
> K2w6mCkMEjdsUK9aQlVcDXwauoIeib5+7MBBJBQI4rIGy9tKLrvajZcJ8MXk
> ly3mCEdyAlFSaZ6O/I7kFnyNZbl2Krjr58DEtJ60h0BBd72kTarfvoqbhq0Y
> 3KDsnSVAx4rT+kt3yuw6po2mM6XkiCR/tjPp8XIgzJ6mred5jM7/UJBCEqTH
> hlJWEA4fmvD/p4wbe9XlXZnhFwFH7Hb+AQaa/LRxU5kJLcPXzRFNw/4F4ItI
> dHGwIyo62va2AkxunH1IHT7T9Gt10PQSJ6q2D53Errpx1Och5qnliffPwvWA
> JCEgsrTIwF2wdskMms79XKVBiIkPujLL4c8pFvh+MHqLvIEPjSu5aLU0tIAc
> aQ5z
> =/jV9
> -----END PGP SIGNATURE-----
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com