Re: Reducing the impact of OSD restarts (noout ain't uptosnuff)

Christian Balzer <chibi@xxxxxxx> · Sat, 13 Feb 2016 14:31:55 +0900

Hello,

for the record what Robert is writing below matches my experience the best.

On Fri, 12 Feb 2016 22:17:01 +0000 Steve Taylor wrote:

> I could be wrong, but I didn't think a PG would have to peer when an OSD
> is restarted with noout set. If I'm wrong, then this peering would
> definitely block I/O. I just did a quick test on a non-busy cluster and
> didn't see any peering when my OSD went down or up, but I'm not sure how
> good a test that is. The OSD should also stay "in" throughout the
> restart with noout set, so it wouldn't have been "out" before to cause
> peering when it came "in."
> 
It stays in as far as the ceph -s output is concerned, but clearly for
things to work as desired/imagined/expected some redirection state has to
be enabled for the duration.

> I do know that OSDs don’t mark themselves "up" until they're caught up
> on OSD maps. They won't accept any op requests until they're "up," so
> they shouldn't have any catching up to do by the time they start taking
> op requests. In theory they're ready to handle I/O by the time they
> start handling I/O. At least that's my understanding.
> 

Well, here is the 3 minute ordeal of a restart, things went downhill from
the moment the shutdown was initiated to the time it was fully back up,
with some additional fun at the recovery after that, but compared to the 3
minutes of near silence, that was a minor hiccup.

---
2016-02-12 01:33:45.408348 7f7f0c786700 -1 osd.2 1788 *** Got signal Terminated ***
2016-02-12 01:33:45.408414 7f7f0c786700  0 osd.2 1788 prepare_to_stop telling mon we are shutting down
2016-02-12 01:33:45.408807 7f7f1cfa7700  0 osd.2 1788 got_stop_ack starting shutdown
2016-02-12 01:33:45.408841 7f7f0c786700  0 osd.2 1788 prepare_to_stop starting shutdown
2016-02-12 01:33:45.408852 7f7f0c786700 -1 osd.2 1788 shutdown
2016-02-12 01:33:45.409541 7f7f0c786700 20 osd.2 1788  kicking pg 1.10
2016-02-12 01:33:45.409547 7f7f0c786700 30 osd.2 pg_epoch: 1788 pg[1.10( empty local-les=1788 n=0 ec=1 les/c 1788/1788
 1787/1787/1787) [1,2] r=1 lpr=1787 pi=1656-1786/12 crt=0'0 active] lock
2016-02-12 01:33:45.409562 7f7f0c786700 10 osd.2 pg_epoch: 1788 pg[1.10( empty local-les=1788 n=0 ec=1 les/c 1788/1788
 1787/1787/1787) [1,2] r=1 lpr=1787 pi=1656-1786/12 crt=0'0 active] on_shutdown
---
This goes on for quite a while with other PGs and more shutdown fun, until
we get here:
---
2016-02-12 01:33:46.413966 7f7f0c786700 20 osd.2 1788  kicking pg 2.17c
2016-02-12 01:33:46.413974 7f7f0c786700 30 osd.2 pg_epoch: 1788 pg[2.17c( v 1788'10083124 (1782'10080123,1788'10083124
] local-les=1785 n=1522 ec=1 les/c 1785/1785 1784/1784/1784) [0,2] r=1 lpr=1784 pi=1652-1783/16 luod=0'0 crt=1692'3885
111 lcod 1788'10083123 active] lock
2016-02-12 01:33:48.690760 7f75be4d57c0  0 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ce
ph-osd, pid 24967
---
So from shutdown to startup about 2 seconds, not that bad.
However here is where the cookie crumbles massively:
---
2016-02-12 01:33:50.263152 7f75be4d57c0  0 filestore(/var/lib/ceph/osd/ceph-2) limited size xattrs
2016-02-12 01:35:31.809897 7f75be4d57c0  0 filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode
: checkpoint is not enabled
---
Nearly 2 minutes to mount things, it probably had to go to disk quite a
bit, as not everything was in the various slab caches. And yes, there is
32GB of RAM, most of it pagecache and vfs_cache_pressure is set to 1. 
During that time, silence of the lambs when it came to ops.
Next this :
---
2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2 1788 load_pgs
2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788 load_pgs opened 564 pgs
---
Another minute to load the PGs.

After this the boot completed and things normalized more or less.

> It would be interesting to see what this cluster looks like as far as
> OSD count, journal configuration, network, CPU, RAM, etc. Something is
> obviously amiss. Even in a semi-decent configuration one should be able
> to restart a single OSD with noout under little load without causing
> blocked op requests.
> 

Well, I specifically stated that this cluster is very, VERY busy.

CPU, RAM, network (IPoIB) are all not the culprit, neither are the 2 DC
S3700s 200GB holding just one journal each.
Which brings us to the OSDs, just 4 of them, 2 per the 2 storage nodes,
RAID with HW cache backed HDDs.

Size and min_size are of course 2 and 1 respectively. 

I've described this problem child before, in this state because my
requests for additional nodes/OSDs over a year ago were ignored for 8
months at which time the load was so high that backfilling in new OSDs
wasn't an option either. 
So now I'm putting a cache tier on top of this (and the cost vs. an upgrade
in time is of course much higher, la-di-da), but due to another bug I found
in Firefly with cache tiers I have to go to Hammer first to do that. 
For the oldtimers here: You're in a twisty maze of passages, all alike.

Again, I was watching this with atop running on all nodes, only the backing
storage of the restarting OSD was busy to the point of melting, the
cluster should have had enough IOPS resources to still hold up on just 3
cylinders, but the other OSDs were never getting to see any of that
traffic.
IOPS dropped 80-90 percent until the OSD was fully back "in".

I could more or less replicate this with my cruddy test cluster, 4 nodes,
4 HDDs (no SSDs), 1Gb links, totally anemic in the CPU and RAM department.
This one is already running Hammer.

When running "rados -p rbd  bench 30 write -b 4096" and stopping one OSD
in the middle of things and starting it 10 seconds later I got this
cheerful log entry:
---
2016-02-13 12:22:42.114247 osd.20 203.216.0.83:6807/26301 239 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.986682 secs
2016-02-13 12:22:42.114270 osd.20 203.216.0.83:6807/26301 240 : cluster [WRN] slow request 30.986682 seconds old, received at 2016-02-13 12:22:11.127490: osd_op(client.28601805.0:114 benchmark_data_engtest03_23598_object113 [write 0~4194304] 8.43856a6f ack+ondisk+write+known_if_redirected e11563) currently reached_pg
2016-02-13 12:22:44.116518 osd.20 203.216.0.83:6807/26301 241 : cluster [WRN] 2 slow requests, 1 included below; oldest blocked for > 32.988940 secs
2016-02-13 12:22:44.116533 osd.20 203.216.0.83:6807/26301 242 : cluster [WRN] slow request 30.254598 seconds old, received at 2016-02-13 12:22:13.861832: osd_op(client.28601805.0:128 benchmark_data_engtest03_23598_object127 [write 0~4194304] 8.ddc143c ack+ondisk+write+known_if_redirected e11563) currently reached_pg
2016-02-13 12:23:01.768430 osd.22 203.216.0.83:6812/27118 203 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.158289 secs
2016-02-13 12:23:01.768458 osd.22 203.216.0.83:6812/27118 204 : cluster [WRN] slow request 30.158289 seconds old, received at 2016-02-13 12:22:31.610001: osd_op(client.28601805.0:191 benchmark_data_engtest03_23598_object190 [write 0~4194304] 8.cafe2d76 ack+ondisk+write+known_if_redirected e11563) currently reached_pg
---

And when doing it with the default 4M blocks the bench output looks like
this, spot when I stopped/started the OSD:
---
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20      16       311       295   58.9881        44  0.542233   1.03916
    21      16       324       308   58.6546        52  0.563641   1.04909
    22      16       338       322   58.5335        56   1.35033   1.05188
    23      16       346       330   57.3796        32   1.01166   1.04428
    24      16       355       339   56.4886        36   0.66026   1.04332
    25      16       361       345   55.1888        24  0.503717   1.03625
    26      16       364       348   53.5277        12  0.297751   1.02988
    27      16       365       349   51.6934         4  0.456629   1.02823
    28      16       365       349   49.8472         0         -   1.02823
    29      16       365       349   48.1285         0         -   1.02823
    30      16       365       349   46.5243         0         -   1.02823
    31      16       365       349   45.0236         0         -   1.02823
    32      16       365       349   43.6167         0         -   1.02823
    33      16       365       349   42.2951         0         -   1.02823
    34      16       365       349   41.0512         0         -   1.02823
    35      16       365       349   39.8784         0         -   1.02823
    36      16       365       349   38.7707         0         -   1.02823
    37      16       366       350   37.8309       0.4   17.1657   1.07434
    38      16       386       370   38.9395        80  0.363365   1.62187
---

Regards,

Christian 

> Steve Taylor | Senior Software Engineer | StorageCraft Technology
> Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 | Fax: 801.545.4705
> 
> If you are not the intended recipient of this message, be advised that
> any dissemination or copying of this message is prohibited. If you
> received this message erroneously, please notify the sender and delete
> it, together with any attachments.
> 
> 
> -----Original Message-----
> From: Robert LeBlanc [mailto:robert@xxxxxxxxxxxxx] 
> Sent: Friday, February 12, 2016 1:30 PM
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>; Christian Balzer
> <chibi@xxxxxxx>; ceph-users@xxxxxxxxxxxxxx Subject: Re: 
> Reducing the impact of OSD restarts (noout ain't uptosnuff)
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> What I've seen is that when an OSD starts up in a busy cluster, as soon
> as it is "in" (could be "out" before) it starts getting client traffic.
> However, it has be "in" to start catching up and peering to the other
> OSDs in the cluster. The OSD is not ready to service requests for that
> PG yet, but it has the OP queued until it is ready. On a busy cluster it
> can take an OSD a long time to become ready especially if it is
> servicing client requests at the same time.
> 
> If someone isn't able to look into the code to resolve this by the time
> I'm finished with the queue optimizations I'm doing (hopefully in a week
> or two), I plan on looking into this to see if there is something that
> can be done to prevent the OPs from being accepted until the OSD is
> ready for them.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk  wrote:
> > I wonder if Christian is hitting some performance issue when the OSD 
> > or number of OSD's all start up at once? Or maybe the OSD is still 
> > doing some internal startup procedure and when the IO hits it on a 
> > very busy cluster, it causes it to become overloaded for a few seconds?
> >
> > I've seen similar things in the past where if I did not have enough 
> > min free KB's configured, PG's would take a long time to peer/activate 
> > and cause slow ops.
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
> >> Of Steve Taylor
> >> Sent: 12 February 2016 16:32
> >> To: Nick Fisk ; 'Christian Balzer' ; ceph- users@xxxxxxxxxxxxxx
> >> Subject: Re:  Reducing the impact of OSD restarts (noout 
> >> ain't
> >> uptosnuff)
> >>
> >> Nick is right. Setting noout is the right move in this scenario.
> > Restarting an
> >> OSD shouldn't block I/O unless nodown is also set, however. The 
> >> exception to this would be a case where min_size can't be achieved 
> >> because of the down OSD, i.e. min_size=3 and 1 of 3 OSDs is 
> >> restarting. That would
> > certainly
> >> block writes. Otherwise the cluster will recognize down OSDs as down 
> >> (without nodown set), redirect I/O requests to OSDs that are up, and
> > backfill
> >> as necessary when things are back to normal.
> >>
> >> You can set min_size to something lower if you don't have enough OSDs 
> >> to allow you to restart one without blocking writes. If this isn't 
> >> the case, something deeper is going on with your cluster. You 
> >> shouldn't get slow requests due to restarting a single OSD with only 
> >> noout set and idle disks
> > on
> >> the remaining OSDs. I've done this many, many times.
> >>
> >> Steve Taylor | Senior Software Engineer | StorageCraft Technology 
> >> Corporation
> >> 380 Data Drive Suite 300 | Draper | Utah | 84020
> >> Office: 801.871.2799 | Fax: 801.545.4705
> >>
> >> If you are not the intended recipient of this message, be advised 
> >> that any dissemination or copying of this message is prohibited.
> >> If you received this message erroneously, please notify the sender 
> >> and delete it, together with any attachments.
> >>
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
> >> Of Nick Fisk
> >> Sent: Friday, February 12, 2016 9:07 AM
> >> To: 'Christian Balzer' ; ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re:  Reducing the impact of OSD restarts (noout 
> >> ain't
> >> uptosnuff)
> >>
> >>
> >>
> >> > -----Original Message-----
> >> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
> >> > Behalf Of Christian Balzer
> >> > Sent: 12 February 2016 15:38
> >> > To: ceph-users@xxxxxxxxxxxxxx
> >> > Subject: Re:  Reducing the impact of OSD restarts 
> >> > (noout ain't
> >> > uptosnuff)
> >> >
> >> > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > On 02/12/2016 03:47 PM, Christian Balzer wrote:
> >> > > > Hello,
> >> > > >
> >> > > > yesterday I upgraded our most busy (in other words lethally
> >> > > > overloaded) production cluster to the latest Firefly in 
> >> > > > preparation for a Hammer upgrade and then phasing in of a cache
> > tier.
> >> > > >
> >> > > > When restarting the ODSs it took 3 minutes (1 minute in a 
> >> > > > consecutive repeat to test the impact of primed caches) during 
> >> > > > which the cluster crawled to a near stand-still and the dreaded 
> >> > > > slow requests piled up, causing applications in the VMs to fail.
> >> > > >
> >> > > > I had of course set things to "noout" beforehand, in hopes of 
> >> > > > staving off this kind of scenario.
> >> > > >
> >> > > > Note that the other OSDs and their backing storage were NOT 
> >> > > > overloaded during that time, only the backing storage of the 
> >> > > > OSD being restarted was under duress.
> >> > > >
> >> > > > I was under the (wishful thinking?) impression that with noout 
> >> > > > set and a controlled OSD shutdown/restart, operations would be 
> >> > > > redirect to the new primary for the duration.
> >> > > > The strain on the restarted OSDs when recovering those 
> >> > > > operations (which I also saw) I was prepared for, the near 
> >> > > > screeching halt not so much.
> >> > > >
> >> > > > Any thoughts on how to mitigate this further or is this the 
> >> > > > expected behavior?
> >> > >
> >> > > I wouldn't use noout in this scenario. It keeps the cluster from 
> >> > > recognizing that a OSD is not available; other OSD will still try 
> >> > > to write to that OSD. This is probably the cause of the blocked
> >> > > requests. Redirecting only works if the cluster is able to detect
> >> > > a PG as being degraded.
> >> > >
> >> > Oh well, that makes of course sense, but I found some article 
> >> > stating that
> >> it
> >> > also would redirect things and the recovery activity I saw 
> >> > afterwards
> >> suggests
> >> > it did so at some point.
> >>
> >> Doesn't noout just stop the crushmap from being modified and hence 
> >> data shuffling. Nodown controls whether or not the OSD is available
> >> for IO?
> >>
> >> Maybe try the reverse. Set noup so that OSD's don't participate in IO 
> >> and then bring them in manually?
> >>
> >> >
> >> > > If the cluster is aware of the OSD being missing, it could handle 
> >> > > the write requests more gracefully. To prevent it from 
> >> > > backfilling etc, I prefer to use nobackfill and norecover. It 
> >> > > blocks backfill on the cluster level, but allows requests to be 
> >> > > carried out (at least in my understanding of these flags).
> >> > >
> >> > Yes, I concur and was thinking of that as well. Will give it a spin 
> >> > with
> >> the
> >> > upgrade to Hammer.
> >> >
> >> > > 'noout' is fine for large scale cluster maintenance, since it 
> >> > > keeps the cluster from backfilling. I've used when I had to power 
> >> > > down our complete cluster.
> >> > >
> >> > Guess with my other, less busy clusters, this never showed up on my
> > radar.
> >> >
> >> > Regards,
> >> >
> >> > Christian
> >> > > Regards,
> >> > > Burkhard
> >> > > _______________________________________________
> >> > > ceph-users mailing list
> >> > > ceph-users@xxxxxxxxxxxxxx
> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > >
> >> >
> >> >
> >> > --
> >> > Christian Balzer        Network/Systems Engineer
> >> > chibi@xxxxxxx       Global OnLine Japan/Rakuten Communications
> >> > http://www.gol.com/
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@xxxxxxxxxxxxxx
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.3.4
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWvkCiCRDmVDuy+mK58QAAo2cP/1T8Mswjr+k25L9mjTU+
> prOnXMZqLgJviGWXKECcbWq0ApGfFB2MSz5X5d88xT8B0TOlq0ozMy1E7RN1
> 0BUXxWWW3yDTib+u7DD9hLr9OgqEiEF4ASJopHkLJyu0ej9qv6G3RfUDSrm4
> aA6s961Y3GdqIE+0iser6XPCDEfZ3yd5gwGcyNCDbQsy8mIB7hfYO5qdlzGY
> v+7YQoXWhO8B1vcLD2goVLlExlVmYXT8yjVOum/a6lZHBW7OKD8v7+KByFs8
> ih1dnqb3gXUKdXnA4ScOiYc1ZnNFY2hltPAgrNtBvKKlWmBbp5HXyMfQbIeE
> K2w6mCkMEjdsUK9aQlVcDXwauoIeib5+7MBBJBQI4rIGy9tKLrvajZcJ8MXk
> ly3mCEdyAlFSaZ6O/I7kFnyNZbl2Krjr58DEtJ60h0BBd72kTarfvoqbhq0Y
> 3KDsnSVAx4rT+kt3yuw6po2mM6XkiCR/tjPp8XIgzJ6mred5jM7/UJBCEqTH
> hlJWEA4fmvD/p4wbe9XlXZnhFwFH7Hb+AQaa/LRxU5kJLcPXzRFNw/4F4ItI
> dHGwIyo62va2AkxunH1IHT7T9Gt10PQSJ6q2D53Errpx1Och5qnliffPwvWA
> JCEgsrTIwF2wdskMms79XKVBiIkPujLL4c8pFvh+MHqLvIEPjSu5aLU0tIAc
> aQ5z
> =/jV9
> -----END PGP SIGNATURE-----

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com