Re: Reducing the impact of OSD restarts (noout ain't uptosnuff)

Christian Balzer <chibi@xxxxxxx> · Sun, 14 Feb 2016 14:39:31 +0900

Hello,

I was about to write something very much along these lines, thanks for
beating me to it. ^o^

On Sat, 13 Feb 2016 21:50:17 -0700 Robert LeBlanc wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> I'm still going to see if I can get Ceph clients to hardly notice that
> an OSD comes back in. Our set up is EXT4 and our SSDs have the hardest
> time with the longest recovery impact. It should be painless no matter
> how slow the drives/CPU/etc are. If it means waiting to service client
> I/O until all the peering, and stuff (not including
> backfilling/recovery because that can be done in the background
> without much impact already) is completed before sending the client
> I/O to the OSD, then that is what I'm going to target. That way if it
> takes 5 minutes for the OSD to get it's bearing because it is swapping
> due to low memory or whatever, the clients happily ignore the OSD
> until it says it is ready and don't have all the client I/O fighting
> to get a piece of scarce resources.
> 
Spot on. 
The recommendation the Ceph documentation is noout, the logic everybody
assumes is happening is that no I/O goes to the OSD until it is actually
ready to serve it and the reality clearly disproves it. 
Once the restart takes longer for whatever reasons than a few seconds it
becomes very visible.

> I appreciate all the suggestions that have been mentioned and believe
> that there is a fundamental issue here that causes a problem when you
> run your hardware into the red zone (like we have to do out of
> necessity). You may be happy with how things are set-up in your
> environment, but I'm not ready to give up on it and I think we can
> make it better. That way it "Just Works" (TM) with more hardware and
> configurations and doesn't need tons of efforts to get it tuned just
> right. Oh, and be careful not to touch it, the balance of the force
> might get thrown off and the whole thing will tank. 

This is exactly what happened in my case and we've seen evidence for in
this ML plenty of times.
Like with nearly all things I/O, there is a tipping point until everything
is fine and then it isn't, often catastrophically so.

> That does not make
> me feel confident. Ceph is so resilient in so many ways already, why
> should this be an Achilles heel for some?

Well said indeed.

Christian
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.3.4
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWwAeGCRDmVDuy+mK58QAAG6MP/j+JN2z1qLK2KwlQOr/w
> dam1U6t1WCzwN1XBpvYvbvJKKMcRHcwKmauuzTLYeEG8FjhgnOcvHaSRoHd8
> NURWINnGQrdTbxiMRGDbwC6iWfJypWMDN5d1vibo9aXC8ib7W6l9R21f+Koa
> CsgyZV32kSwEs36teeM4JZrZBTlYQ4qRTOsMUDIfE1JFtBaeDjEwyI6gajdB
> XsQo3mnqhe4LQC7x9oem/MpKEHp1Y/LO8tyf4jj72ZUp+qmJy2F3+oUPnCdU
> P4h3uC0GZUd6l43p5cKW1w/h1mfEwR/9ppsIyufTghqlWFlE6dziaQdlas88
> IuDpGwCJfyJhiH18VxbtRpZQpNorJ27uxNjPPDcWNoUFHR8+daTCu+8NU6vT
> 8xiZhBWpLiH/tShUtR6ZQnumwKgbwc+VOfHj+GSTY/DIfat/zaPxtKYsCHWz
> LNE6fkzd4st2Aw7UVPSSUKrH/87RhIEnlipptZsh5SQNFUrl1G5ztNBTj7Xl
> tyb+HD1Ge3u2mgS/ycnRGQECyXyUMvPXwITDqHLhN3wF7D/A3616v3Pg2H+v
> R/dU8Wq31wA+A0LRuViMJy2PJMgEBoux+zhBsJFun4TPdXkpC15QODhpquMs
> /0ofBwHG+FaWmmwVSQ0A0jMGGodfXTAgP4r/tL58JjGTgi1xtQu9L74u5KPD
> yHbZ
> =rnWI
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Sat, Feb 13, 2016 at 8:51 PM, Tom Christensen <pavera@xxxxxxxxx>
> wrote:
> >> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
> >> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
> >> > load_pgs opened
> >> 564 pgs > --- > Another minute to load the PGs.
> >> Same OSD reboot as above : 8 seconds for this.
> >
> > Do you really have 564 pgs on a single OSD?  I've never had anything
> > like decent performance on an OSD with greater than about 150pgs.  In
> > our production clusters we aim for 25-30 primary pgs per osd,
> > 75-90pgs/osd total (with size set to 3).  When we initially deployed
> > our large cluster with 150-200pgs/osd (total, 50-70 primary pgs/osd,
> > again size 3) we had no end of trouble getting pgs to peer.  The OSDs
> > ate RAM like nobody's business, took forever to do anything, and in
> > general caused problems.  If you're running 564 pgs/osd in this 4 OSD
> > cluster, I'd look at that first as the potential culprit.  That is a
> > lot of threads inside the OSD process that all need to get
> > CPU/network/disk time in order to peer as they come up.  Especially on
> > firefly I would point to this.  We've moved to Hammer and that did
> > improve a number of our performance bottlenecks, though we've also
> > grown our cluster without adding pgs, so we are now down in the 25-30
> > primary pgs/osd range, and restarting osds, or whole nodes (24-32 OSDs
> > for us) no longer causes us pain.  In the past restarting a node could
> > cause 5-10 minutes of peering and pain/slow requests/unhappiness of
> > various sorts (RAM exhaustion, OOM Killer, Flapping OSDs).  This all
> > improved greatly once we got our pg/osd count under 100 even before we
> > upgraded to hammer.
> >
> >
> >
> >
> >
> > On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton
> > <lionel-subscription@xxxxxxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> Le 13/02/2016 15:52, Christian Balzer a écrit :
> >> > [..]
> >> >
> >> > Hum that's surprisingly long. How much data (size and nb of files)
> >> > do you have on this OSD, which FS do you use, what are the mount
> >> > options, what is the hardware and the kind of access ?
> >> >
> >> > I already mentioned the HW, Areca RAID controller with 2GB HW cache
> >> > and a
> >> > 7 disk RAID6 per OSD.
> >> > Nothing aside from noatime for mount options and EXT4.
> >>
> >> Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me
> >> and may not be innocent.
> >>
> >> >
> >> > 2.6TB per OSD and with 1.4 million objects in the cluster a little
> >> > more than 700k files per OSD.
> >>
> >> That's nearly 3x more than my example OSD but it doesn't explain the
> >> more than 10x difference in startup time (especially considering BTRFS
> >> OSDs are slow to startup and my example was with dropped caches unlike
> >> your case). Your average file size is similar so it's not that either.
> >> Unless you have a more general, system-wide performance problem which
> >> impacts everything including the OSD init, there's 3 main components
> >> involved here :
> >> - Ceph OSD init code,
> >> - ext4 filesystem,
> >> - HW RAID6 block device.
> >>
> >> So either :
> >> - OSD init code doesn't scale past ~500k objects per OSD.
> >> - your ext4 filesystem is slow for the kind of access used during init
> >> (inherently or due to fragmentation, you might want to use filefrag
> >> on a random sample on PG directories, omap and meta),
> >> - your RAID6 array is slow for the kind of access used during init.
> >> - any combination of the above.
> >>
> >> I believe it's possible but doubtful that the OSD code wouldn't scale
> >> at this level (this does not feel like an abnormally high number of
> >> objects to me). Ceph devs will know better.
> >> ext4 could be a problem as it's not the most common choice for OSDs
> >> (from what I read here XFS is usually preferred over it) and it forces
> >> Ceph to use omap to store data which would be stored in extended
> >> attributes otherwise (which probably isn't without performance
> >> problems). RAID5/6 on HW might have performance problems. The usual
> >> ones happen on writes and OSD init is probably read-intensive (or
> >> maybe not, you should check the kind of access happening during the
> >> OSD init to avoid any surprise) but with HW cards it's difficult to
> >> know for sure the performance limitations they introduce (the only
> >> sure way is testing the actual access patterns).
> >>
> >> So I would probably try to reproduce the problem replacing one OSDs
> >> based on RAID6 arrays with as many OSDs as you have devices in the
> >> arrays. Then if it solves the problem and you didn't already do it
> >> you might want to explore Areca tuning, specifically with RAID6 if
> >> you must have it.
> >>
> >>
> >> >
> >> > And kindly take note that my test cluster has less than 120k
> >> > objects and thus 15k files per OSD and I still was able to
> >> > reproduce this behaviour (in
> >> > spirit at least).
> >>
> >> I assume the test cluster uses ext4 and RAID6 arrays too: it would be
> >> a perfect testing environment for defragmentation/switch to
> >> XFS/switch to single drive OSDs then.
> >>
> >> >
> >> >> The only time I saw OSDs take several minutes to reach the point
> >> >> where they fully rejoin is with BTRFS with default options/config.
> >> >>
> >> > There isn't a pole long enough I would touch BTRFS with for
> >> > production, especially in conjunction with Ceph.
> >>
> >> That's a matter of experience and environment but I can understand: we
> >> invested more than a week of testing/development to reach a point
> >> where BTRFS was performing better than XFS in our use case. Not
> >> everyone can dedicate as much time just to select a filesystem and
> >> support it. There might be use cases where it's not even possible to
> >> use it (I'm not sure how it would perform if you only did small
> >> objects storage for example).
> >>
> >> BTRFS has been invaluable though : it detected and helped fix
> >> corruption generated by faulty Raid controllers (by forcing Ceph to
> >> use other replicas when repairing). I wouldn't let precious data live
> >> on anything other than checksumming filesystems now (the
> >> probabilities of undetectable disk corruption are too high for our
> >> use case now). We have 30 BTRFS OSDs in production (and many BTRFS
> >> filesystems on other systems) and we've never had any problem with
> >> them. These filesystems even survived several bad datacenter
> >> equipment failures (faulty backup generator control system and UPS
> >> blowing up during periodic testing). That said I'm susbcribed to
> >> linux-btrfs, was one of the SATA controller driver maintainers long
> >> ago so I know my way around kernel code, I hand pick the kernel
> >> versions going to production and we have custom tools and maintenance
> >> procedures for the BTRFS OSDs. So I've means and experience which
> >> make this choice comfortable for me and my team: I wouldn't blindly
> >> advise BTRFS to anyone else (not yet).
> >>
> >> Anyway it's possible ext4 is a problem but it seems to me less likely
> >> than the HW RAID6. In my experience RAID controllers with cache aren't
> >> really worth it with Ceph. Most of the time they perform well because
> >> of BBWC/FBWC but when you get into a situation where you must
> >> repair/backfill because you lost an OSD or added a new one the HW
> >> cache is completely destroyed (what good can 4GB do when you must
> >> backfill 1TB or even catch up with tens of GB of writes ?). It's so
> >> bad that when we add an OSD the first thing we do now is selectively
> >> disable the HW cache for its device to avoid slowing all the other
> >> OSDs connected to the same controller.
> >> Using RAID6 for OSDs can minimize the backfills by avoiding losing
> >> OSDs but probably won't avoid them totally (most people have to
> >> increase storage eventually). In some cases it might be worth it
> >> (very large installations where the number of OSDs may become a
> >> problem) but we aren't there yet and you probably have to test these
> >> arrays extensively (how much IO can you get from them in various
> >> access patterns, including when they are doing internal maintenance,
> >> running with one or two devices missing and rebuilding one or two
> >> replaced devices) so we will delay any kind of RAID below OSDs as
> >> long as we can.
> >>
> >> Best regards,
> >>
> >> Lionel
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com