Re: Reducing the impact of OSD restarts (noout ain't uptosnuff)

Christian Balzer <chibi@xxxxxxx> · Mon, 15 Feb 2016 14:51:03 +0900

Hello,

Wall of text, paragraphs make for better reading. ^_-

On Sun, 14 Feb 2016 06:25:11 -0700 Tom Christensen wrote:

> To be clear when you are restarting these osds how many pgs go into
> peering state?  And do they stay there for the full 3 minutes?
>
I can't say that with anything resembling confidence, as the logs aren't
all that chatty when it comes to this and from observing atop and a 
"watch ceph -s" I would say things are not stuck in peering per se.

What the OSD was doing between those 2 entries maybe the devs know, all I
can tell is that it had a fun time with the backing storage mostly:
---
2016-02-12 01:33:50.263152 7f75be4d57c0  0 filestore(/var/lib/ceph/osd/ceph-2) limited size xattrs 
2016-02-12 01:35:31.809897 7f75be4d57c0  0 filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
---

> Certainly I've seen iops drop to zero or near zero when a large number
> of pgs are peering.  It would be wonderful if we could keep iops flowing
> even when pgs are peering.  In your case with such a high pg/osd count,
> my guess is peering always takes a long time.  As the OSD goes down it
> has to peer those 564 pgs across the remaining 3 osds, then re-peer them
> once the OSD comes up again...  

Just by "feel" (corroborated by atop with regards to network/disk/CPU
activities), I'd say that it was not peering by itself.

>Also because the OSD is a RAID6 I'm
> pretty sure the IO pattern is going to be bad, all 564 of those threads
> are going to request reads and writes (the peering process is going to
> update metadata in each pg directory on the OSD) nearly simultaneously.
> In a raid 6 each non-cached read will cause a read io from at least 5
> disks and each write will cause a write io to all 7 disks.  With that
> many threads hitting the volume simultaneously it means you're going to
> have massive disk head contention/head seek times, which is going to
> absolutely destroy your iops and make peering take that much longer.  In
> effect in the non-cached case the raid6 is going to almost entirely
> negate the distribution of IO load across those 7 disks, and is going to
> make them behave with a performance closer to a single HDD.  As Lionel
> said earlier, the HW Cache is going to be nearly useless in any sort of
> recovery scenario in ceph (which this is).
> 
I'm quite aware of this, I have been using RAID controllers for decades
and this brand (Areca) for more than 8 years and am very well acquainted
with what they can do (and what not so well).

Suffice to say that this cluster is currently handling more than 3 times
the load it was designed for and when not being forced to read things from
the actual disk (the 270 VMs using it are nearly write-only in steady
state), it is still doing well. 
Along the lines of this:
---
2016-02-15 07:01:04.359510 mon.0 10.0.0.10:6789/0 173335 : [INF] pgmap v35952617: 1152 pgs: 1152 active+clean; 5793 GB data, 11426 GB used, 88752 GB / 100178 GB avail; 22212 B/s rd, 14375 kB/s wr, 2558 op/s
---

It isn't also helped by the fact that the actual HDDs are crap Toshiba DT
drives, which have (or had, we stopped buying them of course) a firmware
bug/feature that would slow them down 90% for at least 8 hours every week.
Some drives would stay in that mode until power-cycled, I posted about
this before.

That's the other reason why I'm doing this song and dance:
- To add a SSD cache tier that will reduce the load on the current storage
  servers.
- So with that reduced load I can add another HDD backed node (with 6
  RAID10 instead of 2 RAID6) where backfilling doesn't kill things
- And then recycle the existing storage nodes on by one, replacing the
  HDDs and turning them into 6 RAID10s per node as well.

> I hope Robert or someone can come up with a way to continue IO to a pg in
> peering state, that would be wonderful as this is the fundamental
> problem I believe.  I'm not "happy" with the amount of work we had to
> put in to getting our cluster to behave as well as it is now, and it
> would certainly be great if things "Just Worked".  I'm just trying to
> relate our experience, and indicate what I see as the bottleneck in this
> particular setup based on that experience.  I believe the ceph pg
> calculator and recommendations about pg counts are too high and your
> setup is 2-3x above that.  

Again, it was a very conscious decision at the time, with a clearly
planned growth in the near future (that didn't happen).

> I've been able to easily topple clusters
> (mostly due to RAM exhaustion/swapping/OOM killer) with the recommended
> pg/osd counts and recommended RAM (1GB/OSD + 1GB/TB of storage) by
> causing recovery in a cluster for 2 years now, and its not been improved
> as far as I can tell. 

I have about 100 PGs per OSD on our other, "normal" production cluster and
definitely have no issues there. But then again, it is:
a) lightly loaded and
b) has 64GB RAM for 8 OSDs per node and ample CPU power.

However my crappy test cluster with 4GB RAM for 4 OSDs per node and the
same (~100) PG/OSD ratio I can easily put into a world of memory exhaustion
pain indeed.

>The only solution I've seen work reliably is to
> drop the pg/osd ratio. Dropping said ratio also greatly reduced the
> peering load and time and made the pain of osd restarts almost
> negligible.
> 
> To your question about our data distribution, it is excellent as far as
> per pg is concerned, less than 3% variance between pgs.  We did see a
> massive disparity between how many pgs each osd gets.  Originally we had
> osds with as few as 100pgs, and some with as many as 250 when on average
> they should have had about 175pgs each, that was with the recommended
> pg/osd settings. Additionally that ratio/variance has been the same
> regardless of the number of pgs/osd.  Meaning it started out bad, and
> stayed bad but didn't get worse as we added osds.  We've had to reweight
> osds in our crushmap to get anything close to a sane distribution of pgs.
> 
This is something that will also need addressing down the road, as manual
crush map tuning is not for the faint hearted and of course will need
fondling with every OSD addition/removal. 

Christian

> -Tom
> 
> 
> On Sat, Feb 13, 2016 at 10:57 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> > On Sat, 13 Feb 2016 20:51:19 -0700 Tom Christensen wrote:
> >
> > > > > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0
> > > > > osd.2 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0
> > > > > osd.2 1788 load_pgs opened
> > > > 564 pgs > --- > Another minute to load the PGs.
> > > > Same OSD reboot as above : 8 seconds for this.
> > >
> > > Do you really have 564 pgs on a single OSD?
> >
> > Yes, the reason is simple, more than a year ago it should have been 8
> > OSDs (halving that number) and now it should be 18 OSDs, which would
> > be a perfect fit for the 1024 PGs in the rbd pool.
> >
> > >I've never had anything like
> > > decent performance on an OSD with greater than about 150pgs.  In our
> > > production clusters we aim for 25-30 primary pgs per osd,
> > > 75-90pgs/osd total (with size set to 3).  When we initially deployed
> > > our large cluster with 150-200pgs/osd (total, 50-70 primary pgs/osd,
> > > again size 3) we had no end of trouble getting pgs to peer.  The
> > > OSDs ate RAM like nobody's business, took forever to do anything,
> > > and in general caused problems.
> >
> > The cluster performs admirable for the stress it is under, the number
> > of PGs per OSD never really was an issue when it came to
> > CPU/RAM/network. For example the restart increased the OSD process
> > size from 1.3 to 2.8GB, but that left 24GB still "free".
> > The main reason to have more OSDs (and thus a lower PG count per OSD)
> > is to have more IOPS from the underlying storage.
> >
> > > If you're running 564 pgs/osd in this 4 OSD cluster, I'd look at that
> > > first as the potential culprit.  That is a lot of threads inside the
> > > OSD process that all need to get CPU/network/disk time in order to
> > > peer as they come up.  Especially on firefly I would point to this.
> > > We've moved to Hammer and that did improve a number of our
> > > performance bottlenecks, though we've also grown our cluster without
> > > adding pgs, so we are now down in the 25-30 primary pgs/osd range,
> > > and restarting osds, or whole nodes (24-32 OSDs for us) no longer
> > > causes us pain.
> >
> > At that PG count, how good (bad really) is your data balancing out?
> >
> > > In the past
> > > restarting a node could cause 5-10 minutes of peering and pain/slow
> > > requests/unhappiness of various sorts (RAM exhaustion, OOM Killer,
> > > Flapping OSDs).
> >
> > Nodes with that high number of OSDs I can indeed see cause pain, which
> >
> > > This all improved greatly once we got our pg/osd count
> > > under 100 even before we upgraded to hammer.
> > >
> >
> > Interesting point, but in my case all the slowness can be attributed to
> > disk I/O of the respective backing storage. Which should be fast
> > enough if ALL that it would do were to read things in.
> > I'll see if Hammer behaves better, but I doubt it (especially for the
> > first time when it upgrades stuff on the disk).
> >
> > Penultimately however I didn't ask on how to speed up OSD restarts (I
> > have a lot of knowledge/ideas on how to do that), I asked about
> > mitigating the impact of OSD restarts when they are going to be slow,
> > for whatever reason.
> >
> > Regards,
> >
> > Christian
> > >
> > >
> > >
> > >
> > > On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton <
> > > lionel-subscription@xxxxxxxxxxx> wrote:
> > >
> > > > Hi,
> > > >
> > > > Le 13/02/2016 15:52, Christian Balzer a écrit :
> > > > > [..]
> > > > >
> > > > > Hum that's surprisingly long. How much data (size and nb of
> > > > > files) do you have on this OSD, which FS do you use, what are
> > > > > the mount options, what is the hardware and the kind of access ?
> > > > >
> > > > > I already mentioned the HW, Areca RAID controller with 2GB HW
> > > > > cache and a 7 disk RAID6 per OSD.
> > > > > Nothing aside from noatime for mount options and EXT4.
> > > >
> > > > Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to
> > > > me and may not be innocent.
> > > >
> > > > >
> > > > > 2.6TB per OSD and with 1.4 million objects in the cluster a
> > > > > little more than 700k files per OSD.
> > > >
> > > > That's nearly 3x more than my example OSD but it doesn't explain
> > > > the more than 10x difference in startup time (especially
> > > > considering BTRFS OSDs are slow to startup and my example was with
> > > > dropped caches unlike your case). Your average file size is
> > > > similar so it's not that either. Unless you have a more general,
> > > > system-wide performance problem which impacts everything including
> > > > the OSD init, there's 3 main components involved here :
> > > > - Ceph OSD init code,
> > > > - ext4 filesystem,
> > > > - HW RAID6 block device.
> > > >
> > > > So either :
> > > > - OSD init code doesn't scale past ~500k objects per OSD.
> > > > - your ext4 filesystem is slow for the kind of access used during
> > > > init (inherently or due to fragmentation, you might want to use
> > > > filefrag on a random sample on PG directories, omap and meta),
> > > > - your RAID6 array is slow for the kind of access used during init.
> > > > - any combination of the above.
> > > >
> > > > I believe it's possible but doubtful that the OSD code wouldn't
> > > > scale at this level (this does not feel like an abnormally high
> > > > number of objects to me). Ceph devs will know better.
> > > > ext4 could be a problem as it's not the most common choice for OSDs
> > > > (from what I read here XFS is usually preferred over it) and it
> > > > forces Ceph to use omap to store data which would be stored in
> > > > extended attributes otherwise (which probably isn't without
> > > > performance problems). RAID5/6 on HW might have performance
> > > > problems. The usual ones happen on writes and OSD init is probably
> > > > read-intensive (or maybe not, you should check the kind of access
> > > > happening during the OSD init to avoid any surprise) but with HW
> > > > cards it's difficult to know for sure the performance limitations
> > > > they introduce (the only sure way is testing the actual access
> > > > patterns).
> > > >
> > > > So I would probably try to reproduce the problem replacing one OSDs
> > > > based on RAID6 arrays with as many OSDs as you have devices in the
> > > > arrays. Then if it solves the problem and you didn't already do it
> > > > you might want to explore Areca tuning, specifically with RAID6 if
> > > > you must have it.
> > > >
> > > >
> > > > >
> > > > > And kindly take note that my test cluster has less than 120k
> > > > > objects and thus 15k files per OSD and I still was able to
> > > > > reproduce this behaviour
> > > > (in
> > > > > spirit at least).
> > > >
> > > > I assume the test cluster uses ext4 and RAID6 arrays too: it would
> > > > be a perfect testing environment for defragmentation/switch to
> > > > XFS/switch to single drive OSDs then.
> > > >
> > > > >
> > > > >> The only time I saw OSDs take several minutes to reach the point
> > > > >> where they fully rejoin is with BTRFS with default
> > > > >> options/config.
> > > > >>
> > > > > There isn't a pole long enough I would touch BTRFS with for
> > > > > production, especially in conjunction with Ceph.
> > > >
> > > > That's a matter of experience and environment but I can
> > > > understand: we invested more than a week of testing/development to
> > > > reach a point where BTRFS was performing better than XFS in our
> > > > use case. Not everyone can dedicate as much time just to select a
> > > > filesystem and support it. There might be use cases where it's not
> > > > even possible to use it (I'm not sure how it would perform if you
> > > > only did small objects storage for example).
> > > >
> > > > BTRFS has been invaluable though : it detected and helped fix
> > > > corruption generated by faulty Raid controllers (by forcing Ceph to
> > > > use other replicas when repairing). I wouldn't let precious data
> > > > live on anything other than checksumming filesystems now (the
> > > > probabilities of undetectable disk corruption are too high for our
> > > > use case now). We have 30 BTRFS OSDs in production (and many BTRFS
> > > > filesystems on other systems) and we've never had any problem with
> > > > them. These filesystems even survived several bad datacenter
> > > > equipment failures (faulty backup generator control system and UPS
> > > > blowing up during periodic testing). That said I'm susbcribed to
> > > > linux-btrfs, was one of the SATA controller driver maintainers
> > > > long ago so I know my way around kernel code, I hand pick the
> > > > kernel versions going to production and we have custom tools and
> > > > maintenance procedures for the BTRFS OSDs. So I've means and
> > > > experience which make this choice comfortable for me and my team:
> > > > I wouldn't blindly advise BTRFS to anyone else (not yet).
> > > >
> > > > Anyway it's possible ext4 is a problem but it seems to me less
> > > > likely than the HW RAID6. In my experience RAID controllers with
> > > > cache aren't really worth it with Ceph. Most of the time they
> > > > perform well because of BBWC/FBWC but when you get into a
> > > > situation where you must repair/backfill because you lost an OSD
> > > > or added a new one the HW cache is completely destroyed (what good
> > > > can 4GB do when you must backfill 1TB or even catch up with tens
> > > > of GB of writes ?). It's so bad that when we add an OSD the first
> > > > thing we do now is selectively disable the HW cache for its device
> > > > to avoid slowing all the other OSDs connected to the same
> > > > controller. Using RAID6 for OSDs can minimize the backfills by
> > > > avoiding losing OSDs but probably won't avoid them totally (most
> > > > people have to increase storage eventually). In some cases it
> > > > might be worth it (very large installations where the number of
> > > > OSDs may become a problem) but we aren't there yet and you
> > > > probably have to test these arrays extensively (how much IO can
> > > > you get from them in various access patterns, including when they
> > > > are doing internal maintenance, running with one or two devices
> > > > missing and rebuilding one or two replaced devices) so we will
> > > > delay any kind of RAID below OSDs as long as we can.
> > > >
> > > > Best regards,
> > > >
> > > > Lionel
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com