Re: New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

Nick Fisk <nick@xxxxxxxxxx> · Mon, 24 Oct 2016 09:42:35 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Christian Balzer
> Sent: 24 October 2016 02:30
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg
> queue size
> 
> 
> Hello,
> 
> On Fri, 21 Oct 2016 17:44:25 +0000 Jim Kilborn wrote:
> 
> > Reed/Christian,
> >
> > So if I put the OSD journals on an SSD that has power loss protection (Samsung SM863) , all the write then go through those journals.
> Can I then leave write caching turn on for the spinner OSDs, even without BBU caching controller? In the event of a power outage past
> our ups time, I want to ensure all the osds aren’t corrupt after bring the nodes back up.

I think just to clarify here. I think the poster who had data loss when enabling write cache on their disks, had the disk write cache option enabled on a RAID controller. This is different to having write cache enabled (via hdparm) if the disk is directly presented into Linux, as the kernel can force (hopefully!!!) the disk to flush its write cache. RAID controllers hide this capability from the kernel to communicate with the disk and tell it to flush its cache, as they only see the virtual raid disk(s).

> >
> 
> Nope, as the corruption happened on the actual OSD, the LevelDB.
> 
> At this point I'd like to point out the last paragraph here:
> https://en.wikipedia.org/wiki/LevelDB
> 
> Which matches my experience, on a MON node with non-power fail safe SSDs we lost power (by human error, not actual DC
> problems) twice and had the levelDB corrupted both times, while the filesystem was fine thanks to proper journaling with barriers and
> SYNC points.
> 
> If you have direct control/communications with your UPS, I'd recommend using that info to shut down things before power runs out.
> 
> Alternatively, have ALL your data on PLP SSDs or with lots of manual effort, symlink to the leveldb to such a device.
> 
> 
> Christian
> 
> > Secondly, Seagate 8TB enterprise drives say they employ power loss protection as well. Apparently, in your case, this turned out to
> be untrue?
> >
> >
> >
> >
> > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
> > Windows 10
> >
> > From: Reed Dier<mailto:reed.dier@xxxxxxxxxxx>
> > Sent: Friday, October 21, 2016 10:06 AM
> > To: Christian Balzer<mailto:chibi@xxxxxxx>
> > Cc: ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  New cephfs cluster performance issues- Jewel
> > - cache pressure, capability release, poor iostat await avg queue size
> >
> >
> > On Oct 19, 2016, at 7:54 PM, Christian Balzer <chibi@xxxxxxx<mailto:chibi@xxxxxxx>> wrote:
> >
> >
> > Hello,
> >
> > On Wed, 19 Oct 2016 12:28:28 +0000 Jim Kilborn wrote:
> >
> > I have setup a new linux cluster to allow migration from our old SAN based cluster to a new cluster with ceph.
> > All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> > As others mentioned, not a good choice, but also not the (main) cause
> > of your problems.
> >
> > I am basically running stock ceph settings, with just turning the write cache off via hdparm on the drives, and temporarily turning of
> scrubbing.
> >
> > The former is bound to kill performance, if you care that much for
> > your data but can't guarantee constant power (UPS, dual PSUs, etc),
> > consider using a BBU caching controller.
> >
> > I wanted to comment on this small bolded bit, in the early days of my ceph cluster, testing resiliency to power failure (worst case
> scenario), when the on-disk write cache was enabled on my drives, I would lose that OSD to leveldb corruption, even with BBU.
> >
> > With BBU + no disk-level cache, the OSD would come back, with no data
> > loss, however performance would be significantly degraded. (xfsaild
> > process with 99% iowait, cured by zapping disk and recreating OSD)
> >
> > For reference, these were Seagate ST8000NM0065, backed by an LSI 3108 RoC, with the OSD set as a single RAID0 VD. On disk
> journaling.
> >
> > There was a decent enough hit to write performance after disabling write caching at the disk layer, but write-back caching at the
> controller layer provided enough of a negating increase, that the data security was an acceptable trade off.
> >
> > Was a tough way to learn how important this was after data center was struck by lightning two weeks after initial ceph cluster install
> and one phase of power was knocked out for 15 minutes, taking half the non-dual-PSU nodes with it.
> >
> > Just want to make sure that people learn from that painful experience.
> >
> > Reed
> >
> >
> > The later I venture you did because performance was abysmal with
> > scrubbing enabled.
> > Which is always a good indicator that your cluster needs tuning, improving.
> >
> > The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So Server performance should be good.
> > Memory is fine, CPU I can't tell from the model number and I'm not
> > inclined to look up or guess, but that usually only becomes a
> > bottleneck when dealing with all SSD setup and things requiring the
> > lowest latency possible.
> >
> >
> > Since I am running cephfs, I have tiering setup.
> > That should read "on top of EC pools", and as John said, not a good
> > idea at all, both EC pools and cache-tiering.
> >
> > Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So the idea is to ensure a single host failure.
> > Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in
> > a replicated set with size=2
> >
> > This isn't a Seagate, you mean Samsung. And that's a consumer model,
> > ill suited for this task, even with the DC level SSDs below as journals.
> >
> > And as such a replication of 2 is also ill advised, I've seen these
> > SSDs die w/o ANY warning whatsoever and long before their (abysmal)
> > endurance was exhausted.
> >
> > The cache tier also has a 128GB SM863 SSD that is being used as a
> > journal for the cache SSD. It has power loss protection
> >
> > Those are fine. If you re-do you cluster, don't put more than 4-5
> > journals on them.
> >
> > My crush map is setup to ensure the cache pool uses only the 4 850 pro and the erasure code uses only the 16 spinning 4TB drives.
> >
> > The problems that I am seeing is that I start copying data from our old san to the ceph volume, and once the cache tier gets to my
> target_max_bytes of 1.4 TB, I start seeing:
> >
> > HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow
> > requests; noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> > 26 ops are blocked > 65.536 sec on osd.0
> > 37 ops are blocked > 32.768 sec on osd.0
> > 1 osds have slow requests
> > noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> >
> > osd.0 is the cache ssd
> >
> > If I watch iostat on the cache ssd, I see the queue lengths are high
> > and the await are high Below is the iostat on the cache drive (osd.0)
> > on the first host. The avgqu-sz is between 87 and 182 and the await is
> > between 88ms and 1193ms
> >
> > Device:   rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sdb
> >                  0.00     0.33    9.00   84.33     0.96    20.11   462.40    75.92  397.56  125.67  426.58  10.70  99.90
> >                  0.00     0.67   30.00   87.33     5.96    21.03   471.20    67.86  910.95   87.00 1193.99   8.27  97.07
> >                  0.00    16.67   33.00  289.33     4.21    18.80   146.20    29.83   88.99   93.91   88.43   3.10  99.83
> >                  0.00     7.33    7.67  261.67     1.92    19.63   163.81   117.42  331.97  182.04  336.36   3.71 100.00
> >
> >
> > If I look at the iostat for all the drives, only the cache ssd drive
> > is backed up
> >
> > Yes, consumer SSDs on top of a design that channels everything through
> > them.
> >
> > Rebuild your cluster along more conventional and conservative lines,
> > don't use the 850 PROs.
> > Feel free to run any new design by us.
> >
> > Christian
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx<mailto:chibi@xxxxxxx>    Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com