Re: Fwd: Fwd: High IOWait Issue

Christian Balzer <chibi@xxxxxxx> · Tue, 27 Mar 2018 13:05:06 +0900

On Mon, 26 Mar 2018 23:00:28 +0700 Sam Huracan wrote:

> Thanks for your information.
> Here is result when I run atop on 1 Ceph HDD host:
> http://prntscr.com/iwmc86
> 
This pretty much confirms the iostat output, clearly deep scrubbing is
killing your cluster performance.

> There is some disk busy with over 100%, but the SSD journal (SSD) use only
> 3%, is it normal? Is there any way to optimize using of SSD journal? Could
> you give me some keyword?
> 
The journal SSD is not involved in the deep scrubbing at all for starters.

And even if you were to configure things (as you clumsily and halfway
attempted with the filestore and journal config changes), at some point
ALL those IOPS will hit your disks and they will need to handle them.

Any cache/journal only helps with short bursts and things that stay within
the capacity of that cache.
Prolonged high IOPS will run into the disk limitation wall eventually.

> Here is configuration of Ceph HDD Host:
> Dell PowerEdge R730xd Server Quantity
> PE R730/xd Motherboard 1
> Intel Xeon E5-2620 v4 2.1GHz,20M Cache,8.0GT/s QPI,Turbo,HT,8C/16T (85W)
> Max Mem 2133MHz 1
> 16GB RDIMM, 2400MT/s, Dual Rank, x8 Data Width 2
> 300GB 15K RPM SAS 12Gbps 2.5in Flex Bay Hard Drive - OS Drive (RAID 1) 2
> 4TB 7.2K RPM NLSAS 12Gbps 512n 3.5in Hot-plug Hard Drive - OSD Drive 7
As RAID 0 single disk connected to the RAID controller?

> 200GB Solid State Drive SATA Mix Use MLC 6Gbps 2.5in Hot-plug Drive -
> Journal Drive (RAID 1) 2
These vendor "names" could mean different things at different times,
but the journal SSD clearly isn't it.

> PERC H730 Integrated RAID Controller, 1GB Cache *(we are using Writeback
> mode)* 1
As David said,  check that BBU is working and writeback is actually active.
Also the SSDs will be fine w/o RAID controller writeback cache and that
will give more cache to the HDDs which need it.

> Dual, Hot-plug, Redundant Power Supply (1+1), 750W 1
> Broadcom 5720 QP 1Gb Network Daughter Card 1
> QLogic 57810 Dual Port 10Gb Direct Attach/SFP+ Network Adapter 1
> 
> For some reasons, we can't configure Jumbo Frame in this cluster. We'll
> refer your suggest about scrub.
> 
Jumbo frames are not the issue here and depending on the overall
installation may not work, cause problems when somebody forgets to
configure things on either host or switch, or with a good modern switch
not even buy that much in the latency department.

Christian

> 
> 2018-03-26 7:41 GMT+07:00 Christian Balzer <chibi@xxxxxxx>:
> 
> >
> > Hello,
> >
> > in general and as reminder for others, the more information you supply,
> > the more likely are people to answer and answer with actually pertinent
> > information.
> > Since you haven't mentioned the hardware (actual HDD/SSD models, CPU/RAM,
> > controllers, etc) we're still missing a piece of the puzzle that could be
> > relevant.
> >
> > But given what we have some things are more likely than others.
> > Also, an inline 90KB screenshot of a TEXT iostat output is a bit of a
> > no-no, never mind that atop instead of top from the start would have given
> > you and us much more insight.
> >
> > On Sun, 25 Mar 2018 14:35:57 +0700 Sam Huracan wrote:
> >  
> > > Thank you all.
> > >
> > > 1. Here is my ceph.conf file:
> > > https://pastebin.com/xpF2LUHs
> > >  
> > As Lazlo noted (and it matches your iostat output beautifully), tuning
> > down scrubs is likely going to have an immediate beneficial impact, as
> > deep-scrubs in particular are VERY disruptive and I/O intense operations.
> >
> > However the "osd scrub sleep = 0.1" may make things worse in certain Jewel
> > versions, as they all went through the unified queue and this would cause
> > a sleep for ALL operations, not just the scrub ones.
> > I can't remember when this was fixed and the changelog is of no help, so
> > hopefully somebody who knows will pipe up.
> > If in doubt of course, experiment.
> >
> > In addition to that, if you have low usage times, set
> > your osd_scrub_(start|end)_hour accordingly and also check the ML archives
> > for other scrub scheduling tips.
> >
> > I'd also leave these:
> >         filestore max sync interval = 100
> >         filestore min sync interval = 50
> >         filestore queue max ops  = 5000
> >         filestore queue committing max ops  = 5000
> >         journal max write entries  = 1000
> >         journal queue max ops  = 5000
> >
> > at their defaults, playing with those parameters requires a good
> > understanding of how Ceph filestore works AND usually only makes sense
> > with SSD/NVMe setups.
> > Especially the first 2 could lead to quite the IO pileup.
> >
> >  
> > > 2. Here is result from ceph -s:
> > > root@ceph1:/etc/ceph# ceph -s
> > >     cluster 31154d30-b0d3-4411-9178-0bbe367a5578
> > >      health HEALTH_OK
> > >      monmap e3: 3 mons at {ceph1=
> > > 10.0.30.51:6789/0,ceph2=10.0.30.52:6789/0,ceph3=10.0.30.53:6789/0}
> > >             election epoch 18, quorum 0,1,2 ceph1,ceph2,ceph3
> > >      osdmap e2473: 63 osds: 63 up, 63 in
> > >             flags sortbitwise,require_jewel_osds
> > >       pgmap v34069952: 4096 pgs, 6 pools, 21534 GB data, 5696 kobjects
> > >             59762 GB used, 135 TB / 194 TB avail
> > >                 4092 active+clean
> > >                    2 active+clean+scrubbing
> > >                    2 active+clean+scrubbing+deep
> > >   client io 36096 kB/s rd, 41611 kB/s wr, 1643 op/s rd, 1634 op/s wr
> > >  
> > See above about deep-scrub, which will read ALL the objects of the PG
> > being scrubbed and thus not only saturates the OSDs involved with reads
> > but ALSO dirties the pagecache with cold objects, making other reads on
> > the nodes slow by requiring them to hit the disks, too.
> >
> > It would be interesting to see a "ceph -s" when your cluster is busy but
> > NOT scrubbing, 1600 write op/s are about what 21 HDDs can handle.
> > So for the time being, disable scrubs entirely and see if your problems
> > go away.
> > If so, you now know the limits of your current setup and will want to
> > avoid hitting them again.
> >
> > Having a dedicated SSD pool for high-end VMs or a cache-tier (if it is a
> > fit, not likely in your case) would be a way forward if your client
> > demands are still growing.
> >
> > Christian
> >  
> > >
> > >
> > > 3. We use 1 SSD for journaling 7 HDD (/dev/sdi), I set 16GB for each
> > > journal,  here is result from ceph-disk list command:
> > >
> > > /dev/sda :
> > >  /dev/sda1 ceph data, active, cluster ceph, osd.0, journal /dev/sdi1
> > > /dev/sdb :
> > >  /dev/sdb1 ceph data, active, cluster ceph, osd.1, journal /dev/sdi2
> > > /dev/sdc :
> > >  /dev/sdc1 ceph data, active, cluster ceph, osd.2, journal /dev/sdi3
> > > /dev/sdd :
> > >  /dev/sdd1 ceph data, active, cluster ceph, osd.3, journal /dev/sdi4
> > > /dev/sde :
> > >  /dev/sde1 ceph data, active, cluster ceph, osd.4, journal /dev/sdi5
> > > /dev/sdf :
> > >  /dev/sdf1 ceph data, active, cluster ceph, osd.5, journal /dev/sdi6
> > > /dev/sdg :
> > >  /dev/sdg1 ceph data, active, cluster ceph, osd.6, journal /dev/sdi7
> > > /dev/sdh :
> > >  /dev/sdh3 other, LVM2_member
> > >  /dev/sdh1 other, vfat, mounted on /boot/efi
> > > /dev/sdi :
> > >  /dev/sdi1 ceph journal, for /dev/sda1
> > >  /dev/sdi2 ceph journal, for /dev/sdb1
> > >  /dev/sdi3 ceph journal, for /dev/sdc1
> > >  /dev/sdi4 ceph journal, for /dev/sdd1
> > >  /dev/sdi5 ceph journal, for /dev/sde1
> > >  /dev/sdi6 ceph journal, for /dev/sdf1
> > >  /dev/sdi7 ceph journal, for /dev/sdg1
> > >
> > > 4. With iostat, we just run "iostat -x 2", /dev/sdi is journal SSD,
> > > /dev/sdh is OS Disk, and the rest is OSD Disks.
> > > root@ceph1:/etc/ceph# lsblk
> > > NAME                             MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> > > sda                                8:0    0   3.7T  0 disk
> > > └─sda1                             8:1    0   3.7T  0 part
> > > /var/lib/ceph/osd/ceph-0
> > > sdb                                8:16   0   3.7T  0 disk
> > > └─sdb1                             8:17   0   3.7T  0 part
> > > /var/lib/ceph/osd/ceph-1
> > > sdc                                8:32   0   3.7T  0 disk
> > > └─sdc1                             8:33   0   3.7T  0 part
> > > /var/lib/ceph/osd/ceph-2
> > > sdd                                8:48   0   3.7T  0 disk
> > > └─sdd1                             8:49   0   3.7T  0 part
> > > /var/lib/ceph/osd/ceph-3
> > > sde                                8:64   0   3.7T  0 disk
> > > └─sde1                             8:65   0   3.7T  0 part
> > > /var/lib/ceph/osd/ceph-4
> > > sdf                                8:80   0   3.7T  0 disk
> > > └─sdf1                             8:81   0   3.7T  0 part
> > > /var/lib/ceph/osd/ceph-5
> > > sdg                                8:96   0   3.7T  0 disk
> > > └─sdg1                             8:97   0   3.7T  0 part
> > > /var/lib/ceph/osd/ceph-6
> > > sdh                                8:112  0 278.9G  0 disk
> > > ├─sdh1                             8:113  0   512M  0 part /boot/efi
> > > └─sdh3                             8:115  0 278.1G  0 part
> > >   ├─hnceph--hdd1--vg-swap (dm-0) 252:0    0  59.6G  0 lvm  [SWAP]
> > >   └─hnceph--hdd1--vg-root (dm-1) 252:1    0 218.5G  0 lvm  /
> > > sdi                                8:128  0 185.8G  0 disk
> > > ├─sdi1                             8:129  0  16.6G  0 part
> > > ├─sdi2                             8:130  0  16.6G  0 part
> > > ├─sdi3                             8:131  0  16.6G  0 part
> > > ├─sdi4                             8:132  0  16.6G  0 part
> > > ├─sdi5                             8:133  0  16.6G  0 part
> > > ├─sdi6                             8:134  0  16.6G  0 part
> > > └─sdi7                             8:135  0  16.6G  0 part
> > >
> > > Could you give me some idea to continue check?
> > >
> > >
> > > 2018-03-25 12:25 GMT+07:00 Budai Laszlo <laszlo.budai@xxxxxxxxx>:
> > >  
> > > > could you post the result of "ceph -s" ? besides the health status  
> > there  
> > > > are other details that could help, like the status of your PGs., also  
> > the  
> > > > result of "ceph-disk list" would be useful to understand how your  
> > disks are  
> > > > organized. For instance with 1 SSD for 7 HDD the SSD could be the
> > > > bottleneck.
> > > > From the outputs you gave us we don't know which are the spinning disks
> > > > and which is the ssd (looking at the numbers I suspect that sdi is your
> > > > SSD). we also don't kow what parameters were you using when you've ran  
> > the  
> > > > iostat command.
> > > >
> > > > Unfortunately it's difficult to help you without knowing more about  
> > your  
> > > > system.
> > > >
> > > > Kind regards,
> > > > Laszlo
> > > >
> > > > On 24.03.2018 20:19, Sam Huracan wrote:  
> > > > > This is from iostat:
> > > > >
> > > > > I'm using Ceph jewel, has no HW error.
> > > > > Ceph  health OK, we've just use 50% total volume.
> > > > >
> > > > >
> > > > > 2018-03-24 22:20 GMT+07:00 <ceph@xxxxxxxxxx <mailto:ceph@xxxxxxxxxx  
> > >>:  
> > > > >
> > > > >     I would Check with Tools like atop the utilization of your Disks  
> > > > also. Perhaps something Related in dmesg or dorthin?  
> > > > >
> > > > >     - Mehmet
> > > > >
> > > > >     Am 24. März 2018 08:17:44 MEZ schrieb Sam Huracan <  
> > > > nowitzki.sammy@xxxxxxxxx <mailto:nowitzki.sammy@xxxxxxxxx>>:  
> > > > >
> > > > >
> > > > >         Hi guys,
> > > > >         We are running a production OpenStack backend by Ceph.
> > > > >
> > > > >         At present, we are meeting an issue relating to high iowait  
> > in  
> > > > VM, in some MySQL VM, we see sometime IOwait reaches  abnormal high  
> > peaks  
> > > > which lead to slow queries increase, despite load is stable (we test  
> > with  
> > > > script simulate real load), you can see in graph.  
> > > > >         https://prnt.sc/ivndni
> > > > >
> > > > >         MySQL VM are place on Ceph HDD Cluster, with 1 SSD journal  
> > for 7  
> > > > HDD. In this cluster, IOwait on each ceph host is about 20%.  
> > > > >         https://prnt.sc/ivne08
> > > > >
> > > > >
> > > > >         Can you guy help me find the root cause of this issue, and  
> > how  
> > > > to eliminate this high iowait?  
> > > > >
> > > > >         Thanks in advance.
> > > > >
> > > > >
> > > > >     _______________________________________________
> > > > >     ceph-users mailing list
> > > > >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> > > > >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <  
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>  
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> > > >
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >  
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Rakuten Communications
> >  

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com