Re: Dramatic performance drop at certain number of objects in pool

Christian Balzer <chibi@xxxxxxx> · Sat, 25 Jun 2016 12:07:29 +0900

Hello,

On Fri, 24 Jun 2016 15:45:52 -0400 Wade Holler wrote:

> Not reasonable as you say :
> 
> vm.min_free_kbytes = 90112
>
Yeah, my nodes with IB adapters all have that set to at least 512MB, 1GB
if they're over 64GB.

> we're in recovery post expansion (48->54 OSDs) right now but free -t is:
> 
> #free -t
> 
Free can be very misleading when it comes to the actual state of things
with regards to memory fragmentation.

Take a look at "cat /proc/buddyinfo" and read up on linux memory
fragmentation.
Also this:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg22214.html

Christian

>               total        used        free      shared  buff/cache
> available
> 
> Mem:      693097104   378383384    36870080      369292   277843640
> 250931372
> 
> Swap:       1048572         956     1047616
> 
> Total:    694145676   378384340    37917696
> 
> 
> On Fri, Jun 24, 2016 at 12:24 PM, Warren Wang - ISD
> <Warren.Wang@xxxxxxxxxxx> wrote:
> > Oops, that reminds me, do you have min_free_kbytes set to something
> > reasonable like at least 2-4GB?
> >
> > Warren Wang
> >
> >
> >
> > On 6/24/16, 10:23 AM, "Wade Holler" <wade.holler@xxxxxxxxx> wrote:
> >
> >>On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
> >>think it is the best choice for most configs.  However with our large
> >>memory footprint, vfs_cache_pressure=1 increased the likelihood of
> >>hitting an issue where our write response time would double; then a
> >>drop of caches would return response time to normal.  I don't claim to
> >>totally understand this and I only have speculation at the moment.
> >>Again thanks for this suggestion, I do think it is best for boxes that
> >>don't have very large memory.
> >>
> >>@ Christian - reformatting to btrfs or ext4 is an option in my test
> >>cluster.  I thought about that but needed to sort xfs first. (thats
> >>what production will run right now) You all have helped me do that and
> >>thank you again.  I will circle back and test btrfs under the same
> >>conditions.  I suspect that it will behave similarly but it's only a
> >>day and half's work or so to test.
> >>
> >>Best Regards,
> >>Wade
> >>
> >>
> >>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> >>wrote:
> >>> Oops , typo , 128 GB :-)...
> >>>
> >>> -----Original Message-----
> >>> From: Christian Balzer [mailto:chibi@xxxxxxx]
> >>> Sent: Thursday, June 23, 2016 5:08 PM
> >>> To: ceph-users@xxxxxxxxxxxxxx
> >>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite;
> >>> Ceph
> >>>Development
> >>> Subject: Re:  Dramatic performance drop at certain number
> >>>of objects in pool
> >>>
> >>>
> >>> Hello,
> >>>
> >>> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
> >>>
> >>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
> >>>> *pin* inode/dentries in memory. We are using that for long now (with
> >>>> 128 TB node memory) and it seems helping specially for the random
> >>>> write workload and saving xattrs read in between.
> >>>>
> >>> 128TB node memory, really?
> >>> Can I have some of those, too? ^o^
> >>> And here I was thinking that Wade's 660GB machines were on the
> >>>excessive side.
> >>>
> >>> There's something to be said (and optimized) when your storage nodes
> >>>have the same or more RAM as your compute nodes...
> >>>
> >>> As for Warren, well spotted.
> >>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
> >>>fireworks if your memory is really needed elsewhere, while keeping
> >>>things in memory normally.
> >>>
> >>> Christian
> >>>
> >>>> Thanks & Regards
> >>>> Somnath
> >>>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>> Behalf Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
> >>>> To: Wade Holler; Blair Bethwaite
> >>>> Cc: Ceph Development; ceph-users@xxxxxxxxxxxxxx
> >>>> Subject: Re:  Dramatic performance drop at certain
> >>>> number of objects in pool
> >>>>
> >>>> vm.vfs_cache_pressure = 100
> >>>>
> >>>> Go the other direction on that. You易ll want to keep it low to help
> >>>> keep inode/dentry info in memory. We use 10, and haven易t had a
> >>>> problem.
> >>>>
> >>>>
> >>>> Warren Wang
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@xxxxxxxxx> wrote:
> >>>>
> >>>> >Blairo,
> >>>> >
> >>>> >We'll speak in pre-replication numbers, replication for this pool
> >>>> >is
> >>>>3.
> >>>> >
> >>>> >23.3 Million Objects / OSD
> >>>> >pg_num 2048
> >>>> >16 OSDs / Server
> >>>> >3 Servers
> >>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
> >>>> >vm.vfs_cache_pressure = 100
> >>>> >
> >>>> >Workload is native librados with python.  ALL 4k objects.
> >>>> >
> >>>> >Best Regards,
> >>>> >Wade
> >>>> >
> >>>> >
> >>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> >>>> ><blair.bethwaite@xxxxxxxxx> wrote:
> >>>> >> Wade, good to know.
> >>>> >>
> >>>> >> For the record, what does this work out to roughly per OSD? And
> >>>> >> how much RAM and how many PGs per OSD do you have?
> >>>> >>
> >>>> >> What's your workload? I wonder whether for certain workloads
> >>>> >> (e.g. RBD) it's better to increase default object size somewhat
> >>>> >> before pushing the split/merge up a lot...
> >>>> >>
> >>>> >> Cheers,
> >>>> >>
> >>>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@xxxxxxxxx>
> >>>> >> wrote:
> >>>> >>> Based on everyones suggestions; The first modification to 50 /
> >>>> >>> 16 enabled our config to get to ~645Mill objects before the
> >>>> >>> behavior in question was observed (~330 was the previous
> >>>> >>> ceiling). Subsequent modification to 50 / 24 has enabled us to
> >>>> >>> get to 1.1 Billion+
> >>>> >>>
> >>>> >>> Thank you all very much for your support and assistance.
> >>>> >>>
> >>>> >>> Best Regards,
> >>>> >>> Wade
> >>>> >>>
> >>>> >>>
> >>>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer
> >>>> >>> <chibi@xxxxxxx>
> >>>> >>>wrote:
> >>>> >>>>
> >>>> >>>> Hello,
> >>>> >>>>
> >>>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> >>>> >>>>
> >>>> >>>>> Sorry, late to the party here. I agree, up the merge and split
> >>>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
> >>>> >>>>>here.
> >>>> >>>>> One of those things you just have to find out as an operator
> >>>> >>>>>since it's  not well documented :(
> >>>> >>>>>
> >>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>> >>>>>
> >>>> >>>>> We have over 200 million objects in this cluster, and it's
> >>>> >>>>> still
> >>>> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
> >>>> >>>>>drives
> >>>> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
> >>>> >>>>>vfs_cache_pressure  should also help.
> >>>> >>>>>
> >>>> >>>> Indeed.
> >>>> >>>>
> >>>> >>>> Since it was asked in that bug report and also my first
> >>>> >>>>suspicion, it  would probably be good time to clarify that it
> >>>> >>>>isn't the splits that cause  the performance degradation, but
> >>>> >>>>the resulting inflation of dir entries  and exhaustion of SLAB
> >>>> >>>>and thus having to go to disk for things that  normally would
> >>>> >>>>be in
> >>>>memory.
> >>>> >>>>
> >>>> >>>> Looking at Blair's graph from yesterday pretty much makes that
> >>>> >>>>clear, a  purely split caused degradation should have relented
> >>>> >>>>much quicker.
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>> Keep in mind that if you change the values, it won't take
> >>>> >>>>> effect immediately. It only merges them back if the directory
> >>>> >>>>> is under the calculated threshold and a write occurs (maybe a
> >>>> >>>>> read, I forget).
> >>>> >>>>>
> >>>> >>>> If it's a read a plain scrub might do the trick.
> >>>> >>>>
> >>>> >>>> Christian
> >>>> >>>>> Warren
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>> From: ceph-users
> >>>> >>>>>
> >>>>
> >>>>>>>>><ceph-users-bounces@xxxxxxxxxxxxxx<mailto:ceph-users-bounces@lists.
> >>>> >>>>>cep
> >>>> >>>>>h.com>>
> >>>> >>>>> on behalf of Wade Holler
> >>>> >>>>> <wade.holler@xxxxxxxxx<mailto:wade.holler@xxxxxxxxx>> Date:
> >>>> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
> >>>> >>>>><blair.bethwaite@xxxxxxxxx<mailto:blair.bethwaite@xxxxxxxxx>>,
> >>>> >>>>>Wido den  Hollander <wido@xxxxxxxx<mailto:wido@xxxxxxxx>> Cc:
> >>>> >>>>>Ceph Development
> >>>> >>>>><ceph-devel@xxxxxxxxxxxxxxx<mailto:ceph-devel@xxxxxxxxxxxxxxx>>,
> >>>> >>>>> "ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>"
> >>>> >>>>> <ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>>
> >>>> >>>>>Subject:
> >>>> >>>>> Re:  Dramatic performance drop at certain number
> >>>> >>>>> of
> >>>> >>>>>objects  in pool
> >>>> >>>>>
> >>>> >>>>> Thanks everyone for your replies.  I sincerely appreciate it.
> >>>> >>>>> We are testing with different pg_num and
> >>>> >>>>> filestore_split_multiple settings. Early indications are ....
> >>>> >>>>> well not great. Regardless it is nice to understand the
> >>>> >>>>> symptoms better so we try to design around it.
> >>>> >>>>>
> >>>> >>>>> Best Regards,
> >>>> >>>>> Wade
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>>> >>>>><blair.bethwaite@xxxxxxxxx<mailto:blair.bethwaite@xxxxxxxxx>>
> >>>>wrote:
> >>>> >>>>>On
> >>>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
> >>>> >>>>><blair.bethwaite@xxxxxxxxx<mailto:blair.bethwaite@xxxxxxxxx>>
> >>>>wrote:
> >>>> >>>>> > slow request issues). If you watch your xfs stats you'll
> >>>> >>>>> > likely get further confirmation. In my experience
> >>>> >>>>> > xs_dir_lookups balloons
> >>>> >>>>>(which
> >>>> >>>>> > means directory lookups are missing cache and going to
> >>>> >>>>> > disk).
> >>>> >>>>>
> >>>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
> >>>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit
> >>>> >>>>> this very problem we had only ephemerally set the new
> >>>> >>>>> filestore merge/split values - oops. Here's what started
> >>>> >>>>> happening when we upgraded and restarted a bunch of OSDs:
> >>>> >>>>>
> >>>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
> >>>> >>>>>_d
> >>>> >>>>>ir_
> >>>> >>>>>lookup.png
> >>>> >>>>>
> >>>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it
> >>>> >>>>> about 12:30, then still took a while to settle.
> >>>> >>>>>
> >>>> >>>>> --
> >>>> >>>>> Cheers,
> >>>> >>>>> ~Blairo
> >>>> >>>>>
> >>>> >>>>> This email and any files transmitted with it are confidential
> >>>> >>>>>and intended solely for the individual or entity to whom they
> >>>> >>>>>are addressed.
> >>>> >>>>> If you have received this email in error destroy it
> >>>> >>>>> immediately.
> >>>> >>>>>***  Walmart Confidential ***
> >>>> >>>>
> >>>> >>>>
> >>>> >>>> --
> >>>> >>>> Christian Balzer        Network/Systems Engineer
> >>>> >>>> chibi@xxxxxxx           Global OnLine Japan/Rakuten
> >>>> >>>> Communications http://www.gol.com/
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> Cheers,
> >>>> >> ~Blairo
> >>>>
> >>>> This email and any files transmitted with it are confidential and
> >>>> intended solely for the individual or entity to whom they are
> >>>>addressed.
> >>>> If you have received this email in error destroy it immediately. ***
> >>>> Walmart Confidential ***
> >>>> _______________________________________________
> >>>> ceph-users mailing list ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
> >>>> The information contained in this electronic mail message is
> >>>> intended only for the use of the designated recipient(s) named
> >>>> above. If the reader of this message is not the intended recipient,
> >>>> you are hereby notified that you have received this message in
> >>>> error and that any review, dissemination, distribution, or copying
> >>>> of this message is strictly prohibited. If you have received this
> >>>> communication in error, please notify the sender by telephone or
> >>>> e-mail (as shown above) immediately and destroy any and all copies
> >>>> of this message in your possession (whether hard copies or
> >>>> electronically stored copies).
> >>>> _______________________________________________ ceph-users mailing
> >>>> list ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>
> >>>
> >>> --
> >>> Christian Balzer        Network/Systems Engineer
> >>> chibi@xxxxxxx   Global OnLine Japan/Rakuten Communications
> >>> http://www.gol.com/
> >>> PLEASE NOTE: The information contained in this electronic mail
> >>> message
> >>>is intended only for the use of the designated recipient(s) named
> >>>above. If the reader of this message is not the intended recipient,
> >>>you are hereby notified that you have received this message in error
> >>>and that any review, dissemination, distribution, or copying of this
> >>>message is strictly prohibited. If you have received this
> >>>communication in error, please notify the sender by telephone or
> >>>e-mail (as shown above) immediately and destroy any and all copies of
> >>>this message in your possession (whether hard copies or
> >>>electronically stored copies).
> >
> >
> > This email and any files transmitted with it are confidential and
> > intended solely for the individual or entity to whom they are
> > addressed. If you have received this email in error destroy it
> > immediately. *** Walmart Confidential ***
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com