Re: Hammer to Jewel Upgrade - Extreme OSD Boot Time

Chris Jones <chris.jones@xxxxxx> · Mon, 6 Nov 2017 13:05:06 +0000

I'll document the resolution here for anyone else who experiences similar issues.

We have determined the root cause of the long boot time was a combination of factors having to do with ZFS version and tuning, in combination with how long filenames are handled.

## 1 ## Insufficient ARC cache size. 

Dramatically increasing the arc_max and arc_meta_limit allowed better performance once the cache had time to populate. Previously, each call to getxattr took about 8ms (0.008 sec). Multiply that by millions of getxattr calls during OSD daemon startup,
 this was taking hours. This only became apparent when we upgraded to Jewel. Hammer does not appear to parse all of the extended attributes during startup; This appeared to be introduced in Jewel as part of the sortbitwise algorithm.

Increasing the arc_max and arc_meta_limit allowed more of the meta data to be cached in memory. This reduced getxattr call duration to between 10 to 100 microseconds (0.0001 to 0.00001 sec). An average of around 400x faster.

## 2 ## ZFS version 0.6.5.11 and inability to store large amounts of meta info in the inode/dnode.

My understanding is that the ability to use a larger dnode size to store meta was not introduced until ZFS version 0.7.x. In version 0.6.5.11 this was causing large quantities of meta data to be stored in inefficient spill blocks, which were taking longer
 to access since they were not cached due to (previously) undersized ARC settings.

## Summary ##

Increasing ARC cache settings improved performance, but performance will still be a concern if the ARC is purged/flushed, such during system reboot, until the cache rebuilds itself.

Upgrading to ZFS version 0.7.x is one potential upgrade path to utilize larger dnode size. Another upgrade path is to switch to XFS, which is the recommended filesystem for CEPH. XFS does not appear to require any kind of meta cache due to different handling
 of meta info in the inode.

--------------------------------------------------
Chris

From: Willem Jan Withagen <wjw@xxxxxxxxxxx>

Sent: Wednesday, November 1, 2017 4:51:52 PM

To: Chris Jones; Gregory Farnum

Cc: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  Hammer to Jewel Upgrade - Extreme OSD Boot Time

On 01/11/2017 18:04, Chris Jones wrote:

> Greg,

> 

> Thanks so much for the reply!

> 

> We are not clear on why ZFS is behaving poorly under some circumstances 

> on getxattr system calls, but that appears to be the case.

> 

> Since the last update we have discovered that back-to-back booting of 

> the OSD yields very fast boot time, and very fast getxattr system calls.

> 

> A longer period between boots (or perhaps related to influx of new data) 

> correlates to longer boot duration. This is due to slow getxattr calls 

> of certain types.

> 

> We suspect this may be a caching or fragmentation issue with ZFS for 

> xattrs. Use of longer filenames appear to make this worse.

As far as I understand is a lot of this data stored in the metadata.

Which is (or can be) a different set in the (l2)arc cache.

So are you talking about a OSD reboot, or a system reboot?

Don't quite understand what you mean back-to-back...

I have little experience with ZFS on Linux.

So if behaviour there is different is hard for me to tell.

IF you are rebooting the OSD, I can imagine that with certain sequences 

of rebooting pre-loads the meta-cache. Reboots further apart can have 

lead to a different working set in the ZFS-caches. And then all data 

needs to be refetched, instead of getting it from l2arc.

And note that in newer ZFS versions the in memory ARC even can be 

compressed, leading to an even higher hit rate.

For example on my development server with 32Gb memory:

ARC: 20G Total, 1905M MFU, 16G MRU, 70K Anon, 557M Header, 1709M Other

      17G Compressed, 42G Uncompressed, 2.49:1 Ratio

--WjW

> 

> We experimented on some OSDs with swapping over to XFS as the 

> filesystem, and the problem does not appear to be present on those OSDs.

> 

> The two examples below are representative of a Long Boot (longer running 

> time and more data influx between osd rebooting) and a Short Boot where 

> we booted the same OSD back to back.

> 

> Notice the drastic difference in time on the getxattr that yields the 

> ENODATA return. Around 0.009 secs for "long boot" and "0.0002" secs when 

> the same OSD is booted back to back. Long boot time is approx 40x to 50x 

> longer. Multiplied by thousands of getxattr calls, this is/was our 

> source of longer boot time.

> 

> We are considering a full switch to XFS, but would love to hear any ZFS 

> tuning tips that might be a short term workaround.

> 

> We are using ZFS 6.5.11 prior to implementation of the ability to use 

> large dnodes which would allow the use of dnodesize=auto.

> 

> #Long Boot

> <0.000044>[pid 3413902] 13:08:00.884238 

> getxattr("/osd/9/current/20.86bs3_head/default.34597.7\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebana_1d9e1e82d623f49c994f_0_long",

> "user.cephos.lfn3", 

> "default.34597.7\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-92d9df789f9aaf007c50c50bb66e70af__head_0177C86B__14_ffffffffffffffff_3",

> 1024) = 616 <0.000044>

> <0.008875>[pid 3413902] 13:08:00.884476 

> getxattr("/osd/9/current/20.86bs3_head/default.34597.57\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_79a7acf2d32f4302a1a4_0_long",

> "user.cephos.lfn3-alt", 0x7f849bf95180, 1024) = -1 ENODATA (No data 

> available) <0.008875>

> 

> #Short Boot

> <0.000015> [pid 3452111] 13:37:18.604442 

> getxattr("/osd/9/current/20.15c2s3_head/default.34597.22\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_efb8ca13c57689d76797_0_long",

> "user.cephos.lfn3", 

> "default.34597.22\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-b519f8607a3d9de0f815d18b6905b27d__head_9726F5C2__14_ffffffffffffffff_3",

> 1024) = 617 <0.000015>

> <0.000018> [pid 3452111] 13:37:18.604546 

> getxattr("/osd/9/current/20.15c2s3_head/default.34597.66\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_0e6d86f58e03d0f6de04_0_long",

> "user.cephos.lfn3-alt", 0x7fd4e8017680, 1024) = -1 ENODATA (No data 

> available) <0.000018>

> 

> 

> --------------------------------------------------

> Christopher J. Jones

> 

> ------------------------------------------------------------------------

> *From:* Gregory Farnum <gfarnum@xxxxxxxxxx>

> *Sent:* Monday, October 30, 2017 6:20:15 PM

> *To:* Chris Jones

> *Cc:* ceph-users@xxxxxxxxxxxxxx

> *Subject:* Re:  Hammer to Jewel Upgrade - Extreme OSD Boot Time

> On Thu, Oct 26, 2017 at 11:33 AM Chris Jones <chris.jones@xxxxxx 

> <mailto:chris.jones@xxxxxx>> wrote:

> 

>     The long running functionality appears to be related to

>     clear_temp_objects(); from OSD.cc called from init().

> 

> 

>     What is this functionality intended to do? Is it required to be run

>     on every OSD startup? Any configuration settings that would help

>     speed this up?

> 

> 

> 

> This function looks like it's cleaning up temporary objects that might 

> have been left behind. Basically, we are scanning through the objects 

> looking for temporaries, but we stop doing so once we hit a non-temp 

> object (implying they are ordered). So in the common case I think we're 

> doing one listing in each PG, finding there are no temp objects (or 

> noting the few that remain), and then advancing to the next PG. This 

> will take a little time as we're basically doing one metadata listing 

> per PG, but that should end quickly.

> 

> I'm curious why this is so slow for you as I'm not aware of anybody else 

> reporting such issues. I suspect the ZFS backend is behaving rather 

> differently than the others, or that you've changed the default config 

> options dramatically, so that your OSDs have to do a much larger listing 

> in order to return the sorted list the OSD interface requires.

> -Greg

> 

> 

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com