Hello Blair, On Mon, 20 Jun 2016 09:21:27 +1000 Blair Bethwaite wrote: > Hi Wade, > > (Apologies for the slowness - AFK for the weekend). > > On 16 June 2016 at 23:38, Wido den Hollander <wido@xxxxxxxx> wrote: > > > >> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@xxxxxxxxx>: > >> > >> > >> Hi All, > >> > >> I have a repeatable condition when the object count in a pool gets to > >> 320-330 million the object write time dramatically and almost > >> instantly increases as much as 10X, exhibited by fs_apply_latency > >> going from 10ms to 100s of ms. > >>r filestore > > > > My first guess is the filestore splitting and the amount of files per > > directory. > > I concur with Wido and suggest you try upping your filestore split and > merge threshold config values. > This is probably a good idea but as mentioned/suggested below, it would be something that eventually settle down in a new equilibrium. Something I don't think is happening here. > I've seen this issue a number of times now with write heavy workload, > and would love to at least write some docs about it, because it must > happen to a lot of users running RBD workloads on largish drives. > However, I'm not sure how to definitively diagnose the issue and > pinpoint the problem. The gist of the issue is the number of files > and/or directories on your OSD filesystems, at some system dependent > threshold you get to a point where you can no longer sufficiently > cache inodes and/or dentrys, so IOs on those files(ystems) have to > incur extra disk IOPS to read the filesystem structure from disk (I > believe that's the small read IO you're seeing, and unfortunately it > seems to effectively choke writes - we've seen all sorts of related > slow request issues). If you watch your xfs stats you'll likely get > further confirmation. In my experience xs_dir_lookups balloons (which > means directory lookups are missing cache and going to disk). > > What I'm not clear on is whether there are two different pathologies > at play here, i.e., specifically dentry cache issues versus inode > cache issues. In the former case making Ceph's directory structure > shallower with more files per directory may help (or perhaps > increasing the number of PGs - more top-level directories), but in the > latter case you're likely to need various system tuning (lower vfs > cache pressure, more memory?, fewer files (larger object size)) > depending on your workload. > I can very much confirm this from the days when on my main production cluster all 1024 PGs (but only about 6GB of data and 1.6 million objects) were on just 4 OSDs (25TB each). Once SLAB ran out of steam and couldn't hold all the respective entries (Ext4 here, but same diff), things became very slow. My litmus test is that a "ls -R /var/lib/ceph/osd/ceph-nn/ >/dev/null" should be pretty much instantaneous and not having to access the disk at all. More RAM and proper tuning as well as smaller OSDs are all ways forward to alleviate/prevent this issue. It would be interesting to see/know how bluestore fares in this kind of scenario. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html