> FWIW, I often see performance increase when favoring inode/dentry cache, but > probably with far fewer inodes that the setup you just saw. It sounds like there > needs to be some maximum limit on the inode/dentry cache to prevent this > kind of behavior but still favor it up until that point. Having said that, maybe > avoiding syncfs is best as you say below. We also see that in most of the case. Usually we set this to 1 (prefer inode) as a tuning BKM for small file storage. Can we walk around it by enlarge the size of FDCache and tune /proc/sys/vm/vfs_cache_pressure to 100(prefer data)? That means we try to use FDCache to replace inode/dentry cache as possible. > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson > Sent: Thursday, August 6, 2015 5:56 AM > To: Sage Weil; Somnath.Roy@xxxxxxxxxxx > Cc: ceph-devel@xxxxxxxxxxxxxxx; sjust@xxxxxxxxxx > Subject: Re: FileStore should not use syncfs(2) > > > > On 08/05/2015 04:26 PM, Sage Weil wrote: > > Today I learned that syncfs(2) does an O(n) search of the superblock's > > inode list searching for dirty items. I've always assumed that it was > > only traversing dirty inodes (e.g., a list of dirty inodes), but that > > appears not to be the case, even on the latest kernels. > > > > That means that the more RAM in the box, the larger (generally) the > > inode cache, the longer syncfs(2) will take, and the more CPU you'll > > waste doing it. The box I was looking at had 256GB of RAM, 36 OSDs, > > and a load of ~40 servicing a very light workload, and each syncfs(2) > > call was taking ~7 seconds (usually to write out a single inode). > > > > A possible workaround for such boxes is to turn > > /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors > > caching pages instead of inodes/dentries)... > > FWIW, I often see performance increase when favoring inode/dentry cache, but > probably with far fewer inodes that the setup you just saw. It sounds like there > needs to be some maximum limit on the inode/dentry cache to prevent this > kind of behavior but still favor it up until that point. Having said that, maybe > avoiding syncfs is best as you say below. > > > > > I think the take-away though is that we do need to bite the bullet and > > make FileStore f[data]sync all the right things so that the syncfs > > call can be avoided. This is the path you were originally headed > > down, Somnath, and I think it's the right one. > > > > The main thing to watch out for is that according to POSIX you really > > need to fsync directories. With XFS that isn't the case since all > > metadata operations are going into the journal and that's fully > > ordered, but we don't want to allow data loss on e.g. ext4 (we need to > > check what the metadata ordering behavior is there) or other file systems. > > > > :( > > > > sage > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body > of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html