2012/1/23 Chris Mason <chris.mason@xxxxxxxxxx>: > On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote: >> On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote: >> > As you might know, I have been seeing btrfs slowdowns in our ceph >> > cluster for quite some time. Even with the latest btrfs code for 3.3 >> > I'm still seeing these problems. To make things reproducible, I've now >> > written a small test, that imitates ceph's behavior: >> > >> > On a freshly created btrfs filesystem (2 TB size, mounted with >> > "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening >> > 100 files. After that I'm doing random writes on these files with a >> > sync_file_range after each write (each write has a size of 100 bytes) >> > and ioctl(BTRFS_IOC_SYNC) after every 100 writes. >> > >> > After approximately 20 minutes, write activity suddenly increases >> > fourfold and the average request size decreases (see chart in the >> > attachment). >> > >> > You can find IOstat output here: http://pastebin.com/Smbfg1aG >> > >> > I hope that you are able to trace down the problem with the test >> > program in the attachment. >> >> Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and >> formatted the fs with 64k node and leaf sizes and the problem appeared to go >> away. So surprise surprise fragmentation is biting us in the ass. If you can >> try running that branch with 64k node and leaf sizes with your ceph cluster and >> see how that works out. Course you should only do that if you dont mind if you >> lose everything :). Thanks, >> > > Please keep in mind this branch is only out there for development, and > it really might have huge flaws. scrub doesn't work with it correctly > right now, and the IO error recovery code is probably broken too. > > Long term though, I think the bigger block sizes are going to make a > huge difference in these workloads. > > If you use the very dangerous code: > > mkfs.btrfs -l 64k -n 64k /dev/xxx > > (-l is leaf size, -n is node size). > > 64K is the max right now, 32K may help just as much at a lower CPU cost. Thanks for taking a look. - I'm glad to hear that there is a solution on the horizon, but I'm not brave enough to try this on our ceph cluster. I'll try it when the code has stabilized a bit. Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html