On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote: > On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote: > > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote: > > This will have significant impact > > on SGI's DMF managed filesystems. > > You're concerned about bulkstat performance, then? Bulkstat will CRC > every inode it reads, so the increase in inode size is the least of > your worries.... > > But bulkstat scalability is an unrelated issue to the CRC work, > especially as bulkstat already needs application provided > parallelism to scale effectively. So, I just added a single threaded bulkstat pass to the fsmark workload by passing xfs_fsr across the filesystem to test out what impact it has. So, 50 million inodes in the directory structure: 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- wall time 13m34.203s 14m15.266s sys CPU 7m7.930s 8m52.050s rate 61,425 inodes/s 58,479 inode/s efficency 116,800 inodes/CPU/s 93,984 inodes/CPU/s So, really it's not particularly significant in terms of performance differential. Certainly there isn't anything signficant problem that larger inodes cause. For comparison, the 8-way find workloads: 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- wall time 5m33.165s 8m18.256s sys CPU 18m36.731s 22m2.277s rate 150,055 inodes/s 100,400 inodes/s efficiency 44,800 inodes/CPU/s 37,800 inodes/CPU/s Which makes me think omething is not right with this bulkstat pass I've just done. It's way too slow if a find+stat is 2-2.5x faster. Ah, xfs_fsr only bulkstats 64 inodes at a time. That's right, last time I did this I used bstat out of xfstests. On a CRC enabled fs: ninodes runtime sys time read bw(IOPS) 64 14m01s 8m37s 128 11m20s 7m58s 35MB/s(5000) 256 8m53s 7m24s 45MB/s(6000) 512 7m24s 6m28s 55MB/s(7000) 1024 6m37s 5m40s 65MB/s(8000) 2048 10m50s 6m51s 35MB/s(5000) 4096(default) 26m23s 8m38s Ask bulkstat for too few or too much, and it all goes to hell. So if we get the bulkstat config right, a single threaded bulkstat is faster than the 8-way find, and a whole lot more efficient at it. But, still there is effectively no performance differential between 256 byte and 512 byte inodes worth talking about. And, FWIW, I just hacked threading into bstat to run a thread per AG and just scan a single AG per thread. It's not perfect - it counts some inodes twice (threads*ninodes at most) before it detects it's run into the next AG. This is on a 100TB filesystem, so it runs 100 threads. CRC enabled fs: ninodes runtime sys time read bw(IOPS) 64 1m53s 10m25s 220MB/s (27000) 256 1m52s 10m03s 220MB/s (27000) 1024 1m55s 10m08s 210MB/s (26000) So when it's threaded, the small request size just doesn't matter - there's enough IO to drive the system to being IOPS bound and that limits performance. Just to go full circle, the differences between 256 byte inodes, no CRCs and the crc enabled filesystem for a single threaded bulkstat: 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- ninodes 1024 1024 wall time 5m22s 6m37s sys CPU 4m46s 5m40s bw(IOPS) 40MB/s(5000) 65MB/s(8000) rate 155,300 inodes/s 126,000 inode/s efficency 174,800 inodes/CPU/s 147,000 inodes/CPU/s Both follow the same ninode profile, but there is less IO done for the 256 byte inode filesystem and throughput is higher. There's no big surprise there, what does surprise me is that the difference isn't larger. Let's drive it to being I/O bound with threading: 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- ninodes 256 256 wall time 1m02s 1m52s sys CPU 7m04s 10m03s bw/IOPS 210MB/s (27000) 220MB/s (27000) rate 806,500 inodes/s 446,500 inode/s efficency 117,900 inodes/CPU/s 82,900 inodes/CPU/s The 256 byte inode test is completely CPU bound - it can't go any faster than that, and it just so happens to be pretty close to IO bound as well. So, while there's double the throughput for 256 byte inodes, it raises an interesting question: why are all the IOs only 8k in size? That means the inode readahead that bulkstat is doing is not being combined down in the elevator - it is either being cancelled because there is too much, or it is being dispatched immediately and so we are being IOPS limited long before we should be. i.e. there's still 500MB of bandwidth available on this filesystem and we're issuing sequential adjacent 8k IO. Either way, it's not functioning as it should. <blktrace> Yup, immediate, explicit unplug and dispatch. No readahead batching and the unplug is coming from _xfs_buf_ioapply(). Well, that is easy to fix. 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- ninodes 256 256 wall time 1m02s 1m08s sys CPU 7m07s 8m09s bw/IOPS 210MB/s (13500) 360MB/s (14000) rate 806,500 inodes/s 735,300 inode/s efficency 117,100 inodes/CPU/s 102,200 inodes/CPU/s So, the difference in performance pretty much goes away. We burn more bandwidth, but now the multithreaded bulkstat is CPU limited for both non-crc, 256 byte inodes and CRC enabled 512 byte inodes. What this says to me is that there isn't a bulkstat performance problem that we need to fix apart from the 3 lines of code for the readahead IO plugging that I just added. It's only limited by storage IOPS and available CPU power, yet the bandwidth is sufficiently low that any storage system that SGI installs for DMF is not going to be stressed by it. IOPS, yes. Bandwidth, no. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs