Hi folks, After listening to Eric swear about the per-cpu counter implementation we have for the in-core superblock all week, I decided the best thing to do woul dbe to simply replace them with generic per-cpu counters. The current icsb counters were implemented long before we had generic counter infrastructure, and it's remained that way because if it ain't broke.... Anyway, we do have a couple of issues with the counters to do with enforcing the maximum inode count on small filesystems. Fixing these problems is what Eric spend time swearing about. Anyway, to cut a long story short, there is nothing unique about the inode counters - neither the allocated inode count nor the free inode count need to be accurate at zero as they are not used for ENOSPC enforcement at this limit, and the allocated inode count doesn't need to be perfectly accurate at the maximum count, either. Hence we can just replace them with generic per-cpu coutners without second thoughts. The free block counter is a little different. We need to be able to accurately determine zero free blocks due to ENOSPC detection requirements, and this is where all the complexity came from in the existing infrastructure. The key technique that the existing infrastructure uses to be accurate at zero is that it goes back to a global lock and serialisation as it approaches zero. hence we trade off scalability for accuracy at ENOSPC. It turns out we can play the same trick with the generic per-cpu counter infrastructure. They allow a customised "batch" value, which is the threshold at which the local per-cpu counter is folded back into the global counter. By setting this batch to 1 we effectively serialise all modifications to the counter as any change will be over the batch fold threshold. Hence we can add a simple check on the global counter value and switch from large batch values to small values as we approach the zero threshold. This patchset has passed xfstests with no regressions, and there are no performance impacts measurable on my 16p test VM on inode allocation/freeing intensive workloads, nor on delayed allocation workloads (which reserve a block at a time and hence trigger extremely frequent updates) at IO rates of over 1GB/s. it also fixes the maxicount enforcement issue on small filesystems that started this off... SGI: this is a change that you are going to want to test for regressions on one of your large machines that has multiple GB/s of IO bandwidth. I don't expect there to be any problems, but if there are we might need to tweak batch thresholds based on CPU count...... This patchset is based on for-next, as it is dependent on the superblock logging changes that are already queued for the next cycle. Diffstat is as follows: fs/xfs/libxfs/xfs_bmap.c | 16 +- fs/xfs/libxfs/xfs_format.h | 96 +------ fs/xfs/libxfs/xfs_ialloc.c | 6 +- fs/xfs/libxfs/xfs_sb.c | 43 +-- fs/xfs/xfs_fsops.c | 16 +- fs/xfs/xfs_iomap.c | 3 +- fs/xfs/xfs_linux.h | 9 - fs/xfs/xfs_log_recover.c | 5 +- fs/xfs/xfs_mount.c | 730 ++++++----------------------------------------- fs/xfs/xfs_mount.h | 67 +---- fs/xfs/xfs_rtalloc.c | 6 +- fs/xfs/xfs_super.c | 101 +++++-- fs/xfs/xfs_super.h | 83 ++++++ fs/xfs/xfs_trans.c | 19 +- 14 files changed, 309 insertions(+), 891 deletions(-) Comments, thoughts? -Dave. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs