Topic: SSD Optimised allocation policies Scope: Performance Storage efficiency Proposal: Non-rotational storage is typically very fast. Our allocation policies are all, fundamentally, based on very slow storage which has extremely high latency between IO to different LBA regions. We burn CPU to optimise for minimal seeks to minimise the expensive physical movement of disk heads and platter rotation. We know when the underlying storage is solid state - there's a "non-rotational" field in the block device config that tells us the storage doesn't need physical seek optimisation. We should make use of that. My proposal is that we look towards arranging the filesystem allocation policies into CPU-optimised silos. We start by making filesystems on SSDs with AG counts that are multiples of the CPU count in the system (e.g. 4x the number of CPUs) to drive parallelism at the allocation level, and then associate allocation groups with specific CPUs in the system. Hence each CPU has a set of allocation groups is selects between for the operations that are run on it. Hence allocation is typically local to a specific CPU. Optimisation proceeds from the basis of CPU locality optimisation, not storage locality optimisation. What this allows is processes on different CPUs to never contend for allocation resources. Locality of objects just doesn't matter for solid state storage, so we gain nothing by trying to group inodes, directories, their metadata and data physically close together. We want writes that happen at the same time to be physically close together so we aggregate them into larger IOs, but we really don't care about optimising write locality for best read performance (i.e. must be contiguous for sequential access) for this storage. Further, we can look at faster allocation strategies - we don't need to find the "nearest" if we don't have a contiguous free extent to allocate into, we just want the one that costs the least CPU to find. This is because solid state storage is so fast that filesystem performance is CPU limited, not storage limited. Hence we need to think about allocation policies differently and start optimising them for minimum CPU expenditure rather than best layout. Other things to discuss include: - how do we convert metadata structures to write-once style behaviour rather than overwrite in place? - extremely large block sizes for metadata (e.g. 4MB) to align better with SSD erase block sizes - what parts of the allocation algorithms don't we need - are we better off with huge numbers of small AGs rather than fewer large AGs? -- Dave Chinner david@xxxxxxxxxxxxx