On 2/15/16 9:18 AM, David Casier wrote: > Hi Dave, > 1TB is very wide for SSD. > Exemple with only 10GiB : > https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/ It wouldn't be too hard to modify the inode32 restriction to a lower threshold, I think, if it would really be useful. On the other hand, 10GiB seems awfully small. What are realistic sizes for this usecase? -Eric > 2015-12-08 5:46 GMT+01:00 Dave Chinner <dchinner@xxxxxxxxxx>: >> On Fri, Dec 04, 2015 at 03:12:25PM -0500, Ric Wheeler wrote: >>> On 12/01/2015 05:02 PM, Sage Weil wrote: >>>> Hi David, >>>> >>>> On Tue, 1 Dec 2015, David Casier wrote: >>>>> Hi Sage, >>>>> With a standard disk (4 to 6 TB), and a small flash drive, it's easy >>>>> to create an ext4 FS with metadata on flash >>>>> >>>>> Example with sdg1 on flash and sdb on hdd : >>>>> >>>>> size_of() { >>>>> blockdev --getsize $1 >>>>> } >>>>> >>>>> mkdmsetup() { >>>>> _ssd=/dev/$1 >>>>> _hdd=/dev/$2 >>>>> _size_of_ssd=$(size_of $_ssd) >>>>> echo """0 $_size_of_ssd linear $_ssd 0 >>>>> $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2} >>>>> } >> >> So this is just a linear concatenation that relies on ext4 putting >> all it's metadata at the front of the filesystem? >> >>>>> >>>>> mkdmsetup sdg1 sdb >>>>> >>>>> mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode >>>>> -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i >>>>> $((1024*512)) /dev/mapper/dm-sdg1-sdb >>>>> >>>>> With that, all meta_blocks are on the SSD >> >> IIRC, it's the "packed_meta_blocks=1" that does this. >> >> THis is something that is pretty trivial to do with XFS, too, >> by use of the inode32 allocation mechanism. That reserves the >> first TB of space for inodes and other metadata allocations, >> so if you span the first TB with SSDs, you get almost all the >> metadata on the SSDs, and all the data in the higher AGs. With the >> undocumented log location mkfs option, you can also put hte log at >> the start og AG 0 which means that would sit on the SSD, too, >> without needing an external log device. >> >> SGI even had a mount option hack to limit this allocator behaviour >> to a block limit lower than 1TB so they could limit the metadata AG >> regions to, say, the first 200GB. >> >>>> This is coincidentally what I've been working on today. So far I've just >>>> added the ability to put the rocksdb WAL on a second device, but it's >>>> super easy to push rocksdb data there as well (and have it spill over onto >>>> the larger, slower device if it fills up). Or to put the rocksdb WAL on a >>>> third device (e.g., expensive NVMe or NVRAM). >> >> I have old bits and pieces from 7-8 years ago that would allow some >> application control of allocation policy to allow things like this >> to be done, but I left SGI before it was anything mor ethan just a >> proof of concept.... >> >>>> See this ticket for the ceph-disk tooling that's needed: >>>> >>>> http://tracker.ceph.com/issues/13942 >>>> >>>> I expect this will be more flexible and perform better than the ext4 >>>> metadata option, but we'll need to test on your hardware to confirm! >>>> >>>> sage >>> >>> I think that XFS "realtime" subvolumes are the thing that does this >>> - the second volume contains only the data (no metadata). >>> >>> Seem to recall that it is popular historically with video >>> appliances, etc but it is not commonly used. >> >> Because it's a single threaded allocator. It's not suited to highly >> concurrent applications, just applications that require large >> extents allocated in a deterministic manner. >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> dchinner@xxxxxxxxxx > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html