Hi Dave, 1TB is very wide for SSD. Exemple with only 10GiB : https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/ 2015-12-08 5:46 GMT+01:00 Dave Chinner <dchinner@xxxxxxxxxx>: > On Fri, Dec 04, 2015 at 03:12:25PM -0500, Ric Wheeler wrote: >> On 12/01/2015 05:02 PM, Sage Weil wrote: >> >Hi David, >> > >> >On Tue, 1 Dec 2015, David Casier wrote: >> >>Hi Sage, >> >>With a standard disk (4 to 6 TB), and a small flash drive, it's easy >> >>to create an ext4 FS with metadata on flash >> >> >> >>Example with sdg1 on flash and sdb on hdd : >> >> >> >>size_of() { >> >> blockdev --getsize $1 >> >>} >> >> >> >>mkdmsetup() { >> >> _ssd=/dev/$1 >> >> _hdd=/dev/$2 >> >> _size_of_ssd=$(size_of $_ssd) >> >> echo """0 $_size_of_ssd linear $_ssd 0 >> >> $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2} >> >>} > > So this is just a linear concatenation that relies on ext4 putting > all it's metadata at the front of the filesystem? > >> >> >> >>mkdmsetup sdg1 sdb >> >> >> >>mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode >> >>-E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i >> >>$((1024*512)) /dev/mapper/dm-sdg1-sdb >> >> >> >>With that, all meta_blocks are on the SSD > > IIRC, it's the "packed_meta_blocks=1" that does this. > > THis is something that is pretty trivial to do with XFS, too, > by use of the inode32 allocation mechanism. That reserves the > first TB of space for inodes and other metadata allocations, > so if you span the first TB with SSDs, you get almost all the > metadata on the SSDs, and all the data in the higher AGs. With the > undocumented log location mkfs option, you can also put hte log at > the start og AG 0 which means that would sit on the SSD, too, > without needing an external log device. > > SGI even had a mount option hack to limit this allocator behaviour > to a block limit lower than 1TB so they could limit the metadata AG > regions to, say, the first 200GB. > >> >This is coincidentally what I've been working on today. So far I've just >> >added the ability to put the rocksdb WAL on a second device, but it's >> >super easy to push rocksdb data there as well (and have it spill over onto >> >the larger, slower device if it fills up). Or to put the rocksdb WAL on a >> >third device (e.g., expensive NVMe or NVRAM). > > I have old bits and pieces from 7-8 years ago that would allow some > application control of allocation policy to allow things like this > to be done, but I left SGI before it was anything mor ethan just a > proof of concept.... > >> >See this ticket for the ceph-disk tooling that's needed: >> > >> > http://tracker.ceph.com/issues/13942 >> > >> >I expect this will be more flexible and perform better than the ext4 >> >metadata option, but we'll need to test on your hardware to confirm! >> > >> >sage >> >> I think that XFS "realtime" subvolumes are the thing that does this >> - the second volume contains only the data (no metadata). >> >> Seem to recall that it is popular historically with video >> appliances, etc but it is not commonly used. > > Because it's a single threaded allocator. It's not suited to highly > concurrent applications, just applications that require large > extents allocated in a deterministic manner. > > Cheers, > > Dave. > -- > Dave Chinner > dchinner@xxxxxxxxxx -- ________________________________________________________ Cordialement, David CASIER 3B Rue Taylor, CS20004 75481 PARIS Cedex 10 Paris Ligne directe: 01 75 98 53 85 Email: david.casier@xxxxxxxx ________________________________________________________ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html