Re: Fwd: Fwd: [newstore (again)] how disable double write WAL

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/01/2015 05:02 PM, Sage Weil wrote:
Hi David,

On Tue, 1 Dec 2015, David Casier wrote:
Hi Sage,
With a standard disk (4 to 6 TB), and a small flash drive, it's easy
to create an ext4 FS with metadata on flash

Example with sdg1 on flash and sdb on hdd :

size_of() {
   blockdev --getsize $1
}

mkdmsetup() {
   _ssd=/dev/$1
   _hdd=/dev/$2
   _size_of_ssd=$(size_of $_ssd)
   echo """0 $_size_of_ssd linear $_ssd 0
   $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
}

mkdmsetup sdg1 sdb

mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
-E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
$((1024*512)) /dev/mapper/dm-sdg1-sdb

With that, all meta_blocks are on the SSD

If omap are on SSD, there are almost no metadata on HDD

Consequence : performance Ceph (with hack on filestore without journal
and directIO) are almost same that performance of the HDD.

With cache-tier, it's very cool !
Cool!  I know XFS lets you do that with the journal, but I'm not sure if
you can push the fs metadata onto a different device too.. I'm guessing
not?

That is why we are working on a hybrid approach HDD / Flash on ARM or Intel

With newstore, it's much more difficult to control the I/O profil.
Because rocksDB embedded its own intelligence
This is coincidentally what I've been working on today.  So far I've just
added the ability to put the rocksdb WAL on a second device, but it's
super easy to push rocksdb data there as well (and have it spill over onto
the larger, slower device if it fills up).  Or to put the rocksdb WAL on a
third device (e.g., expensive NVMe or NVRAM).

See this ticket for the ceph-disk tooling that's needed:

	http://tracker.ceph.com/issues/13942

I expect this will be more flexible and perform better than the ext4
metadata option, but we'll need to test on your hardware to confirm!

sage

I think that XFS "realtime" subvolumes are the thing that does this - the second volume contains only the data (no metadata).

Seem to recall that it is popular historically with video appliances, etc but it is not commonly used.

Some of the XFS crew cc'ed above would have more information on this,

Ric


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux