Re: Fwd: Fwd: [newstore (again)] how disable double write WAL

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 04, 2015 at 03:12:25PM -0500, Ric Wheeler wrote:
> On 12/01/2015 05:02 PM, Sage Weil wrote:
> >Hi David,
> >
> >On Tue, 1 Dec 2015, David Casier wrote:
> >>Hi Sage,
> >>With a standard disk (4 to 6 TB), and a small flash drive, it's easy
> >>to create an ext4 FS with metadata on flash
> >>
> >>Example with sdg1 on flash and sdb on hdd :
> >>
> >>size_of() {
> >>   blockdev --getsize $1
> >>}
> >>
> >>mkdmsetup() {
> >>   _ssd=/dev/$1
> >>   _hdd=/dev/$2
> >>   _size_of_ssd=$(size_of $_ssd)
> >>   echo """0 $_size_of_ssd linear $_ssd 0
> >>   $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
> >>}

So this is just a linear concatenation that relies on ext4 putting
all it's metadata at the front of the filesystem?

> >>
> >>mkdmsetup sdg1 sdb
> >>
> >>mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
> >>-E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
> >>$((1024*512)) /dev/mapper/dm-sdg1-sdb
> >>
> >>With that, all meta_blocks are on the SSD

IIRC, it's the "packed_meta_blocks=1" that does this.

THis is something that is pretty trivial to do with XFS, too,
by use of the inode32 allocation mechanism. That reserves the
first TB of space for inodes and other metadata allocations,
so if you span the first TB with SSDs, you get almost all the
metadata on the SSDs, and all the data in the higher AGs. With the
undocumented log location mkfs option, you can also put hte log at
the start og AG 0 which means that would sit on the SSD, too,
without needing an external log device.

SGI even had a mount option hack to limit this allocator behaviour
to a block limit lower than 1TB so they could limit the metadata AG
regions to, say, the first 200GB.

> >This is coincidentally what I've been working on today.  So far I've just
> >added the ability to put the rocksdb WAL on a second device, but it's
> >super easy to push rocksdb data there as well (and have it spill over onto
> >the larger, slower device if it fills up).  Or to put the rocksdb WAL on a
> >third device (e.g., expensive NVMe or NVRAM).

I have old bits and pieces from 7-8 years ago that would allow some
application control of allocation policy to allow things like this
to be done, but I left SGI before it was anything mor ethan just a
proof of concept....

> >See this ticket for the ceph-disk tooling that's needed:
> >
> >	http://tracker.ceph.com/issues/13942
> >
> >I expect this will be more flexible and perform better than the ext4
> >metadata option, but we'll need to test on your hardware to confirm!
> >
> >sage
> 
> I think that XFS "realtime" subvolumes are the thing that does this
> -  the second volume contains only the data (no metadata).
> 
> Seem to recall that it is popular historically with video
> appliances, etc but it is not commonly used.

Because it's a single threaded allocator. It's not suited to highly
concurrent applications, just applications that require large
extents allocated in a deterministic manner.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@xxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux