Re: newstore direction

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Mon, 19 Oct 2015 14:22:28 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I think there is a lot that can be gained by Ceph managing a raw block
device. As I mentioned on ceph-users, I've given this some though and
a lot of optimizations could be done that is conducive to storing
objects. I didn't think however to bypass VFS all together by opening
the raw device directly, but this would make things simpler as you
don't have to program things for VFS that don't make sense.

Some of my thoughts were to employ a hashing algorithm for inode
lookup (CRUSH like). Is there a good use case for listing a directory?
We may need to keep a list for deletion, but there may be a better way
to handle this. Is there a need to do snapshots at the block layer if
operations can be atomic? Is there a real advantage to have an
allocation as small as 4K, or does it make since to use something like
512K?

I'm interested in how this might pan out.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWJVEACRDmVDuy+mK58QAAIQEQAK9GUmGQBP1wYa9yXNEp
juofzj5SCxxiNCBdY3kkdHXELCWkLGn331JX2El8h1lPaqH8/nWNy4U6hx0s
7A5EBgQp7+LN03OLroSfiSccPhEe5B/OB1cnyZjmxwDXyaMJzqXwn231f5ev
lBEzvU5PpHrMdNIIGxNFEHgduxfPIw5ciOokP27Tle1JdAGSn6fL6nRLtQfd
HmVLnnXJT9zaGRyxnL8ZQU8IlfjfhMpIc1bM3QKkQkBmXanzCaNaULrlO35L
XtIy0fEXAjkcGHpxOTz4yx5OFKwkpirFduU2PBn+5kqxPRvGL/eEzIxTV89c
SfhAkyBFpl+g7G+q532i7L/34r2wXOL7wcn9seLdOZIt1LVnb059r0tpy4Fz
X/V2/ao1Fua2BFMYzMskPXiKFzxLu/jOS12CjvYWkNhN4C2pGUbRxhqYnC0k
gjRpoOZHDr+RogQdlzXeUmcbZzvtwWqk2uECIX2mLR1aHTVgnpegJhvvHdl3
Nm7jxLyTof2bcXQgSwO5YEXvWO3dNfQynrb5zE+aIVM5ps9D95Mmm94lJtda
47zraQNwrL1OVS7Fd4ot9VepLcQ4orCUZPSqrm5FBlBWj5G+/U0F8VQl8u/g
/nSZrxMXjHJWRhFvzFMYC3yUp59N75LXR5wId8RkAkgZVM+PftB4LmB7spHC
WcGR
=j3i1
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Mon, Oct 19, 2015 at 1:49 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> The current design is based on two simple ideas:
>
>  1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>  2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
>
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb changes
> land... the kv commit is currently 2-3).  So two people are managing
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
>
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but
> at a minimum it is a couple btree lookups.  We'd love to use open by
> handle (which would reduce this to 1 btree traversal), but running
> the daemon as ceph and not root makes that hard...
>
>  - ...and file systems insist on updating mtime on writes, even when it is
> a overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
>
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep
> it pretty simple, and manage it in kv store along with all of our other
> metadata.
>
> Wins:
>
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).
>
>  - No concern about mtime getting in the way
>
>  - Faster reads (no fs lookup)
>
>  - Similarly sized metadata for most objects.  If we assume most objects
> are not fragmented, then the metadata to store the block offsets is about
> the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.
>
>  - We have to write and maintain an allocator.  I'm still optimistic this
> can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>  - We'll need a fsck to ensure our internal metadata is consistent.  The
> good news is it'll just need to validate what we have stored in the kv
> store.
>
> Other thoughts:
>
>  - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
>
>  - Rocksdb can push colder data to a second directory, so we could have a
> fast ssd primary area (for wal and most metadata) and a second hdd
> directory for stuff it has to push off.  Then have a conservative amount
> of file space on the hdd.  If our block fills up, use the existing file
> mechanism to put data there too.  (But then we have to maintain both the
> current kv + file approach and not go all-in on kv + block.)
>
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html