On 10/19/2015 09:49 PM, Sage Weil wrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb changes > land... the kv commit is currently 2-3). So two people are managing > metadata, here: the fs managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but > at a minimum it is a couple btree lookups. We'd love to use open by > handle (which would reduce this to 1 btree traversal), but running > the daemon as ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is > a overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep > it pretty simple, and manage it in kv store along with all of our other > metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block > device, one to commit our transaction (vs 4+ before). For overwrites, > we'd have one io to do our write-ahead log (kv journal), then do > the overwrite async (vs 4+ before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects > are not fragmented, then the metadata to store the block offsets is about > the same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this > can be reasonbly simple, especially for the flash case (where > fragmentation isn't such an issue as long as our blocks are reasonbly > sized). For disk we may beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The > good news is it'll just need to validate what we have stored in the kv > store. > > Other thoughts: > > - We might want to consider whether dm-thin or bcache or other block > layers might help us with elasticity of file vs block areas. > I've been using bcache for a while now in production and that helped a lot. Intel SSDs with GPT. First few partitions as Journals and then one big partition for bcache. /dev/bcache0 2.8T 264G 2.5T 10% /var/lib/ceph/osd/ceph-60 /dev/bcache1 2.8T 317G 2.5T 12% /var/lib/ceph/osd/ceph-61 /dev/bcache2 2.8T 303G 2.5T 11% /var/lib/ceph/osd/ceph-62 /dev/bcache3 2.8T 316G 2.5T 12% /var/lib/ceph/osd/ceph-63 /dev/bcache4 2.8T 167G 2.6T 6% /var/lib/ceph/osd/ceph-64 /dev/bcache5 2.8T 295G 2.5T 11% /var/lib/ceph/osd/ceph-65 The maintainers from bcache also presented bcachefs: https://lkml.org/lkml/2015/8/21/22 "checksumming, compression: currently only zlib is supported for compression, and for checksumming there's crc32c and a 64 bit checksum." Wouldn't that be something that can be leveraged from? Consuming a raw block device seems like re-inventing the wheel to me. I might be wrong though. I have no idea how stable bcachefs is, but it might be worth looking in to. > - Rocksdb can push colder data to a second directory, so we could have a > fast ssd primary area (for wal and most metadata) and a second hdd > directory for stuff it has to push off. Then have a conservative amount > of file space on the hdd. If our block fills up, use the existing file > mechanism to put data there too. (But then we have to maintain both the > current kv + file approach and not go all-in on kv + block.) > > Thoughts? > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html