On Tue, 20 Oct 2015, Haomai Wang wrote: > On Tue, Oct 20, 2015 at 3:49 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb changes > > land... the kv commit is currently 2-3). So two people are managing > > metadata, here: the fs managing the file metadata (with its own > > journal) and the kv backend (with its journal). > > > > - On read we have to open files by name, which means traversing the fs > > namespace. Newstore tries to keep it as flat and simple as possible, but > > at a minimum it is a couple btree lookups. We'd love to use open by > > handle (which would reduce this to 1 btree traversal), but running > > the daemon as ceph and not root makes that hard... > > > > - ...and file systems insist on updating mtime on writes, even when it is > > a overwrite with no allocation changes. (We don't care about mtime.) > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > brainfreeze. > > > > - XFS is (probably) never going going to give us data checksums, which we > > want desperately. > > > > But what's the alternative? My thought is to just bite the bullet and > > consume a raw block device directly. Write an allocator, hopefully keep > > it pretty simple, and manage it in kv store along with all of our other > > metadata. > > This is really a tough decision. Although making a block device based > objectstore never walk out my mind since two years ago. > > We would much more concern about the effective of space utilization > compared to local fs, the buggy, the consuming time to build a tiny > local filesystem. I'm a little afraid of we would stuck into.... > > > > > Wins: > > > > - 2 IOs for most: one to write the data to unused space in the block > > device, one to commit our transaction (vs 4+ before). For overwrites, > > we'd have one io to do our write-ahead log (kv journal), then do > > the overwrite async (vs 4+ before). > > Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL > area from my perf. With this change it is close to parity: https://github.com/facebook/rocksdb/pull/746 > > - No concern about mtime getting in the way > > > > - Faster reads (no fs lookup) > > > > - Similarly sized metadata for most objects. If we assume most objects > > are not fragmented, then the metadata to store the block offsets is about > > the same size as the metadata to store the filenames we have now. > > > > Problems: > > > > - We have to size the kv backend storage (probably still an XFS > > partition) vs the block storage. Maybe we do this anyway (put metadata on > > SSD!) so it won't matter. But what happens when we are storing gobs of > > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > > a different pool and those aren't currently fungible. > > > > - We have to write and maintain an allocator. I'm still optimistic this > > can be reasonbly simple, especially for the flash case (where > > fragmentation isn't such an issue as long as our blocks are reasonbly > > sized). For disk we may beed to be moderately clever. > > > > - We'll need a fsck to ensure our internal metadata is consistent. The > > good news is it'll just need to validate what we have stored in the kv > > store. > > > > Other thoughts: > > > > - We might want to consider whether dm-thin or bcache or other block > > layers might help us with elasticity of file vs block areas. > > > > - Rocksdb can push colder data to a second directory, so we could have a > > fast ssd primary area (for wal and most metadata) and a second hdd > > directory for stuff it has to push off. Then have a conservative amount > > of file space on the hdd. If our block fills up, use the existing file > > mechanism to put data there too. (But then we have to maintain both the > > current kv + file approach and not go all-in on kv + block.) > > A complex way... > > Actually I would like to employ FileStore2 impl, which means we still > use FileJournal(or alike ..). But we need to employ more memory to > keep metadata/xattrs and use aio+dio to flush disk. A userspace > pagecache needed to be impl. Then we can skip journal if full write, > because osd is pg isolation we could make a barrier for single pg when > skipping journal. @Sage Is there other concerns for filestore skip > journal? > > In a word, I like the model that filestore owns, but we need to have a > big refactor for existing impl. > > Sorry to disturb the thought.... I think the directory (re)hashing strategy in filestore is too expensive, and I don't see how it can be fixed without managing the namespace ourselves (as newstore does). If we want a middle road approach where we still rely on a file system for doing block allocation then IMO the current incarnation of newstore is the right path... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html