I really like the idea, one scenario keeps bothering us is that there are too many small files which make the file system indexing slow (so that a single read request could take more than 10 disk IOs for path lookup). If we pursuit this proposal, is there a chance we can take one step further, that instead of storing one physical file for each object, we can allocate a big file (tens of GB) and each object only map to a chunk within that big file. So that all those big file’s description could be cached to avoid disk I/O to open the file. At least we keep it flexible that if someone would like to implement in such way, there is a chance to leverage the existing framework. Thanks, Guang On Jul 31, 2014, at 1:25 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > After the latest set of bug fixes to the FileStore file naming code I am > newly inspired to replace it with something less complex. Right now I'm > mostly thinking about HDDs, although some of this may map well onto hybrid > SSD/HDD as well. It may or may not make sense for pure flash. > > Anyway, here are the main flaws with the overall approach that FileStore > uses: > > - It tries to maintain a direct mapping of object names to file names. > This is problematic because of 255 character limits, rados namespaces, pg > prefixes, and the pg directory hashing we do to allow efficient split, for > starters. It is also problematic because we often want to do things like > rename but can't make it happen atomically in combination with the rest of > our transaction. > > - The PG directory hashing (that we do to allow efficient split) can have > a big impact on performance, particularly when injesting lots of data. > (And when benchmarking.) It's also complex. > > - We often overwrite or replace entire objects. These are "easy" > operations to do safely without doing complete data journaling, but the > current design is not conducive to doing anything clever (and it's complex > enough that I wouldn't want to add any cleverness on top). > > - Objects may contain only key/value data, but we still have to create an > inode for them and look that up first. This only matters for some > workloads (rgw indexes, cephfs directory objects). > > Instead, I think we should try a hybrid approach that more heavily > leverages a key/value db in combination with the file system. The kv db > might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just > assume it provides transactional key/value storage and efficient range > operations. Here's the basic idea: > > - The mapping from names to object lives in the kv db. The object > metadata is in a structure we can call an "onode" to avoid confusing it > with the inodes in the backing file system. The mapping is simple > ghobject_t -> onode map; there is no PG collection. The PG collection > still exist but really only as ranges of those keys. We will need to be > slightly clever with the coll_t to distinguish between "bare" PGs (that > live in this flat mapping) and the other collections (*_temp and > metadata), but that should be easy. This makes PG splitting "free" as far > as the objects go. > > - The onodes are relatively small. They will contain the xattrs and > basic metadata like object size. They will also identify the file name of > the backing file in the file system (if size > 0). > > - The backing file can be a random, short file name. We can just make a > one or two level deep set of directories, and let the directories get > reasonably big... whatever we decide the backing fs can handle > efficiently. We can also store a file handle in the onode and use the > open by handle API; this should let us go directly from onode (in our kv > db) to the on-disk inode without looking at the directory at all, and fall > back to using the actual file name only if that fails for some reason > (say, someone mucked around with the backing files). The backing file > need not have any xattrs on it at all (except perhaps some simple id to > verify it does it fact belong to the referring onode, just as a sanity > check). > > - The name -> onode mapping can live in a disjunct part of the kv > namespace so that the other kv stuff associated with the file (like omap > pairs or big xattrs or whatever) don't blow up those parts of the > db and slow down lookup. > > - We can keep a simple LRU of recent onodes in memory and avoid the kv > lookup for hot objects. > > - Previously complicated operations like rename are now trivial: we just > update the kv db with a transaction. The backing file never gets renamed, > ever, and the other object omap data is keyed by a unique (onode) id, not > the name. > > Initially, for simplicity, we can start with the existing data journaling > behavior. However, I think there are opportunities to improve the > situation there. There is a pending wip-transactions branch in which I > started to rejigger the ObjectStore::Transaction interface a bit so that > you identify objects by handle and then operation on them. Although it > doesn't change the encoding yet, once it does, we can make the > implementation take advantage of that, by avoid duplicate name lookups. > It will also let us do things like clearly identify when an object is > entirely new; in that case, we might forgo data journaling and instead > write the data to the (new) file, fsync, and then commit the journal entry > with the transaction that uses it. (On remount a simple cleanup process > can throw out new but unreferenced backing files.) It would also make it > easier to track all recently touched files and bulk fsync them instead of > doing a syncfs (if we decide that is faster). > > Anyway, at the end of the day, small writes or overwrites would still be > journaled, but large writes or large new objects would not, which would (I > think) be a pretty big improvement. Overall, I think the design will be > much simpler to reason about, and there are several potential avenues to > be clever and make improvements. I'm not sure we can say the same about > the FileStore design, which suffers from the fact that it has evolved > slowly over the last 9 years or so. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html