Re: KeyFileStore ?

Guang Yang <yguang11@xxxxxxxxxxx> · Mon, 4 Aug 2014 22:27:11 +0800



在 2014年8月2日，上午5:34，Samuel Just <sam.just@xxxxxxxxxxx> 写道：

> Sage's basic approach sounds about right to me.  I'm fairly skeptical
> about the benefits of packing small objects together within larger
> files, though.  It seems like for very small objects, we would be
> better off stashing the contents opportunistically within the onode.
I really like this idea, for radosgw + EC use case, there are lots of small physical files generated (multiple Kbs), and when the OSD disk is filled to a certain ratio, each read to one chunk could incur several disk I/Os (path lookup and data), and putting the data as part of onode could boost the read performance and as the same time, decrease the number of physical files.
> For somewhat larger objects, it seems like the complexity of
> maintaining information about the larger pack objects would be
> equivalent to the what the filesystem would do anyway.
> -Sam
> 
> On Fri, Aug 1, 2014 at 8:08 AM, Guang Yang <yguang11@xxxxxxxxxxx> wrote:
>> I really like the idea, one scenario keeps bothering us is that there are too many small files which make the file system indexing slow (so that a single read request could take more than 10 disk IOs for path lookup).
>> 
>> If we pursuit this proposal, is there a chance we can take one step further, that instead of storing one physical file for each object, we can allocate a big file (tens of GB) and each object only map to a chunk within that big file. So that all those big file’s description could be cached to avoid disk I/O to open the file. At least we keep it flexible that if someone would like to implement in such way, there is a chance to leverage the existing framework.
>> 
>> Thanks,
>> Guang
>> 
>> On Jul 31, 2014, at 1:25 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> 
>>> After the latest set of bug fixes to the FileStore file naming code I am
>>> newly inspired to replace it with something less complex.  Right now I'm
>>> mostly thinking about HDDs, although some of this may map well onto hybrid
>>> SSD/HDD as well.  It may or may not make sense for pure flash.
>>> 
>>> Anyway, here are the main flaws with the overall approach that FileStore
>>> uses:
>>> 
>>> - It tries to maintain a direct mapping of object names to file names.
>>> This is problematic because of 255 character limits, rados namespaces, pg
>>> prefixes, and the pg directory hashing we do to allow efficient split, for
>>> starters.  It is also problematic because we often want to do things like
>>> rename but can't make it happen atomically in combination with the rest of
>>> our transaction.
>>> 
>>> - The PG directory hashing (that we do to allow efficient split) can have
>>> a big impact on performance, particularly when injesting lots of data.
>>> (And when benchmarking.)  It's also complex.
>>> 
>>> - We often overwrite or replace entire objects.  These are "easy"
>>> operations to do safely without doing complete data journaling, but the
>>> current design is not conducive to doing anything clever (and it's complex
>>> enough that I wouldn't want to add any cleverness on top).
>>> 
>>> - Objects may contain only key/value data, but we still have to create an
>>> inode for them and look that up first.  This only matters for some
>>> workloads (rgw indexes, cephfs directory objects).
>>> 
>>> Instead, I think we should try a hybrid approach that more heavily
>>> leverages a key/value db in combination with the file system.  The kv db
>>> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
>>> assume it provides transactional key/value storage and efficient range
>>> operations.  Here's the basic idea:
>>> 
>>> - The mapping from names to object lives in the kv db.  The object
>>> metadata is in a structure we can call an "onode" to avoid confusing it
>>> with the inodes in the backing file system.  The mapping is simple
>>> ghobject_t -> onode map; there is no PG collection.  The PG collection
>>> still exist but really only as ranges of those keys.  We will need to be
>>> slightly clever with the coll_t to distinguish between "bare" PGs (that
>>> live in this flat mapping) and the other collections (*_temp and
>>> metadata), but that should be easy.  This makes PG splitting "free" as far
>>> as the objects go.
>>> 
>>> - The onodes are relatively small.  They will contain the xattrs and
>>> basic metadata like object size.  They will also identify the file name of
>>> the backing file in the file system (if size > 0).
>>> 
>>> - The backing file can be a random, short file name.  We can just make a
>>> one or two level deep set of directories, and let the directories get
>>> reasonably big... whatever we decide the backing fs can handle
>>> efficiently.  We can also store a file handle in the onode and use the
>>> open by handle API; this should let us go directly from onode (in our kv
>>> db) to the on-disk inode without looking at the directory at all, and fall
>>> back to using the actual file name only if that fails for some reason
>>> (say, someone mucked around with the backing files).  The backing file
>>> need not have any xattrs on it at all (except perhaps some simple id to
>>> verify it does it fact belong to the referring onode, just as a sanity
>>> check).
>>> 
>>> - The name -> onode mapping can live in a disjunct part of the kv
>>> namespace so that the other kv stuff associated with the file (like omap
>>> pairs or big xattrs or whatever) don't blow up those parts of the
>>> db and slow down lookup.
>>> 
>>> - We can keep a simple LRU of recent onodes in memory and avoid the kv
>>> lookup for hot objects.
>>> 
>>> - Previously complicated operations like rename are now trivial: we just
>>> update the kv db with a transaction.  The backing file never gets renamed,
>>> ever, and the other object omap data is keyed by a unique (onode) id, not
>>> the name.
>>> 
>>> Initially, for simplicity, we can start with the existing data journaling
>>> behavior.  However, I think there are opportunities to improve the
>>> situation there.  There is a pending wip-transactions branch in which I
>>> started to rejigger the ObjectStore::Transaction interface a bit so that
>>> you identify objects by handle and then operation on them.  Although it
>>> doesn't change the encoding yet, once it does, we can make the
>>> implementation take advantage of that, by avoid duplicate name lookups.
>>> It will also let us do things like clearly identify when an object is
>>> entirely new; in that case, we might forgo data journaling and instead
>>> write the data to the (new) file, fsync, and then commit the journal entry
>>> with the transaction that uses it.  (On remount a simple cleanup process
>>> can throw out new but unreferenced backing files.)  It would also make it
>>> easier to track all recently touched files and bulk fsync them instead of
>>> doing a syncfs (if we decide that is faster).
>>> 
>>> Anyway, at the end of the day, small writes or overwrites would still be
>>> journaled, but large writes or large new objects would not, which would (I
>>> think) be a pretty big improvement.  Overall, I think the design will be
>>> much simpler to reason about, and there are several potential avenues to
>>> be clever and make improvements.  I'm not sure we can say the same about
>>> the FileStore design, which suffers from the fact that it has evolved
>>> slowly over the last 9 years or so.
>>> 
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html