Re: Radosgw - bucket index

Sage Weil <sage@xxxxxxxxxxx> · Fri, 16 May 2014 06:09:26 -0700 (PDT)

Hi Guang,

[I think the problem is that your email is HTML formatted, and vger 
silently drops those.  Make sure your mailer is set to plain text mode.]

On Fri, 16 May 2014, Guang wrote:

>       * *Key/value OSD backend* (experimental): An alternative storage
>       backend
>        for Ceph OSD processes that puts all data in a key/value
>       database like
>        leveldb.  This provides better performance for workloads
>       dominated by
>        key/value operations (like radosgw bucket indices).
> 
> Hi Yehuda and Haomai,I managed to set up a K/V store backend and played
> around with it, as Sage mentioned in the release note, I thought K/V store
> could be the solution for radosgw?s bucket indexing feature which currently
> has scaling problems [1], however, after playing around with K/V store and
> understanding the requirement for bucket indexing, I think at least for now
> there is still gap to fix the bucket indexing by leveraging the K/V store.
> 
> In my opinion, one requirement (API) to implement bucket indexing is to
> support ordered scan (prefix filter), which is not part of the API of rados,
> and as K/V store does not extend the rados API (it is not supposed to) but
> only  change the underlying object store strategy. It is not likely to help
> for the bucket indexing, except that we use the original way using omap to
> store bucket indexing and each bucket corresponds to one object.

The rados omap API does allow a prefix filter, although it's somewhat 
implicit:

    /**
     * omap_get_keys: keys from the object omap
     *
     * Get up to max_return keys beginning after start_after
     *
     * @param start_after [in] list keys starting after start_after
     * @parem max_return [in] list no more than max_return keys
     * @param out_keys [out] place returned values in out_keys on completion
     * @param prval [out] place error code in prval upon completion
     */
    void omap_get_keys(const std::string &start_after,
                       uint64_t max_return,
                       std::set<std::string> *out_keys,
                       int *prval);

Since all keys are sorted alphanumerically, you simply have to set 
start_after == your prefix, and start ignoring the results once you get a 
key that does not contain your prefix.  This could be improved by having 
an explicit prefix argument that does this server-side, but for now at you 
can get the right data (plus a bit a extra at the end).

Is that what you mean by prefix scan, or are you referring to the ability 
to scan for rados objects that begin with a prefix?  If it's the latter, 
you are right: objects are hashed across nodes and there is no sorted 
object name index to allow prefix filtering.  There is a list_objects 
filter option, but it is still O(objects in the pool).

> Did I miss anything obvious here?
> 
> We are very interested in the effort to improve the scalability of bucket
> index [1] as the blueprint mentioned, here is my thoughts on top of this:
>  1. It would be nice we can refactor the interface so that it is easy to
> switch to a different underlying storage system for bucket indexing, for
> example, DynamoDB seems like being used for S3?s implementation [2], and SWIFT
> uses sqllite [1] and has a flat namespace for listing purpose (with prefix
> and delimiter).

radosgw is using the omap key/value API for objects, which is more or less 
equivalent to what swift is doing with sqlite.  This data passes straight 
into leveldb on the backend (or whatever other backend you are using).  
Using something like rocksdb in its place is pretty simple and ther are
unmerged patches to do that; the user would just need to adjust their 
crush map so that the rgw index pool is mapped to a different set of OSDs 
with the better k/v backend.

>  2. As mentioned in the blueprint, if we go with the approach to do sharding
> for the bucket index object, what is the design choice? Are we going to
> maintain a B- tree structure get all keys sorted and sharidng on demand,
> like having a background thread do the sharding when it reaches a certain
> threshold? 

I don't know... I'm sure Yehuda has a more well-formed opinion on this.  I 
suspect something simpler than a B tree (like a single-level hash-based 
fan out) would be sufficient, although you'd pay a bit of a price for 
object enumeration.

sage

> 
> [1] https://wiki.ceph.com/Planning/Sideboard/rgw%3A_bucket_index_scalabilit
> y
> [2] http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAn
> dScan.html
> [3] https://swiftstack.com/openstack-swift/architecture/
> 
> Thanks,
> Guang
> 
> On May 7, 2014, at 9:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> 
>       We did it!  Firefly v0.80 is built and pushed out to the
>       ceph.com
>       repositories.
> 
>       This release will form the basis for our long-term supported
>       release
>       Firefly, v0.80.x.  The big new features are support for erasure
>       coding
>       and cache tiering, although a broad range of other features,
>       fixes,
>       and improvements have been made across the code base.
>        Highlights include:
> 
>       * *Erasure coding*: support for a broad range of erasure codes
>       for lower
>        storage overhead and better data durability.
>       * *Cache tiering*: support for creating 'cache pools' that store
>       hot,
>        recently accessed objects with automatic demotion of colder
>       data to
>        a base tier.  Typically the cache pool is backed by faster
>       storage
>        devices like SSDs.
>       * *Primary affinity*: Ceph now has the ability to skew selection
>       of
>        OSDs as the "primary" copy, which allows the read workload to
>       be
>        cheaply skewed away from parts of the cluster without migrating
>       any
>        data.
>       * *Key/value OSD backend* (experimental): An alternative storage
>       backend
>        for Ceph OSD processes that puts all data in a key/value
>       database like
>        leveldb.  This provides better performance for workloads
>       dominated by
>        key/value operations (like radosgw bucket indices).
>       * *Standalone radosgw* (experimental): The radosgw process can
>       now run
>        in a standalone mode without an apache (or similar) web server
>       or
>        fastcgi.  This simplifies deployment and can improve
>       performance.
> 
>       We expect to maintain a series of stable releases based on v0.80
>       Firefly for as much as a year.  In the meantime, development of
>       Ceph
>       continues with the next release, Giant, which will feature work
>       on the
>       CephFS distributed file system, more alternative storage
>       backends
>       (like RocksDB and f2fs), RDMA support, support for pyramid
>       erasure
>       codes, and additional functionality in the block device (RBD)
>       like
>       copy-on-read and multisite mirroring.
> 
>       This release is the culmination of a huge collective effort by
>       about 100
>       different contributors.  Thank you everyone who has helped to
>       make this
>       possible!
> 
>       Upgrade Sequencing
>       ------------------
> 
>       * If your existing cluster is running a version older than v0.67
>        Dumpling, please first upgrade to the latest Dumpling release
>       before
>        upgrading to v0.80 Firefly.  Please refer to the :ref:`Dumpling
>       upgrade`
>        documentation.
> 
>       * Upgrade daemons in the following order:
> 
>          1. Monitors
>          2. OSDs
>          3. MDSs and/or radosgw
> 
>        If the ceph-mds daemon is restarted first, it will wait until
>       all
>        OSDs have been upgraded before finishing its startup sequence.
>        If
>        the ceph-mon daemons are not restarted prior to the ceph-osd
>        daemons, they will not correctly register their new
>       capabilities
>        with the cluster and new features may not be usable until they
>       are
>        restarted a second time.
> 
>       * Upgrade radosgw daemons together.  There is a subtle change in
>       behavior
>        for multipart uploads that prevents a multipart request that
>       was initiated
>        with a new radosgw from being completed by an old radosgw.
> 
>       Notable changes since v0.79
>       ---------------------------
> 
>       * ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng)
>       * ceph-fuse: trim inodes in response to mds memory pressure
>       (Yan, Zheng)
>       * librados: fix inconsistencies in API error values (David
>       Zafman)
>       * librados: fix watch operations with cache pools (Sage Weil)
>       * librados: new snap rollback operation (David Zafman)
>       * mds: fix respawn (John Spray)
>       * mds: misc bugs (Yan, Zheng)
>       * mds: misc multi-mds fixes (Yan, Zheng)
>       * mds: use shared_ptr for requests (Greg Farnum)
>       * mon: fix peer feature checks (Sage Weil)
>       * mon: require 'x' mon caps for auth operations (Joao Luis)
>       * mon: shutdown when removed from mon cluster (Joao Luis)
>       * msgr: fix locking bug in authentication (Josh Durgin)
>       * osd: fix bug in journal replay/restart (Sage Weil)
>       * osd: many many many bug fixes with cache tiering (Samuel Just)
>       * osd: track omap and hit_set objects in pg stats (Samuel Just)
>       * osd: warn if agent cannot enable due to invalid (post-split)
>       stats (Sage Weil)
>       * rados bench: track metadata for multiple runs separately
>       (Guang Yang)
>       * rgw: fixed subuser modify (Yehuda Sadeh)
>       * rpm: fix redhat-lsb dependency (Sage Weil, Alfredo Deza)
> 
>       For the complete release notes, please see:
> 
>         http://ceph.com/docs/master/release-notes/#v0-80-firefly
> 
> 
>       Getting Ceph
>       ------------
> 
>       * Git at git://github.com/ceph/ceph.git
>       * Tarball at http://ceph.com/download/ceph-0.80.tar.gz
>       * For packages, see
>       http://ceph.com/docs/master/install/get-packages
>       * For ceph-deploy, see
>       http://ceph.com/docs/master/install/install-ceph-deploy
> 
>       --
>       To unsubscribe from this list: send the line "unsubscribe
>       ceph-devel" in
>       the body of a message to majordomo@xxxxxxxxxxxxxxx
>       More majordomo info at
>        http://vger.kernel.org/majordomo-info.html
> 
> 
> 
>