v11.2.0 kraken released

Abhishek L <abhishek@xxxxxxxx> · Fri, 20 Jan 2017 14:44:59 +0100

This is the first release of the Kraken series.  It is suitable for
use in production deployments and will be maintained until the next
stable release, Luminous, is completed in the Spring of 2017.

Major Changes from Jewel
------------------------

- *RADOS*:

  * The new *BlueStore* backend now has a stable disk format and is
    passing our failure and stress testing. Although the backend is
    still flagged as experimental, we encourage users to try it out
    for non-production clusters and non-critical data sets.
  * RADOS now has experimental support for *overwrites on
    erasure-coded* pools. Because the disk format and implementation
    are not yet finalized, there is a special pool option that must be
    enabled to test the new feature. Enabling this option on a cluster
    will permanently bar that cluster from being upgraded to future
    versions.
  * We now default to the AsyncMessenger (``ms type = async``) instead
    of the legacy SimpleMessenger. The most noticeable difference is
    that we now use a fixed sized thread pool for network connections
    (instead of two threads per socket with SimpleMessenger).
  * Some OSD failures are now detected almost immediately, whereas
    previously the heartbeat timeout (which defaults to 20 seconds)
    had to expire.  This prevents IO from blocking for an extended
    period for failures where the host remains up but the ceph-osd
    process is no longer running.
  * There is a new ``ceph-mgr`` daemon. It is currently collocated with
    the monitors by default, and is not yet used for much, but the basic
    infrastructure is now in place.
  * The size of encoded OSDMaps has been reduced.
  * The OSDs now quiesce scrubbing when recovery or rebalancing is in progress.

- *RGW*:

  * RGW now supports a new zone type that can be used for metadata indexing
    via ElasticSearch.
  * RGW now supports the S3 multipart object copy-part API.
  * It is possible now to reshard an existing bucket. Note that bucket
    resharding currently requires that all IO (especially writes) to
    the specific bucket is quiesced.
  * RGW now supports data compression for objects.
  * Civetweb version has been upgraded to 1.8
  * The Swift static website API is now supported (S3 support has been added
    previously).
  * S3 bucket lifecycle API has been added. Note that currently it only supports
    object expiration.
  * Support for custom search filters has been added to the LDAP auth
    implementation.
  * Support for NFS version 3 has been added to the RGW NFS gateway.
  * A Python binding has been created for librgw.

- *RBD*:

  * RBD now supports images stored in an *erasure-coded* RADOS pool
    using the new (experimental) overwrite support. Images must be
    created using the new rbd CLI "--data-pool <ec pool>" option to
    specify the EC pool where the backing data objects are
    stored. Attempting to create an image directly on an EC pool will
    not be successful since the image's backing metadata is only
    supported on a replicated pool.
  * The rbd-mirror daemon now supports replicating dynamic image
    feature updates and image metadata key/value pairs from the
    primary image to the non-primary image.
  * The number of image snapshots can be optionally restricted to a
    configurable maximum.
  * The rbd Python API now supports asynchronous IO operations.

- *CephFS*:

  * libcephfs function definitions have been changed to enable proper
    uid/gid control.  The library version has been increased to reflect the
    interface change.
  * Standby replay MDS daemons now consume less memory on workloads
    doing deletions.
  * Scrub now repairs backtrace, and populates `damage ls` with
    discovered errors.
  * A new `pg_files` subcommand to `cephfs-data-scan` can identify
    files affected by a damaged or lost RADOS PG.
  * The false-positive "failing to respond to cache pressure" warnings have
    been fixed.

Upgrading from Kraken release candidate 11.1.0
----------------------------------------------

* The new *BlueStore* backend had an on-disk format change after 11.1.0.
  Any BlueStore OSDs created with 11.1.0 will need to be destroyed and
  recreated.

Upgrading from Jewel
--------------------

* All clusters must first be upgraded to Jewel 10.2.z before upgrading
  to Kraken 11.2.z (or, eventually, Luminous 12.2.z).

* The ``sortbitwise`` flag must be set on the Jewel cluster before upgrading
  to Kraken.  The latest Jewel (10.2.4+) releases issue a health warning if
  the flag is not set, so this is probably already set.  If it is not, Kraken
  OSDs will refuse to start and will print and error message in their log.

* You may upgrade OSDs, Monitors, and MDSs in any order.  RGW daemons
  should be upgraded last.

* When upgrading, new ceph-mgr daemon instances will be created automatically
  alongside any monitors.  This will be true for Jewel to Kraken and Jewel to
  Luminous upgrades, but likely not be true for future upgrades beyond
  Luminous.  You are, of course, free to create new ceph-mgr daemon instances
  and destroy the auto-created ones if you do not with them to be colocated
  with the ceph-mon daemons.

BlueStore
---------

BlueStore is a new backend for managing data stored by each OSD on the directly
hard disk or SSD.  Unlike the existing FileStore implementation, which makes
use of an XFS file system to store objects as files, BlueStore manages the
underlying block device directly.  Implements its own file system-like on-disk
structure the is designed specifically for Ceph OSD workloads.  Key features
of BlueStore include:

 * Checksums on all data written to disk, with checksum verifications on all
   reads, enabled by default.
 * Inline compression support, which can be enabled on a per-pool or per-object
   basis via pool properties or client hints, respectively.
 * Efficient journaling.  Unlike FileStore, which writes *all* data to
   its journal device, BlueStore only journals metadata and (in some
   cases) small writes, reducing the size and throughput requirements
   for its journal.  As with FileStore, the journal can be colocated
   on the same device as other data or allocated on a smaller,
   high-performance device (e.g., an SSD or NVMe device).  BlueStore
   journals are only 512 MB by default.

The BlueStore on-disk format is expected to continue to evolve.  However, we
will provide support in the OSD to migrate to the new format on upgrade.

note: BlueStore is still marked "experimental" in Kraken.  We
   recommend its use for proof-of-concept and test environments, or
   other cases where data loss can be tolerated.  Although it is
   stable in our testing environment, the code is new and bugs are
   inevitable.  We hope that with user feedback from Kraken
   deployments we will have sufficient confidence to mark it stable
   (and the default) in the next major release (Luminous).

In order to enable BlueStore, add the following to ceph.conf:

  enable experimental unrecoverable data corrupting features = bluestore

To create a BlueStore OSD, pass the --bluestore option to ceph-disk or
ceph-deploy during OSD creation.

Upgrade notes
-------------

* The OSDs now avoid starting new scrubs while recovery is in progress.  To
  revert to the old behavior (and do not let recovery activity affect the
  scrub scheduling) you can set the following option::

    osd scrub during recovery = true

* The list of monitor hosts/addresses for building the monmap can now be
  obtained from DNS SRV records. The service name used in when querying the DNS
  is defined in the "mon_dns_srv_name" config option, which defaults to
  "ceph-mon".

* The 'osd class load list' config option is a list of object class names that
  the OSD is permitted to load (or '*' for all classes). By default it
  contains all existing in-tree classes for backwards compatibility.

* The 'osd class default list' config option is a list of object class
  names (or '*' for all classes) that clients may invoke having only
  the '*', 'x', 'class-read', or 'class-write' capabilities. By
  default it contains all existing in-tree classes for backwards
  compatibility. Invoking classes not listed in 'osd class default
  list' requires a capability naming the class (e.g. 'allow class
  foo').

* The 'rgw rest getusage op compat' config option allows you to dump
  (or not dump) the description of user stats in the S3 GetUsage
  API. This option defaults to false.  If the value is true, the
  reponse data for GetUsage looks like::

    "stats": {
                "TotalBytes": 516,
                "TotalBytesRounded": 1024,
                "TotalEntries": 1
             }

  If the value is false, the reponse for GetUsage looks as it did before::

    {
         516,
         1024,
         1
    }

* The 'osd out ...' and 'osd in ...' commands now preserve the OSD
  weight.  That is, after marking an OSD out and then in, the weight
  will be the same as before (instead of being reset to 1.0).
  Previously the mons would only preserve the weight if the mon
  automatically marked and OSD out and then in, but not when an admin
  did so explicitly.

* The 'ceph osd perf' command will display 'commit_latency(ms)' and
  'apply_latency(ms)'. Previously, the names of these two columns are
  'fs_commit_latency(ms)' and 'fs_apply_latency(ms)'. We remove the
  prefix 'fs_', because they are not filestore specific.

* Monitors will no longer allow pools to be removed by default.  The
  setting mon_allow_pool_delete has to be set to true (defaults to
  false) before they allow pools to be removed.  This is a additional
  safeguard against pools being removed by accident.

* If you have manually specified the monitor user rocksdb via the
  ``mon keyvaluedb = rocksdb`` option, you will need to manually add a
  file to the mon data directory to preserve this option::

     echo rocksdb > /var/lib/ceph/mon/ceph-`hostname`/kv_backend

  New monitors will now use rocksdb by default, but if that file is
  not present, existing monitors will use leveldb.  The ``mon
  keyvaluedb`` option now only affects the backend chosen when a
  monitor is created.

* The 'osd crush initial weight' option allows you to specify a CRUSH
  weight for a newly added OSD.  Previously a value of 0 (the default)
  meant that we should use the size of the OSD's store to weight the
  new OSD.  Now, a value of 0 means it should have a weight of 0, and
  a negative value (the new default) means we should automatically
  weight the OSD based on its size.  If your configuration file
  explicitly specifies a value of 0 for this option you will need to
  change it to a negative value (e.g., -1) to preserve the current
  behavior.

* The `osd crush location` config option is no longer supported.  Please
  update your ceph.conf to use the `crush location` option instead.

* The static libraries are no longer included by the debian
  development packages (lib*-dev) as it is not required per debian
  packaging policy.  The shared (.so) versions are packaged as before.

* The libtool pseudo-libraries (.la files) are no longer included by
  the debian development packages (lib*-dev) as they are not required
  per https://wiki.debian.org/ReleaseGoals/LAFileRemoval and
  https://www.debian.org/doc/manuals/maint-guide/advanced.en.html.

* The jerasure and shec plugins can now detect SIMD instruction at
  runtime and no longer need to be explicitly configured for different
  processors.  The following plugins are now deprecated:
  jerasure_generic, jerasure_sse3, jerasure_sse4, jerasure_neon,
  shec_generic, shec_sse3, shec_sse4, and shec_neon. If you use any of
  these plugins directly you will see a warning in the mon log file.
  Please switch to using just 'jerasure' or 'shec'.

* The librados omap get_keys and get_vals operations include a start key and a
  limit on the number of keys to return.  The OSD now imposes a configurable
  limit on the number of keys and number of total bytes it will respond with,
  which means that a librados user might get fewer keys than they asked for.
  This is necessary to prevent careless users from requesting an unreasonable
  amount of data from the cluster in a single operation.  The new limits are
  configured with `osd_max_omap_entries_per_request`, defaulting to 131,072, and
  'osd_max_omap_bytes_per_request', defaulting to 4MB.

* Calculation of recovery priorities has been updated.
  This could lead to unintuitive recovery prioritization
  during cluster upgrade. In case of such recovery, OSDs
  in old version would operate on different priority ranges
  than new ones. Once upgraded, cluster will operate on
  consistent values.

A more detailed list of all the features in kraken and the full release
notes is available at http://ceph.com/releases/v11-2-0-kraken-released

A big thank you to everyone for contributing towards this release.

Getting Ceph
------------

* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-11.2.0.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy

Best,
Abhishek
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html