This is the first release of the Kraken series. It is suitable for use in production deployments and will be maintained until the next stable release, Luminous, is completed in the Spring of 2017. Major Changes from Jewel ------------------------ - *RADOS*: * The new *BlueStore* backend now has a stable disk format and is passing our failure and stress testing. Although the backend is still flagged as experimental, we encourage users to try it out for non-production clusters and non-critical data sets. * RADOS now has experimental support for *overwrites on erasure-coded* pools. Because the disk format and implementation are not yet finalized, there is a special pool option that must be enabled to test the new feature. Enabling this option on a cluster will permanently bar that cluster from being upgraded to future versions. * We now default to the AsyncMessenger (``ms type = async``) instead of the legacy SimpleMessenger. The most noticeable difference is that we now use a fixed sized thread pool for network connections (instead of two threads per socket with SimpleMessenger). * Some OSD failures are now detected almost immediately, whereas previously the heartbeat timeout (which defaults to 20 seconds) had to expire. This prevents IO from blocking for an extended period for failures where the host remains up but the ceph-osd process is no longer running. * There is a new ``ceph-mgr`` daemon. It is currently collocated with the monitors by default, and is not yet used for much, but the basic infrastructure is now in place. * The size of encoded OSDMaps has been reduced. * The OSDs now quiesce scrubbing when recovery or rebalancing is in progress. - *RGW*: * RGW now supports a new zone type that can be used for metadata indexing via ElasticSearch. * RGW now supports the S3 multipart object copy-part API. * It is possible now to reshard an existing bucket. Note that bucket resharding currently requires that all IO (especially writes) to the specific bucket is quiesced. * RGW now supports data compression for objects. * Civetweb version has been upgraded to 1.8 * The Swift static website API is now supported (S3 support has been added previously). * S3 bucket lifecycle API has been added. Note that currently it only supports object expiration. * Support for custom search filters has been added to the LDAP auth implementation. * Support for NFS version 3 has been added to the RGW NFS gateway. * A Python binding has been created for librgw. - *RBD*: * RBD now supports images stored in an *erasure-coded* RADOS pool using the new (experimental) overwrite support. Images must be created using the new rbd CLI "--data-pool <ec pool>" option to specify the EC pool where the backing data objects are stored. Attempting to create an image directly on an EC pool will not be successful since the image's backing metadata is only supported on a replicated pool. * The rbd-mirror daemon now supports replicating dynamic image feature updates and image metadata key/value pairs from the primary image to the non-primary image. * The number of image snapshots can be optionally restricted to a configurable maximum. * The rbd Python API now supports asynchronous IO operations. - *CephFS*: * libcephfs function definitions have been changed to enable proper uid/gid control. The library version has been increased to reflect the interface change. * Standby replay MDS daemons now consume less memory on workloads doing deletions. * Scrub now repairs backtrace, and populates `damage ls` with discovered errors. * A new `pg_files` subcommand to `cephfs-data-scan` can identify files affected by a damaged or lost RADOS PG. * The false-positive "failing to respond to cache pressure" warnings have been fixed. Upgrading from Kraken release candidate 11.1.0 ---------------------------------------------- * The new *BlueStore* backend had an on-disk format change after 11.1.0. Any BlueStore OSDs created with 11.1.0 will need to be destroyed and recreated. Upgrading from Jewel -------------------- * All clusters must first be upgraded to Jewel 10.2.z before upgrading to Kraken 11.2.z (or, eventually, Luminous 12.2.z). * The ``sortbitwise`` flag must be set on the Jewel cluster before upgrading to Kraken. The latest Jewel (10.2.4+) releases issue a health warning if the flag is not set, so this is probably already set. If it is not, Kraken OSDs will refuse to start and will print and error message in their log. * You may upgrade OSDs, Monitors, and MDSs in any order. RGW daemons should be upgraded last. * When upgrading, new ceph-mgr daemon instances will be created automatically alongside any monitors. This will be true for Jewel to Kraken and Jewel to Luminous upgrades, but likely not be true for future upgrades beyond Luminous. You are, of course, free to create new ceph-mgr daemon instances and destroy the auto-created ones if you do not with them to be colocated with the ceph-mon daemons. BlueStore --------- BlueStore is a new backend for managing data stored by each OSD on the directly hard disk or SSD. Unlike the existing FileStore implementation, which makes use of an XFS file system to store objects as files, BlueStore manages the underlying block device directly. Implements its own file system-like on-disk structure the is designed specifically for Ceph OSD workloads. Key features of BlueStore include: * Checksums on all data written to disk, with checksum verifications on all reads, enabled by default. * Inline compression support, which can be enabled on a per-pool or per-object basis via pool properties or client hints, respectively. * Efficient journaling. Unlike FileStore, which writes *all* data to its journal device, BlueStore only journals metadata and (in some cases) small writes, reducing the size and throughput requirements for its journal. As with FileStore, the journal can be colocated on the same device as other data or allocated on a smaller, high-performance device (e.g., an SSD or NVMe device). BlueStore journals are only 512 MB by default. The BlueStore on-disk format is expected to continue to evolve. However, we will provide support in the OSD to migrate to the new format on upgrade. note: BlueStore is still marked "experimental" in Kraken. We recommend its use for proof-of-concept and test environments, or other cases where data loss can be tolerated. Although it is stable in our testing environment, the code is new and bugs are inevitable. We hope that with user feedback from Kraken deployments we will have sufficient confidence to mark it stable (and the default) in the next major release (Luminous). In order to enable BlueStore, add the following to ceph.conf: enable experimental unrecoverable data corrupting features = bluestore To create a BlueStore OSD, pass the --bluestore option to ceph-disk or ceph-deploy during OSD creation. Upgrade notes ------------- * The OSDs now avoid starting new scrubs while recovery is in progress. To revert to the old behavior (and do not let recovery activity affect the scrub scheduling) you can set the following option:: osd scrub during recovery = true * The list of monitor hosts/addresses for building the monmap can now be obtained from DNS SRV records. The service name used in when querying the DNS is defined in the "mon_dns_srv_name" config option, which defaults to "ceph-mon". * The 'osd class load list' config option is a list of object class names that the OSD is permitted to load (or '*' for all classes). By default it contains all existing in-tree classes for backwards compatibility. * The 'osd class default list' config option is a list of object class names (or '*' for all classes) that clients may invoke having only the '*', 'x', 'class-read', or 'class-write' capabilities. By default it contains all existing in-tree classes for backwards compatibility. Invoking classes not listed in 'osd class default list' requires a capability naming the class (e.g. 'allow class foo'). * The 'rgw rest getusage op compat' config option allows you to dump (or not dump) the description of user stats in the S3 GetUsage API. This option defaults to false. If the value is true, the reponse data for GetUsage looks like:: "stats": { "TotalBytes": 516, "TotalBytesRounded": 1024, "TotalEntries": 1 } If the value is false, the reponse for GetUsage looks as it did before:: { 516, 1024, 1 } * The 'osd out ...' and 'osd in ...' commands now preserve the OSD weight. That is, after marking an OSD out and then in, the weight will be the same as before (instead of being reset to 1.0). Previously the mons would only preserve the weight if the mon automatically marked and OSD out and then in, but not when an admin did so explicitly. * The 'ceph osd perf' command will display 'commit_latency(ms)' and 'apply_latency(ms)'. Previously, the names of these two columns are 'fs_commit_latency(ms)' and 'fs_apply_latency(ms)'. We remove the prefix 'fs_', because they are not filestore specific. * Monitors will no longer allow pools to be removed by default. The setting mon_allow_pool_delete has to be set to true (defaults to false) before they allow pools to be removed. This is a additional safeguard against pools being removed by accident. * If you have manually specified the monitor user rocksdb via the ``mon keyvaluedb = rocksdb`` option, you will need to manually add a file to the mon data directory to preserve this option:: echo rocksdb > /var/lib/ceph/mon/ceph-`hostname`/kv_backend New monitors will now use rocksdb by default, but if that file is not present, existing monitors will use leveldb. The ``mon keyvaluedb`` option now only affects the backend chosen when a monitor is created. * The 'osd crush initial weight' option allows you to specify a CRUSH weight for a newly added OSD. Previously a value of 0 (the default) meant that we should use the size of the OSD's store to weight the new OSD. Now, a value of 0 means it should have a weight of 0, and a negative value (the new default) means we should automatically weight the OSD based on its size. If your configuration file explicitly specifies a value of 0 for this option you will need to change it to a negative value (e.g., -1) to preserve the current behavior. * The `osd crush location` config option is no longer supported. Please update your ceph.conf to use the `crush location` option instead. * The static libraries are no longer included by the debian development packages (lib*-dev) as it is not required per debian packaging policy. The shared (.so) versions are packaged as before. * The libtool pseudo-libraries (.la files) are no longer included by the debian development packages (lib*-dev) as they are not required per https://wiki.debian.org/ReleaseGoals/LAFileRemoval and https://www.debian.org/doc/manuals/maint-guide/advanced.en.html. * The jerasure and shec plugins can now detect SIMD instruction at runtime and no longer need to be explicitly configured for different processors. The following plugins are now deprecated: jerasure_generic, jerasure_sse3, jerasure_sse4, jerasure_neon, shec_generic, shec_sse3, shec_sse4, and shec_neon. If you use any of these plugins directly you will see a warning in the mon log file. Please switch to using just 'jerasure' or 'shec'. * The librados omap get_keys and get_vals operations include a start key and a limit on the number of keys to return. The OSD now imposes a configurable limit on the number of keys and number of total bytes it will respond with, which means that a librados user might get fewer keys than they asked for. This is necessary to prevent careless users from requesting an unreasonable amount of data from the cluster in a single operation. The new limits are configured with `osd_max_omap_entries_per_request`, defaulting to 131,072, and 'osd_max_omap_bytes_per_request', defaulting to 4MB. * Calculation of recovery priorities has been updated. This could lead to unintuitive recovery prioritization during cluster upgrade. In case of such recovery, OSDs in old version would operate on different priority ranges than new ones. Once upgraded, cluster will operate on consistent values. A more detailed list of all the features in kraken and the full release notes is available at http://ceph.com/releases/v11-2-0-kraken-released A big thank you to everyone for contributing towards this release. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-11.2.0.tar.gz * For packages, see http://ceph.com/docs/master/install/get-packages * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy Best, Abhishek -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html