OSD Sessions at CDS

Samuel Just <sjust@xxxxxxxxxx> · Thu, 05 Mar 2015 09:23:21 -0800

There were several OSD sessions at CDS on Wednesday, I'll try to
summarize some of the key points.

======================EC Pool Overwrite Support=======================
https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_erasure_coding_pool_overwrite_support

One take away from the discussion was that the no overwrite option for
RBD and cephfs may not be feasible since it's not clear that 4MB objects
make sense for an EC pool, and that with cephfs we need to be able to
handle the case where the file is in shared mode.  We'd probably,
therefore, want to use a 2pc approach, but we'd want much more feedback
on use cases before implementing it ourselves.

========================Scrub and Repair==============================

https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_Scrub_and_Repair

http://pad.ceph.com/p/I-osd-scrub

The discussion focused mainly on a more detailed description of the
scrub state kept by the OSD during peering.  See the etherpad for
details.

=======================Less Intrusive Scrub===========================

https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_Less_intrusive_scrub

http://pad.ceph.com/p/I-osd-less-intrusive-scrub

Some additional things we can do to reduce the scrub impact came up
and can be found in the above etherpad.

=================Faster Peering/Lower Tail Latency====================

https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_Faster_Peering

https://wiki.ceph.com/Planning/Blueprints/Infernalis/Improve_tail_latency

http://pad.ceph.com/p/I-faster-peering_tailing

In addition to what is in the blueprint, Sage suggested that the primary
in some cases can keep the peer_info and peer_missing sets which it
already has if the acting set stays the same or shrinks.

We also touched on prepopulating pg_temp at the monitor and setting a
different temp pg primary at the monitor in the map which marks an osd
back up to avoid that pg being primary immediately (and having to block
reads and writes on recovery).

In the ungraceful shutdown case, we could have a watchdog process
(systemd or something else) mark the specific osd instance which stopped
down (ceph osd down-instance <entity_inst_t>).

For EC pools, the consensus seemed to be that the best way to reduce
read latencies is to implement client side reads.

========================Tiering II (Warm->Cold)========================

https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_Tiering_II_(Warm-%3ECold)

https://wiki.ceph.com/Planning/Blueprints/Infernalis/Dynamic_data_relocation_for_cache_tiering

http://pad.ceph.com/p/I-tiering

Sage and I spent some time comparing the approach above to the approach
from the firefly CDS below.  It's still not clear whether we might want
to do the firefly variant (with the client able to send IO directly to
the cold tier) in addition to the one above (where the cold tier may not
even be a rados pool).

https://wiki.ceph.com/Planning/Blueprints/%3CSIDEBOARD%3E/osd%3A_tiering
%3A_object_redirects

>From the discussion, it seemed like it might make sense to expand the
interface somewhat to allow the osd to proxy partial overwrites if the
backend supports it.

The consensus seemed to be that a rados level pin operation to force an
object to stay in the hot tier woudld be a good idea.

-Sam

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html