Hi all, This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd side when REQ_NOUNMAP flag is set for a block layer request. Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer requests fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd side. With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO request osd can zero out blocks, instead of punching holes. Possible handling of a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP on osd: diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc index dff549535d39..d812d88d7bc0 100644 --- a/src/osd/PrimaryLogPG.cc +++ b/src/osd/PrimaryLogPG.cc @@ -5788,6 +5788,7 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops) // munge ZERO -> TRUNCATE? (don't munge to DELETE or we risk hosing attributes) if (op.op == CEPH_OSD_OP_ZERO && + !(op.flags & CEPH_OSD_OP_FLAG_ZERO_NOUNMAP) && obs.exists && op.extent.offset < static_cast<Option::size_t>(osd->osd_max_object_size) && op.extent.length >= 1 && @@ -6583,3 +6584,28 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops) result = -EOPNOTSUPP; break; } + if (op.flags & CEPH_OSD_OP_FLAG_ZERO_NOUNMAP) { + // ZERO -> WRITE_SAME + // Why? Internally storage backend punches holes on zeroing, but we + // need zeroed blocks instead. + + if (osd_op.indata.length()) { + // Zero op with data? No way. + result = -EINVAL; + goto fail; + } + + // Extent and writesame layouts are almost similar, so reset union + // members which are different + op.extent.truncate_size = 0; + op.extent.truncate_seq = 0; + + // Fill in zero data, will be duplicated inside do_writesame() + const char buf[2] = {0}; + osd_op.indata.append(buf, sizeof(buf)); + op.writesame.data_length = sizeof(buf); + + result = do_writesame(ctx, osd_op); + break; + } + The other possible solution is to send CEPH_OSD_OP_WRITESAME directly from krbd instead of CEPH_OSD_OP_ZERO + NOUNMAP flag, but IMO that has a drawback: OP_WRITESAME was implemented on osd couple of years ago and seems that is not very nice to break the compatibility if someone has updated kernel, but still using old ceph cluster. Also ZERO + NOUNMAP has a cleaner semantics. These are just thoughts, nothing is tested, thus RFC. Roman Penyaev (2): libceph, rbd: pass op flags to osd_req_op_extent_init() libceph, rbd: respect REQ_NOUNMAP by setting new nounmap flag for CEPH_OSD_OP_ZERO drivers/block/rbd.c | 51 ++++++++++++++++++++++++--------- include/linux/ceph/osd_client.h | 2 +- include/linux/ceph/rados.h | 1 + net/ceph/osd_client.c | 6 ++-- 4 files changed, 42 insertions(+), 18 deletions(-) Signed-off-by: Roman Penyaev <rpenyaev@xxxxxxx> Cc: Ilya Dryomov <idryomov@xxxxxxxxx> Cc: Sage Weil <sage@xxxxxxxxxx> Cc: Alex Elder <elder@xxxxxxxxxx> Cc: "Yan, Zheng" <zyan@xxxxxxxxxx> Cc: ceph-devel@xxxxxxxxxxxxxxx -- 2.19.1