[RFC 0/2] rbd: respect REQ_NOUNMAP by setting new nounmap flag for ZERO op

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd side
when REQ_NOUNMAP flag is set for a block layer request.

Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer requests
fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd side.

With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO request
osd can zero out blocks, instead of punching holes.

Possible handling of a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP on osd:

 diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
 index dff549535d39..d812d88d7bc0 100644
 --- a/src/osd/PrimaryLogPG.cc
 +++ b/src/osd/PrimaryLogPG.cc
 @@ -5788,6 +5788,7 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
 
      // munge ZERO -> TRUNCATE?  (don't munge to DELETE or we risk hosing attributes)
      if (op.op == CEPH_OSD_OP_ZERO &&
 +       !(op.flags & CEPH_OSD_OP_FLAG_ZERO_NOUNMAP) &&
          obs.exists &&
          op.extent.offset < static_cast<Option::size_t>(osd->osd_max_object_size) &&
          op.extent.length >= 1 &&
 @@ -6583,3 +6584,28 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
         result = -EOPNOTSUPP;
         break;
        }
 +      if (op.flags & CEPH_OSD_OP_FLAG_ZERO_NOUNMAP) {
 +        // ZERO -> WRITE_SAME
 +        // Why?  Internally storage backend punches holes on zeroing, but we
 +        // need zeroed blocks instead.
 +
 +        if (osd_op.indata.length()) {
 +          // Zero op with data? No way.
 +          result = -EINVAL;
 +          goto fail;
 +        }
 +
 +        // Extent and writesame layouts are almost similar, so reset union
 +        // members which are different
 +        op.extent.truncate_size = 0;
 +        op.extent.truncate_seq  = 0;
 +
 +        // Fill in zero data, will be duplicated inside do_writesame()
 +        const char buf[2] = {0};
 +        osd_op.indata.append(buf, sizeof(buf));
 +        op.writesame.data_length = sizeof(buf);
 +
 +        result = do_writesame(ctx, osd_op);
 +        break;
 +      }
 +

The other possible solution is to send CEPH_OSD_OP_WRITESAME directly from
krbd instead of CEPH_OSD_OP_ZERO + NOUNMAP flag, but IMO that has a
drawback: OP_WRITESAME was implemented on osd couple of years ago and seems
that is not very nice to break the compatibility if someone has updated
kernel, but still using old ceph cluster. Also ZERO + NOUNMAP has a cleaner
semantics.

These are just thoughts, nothing is tested, thus RFC.

Roman Penyaev (2):
  libceph, rbd: pass op flags to osd_req_op_extent_init()
  libceph, rbd: respect REQ_NOUNMAP by setting new nounmap flag for
    CEPH_OSD_OP_ZERO

 drivers/block/rbd.c             | 51 ++++++++++++++++++++++++---------
 include/linux/ceph/osd_client.h |  2 +-
 include/linux/ceph/rados.h      |  1 +
 net/ceph/osd_client.c           |  6 ++--
 4 files changed, 42 insertions(+), 18 deletions(-)

Signed-off-by: Roman Penyaev <rpenyaev@xxxxxxx>
Cc: Ilya Dryomov <idryomov@xxxxxxxxx>
Cc: Sage Weil <sage@xxxxxxxxxx>
Cc: Alex Elder <elder@xxxxxxxxxx>
Cc: "Yan, Zheng" <zyan@xxxxxxxxxx>
Cc: ceph-devel@xxxxxxxxxxxxxxx
-- 
2.19.1




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux