Hi Maged,
On 06/30/2019 09:55 PM, Maged Mokhtar wrote:
Hi Ilya,
Nice work. Some comments/questions:
1) Generally having a state machine makes things easier to track as
the code is less dispersed than before.
2) The running_list is used to keep track of inflight requests in case
of exclusive lock to support rbd_quiesce_lock() when releasing the
lock. It would be great to generalize this list to keep track of all
inflight requests even in the case when lock is not required, it could
be used to support a generic flush (similar to librbd work queue
flush/drain). Having a generic inflight requests list will also make
it easier to support other functions such as task aborts, request
timeouts, flushing on pre-snapshots.
3) For the acquiring_list, my concern is that while the lock is
pending to be acquired, the requests are being accepted without a
limit. In case there is a delay acquiring the lock, for example if the
primary of the object header is down (which could block for ~ 25 sec)
or worse if the pool is inactive, the count could well exceed the max
queue depth + for write requests this can consume a lot of memory.
4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task
for every request. I understand this is a single threaded queue and if
lock is acquired then all acquire tasks are cancelled, however i feel
the queue could fill a lot. Any chance we can only schedule 1 acquire
task ?
5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image
does not require exclusive lock + when lock is acquired. Cosmetically
it may be better to separate them.
6) Probably not an issue, but we are now creating a large number of
mutexes, at least 2 for every request. Maybe we need to test in high
iops/queue depths that there is no overhead for this.
7) Is there any consideration to split the rbd module to multiple
files ? Looking at how big librbd, fitting this in a single kernel
file is challenging at best.
Ilya could provide more information, but I can share something about
this. Ilya and I had a discussion about splitting rbd.c as it is really
big enough. But we want something WIP (this patchset)
could be merged firstly. A rough idea is pulling exclusive-lock and
object-map out from rbd.c.
Thanx
/Maged
On 25/06/2019 16:40, Ilya Dryomov wrote:
Hello,
This series adds support for object-map and fast-diff image features.
Patches 1 - 11 prepare object and image request state machines; patches
12 - 14 fix most of the shortcomings in our exclusive lock code, making
it suitable for guarding the object map; patches 15 - 18 take care of
the prerequisites and finally patches 19 - 20 implement object-map and
fast-diff.
Thanks,
Ilya
Ilya Dryomov (20):
rbd: get rid of obj_req->xferred, obj_req->result and
img_req->xferred
rbd: replace obj_req->tried_parent with obj_req->read_state
rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}
rbd: move OSD request submission into object request state machines
rbd: introduce image request state machine
rbd: introduce obj_req->osd_reqs list
rbd: factor out rbd_osd_setup_copyup()
rbd: factor out __rbd_osd_setup_discard_ops()
rbd: move OSD request allocation into object request state machines
rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
rbd: introduce copyup state machine
rbd: lock should be quiesced on reacquire
rbd: quiescing lock should wait for image requests
rbd: new exclusive lock wait/wake code
libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
libceph: change ceph_osdc_call() to take page vector for response
libceph: export osd_req_op_data() macro
rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
rbd: support for object-map and fast-diff
rbd: setallochint only if object doesn't exist
drivers/block/rbd.c | 2433 ++++++++++++++++++--------
drivers/block/rbd_types.h | 10 +
include/linux/ceph/cls_lock_client.h | 3 +
include/linux/ceph/libceph.h | 6 +-
include/linux/ceph/osd_client.h | 10 +-
include/linux/ceph/striper.h | 2 +
net/ceph/cls_lock_client.c | 47 +-
net/ceph/osd_client.c | 18 +-
net/ceph/striper.c | 17 +
9 files changed, 1817 insertions(+), 729 deletions(-)