On 02/18/2015 06:05 PM, "Sebastian Köhler [Alfahosting GmbH]" wrote: > Hi, > > yesterday we had had the problem that one of our cluster clients > remounted a rbd device in read-only mode. We found this[1] stack trace > in the logs. We investigated further and found similar traces on all > other machines that are using the rbd kernel module. It seems to me that > whenever there is a swapping situation on a client those I/O errors occur. > Is there anything we can do or is this something that needs to be fixed > in the code? Hi, I was looking at that code the other day and was thinking rbd.c might need some changes. 1. We cannot use GFP_KERNEL in the main IO path (requests that are sent down rbd_request_fn and related helper IO), because the allocation could come back on rbd_request_fn. 2. We should use GFP_NOIO instead of GFP_ATOMIC if we have the proper context and are not holding a spin lock. 3. We should be using a mempool or preallocate enough mem, so we can make forward progress on at least one IO at a time. I started to make the attached patch (attached version is built over linus's tree today). I think it can be further refined, so we pass in the gfp_t to some functions, because I think in some cases we could use GFP_KERNEL and/or we do not need to use the mempool. For example, I do not think we could use GFP_KERNEL and not use the mempool in the rbd_obj_watch_request_helper code paths. I was not done with evaluating all the paths, so had not yet posted it. Patch is not tested. Hey Ilya, I was not sure about the layered related code. I thought functions like rbd_img_obj_parent_read_full could get called as a result of a IO getting sent down the rbd_request_fn, but was not 100% sure. I meant to test it out, but have been busy with other stuff.
[PATCH] ceph/rbd: use GFP_NOIO and mempool 1. We cannot use GFP_KERNEL in the main IO path, because it could come back on us. 2. We should use GFP_NOIO instead of GFP_ATOMIC if we have the proper context and are not holding a spin lock. 3. We should be using a mempool or preallocate enough mem, so we can make forward progress on at least one IO at a time. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 8a86b62..c01ecaf 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1915,8 +1915,8 @@ static struct ceph_osd_request *rbd_osd_req_create( /* Allocate and initialize the request, for the num_ops ops */ osdc = &rbd_dev->rbd_client->client->osdc; - osd_req = ceph_osdc_alloc_request(osdc, snapc, num_ops, false, - GFP_ATOMIC); + osd_req = ceph_osdc_alloc_request(osdc, snapc, num_ops, true, + GFP_NOIO); if (!osd_req) return NULL; /* ENOMEM */ @@ -1998,11 +1998,11 @@ static struct rbd_obj_request *rbd_obj_request_create(const char *object_name, rbd_assert(obj_request_type_valid(type)); size = strlen(object_name) + 1; - name = kmalloc(size, GFP_KERNEL); + name = kmalloc(size, GFP_NOIO); if (!name) return NULL; - obj_request = kmem_cache_zalloc(rbd_obj_request_cache, GFP_KERNEL); + obj_request = kmem_cache_zalloc(rbd_obj_request_cache, GFP_NOIO); if (!obj_request) { kfree(name); return NULL; @@ -2456,7 +2456,7 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request, bio_chain_clone_range(&bio_list, &bio_offset, clone_size, - GFP_ATOMIC); + GFP_NOIO); if (!obj_request->bio_list) goto out_unwind; } else if (type == OBJ_REQUEST_PAGES) { @@ -2687,7 +2687,7 @@ static int rbd_img_obj_parent_read_full(struct rbd_obj_request *obj_request) * from the parent. */ page_count = (u32)calc_pages_for(0, length); - pages = ceph_alloc_page_vector(page_count, GFP_KERNEL); + pages = ceph_alloc_page_vector(page_count, GFP_NOIO); if (IS_ERR(pages)) { result = PTR_ERR(pages); pages = NULL; @@ -2814,7 +2814,7 @@ static int rbd_img_obj_exists_submit(struct rbd_obj_request *obj_request) */ size = sizeof (__le64) + sizeof (__le32) + sizeof (__le32); page_count = (u32)calc_pages_for(0, size); - pages = ceph_alloc_page_vector(page_count, GFP_KERNEL); + pages = ceph_alloc_page_vector(page_count, GFP_NOIO); if (IS_ERR(pages)) return PTR_ERR(pages);
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com