> On Jan 27, 2016, at 18:45, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Wed, Jan 27, 2016 at 10:42 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >> >>> On Jan 27, 2016, at 17:29, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>> >>> On Wed, Jan 27, 2016 at 2:52 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >>>> On Wed, Jan 27, 2016 at 6:15 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>> On Tue, Jan 19, 2016 at 2:30 PM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>>>> This patch makes ceph_writepages_start() try using single OSD request >>>>>> to write all dirty pages within a strip. When a nonconsecutive dirty >>>>>> page is found, ceph_writepages_start() tries starting a new write >>>>>> operation to existing OSD request. If it succeeds, it uses the new >>>>>> operation to writeback the dirty page. >>>>>> >>>>>> Signed-off-by: Yan, Zheng <zyan@xxxxxxxxxx> >>>>>> --- >>>>>> fs/ceph/addr.c | 263 ++++++++++++++++++++++++++++++++++++++------------------- >>>>>> 1 file changed, 174 insertions(+), 89 deletions(-) >>>>>> >>>>>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c >>>>>> index c222137..ac14d02 100644 >>>>>> --- a/fs/ceph/addr.c >>>>>> +++ b/fs/ceph/addr.c >>>>>> @@ -606,71 +606,71 @@ static void writepages_finish(struct ceph_osd_request *req, >>>>>> struct inode *inode = req->r_inode; >>>>>> struct ceph_inode_info *ci = ceph_inode(inode); >>>>>> struct ceph_osd_data *osd_data; >>>>>> - unsigned wrote; >>>>>> struct page *page; >>>>>> - int num_pages; >>>>>> - int i; >>>>>> + int num_pages, total_pages = 0; >>>>>> + int i, j; >>>>>> + int rc = req->r_result; >>>>>> struct ceph_snap_context *snapc = req->r_snapc; >>>>>> struct address_space *mapping = inode->i_mapping; >>>>>> - int rc = req->r_result; >>>>>> - u64 bytes = req->r_ops[0].extent.length; >>>>>> struct ceph_fs_client *fsc = ceph_inode_to_client(inode); >>>>>> - long writeback_stat; >>>>>> - unsigned issued = ceph_caps_issued(ci); >>>>>> + bool remove_page; >>>>>> >>>>>> - osd_data = osd_req_op_extent_osd_data(req, 0); >>>>>> - BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES); >>>>>> - num_pages = calc_pages_for((u64)osd_data->alignment, >>>>>> - (u64)osd_data->length); >>>>>> - if (rc >= 0) { >>>>>> - /* >>>>>> - * Assume we wrote the pages we originally sent. The >>>>>> - * osd might reply with fewer pages if our writeback >>>>>> - * raced with a truncation and was adjusted at the osd, >>>>>> - * so don't believe the reply. >>>>>> - */ >>>>>> - wrote = num_pages; >>>>>> - } else { >>>>>> - wrote = 0; >>>>>> + >>>>>> + dout("writepages_finish %p rc %d\n", inode, rc); >>>>>> + if (rc < 0) >>>>>> mapping_set_error(mapping, rc); >>>>>> - } >>>>>> - dout("writepages_finish %p rc %d bytes %llu wrote %d (pages)\n", >>>>>> - inode, rc, bytes, wrote); >>>>>> >>>>>> - /* clean all pages */ >>>>>> - for (i = 0; i < num_pages; i++) { >>>>>> - page = osd_data->pages[i]; >>>>>> - BUG_ON(!page); >>>>>> - WARN_ON(!PageUptodate(page)); >>>>>> + /* >>>>>> + * We lost the cache cap, need to truncate the page before >>>>>> + * it is unlocked, otherwise we'd truncate it later in the >>>>>> + * page truncation thread, possibly losing some data that >>>>>> + * raced its way in >>>>>> + */ >>>>>> + remove_page = !(ceph_caps_issued(ci) & >>>>>> + (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)); >>>>>> >>>>>> - writeback_stat = >>>>>> - atomic_long_dec_return(&fsc->writeback_count); >>>>>> - if (writeback_stat < >>>>>> - CONGESTION_OFF_THRESH(fsc->mount_options->congestion_kb)) >>>>>> - clear_bdi_congested(&fsc->backing_dev_info, >>>>>> - BLK_RW_ASYNC); >>>>>> + /* clean all pages */ >>>>>> + for (i = 0; i < req->r_num_ops; i++) { >>>>>> + if (req->r_ops[i].op != CEPH_OSD_OP_WRITE) >>>>>> + break; >>>>>> >>>>>> - ceph_put_snap_context(page_snap_context(page)); >>>>>> - page->private = 0; >>>>>> - ClearPagePrivate(page); >>>>>> - dout("unlocking %d %p\n", i, page); >>>>>> - end_page_writeback(page); >>>>>> + osd_data = osd_req_op_extent_osd_data(req, i); >>>>>> + BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES); >>>>>> + num_pages = calc_pages_for((u64)osd_data->alignment, >>>>>> + (u64)osd_data->length); >>>>>> + total_pages += num_pages; >>>>>> + for (j = 0; j < num_pages; j++) { >>>>>> + page = osd_data->pages[j]; >>>>>> + BUG_ON(!page); >>>>>> + WARN_ON(!PageUptodate(page)); >>>>>> + >>>>>> + if (atomic_long_dec_return(&fsc->writeback_count) < >>>>>> + CONGESTION_OFF_THRESH( >>>>>> + fsc->mount_options->congestion_kb)) >>>>>> + clear_bdi_congested(&fsc->backing_dev_info, >>>>>> + BLK_RW_ASYNC); >>>>>> + >>>>>> + ceph_put_snap_context(page_snap_context(page)); >>>>>> + page->private = 0; >>>>>> + ClearPagePrivate(page); >>>>>> + dout("unlocking %p\n", page); >>>>>> + end_page_writeback(page); >>>>>> + >>>>>> + if (remove_page) >>>>>> + generic_error_remove_page(inode->i_mapping, >>>>>> + page); >>>>>> >>>>>> - /* >>>>>> - * We lost the cache cap, need to truncate the page before >>>>>> - * it is unlocked, otherwise we'd truncate it later in the >>>>>> - * page truncation thread, possibly losing some data that >>>>>> - * raced its way in >>>>>> - */ >>>>>> - if ((issued & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0) >>>>>> - generic_error_remove_page(inode->i_mapping, page); >>>>>> + unlock_page(page); >>>>>> + } >>>>>> + dout("writepages_finish %p wrote %llu bytes cleaned %d pages\n", >>>>>> + inode, osd_data->length, rc >= 0 ? num_pages : 0); >>>>>> >>>>>> - unlock_page(page); >>>>>> + ceph_release_pages(osd_data->pages, num_pages); >>>>>> } >>>>>> - dout("%p wrote+cleaned %d pages\n", inode, wrote); >>>>>> - ceph_put_wrbuffer_cap_refs(ci, num_pages, snapc); >>>>>> >>>>>> - ceph_release_pages(osd_data->pages, num_pages); >>>>>> + ceph_put_wrbuffer_cap_refs(ci, total_pages, snapc); >>>>>> + >>>>>> + osd_data = osd_req_op_extent_osd_data(req, 0); >>>>>> if (osd_data->pages_from_pool) >>>>>> mempool_free(osd_data->pages, >>>>>> ceph_sb_to_client(inode->i_sb)->wb_pagevec_pool); >>>>>> @@ -778,17 +778,15 @@ retry: >>>>>> while (!done && index <= end) { >>>>>> unsigned i; >>>>>> int first; >>>>>> - pgoff_t next; >>>>>> - int pvec_pages, locked_pages; >>>>>> - struct page **pages = NULL; >>>>>> + pgoff_t strip_end = 0; >>>>>> + int num_ops = 0, max_ops = 0; >>>>>> + int pvec_pages, locked_pages = 0; >>>>>> + struct page **pages = NULL, **data_pages = NULL; >>>>>> mempool_t *pool = NULL; /* Becomes non-null if mempool used */ >>>>>> struct page *page; >>>>>> int want; >>>>>> - u64 offset, len; >>>>>> - long writeback_stat; >>>>>> + u64 offset = 0, len = 0; >>>>>> >>>>>> - next = 0; >>>>>> - locked_pages = 0; >>>>>> max_pages = max_pages_ever; >>>>>> >>>>>> get_more_pages: >>>>>> @@ -824,8 +822,8 @@ get_more_pages: >>>>>> unlock_page(page); >>>>>> break; >>>>>> } >>>>>> - if (next && (page->index != next)) { >>>>>> - dout("not consecutive %p\n", page); >>>>>> + if (strip_end && (page->index > strip_end)) { >>>>>> + dout("end of strip %p\n", page); >>>>>> unlock_page(page); >>>>>> break; >>>>>> } >>>>>> @@ -871,28 +869,60 @@ get_more_pages: >>>>>> * that it will use. >>>>>> */ >>>>>> if (locked_pages == 0) { >>>>>> + int j; >>>>>> BUG_ON(pages); >>>>>> /* prepare async write request */ >>>>>> offset = (u64)page_offset(page); >>>>>> len = wsize; >>>>>> - req = ceph_osdc_new_request(&fsc->client->osdc, >>>>>> + >>>>>> + max_ops = 1 + do_sync; >>>>>> + strip_end = page->index + >>>>>> + ((len - 1) >> PAGE_CACHE_SHIFT); >>>>>> + for (j = i + 1; j < pvec_pages; j++) { >>>>>> + if (pvec.pages[j]->index > strip_end) >>>>>> + break; >>>>>> + if (pvec.pages[j]->index - 1 != >>>>>> + pvec.pages[j - 1]->index) >>>>>> + max_ops++; >>>>>> + } >>>>>> + >>>>>> + if (max_ops > CEPH_OSD_INITIAL_OP) { >>>>>> + max_ops = min(max_ops, CEPH_OSD_MAX_OP); >>>>>> + req = ceph_osdc_new_request( >>>>>> + &fsc->client->osdc, >>>>>> &ci->i_layout, vino, >>>>>> - offset, &len, 0, >>>>>> - do_sync ? 2 : 1, >>>>>> + offset, &len, >>>>>> + 0, max_ops, >>>>>> + CEPH_OSD_OP_WRITE, >>>>>> + CEPH_OSD_FLAG_WRITE | >>>>>> + CEPH_OSD_FLAG_ONDISK, >>>>>> + snapc, truncate_seq, >>>>>> + truncate_size, false); >>>>>> + if (IS_ERR(req)) >>>>>> + req = NULL; >>>>>> + } >>>>>> + if (!req) { >>>>>> + max_ops = CEPH_OSD_INITIAL_OP; >>>>>> + req = ceph_osdc_new_request( >>>>>> + &fsc->client->osdc, >>>>>> + &ci->i_layout, vino, >>>>>> + offset, &len, >>>>>> + 0, max_ops, >>>>>> CEPH_OSD_OP_WRITE, >>>>>> CEPH_OSD_FLAG_WRITE | >>>>>> CEPH_OSD_FLAG_ONDISK, >>>>>> snapc, truncate_seq, >>>>>> truncate_size, true); >>>>>> - if (IS_ERR(req)) { >>>>>> - rc = PTR_ERR(req); >>>>>> - unlock_page(page); >>>>>> - break; >>>>>> + if (IS_ERR(req)) { >>>>>> + unlock_page(page); >>>>>> + rc = PTR_ERR(req); >>>>>> + req = NULL; >>>>>> + break; >>>>>> + } >>>>>> } >>>>>> >>>>>> - if (do_sync) >>>>>> - osd_req_op_init(req, 1, >>>>>> - CEPH_OSD_OP_STARTSYNC, 0); >>>>>> + strip_end = page->index + >>>>>> + ((len - 1) >> PAGE_CACHE_SHIFT); >>>>>> >>>>>> req->r_callback = writepages_finish; >>>>>> req->r_inode = inode; >>>>>> @@ -905,6 +935,60 @@ get_more_pages: >>>>>> pages = mempool_alloc(pool, GFP_NOFS); >>>>>> BUG_ON(!pages); >>>>>> } >>>>>> + >>>>>> + num_ops = 1; >>>>>> + data_pages = pages; >>>>>> + len = 0; >>>>>> + } else if (page->index != >>>>>> + (offset + len) >> PAGE_CACHE_SHIFT) { >>>>>> + u64 offset_inc; >>>>>> + bool new_op = true; >>>>>> + >>>>>> + if (num_ops + do_sync == CEPH_OSD_MAX_OP) { >>>>>> + new_op = false; >>>>>> + } else if (num_ops + do_sync == max_ops) { >>>>>> + int new_max = max_ops; >>>>>> + int j; >>>>>> + for (j = i + 1; j < pvec_pages; j++) { >>>>>> + if (pvec.pages[j]->index > >>>>>> + strip_end) >>>>>> + break; >>>>>> + if (pvec.pages[j]->index - 1 != >>>>>> + pvec.pages[j - 1]->index) >>>>>> + new_max++; >>>>>> + } >>>>>> + new_max++; >>>>>> + new_max = min(new_max, CEPH_OSD_MAX_OP); >>>>>> + if (ceph_osdc_extend_request(req, >>>>>> + new_max, >>>>>> + GFP_NOFS) >= 0) >>>>>> + max_ops = new_max; >>>>>> + else >>>>>> + new_op = false; >>>>>> + } >>>>>> + if (!new_op) { >>>>>> + redirty_page_for_writepage(wbc, page); >>>>>> + unlock_page(page); >>>>>> + break; >>>>>> + } >>>>>> + >>>>>> + offset_inc = (u64)page_offset(page) - offset; >>>>>> + osd_req_op_extent_dup_last(req, offset_inc); >>>>>> + >>>>>> + dout("writepages got pages at %llu~%llu\n", >>>>>> + offset, len); >>>>>> + >>>>>> + osd_req_op_extent_update(req, num_ops - 1, len); >>>>>> + osd_req_op_extent_osd_data_pages(req, >>>>>> + num_ops - 1, >>>>>> + data_pages, >>>>>> + len, 0, >>>>>> + !!pool, false); >>>>>> + >>>>>> + num_ops++; >>>>>> + data_pages = pages + locked_pages; >>>>>> + offset = (u64)page_offset(page); >>>>>> + len = 0; >>>>>> } >>>>>> >>>>>> /* note position of first page in pvec */ >>>>>> @@ -913,9 +997,8 @@ get_more_pages: >>>>>> dout("%p will write page %p idx %lu\n", >>>>>> inode, page, page->index); >>>>>> >>>>>> - writeback_stat = >>>>>> - atomic_long_inc_return(&fsc->writeback_count); >>>>>> - if (writeback_stat > CONGESTION_ON_THRESH( >>>>>> + if (atomic_long_inc_return(&fsc->writeback_count) > >>>>>> + CONGESTION_ON_THRESH( >>>>>> fsc->mount_options->congestion_kb)) { >>>>>> set_bdi_congested(&fsc->backing_dev_info, >>>>>> BLK_RW_ASYNC); >>>>>> @@ -924,7 +1007,7 @@ get_more_pages: >>>>>> set_page_writeback(page); >>>>>> pages[locked_pages] = page; >>>>>> locked_pages++; >>>>>> - next = page->index + 1; >>>>>> + len += PAGE_CACHE_SIZE; >>>>>> } >>>>>> >>>>>> /* did we get anything? */ >>>>>> @@ -952,31 +1035,34 @@ get_more_pages: >>>>>> } >>>>>> >>>>>> /* Format the osd request message and submit the write */ >>>>>> - offset = page_offset(pages[0]); >>>>>> - len = (u64)locked_pages << PAGE_CACHE_SHIFT; >>>>>> if (snap_size == -1) { >>>>>> + /* writepages_finish() clears writeback pages >>>>>> + * according to the data length, so make sure >>>>>> + * data length covers all locked pages */ >>>>>> + u64 min_len = len + 1 - PAGE_CACHE_SIZE; >>>>>> len = min(len, (u64)i_size_read(inode) - offset); >>>>>> - /* writepages_finish() clears writeback pages >>>>>> - * according to the data length, so make sure >>>>>> - * data length covers all locked pages */ >>>>>> - len = max(len, 1 + >>>>>> - ((u64)(locked_pages - 1) << PAGE_CACHE_SHIFT)); >>>>>> + len = max(len, min_len); >>>>>> } else { >>>>>> len = min(len, snap_size - offset); >>>>>> } >>>>>> - dout("writepages got %d pages at %llu~%llu\n", >>>>>> - locked_pages, offset, len); >>>>>> + dout("writepages got pages at %llu~%llu\n", offset, len); >>>>>> >>>>>> - osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, >>>>>> - !!pool, false); >>>>>> + osd_req_op_extent_osd_data_pages(req, num_ops - 1, data_pages, >>>>>> + len, 0, !!pool, false); >>>>>> + osd_req_op_extent_update(req, num_ops - 1, len); >>>>>> >>>>>> + if (do_sync) { >>>>>> + osd_req_op_init(req, num_ops, CEPH_OSD_OP_STARTSYNC, 0); >>>>>> + num_ops++; >>>>>> + } >>>>>> + BUG_ON(num_ops > max_ops); >>>>>> + >>>>>> + index = pages[locked_pages - 1]->index + 1; >>>>>> pages = NULL; /* request message now owns the pages array */ >>>>>> pool = NULL; >>>>>> >>>>>> /* Update the write op length in case we changed it */ >>>>>> >>>>>> - osd_req_op_extent_update(req, 0, len); >>>>>> - >>>>>> vino = ceph_vino(inode); >>>>>> ceph_osdc_build_request(req, offset, snapc, vino.snap, >>>>>> &inode->i_mtime); >>>>>> @@ -986,7 +1072,6 @@ get_more_pages: >>>>>> req = NULL; >>>>>> >>>>>> /* continue? */ >>>>>> - index = next; >>>>>> wbc->nr_to_write -= locked_pages; >>>>>> if (wbc->nr_to_write <= 0) >>>>>> done = 1; >>>>> >>>>> Hi Zheng, >>>>> >>>>> I know I mentioned off-list that v3 looked good, but I've taken >>>>> a deeper look and have to retract that. This version still has a call >>>>> to ceph_osdc_extend_request(), which suggests to me that the number of >>>>> ops is calculated incorrectly. >>>>> >>>>> A single op within an OSD request can be used for a single region of >>>>> contiguous dirty pages, so what this comes down to is calculating the >>>>> number of such dirty contiguous regions. What you appear to be doing >>>>> in this version is calculating this number for the *first* gang of >>>>> pages we get from pagevec_lookup_tag() and allocating an OSD request >>>>> based on that. If that's not enough (i.e. there are more dirty pages >>>>> in the stripe unit), we move on, you do the calculation for the next >>>>> gang and enlange the OSD request by reallocating parts of it. This >>>>> goes on until we either run out of dirty pages in the stripe unit or >>>>> reach CEPH_OSD_MAX_OP ops in the request. >>>>> >>>>> What I don't understand is what prevents us from pre-calculating the >>>>> number of dirty contiguous regions for the entire stripe unit in one go >>>>> and trying to allocate the OSD request based on that number, possibly >>>>> falling back to a smaller request if we are short on memory. The page >>>>> pointer array can hold all of pages in the stripe unit - I don't see >>>>> the need for ceph_osdc_extend_request() at all. Am I missing something? >>>>> >>>> >>>> pre-calculating number of regions in the strip unit is ideal for >>>> random write, but it impose overhead on the most common sequential >>>> write (that's why I don't allocate CEPH_OSD_MAX_OP ops in the first >>>> place) Besides, pre-calculating number of regions require big >>> >>> Care to be more specific about the overhead? You have to go through >>> enough of dirty pages to either stuff up the OSD request or move past >>> the stripe unit no matter which type of write it is. Allocating >>> CEPH_OSD_MAX_OP right off the bat is of course a bad idea, especially >>> in the sequential case, but that's not what I'm advocating for. What >>> I'm proposing would result in allocating a request with exactly the >>> number of ops needed, in both sequential and random cases - no more, no >>> less. >> >> Scanning page within a strip unit requires calling pagevec_lookup_tag() multiples times. There are about 1000 pages in the worst case. It means we need to get/release page 1000 times. In my option, it’s non-trivial overhead. > > Which page would have to be gotten/released a 1000 times? We already > call pagevec_lookup_tag() more than once, until we either move past the > stripe unit or fill in the OSD request. Pre-calculating the number of > dirty regions properly doesn't entail any additional calls to it. We > lock the page or release it, it happens once. Locked pages are stashed > in the pages page pointer array - is there a reason we can't defer the > OSD request allocation until after we lock and stash all of the pages > we are interested in? Unless I'm misreading the code, the page pointer > array is always big enough to accommodate. Make sense, I will update my patches Regards Yan, Zheng > > Thanks, > > Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html