Xiubo Li <xiubli@xxxxxxxxxx> wrote: > Please note the 'ceph_cap_snap' struct is orient to the MDS daemons in > cephfs and for metadata only. While the 'ceph_snap_context' struct is > orient to the ceph Rados for data only instead. That doesn't seem to be how it is implemented, though. Looking at the writeback code, the ordered list of snaps to be written back *is* the ceph_cap_snap list. The writeback code goes through the list in order, writing out the data associated with the snapc that is the ->context for each capsnap. When a new snapshot occurs, the snap notification service code allocates and queues a new capsnap on each inode. Is there any reason that cannot be done lazily? Also, it appears there may be an additional bug here: If memory allocation fails when allocating a new capsnap in queue_realm_cap_snaps(), it abandons the process (not unreasonably), but leaves a bunch of inodes with their old capsnaps. If you can do this lazily, this can be better handled in the write routines where user interaction can be bounced. > As I mentioned above once a page/folio set the 'ceph_snap_context' struct it > shouldn't change any more. Until the page has been cleaned, I presume. > Then could you explain how to pass the snapshot set to ceph Rados when > flushing each dirty page/folio ? The netfs_group pointer passed into netfs_write_group() is stored in the netfs_io_request struct and is thus accessible by the filesystem. I've attached the three functions involved below. Note that this is a work in progress and will most certainly require further changes. It's also untested as I haven't got it to a stage that I can test this part of the code yet. > I don't understand how to bind the 'ceph_cap_snap' struct to dirty pages > here. Point folio->private at it and take a ref on it, much as it is currently doing for snapc. The capsnap->context points to snapc and currently folio->private points to the snapc and then in writeback, and currently we take the context snapc from the capsnap and iterate over all the folios on that inode looking for any with a private pointer that matches the snapc. We could store the capsnap pointer there instead and iterate looking for that. -~- I've pushed my patches out to: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=ceph-iter if you want a look. Note that the top 7 patches are the ones in flux at the moment while I try and write the netfs helpers for the ceph filesystem. Don't expect any of those to compile. Actually, I'd appreciate it if you could have a look at the patches up to "rbd: Use ceph_databuf_enc_start/stop()". Those are almost completely done, though with one exception. There's one bit in RBD that still needs dealing with (marked with a #warn and an "[INCOMPLETE]" marked in the patch subject) - but RBD does otherwise work. The aim is to remove all the different data types, except for two: one where an iov_iter is provided (such as an ITER_XARRAY referring directly to the pagecache) and one where we have a "ceph_databuf" with a bvec[]. Ideally, there would only be one, but I'll see if I can manage that. David --- /* * Handle the termination of a write to the server. */ static void ceph_write_terminated(struct netfs_io_subrequest *subreq, int err) { struct ceph_io_subrequest *csub = container_of(subreq, struct ceph_io_subrequest, sreq); struct ceph_io_request *creq = csub->creq; struct ceph_osd_request *req = csub->req; struct inode *inode = creq->rreq.inode; struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); struct ceph_client *cl = ceph_inode_to_client(inode); struct ceph_inode_info *ci = ceph_inode(inode); ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency, req->r_end_latency, subreq->len, err); if (err != 0) { doutc(cl, "sync_write osd write returned %d\n", err); /* Version changed! Must re-do the rmw cycle */ if ((creq->rmw_assert_version && (err == -ERANGE || err == -EOVERFLOW)) || (!creq->rmw_assert_version && err == -EEXIST)) { /* We should only ever see this on a rmw */ WARN_ON_ONCE(!test_bit(NETFS_RREQ_RMW, &ci->netfs.flags)); /* The version should never go backward */ WARN_ON_ONCE(err == -EOVERFLOW); /* FIXME: limit number of times we loop? */ set_bit(NETFS_RREQ_REPEAT_RMW, &ci->netfs.flags); } ceph_set_error_write(ci); } else { ceph_clear_error_write(ci); } csub->req = NULL; ceph_osdc_put_request(req); netfs_subreq_terminated(subreq, len, err); } static void ceph_upload_to_server(struct netfs_io_subrequest *subreq) { struct ceph_io_subrequest *csub = container_of(subreq, struct ceph_io_subrequest, sreq); struct ceph_osd_request *req = csub->req; struct ceph_fs_client *fsc = ceph_inode_to_fs_client(subreq->rreq->inode); struct ceph_osd_client *osdc = &fsc->client->osdc; trace_netfs_sreq(subreq, netfs_sreq_trace_submit); ceph_osdc_start_request(osdc, req); } /* * Set up write requests for a writeback slice. We need to add a write request * for each write we want to make. */ void ceph_create_write_requests(struct netfs_io_request *wreq, loff_t start, size_t remain) { struct netfs_io_subrequest *subreq; struct ceph_io_subrequest *csub; struct ceph_snap_context *snapc = container_of(wreq->group, struct ceph_snap_context, group); struct inode *inode = wreq->inode; struct ceph_inode_info *ci = ceph_inode(inode); struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); struct ceph_client *cl = ceph_inode_to_client(inode); struct ceph_osd_client *osdc = &fsc->client->osdc; struct ceph_osd_request *req; struct timespec64 mtime = current_time(inode); bool rmw = test_bit(NETFS_RREQ_RMW, &ci->netfs.flags); doutc(cl, "create_write_reqs on inode %p %lld~%zu snapc %p seq %lld\n", inode, start, remain, snapc, snapc->seq); do { unsigned long long llpart; size_t part = remain; u64 assert_ver = 0; u64 objnum, objoff; subreq = netfs_create_write_request(wreq, NETFS_UPLOAD_TO_SERVER, start, remain, NULL); if (!subreq) { wreq->error = -ENOMEM; break; } csub = container_of(subreq, struct ceph_io_subrequest, sreq); /* clamp the length to the end of first object */ ceph_calc_file_object_mapping(&ci->i_layout, start, remain, &objnum, &objoff, &part); llpart = part; doutc(cl, "create_write_reqs ino %llx %lld~%zu part %lld~%llu -- %srmw\n", ci->i_vino.ino, start, remain, start, llpart, rmw ? "" : "no "); req = ceph_osdc_new_request(osdc, &ci->i_layout, ci->i_vino, start, &llpart, rmw ? 1 : 0, rmw ? 2 : 1, CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE, snapc, ci->i_truncate_seq, ci->i_truncate_size, false); if (IS_ERR(req)) { wreq->error = PTR_ERR(req); break; } part = llpart; doutc(cl, "write op %lld~%zu\n", start, part); osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter); req->r_inode = inode; req->r_mtime = mtime; req->subreq = subreq; csub->req = req; /* * If we're doing an RMW cycle, set up an assertion that the * remote data hasn't changed. If we don't have a version * number, then the object doesn't exist yet. Use an exclusive * create instead of a version assertion in that case. */ if (rmw) { if (assert_ver) { osd_req_op_init(req, 0, CEPH_OSD_OP_ASSERT_VER, 0); req->r_ops[0].assert_ver.ver = assert_ver; } else { osd_req_op_init(req, 0, CEPH_OSD_OP_CREATE, CEPH_OSD_OP_FLAG_EXCL); } } ceph_upload_to_server(req); start += part; remain -= part; } while (remain > 0 && !wreq->error); dout("create_write_reqs returning %d\n", wreq->error); }