> On Jul 15, 2024, at 5:08 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, Jul 09, 2024 at 12:10:22PM -0700, Wengang Wang wrote: >> For each segment, the following steps are done trying to defrag it: >> >> 1. share the segment with a temporary file >> 2. unshare the segment in the target file. kernel simulates Cow on the whole >> segment complete the unshare (defrag). >> 3. release blocks from the tempoary file. >> >> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx> >> --- >> spaceman/defrag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 114 insertions(+) >> >> diff --git a/spaceman/defrag.c b/spaceman/defrag.c >> index 175cf461..9f11e36b 100644 >> --- a/spaceman/defrag.c >> +++ b/spaceman/defrag.c >> @@ -263,6 +263,40 @@ add_ext: >> return ret; >> } >> >> +/* >> + * check if the segment exceeds EoF. >> + * fix up the clone range and return true if EoF happens, >> + * return false otherwise. >> + */ >> +static bool >> +defrag_clone_eof(struct file_clone_range *clone) >> +{ >> + off_t delta; >> + >> + delta = clone->src_offset + clone->src_length - g_defrag_file_size; >> + if (delta > 0) { >> + clone->src_length = 0; // to the end >> + return true; >> + } >> + return false; >> +} >> + >> +/* >> + * get the time delta since pre_time in ms. >> + * pre_time should contains values fetched by gettimeofday() >> + * cur_time is used to store current time by gettimeofday() >> + */ >> +static long long >> +get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time) >> +{ >> + long long us; >> + >> + gettimeofday(cur_time, NULL); >> + us = (cur_time->tv_sec - pre_time->tv_sec) * 1000000; >> + us += (cur_time->tv_usec - pre_time->tv_usec); >> + return us; >> +} >> + >> /* >> * defragment a file >> * return 0 if successfully done, 1 otherwise >> @@ -273,6 +307,7 @@ defrag_xfs_defrag(char *file_path) { >> long nr_seg_defrag = 0, nr_ext_defrag = 0; >> int scratch_fd = -1, defrag_fd = -1; >> char tmp_file_path[PATH_MAX+1]; >> + struct file_clone_range clone; >> char *defrag_dir; >> struct fsxattr fsx; >> int ret = 0; >> @@ -296,6 +331,8 @@ defrag_xfs_defrag(char *file_path) { >> goto out; >> } >> >> + clone.src_fd = defrag_fd; >> + >> defrag_dir = dirname(file_path); > > Just a note: can you please call this the "source fd", not the > "defrag_fd"? defrag_fd could mean either the source or the > temporary scratch file we use as the defrag destination. I have no problem to change that. Just I was thinking there is no data moving happening from a file to another. The defrag_fd means the file under defrag and the temp file is with scratch_fd. > >> snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir, >> getpid()); >> @@ -309,7 +346,11 @@ defrag_xfs_defrag(char *file_path) { >> } >> >> do { >> + struct timeval t_clone, t_unshare, t_punch_hole; >> struct defrag_segment segment; >> + long long seg_size, seg_off; >> + int time_delta; >> + bool stop; >> >> ret = defrag_get_next_segment(defrag_fd, &segment); >> /* no more segments, we are done */ >> @@ -322,6 +363,79 @@ defrag_xfs_defrag(char *file_path) { >> ret = 1; >> break; >> } >> + >> + /* we are done if the segment contains only 1 extent */ >> + if (segment.ds_nr < 2) >> + continue; >> + >> + /* to bytes */ >> + seg_off = segment.ds_offset * 512; >> + seg_size = segment.ds_length * 512; > > Ugh. Do this in the mapping code that gets the extent info. Have it > return bytes. Or just use FIEMAP because it uses byte ranges to > begin with. OK. > >> + >> + clone.src_offset = seg_off; >> + clone.src_length = seg_size; >> + clone.dest_offset = seg_off; >> + >> + /* checks for EoF and fix up clone */ >> + stop = defrag_clone_eof(&clone); > > Ok, so we copy the segment map into clone args, and ... > >> + gettimeofday(&t_clone, NULL); >> + ret = ioctl(scratch_fd, FICLONERANGE, &clone); >> + if (ret != 0) { >> + fprintf(stderr, "FICLONERANGE failed %s\n", >> + strerror(errno)); >> + break; >> + } > > clone the source to the scratch file. This blocks writes to the > source file while it is in progress, but allows reads to pass > through the source file as data is not changing. > > >> + /* for time stats */ >> + time_delta = get_time_delta_us(&t_clone, &t_unshare); >> + if (time_delta > max_clone_us) >> + max_clone_us = time_delta; >> + >> + /* for defrag stats */ >> + nr_ext_defrag += segment.ds_nr; >> + >> + /* >> + * For the shared range to be unshared via a copy-on-write >> + * operation in the file to be defragged. This causes the >> + * file needing to be defragged to have new extents allocated >> + * and the data to be copied over and written out. >> + */ >> + ret = fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, seg_off, >> + seg_size); >> + if (ret != 0) { >> + fprintf(stderr, "UNSHARE_RANGE failed %s\n", >> + strerror(errno)); >> + break; >> + } > > And now we unshare the source file. This blocks all IO to the source > file. Yes, even fetching data from disk is done while IO is locked. I am wondering if we can fetch data without locking IO. That’s also why I added readahead in a later patch. > > Ok, so this is the fundamental problem this whole "segmented > defrag" is trying to work around: FALLOC_FL_UNSHARE_RANGE blocks > all read and write IO whilst it is in progress. > > We had this same problem with FICLONERANGE taking snapshots of VM > files - we changed the locking to take shared IO locks to allow > reads to run while the clone was in progress. Because the Oracle Vm > infrastructure uses a sidecar to redirect writes while a snapshot > (clone) was in progress, no VM IO got blocked while the clone was in > progress and so the applications inside the VM never even noticed a > clone was taking place. > > Why isn't the same infrastructure being used here? > > FALLOC_FL_UNSHARE_RANGE is not changing data, nor is it freeing any > data blocks. Yes, we are re-writing the data somewhere else, but > in that case the original data is still intact in it's original > location on disk and not being freed. > > Hence if a read races with UNSHARE, it will hit a referenced extent > containing the correct data regardless of whether it is in the old > or new file. Hence we can likely use shared IO locking for UNSHARE, > just like we do for FICLONERANGE. > > At this point, if the Oracle VM infrastructure uses the sidecar > write channel whilst the defrag is in progress, this whole algorithm > simply becomes "for regions with extents smaller than X, clone and > unshare the region". > > The whole need for "idle time" goes away. The need for segment size > control largely goes away. The need to tune the defrag algorithm to > avoid IO latency and/or throughput issues goes away. > As we tested, the write-redirecting doesn’t work well. >> + >> + /* for time stats */ >> + time_delta = get_time_delta_us(&t_unshare, &t_punch_hole); >> + if (time_delta > max_unshare_us) >> + max_unshare_us = time_delta; >> + >> + /* >> + * Punch out the original extents we shared to the >> + * scratch file so they are returned to free space. >> + */ >> + ret = fallocate(scratch_fd, >> + FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, seg_off, >> + seg_size); >> + if (ret != 0) { >> + fprintf(stderr, "PUNCH_HOLE failed %s\n", >> + strerror(errno)); >> + break; >> + } > > This is unnecessary if there is lots of free space. You can leave > this to the very end of defrag so that the source file defrag > operation isn't slowed down by cleaning up all the fragmented > extents.... Yes. Two considerations here: 1. The space use as you mentioned, the temp file might grow huge on a low space system. 2. Punching holes on the temp file doesn’t lock the file under defrag, so with the PUNCH_HOLE, we need a little bit less sleep time for each segment. Thanks, Wengang