Re: [PATCH 2/9] spaceman/defrag: pick up segments from target file

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Tue, 16 Jul 2024 20:23:35 +0000

> On Jul 15, 2024, at 4:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
>> segments are the smallest unit to defragment.
>> 
>> A segment
>> 1. Can't exceed size limit
>> 2. contains some extents
>> 3. the contained extents can't be "unwritten"
>> 4. the contained extents must be contigous in file blocks
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx>
>> ---
>> spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 204 insertions(+)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index c9732984..175cf461 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -14,6 +14,32 @@
>> #include "space.h"
>> #include "input.h"
>> 
>> +#define MAPSIZE 512
>> +/* used to fetch bmap */
>> +struct getbmapx g_mapx[MAPSIZE];
>> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
>> +static long long  g_offset = 0;
>> +/* index to indentify next extent, used to get next extent */
>> +static int g_ext_next_idx = -1;
> 
> Please do not prefix global variables with "g_". This is not
> useful, and simply makes the code hard to read.
> 
> That said, it is much better to pass these as function parameters so
> they are specific to the mapping context and so are inherently
> thread safe.
> 

OK, will try to move them to function parameters, But I do see global variables used in xfsprog.

>> +/*
>> + * segment, the smallest unit to defrag
>> + * it includes some contiguous extents.
>> + * no holes included,
>> + * no unwritten extents included
>> + * the size is limited by g_segment_size_lmt
>> + */
> 
> I have no idea what this comment is trying to tell me.
OK.

> 
>> +struct defrag_segment {
>> + /* segment offset in units of 512 bytes */
>> + long long ds_offset;
>> + /* length of segment in units of 512 bytes */
>> + long long ds_length;
>> + /* number of extents in this segment */
>> + int ds_nr;
>> + /* flag indicating if segment contains shared blocks */
>> + bool ds_shared;
>> +};
>> +
>> /* defrag segment size limit in units of 512 bytes */
>> #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>> #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
>> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>> return true;
>> }
>> 
>> +/*
>> + * get next extent in the file.
>> + * Note: next call will get the same extent unless move_next_extent() is called.
>> + * returns:
>> + * -1: error happened.
>> + * 0: extent returned
>> + * 1: no more extent left
>> + */
>> +static int
>> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
>> +{
>> + int err = 0, i;
>> +
>> + /* when no extents are cached in g_mapx, fetch from kernel */
>> + if (g_ext_next_idx == -1) {
>> + g_mapx[0].bmv_offset = g_offset;
>> + g_mapx[0].bmv_length = -1LL;
>> + g_mapx[0].bmv_count = MAPSIZE;
>> + g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
>> + err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
>> + if (err == -1) {
>> + perror("XFS_IOC_GETBMAPX failed");
>> + goto out;
>> + }
>> + /* for stats */
>> + g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
>> +
>> + /* no more extents */
>> + if (g_mapx[0].bmv_entries == 0) {
>> + err = 1;
>> + goto out;
>> + }
>> +
>> + /* for stats */
>> + for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
>> + if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
>> + g_ext_stats.nr_ext_unwritten++;
>> + if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
>> + g_ext_stats.nr_ext_shared++;
>> + }
>> +
>> + g_ext_next_idx = 1;
>> + g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
>> + g_mapx[g_mapx[0].bmv_entries].bmv_length;
>> + }
>> +
>> + map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
>> + map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
>> + map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
>> +out:
>> + return err;
>> +}
> 
> Ok, so the global variables are just a bmap cache. That's a problem,
> because this cache is stale the moment XFS_IOC_GETBMAPX returns to
> userspace. Iterating it to decide exactly waht to do next will
> race with ongoing file modifications and so it's not going to be
> accurate....

Yes, there is racing. 
And even we do defrag basing on stale extents, there is harm to the file under defrag
though it might not bring good defrag result.

> 
>> +
>> +/*
>> + * move to next extent
>> + */
>> +static void
>> +defrag_move_next_extent()
>> +{
>> + if (g_ext_next_idx == g_mapx[0].bmv_entries)
>> + g_ext_next_idx = -1;
>> + else
>> + g_ext_next_idx += 1;
>> +}
>> +
>> +/*
>> + * check if the given extent is a defrag target.
>> + * no need to check for holes as we are using BMV_IF_NO_HOLES
>> + */
>> +static bool
>> +defrag_is_target(struct getbmapx *mapx)
>> +{
>> + /* unwritten */
>> + if (mapx->bmv_oflags & BMV_OF_PREALLOC)
>> + return false;
>> + return mapx->bmv_length < g_segment_size_lmt;
>> +}
>> +
>> +static bool
>> +defrag_is_extent_shared(struct getbmapx *mapx)
>> +{
>> + return !!(mapx->bmv_oflags & BMV_OF_SHARED);
>> +}
>> +
>> +/*
>> + * get next segment to defragment.
>> + * returns:
>> + * -1 error happened.
>> + * 0 segment returned.
>> + * 1 no more segments to return
>> + */
>> +static int
>> +defrag_get_next_segment(int fd, struct defrag_segment *out)
>> +{
>> + struct getbmapx mapx;
>> + int ret;
>> +
>> + out->ds_offset = 0;
>> + out->ds_length = 0;
>> + out->ds_nr = 0;
>> + out->ds_shared = false;
> 
> out->ds_nr is never set to anything but zero in this patch.
> 

It’s set at line 211 in the raw patch.

206 +add_ext:
207 +               if (defrag_is_extent_shared(&mapx))
208 +                       out->ds_shared = true;
209 +
210 +               out->ds_length += mapx.bmv_length;
211 +               out->ds_nr += 1;
212 +               defrag_move_next_extent();

>> +
>> + do {
>> + ret = defrag_get_next_extent(fd, &mapx);
>> + if (ret != 0) {
>> + /*
>> +  * no more extetns, return current segment if its not
> 
> Typos everywhere.

OK.

> 
>> +  * empty
>> + */
>> + if (ret == 1 && out->ds_nr > 0)
>> + ret = 0;
>> + /* otherwise, error heppened, stop */
>> + break;
>> + }
> 
>> +
>> + /*
>> +  * If the extent is not a defrag target, skip it.
>> +  * go to next extent if the segment is empty;
>> +  * otherwise return the segment.
>> +  */
>> + if (!defrag_is_target(&mapx)) {
>> + defrag_move_next_extent();
>> + if (out->ds_nr == 0)
>> + continue;
>> + else
>> + break;
>> + }
>> +
>> + /* check for segment size limitation */
>> + if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
>> + break;
>> +
>> + /* the segment is empty now, add this extent to it for sure */
>> + if (out->ds_nr == 0) {
>> + out->ds_offset = mapx.bmv_offset;
>> + goto add_ext;
>> + }
> 
> So this is essentially a filter for the getbmapx output that strips
> away unwritten extents and anything outside/larger than the target
> range.

yes.

> 
>> +
>> + /*
>> +  * the segment is not empty, check for hole since the last exent
>> +  * if a hole exist before this extent, this extent can't be
>> +  * added to the segment. return the segment
>> +  */
>> + if (out->ds_offset + out->ds_length != mapx.bmv_offset)
>> + break;
>> +
>> +add_ext:
> 
> Why do you need a goto for this logic?
> 
> /*
>  * the segment is not empty, check for hole since the last exent
>  * if a hole exist before this extent, this extent can't be
>  * added to the segment. return the segment
>  */
> if (out->ds_nr) {
> if (out->ds_offset + out->ds_length != mapx.bmv_offset)
> break;
> } else {
> out->ds_offset = mapx.bmv_offset;
> }
> 

Above code also work.
The using of "goto add_ext” saved a “if" inside a “if” making the code clearer to me.
But I can change it as you expected.

>> + if (defrag_is_extent_shared(&mapx))
>> + out->ds_shared = true;
>> +
>> + out->ds_length += mapx.bmv_length;
>> + out->ds_nr += 1;
>> + defrag_move_next_extent();
>> +
>> + } while (true);
> 
>> +
>> + return ret;
>> +}
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>> struct fsxattr fsx;
>> int ret = 0;
>> 
>> + g_offset = 0;
>> + g_ext_next_idx = -1;
>> +
>> fsx.fsx_nextents = 0;
>> memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>> 
>> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>> ret = 1;
>> goto out;
>> }
>> +
>> + do {
>> + struct defrag_segment segment;
>> +
>> + ret = defrag_get_next_segment(defrag_fd, &segment);
>> + /* no more segments, we are done */
>> + if (ret == 1) {
>> + ret = 0;
>> + break;
>> + }
>> + /* error happened when reading bmap, stop here */
>> + if (ret == -1) {
>> + ret = 1;
>> + break;
>> + }
> 
> ternary return values are nasty. Return a negative errno when a
> an error occurs, and -ENODATA when there are no more segments.
> Then you have
> 
> if (ret < 0) {
> if (ret == -ENODATA)
> exit_value = 0;
> else
> exit_value = 1;
> break;
> }

Agreed.
I have modified defrag_get_next_segment() and defrag_get_next_extent() for V2,
Making them return -1 for error and 0 for success. And the "no more extents/segements”
is checked by looking at their length. Will send them out. 

> 
>> + } while (true);
> 
> Not a fan of do {} while(true) loops.
> 
> WIth the above error handling changes, this becomes:
> 
> do {
> struct defrag_segment segment;
> 
> ret = defrag_get_next_segment(defrag_fd, &segment);
> } while (ret == 0);
> 
> if (ret == 0 || ret == -ENODATA)
> exit_value = 0;
> else
> exit_value = 1;
> 

Yes, I making “big” changes to defrag_get_next_segment()/defrag_get_next_extent()
Please review them when I sent them out.

> 
> Ok, so this is a linear iteration of all extents in the file that
> filters extents for the specific "segment" that is going to be
> processed. I still have no idea why fixed length segments are
> important, but "linear extent scan for filtering" seems somewhat
> expensive.

Hm… fixed length segments — actually not fixed length segments, but segment
size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.
Larger segment take longer time (with filed locked) to defrag. The segment size limit is a way to balance
the defrag and the parallel IO latency.

> 
> Indeed, if you used FIEMAP, you can pass a minimum
> segment length to filter out all the small extents. Iterating that
> extent list means all the ranges you need to defrag are in the holes
> of the returned mapping information. This would be much faster
> than an entire linear mapping to find all the regions with small
> extents that need defrag. The second step could then be doing a
> fine grained mapping of each region that we now know either contains
> fragmented data or holes....

Hm… just a question here:
As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented.
What if the extent layout is like this:

1.    1
2.    2049
3.    2
4.    2050
5.    1
6.    2051

In above case, do you do defrag or not?

As I understand the situation, performance of defrag it’s self is not a critical concern here.

Thanks,
Wengang