Re: [PATCH 2/9] spaceman/defrag: pick up segments from target fileOM

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Thu, 11 Jul 2024 22:37:54 +0000

> On Jul 9, 2024, at 2:50 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
>> segments are the smallest unit to defragment.
>> 
>> A segment
>> 1. Can't exceed size limit
> 
> What size limit?  Do you mean a segment can't extend beyond EOF?  Or did
> you actually mean RLIMIT_FSIZE?

We defrag things by share (non-contiguous) blocks and then UNSHARE them to get contiguous
Blocks.  The size limit is the limit of the range for which we issue a UNSHARE ioctl call.
We put some contiguous extents (mappings) together (as a segment) for a UNSHARE call.
We don’t hope that’s too big taking long time IO lock on file. Default size limit is 16MiB.

> 
>> 2. contains some extents
>> 3. the contained extents can't be "unwritten"
>> 4. the contained extents must be contigous in file blocks
> 
> As in the segment cannot contain sparse holes?

Holes are meaningless for unshare calls, so they are not included in segments.

> 
> I think what I"m reading here is that a segment cannot extend beyond EOF
> and must be completely filled with written extent mappings?

Yes, a segment as the unit/range for UNSHARE, 
1) extents included in the segment must be contiguous by file offset
2) is not a hole
3) is not unwritten extents. Unwritten extents are meaningless for UNSHARE.

> 
> Is there an upper limit on the number of mappings per segment?
> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx>
>> ---
>> spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 204 insertions(+)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index c9732984..175cf461 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -14,6 +14,32 @@
>> #include "space.h"
>> #include "input.h"
>> 
>> +#define MAPSIZE 512
>> +/* used to fetch bmap */
>> +struct getbmapx g_mapx[MAPSIZE];
> 
> Each of these global arrays increases the bss segment size, which
> increases the overall footprint of xfs_spaceman, even when it's not
> being used to defragment files.
> 
> Could you switch this data to be dynamically allocated at the start of
> defrag_f and freed at the end?
> 

Yes, I can.
Well I am wondering the binary size is concern this much?
After adding defrag, xfs_space has a size of 239K, I don’t think it’s too big.

$ ll -h spaceman/xfs_spaceman
-rwxrwxr-x 1 ubuntu ubuntu 239K Jul 10 17:20 spaceman/xfs_spaceman*

>> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
>> +static long long  g_offset = 0;
> 
> Unnecessary space after the 'long long'.
> 

ok.

>> +/* index to indentify next extent, used to get next extent */
>> +static int g_ext_next_idx = -1;
>> +
>> +/*
>> + * segment, the smallest unit to defrag
>> + * it includes some contiguous extents.
>> + * no holes included,
>> + * no unwritten extents included
>> + * the size is limited by g_segment_size_lmt
>> + */
>> +struct defrag_segment {
>> + /* segment offset in units of 512 bytes */
>> + long long ds_offset;
>> + /* length of segment in units of 512 bytes */
>> + long long ds_length;
>> + /* number of extents in this segment */
>> + int ds_nr;
>> + /* flag indicating if segment contains shared blocks */
>> + bool ds_shared;
> 
> Maybe g_mapx belongs in here?  Wait, does a bunch of contiguous written
> bmapx records comprise a segment, or is a segment created from (possibly
> a subection of) a particular written bmapx record?

g_mapx is used to fetch the mappings from beginning of the file to the end.
It,
1) can include at most 511 (MAPSIZE-1) extents/mappings
2) can fill more than one segment
3) can be a part of a segment

We are walking through the extents/mappings to form segments.
g_mapx is used in walking through the extents/mappings.
For forming a segment, pls see other functions. 

> 
>> +};
>> +
>> /* defrag segment size limit in units of 512 bytes */
>> #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>> #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
>> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>> return true;
>> }
>> 
>> +/*
>> + * get next extent in the file.
>> + * Note: next call will get the same extent unless move_next_extent() is called.
>> + * returns:
>> + * -1: error happened.
>> + * 0: extent returned
>> + * 1: no more extent left
>> + */
>> +static int
>> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
>> +{
>> + int err = 0, i;
>> +
>> + /* when no extents are cached in g_mapx, fetch from kernel */
>> + if (g_ext_next_idx == -1) {
>> + g_mapx[0].bmv_offset = g_offset;
>> + g_mapx[0].bmv_length = -1LL;
>> + g_mapx[0].bmv_count = MAPSIZE;
>> + g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
>> + err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
>> + if (err == -1) {
>> + perror("XFS_IOC_GETBMAPX failed");
>> + goto out;
>> + }
>> + /* for stats */
>> + g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
>> +
>> + /* no more extents */
>> + if (g_mapx[0].bmv_entries == 0) {
>> + err = 1;
>> + goto out;
>> + }
>> +
>> + /* for stats */
>> + for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
>> + if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
>> + g_ext_stats.nr_ext_unwritten++;
>> + if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
>> + g_ext_stats.nr_ext_shared++;
>> + }
>> +
>> + g_ext_next_idx = 1;
>> + g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
>> + g_mapx[g_mapx[0].bmv_entries].bmv_length;
>> + }
> 
> Huh.  AFAICT, g_ext_next_idx/g_mapx effectively act as a cursor over the
> mappings for this file segment.  

Yes, right.

> In that case, shouldn't
> defrag_move_next_extent (which actually advances the cursor) be in
> charge of grabbing bmapx records from the kernel, and
> defrag_get_next_extent be the trivial helper to pass mappings to the
> consumer of the bmapx objects (aka defrag_get_next_segment)?
> 

Defrag_move_next_extent moves the cursor and defrag_get_next_extent
Pick the extent on cursor but doesn’t move the cursor.
There is case that if we add current extent (returned by defrag_get_next_extent)
 to current segment, current segment would be too big. So we don’t add current extent
  But will add it to next segment. For next segment,  defrag_get_next_extent is called
To get the extent in question again.
So as summary on the cases we move cursor:
1). The extent on cursor is not target to defrag (unwritten or too big)
2). The extent is added in current segment.

>> +
>> + map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
>> + map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
>> + map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
>> +out:
>> + return err;
>> +}
>> +
>> +/*
>> + * move to next extent
>> + */
>> +static void
>> +defrag_move_next_extent()
>> +{
>> + if (g_ext_next_idx == g_mapx[0].bmv_entries)
>> + g_ext_next_idx = -1;
>> + else
>> + g_ext_next_idx += 1;
>> +}
>> +
>> +/*
>> + * check if the given extent is a defrag target.
>> + * no need to check for holes as we are using BMV_IF_NO_HOLES
>> + */
>> +static bool
>> +defrag_is_target(struct getbmapx *mapx)
>> +{
>> + /* unwritten */
>> + if (mapx->bmv_oflags & BMV_OF_PREALLOC)
>> + return false;
>> + return mapx->bmv_length < g_segment_size_lmt;
>> +}
>> +
>> +static bool
>> +defrag_is_extent_shared(struct getbmapx *mapx)
>> +{
>> + return !!(mapx->bmv_oflags & BMV_OF_SHARED);
>> +}
>> +
>> +/*
>> + * get next segment to defragment.
>> + * returns:
>> + * -1 error happened.
>> + * 0 segment returned.
>> + * 1 no more segments to return
>> + */
>> +static int
>> +defrag_get_next_segment(int fd, struct defrag_segment *out)
>> +{
>> + struct getbmapx mapx;
>> + int ret;
>> +
>> + out->ds_offset = 0;
>> + out->ds_length = 0;
>> + out->ds_nr = 0;
>> + out->ds_shared = false;
>> +
>> + do {
>> + ret = defrag_get_next_extent(fd, &mapx);
>> + if (ret != 0) {
>> + /*
>> +  * no more extetns, return current segment if its not
>> +  * empty
>> + */
>> + if (ret == 1 && out->ds_nr > 0)
>> + ret = 0;
>> + /* otherwise, error heppened, stop */
>> + break;
>> + }
>> +
>> + /*
>> +  * If the extent is not a defrag target, skip it.
>> +  * go to next extent if the segment is empty;
>> +  * otherwise return the segment.
>> +  */
>> + if (!defrag_is_target(&mapx)) {
>> + defrag_move_next_extent();
>> + if (out->ds_nr == 0)
>> + continue;
>> + else
>> + break;
>> + }
>> +
>> + /* check for segment size limitation */
>> + if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
>> + break;
>> +
>> + /* the segment is empty now, add this extent to it for sure */
>> + if (out->ds_nr == 0) {
>> + out->ds_offset = mapx.bmv_offset;
>> + goto add_ext;
>> + }
>> +
>> + /*
>> +  * the segment is not empty, check for hole since the last exent
>> +  * if a hole exist before this extent, this extent can't be
>> +  * added to the segment. return the segment
>> +  */
>> + if (out->ds_offset + out->ds_length != mapx.bmv_offset)
>> + break;
>> +
>> +add_ext:
>> + if (defrag_is_extent_shared(&mapx))
>> + out->ds_shared = true;
>> +
>> + out->ds_length += mapx.bmv_length;
>> + out->ds_nr += 1;
> 
> OH, ok.  So we walk the mappings for a file.  If we can identify a run
> of contiguous written mappings, we define a segment to be the file range
> described by that run, up to whatever the maximum is (~4-16M).  Each of
> these segments is defragmented (somehow).  Is that correct?

Yes, correct.

>> + defrag_move_next_extent();
>> +
>> + } while (true);
>> +
>> + return ret;
>> +}
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>> struct fsxattr fsx;
>> int ret = 0;
>> 
>> + g_offset = 0;
>> + g_ext_next_idx = -1;
>> +
>> fsx.fsx_nextents = 0;
>> memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>> 
>> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>> ret = 1;
>> goto out;
>> }
>> +
>> + do {
>> + struct defrag_segment segment;
>> +
>> + ret = defrag_get_next_segment(defrag_fd, &segment);
>> + /* no more segments, we are done */
>> + if (ret == 1) {
>> + ret = 0;
>> + break;
>> + }
> 
> If you reverse the polarity of the 0/1 return values (aka return 1 if
> there is a segment and 0 if there is none) then you can shorten this
> loop to:
> 
> struct defrag_segment segment;
> int ret;
> 
> while ((ret = defrag_get_next_segment(...)) == 1) {
> /* process segment */
> }
> 
> return ret;
> 

Yes, that might be better. What I was just thinking is the “0” usually mean a ‘good’ return :D
Will consider this.

Thanks,
Wengang

> --D
> 
>> + /* error happened when reading bmap, stop here */
>> + if (ret == -1) {
>> + ret = 1;
>> + break;
>> + }
>> + } while (true);
>> out:
>> if (scratch_fd != -1) {
>> close(scratch_fd);
>> -- 
>> 2.39.3 (Apple Git-146)