Re: [RFC] Defragmentation interface

David Chinner <dgc@xxxxxxx> · Fri, 3 Nov 2006 09:59:53 +1100

On Thu, Nov 02, 2006 at 03:39:29PM +0100, Jan Kara wrote:
>   Hi,
> 
>   from the thread after my patch implementing ext3 online
> defragmentation I found out that probably the only (and definitely the
> biggest) issue is the interface. Someone wants is common enough so that
> we can profit from common tools for several filesystems, others object
> that some applications, e.g. defragmenter, need to know something about
> ext3 internals to work reasonably well. Moreover ioctl() is ugly and has
> some compatibility issues, on the other hand ext2meta is too lowlevel,
> fs-specific and it would be hard to do any reasonable application
> crash-safe...
>   So in this email I try to propose some interface which should hopefully
> address most of the concerns. The type of the interface is sysfs like
> (idea taken from ext2meta) - that has a few advantages:
>  - no 32/64-bit compatibility issues
>  - easily extensible
>  - generally nice ;)

- complex
- over-engineered
- little common code between filesystems

BTW, does use of sysfs mean ASCII encoding of all the data
passing between kernel and userspace?

>   Each filesystem willing to support this interface implements special
> filesystem (e.g. ext3meta, XFSmeta, ...) and admin/defrag-tool mounts it
> to some directory.

- not useful for wider audiences like applications that would like
  to direct allocation

> There are parts of this interface which should be
> common for all filesystems (so that tools don't have to care about
> particular filesystem and still get some useful results), other parts
> are fs-specific. Here is basic structure I propose:
> 
> meta/features
>   - bitmap of features supported by the interface (ext2/3-like) so that
>     the tool can verify whether it understands the interface and don't
>     mess with it otherwise

- grow very large, very quickly if it has to support all the
  different quirks of different filesystems.

> meta/allocation/free_blocks
>   - RO file - if you read from fpos F, you'll get a list of extents
>     describing areas with free blocks (as many as fits into supplied
>     buffer) starting from block F. Fpos of your file descriptor is
>     shifted to the first unreported free block.

- linear search properties == Bad. (think fs sizes of hundreds of
  terabytes - XFS is already deployed with filesystems of this size)
- cannot use smart requests like given me free blocks near X,
  in AG Y or Z, etc.
- some filesystems have more than one data area - e.g. XFS has the
  realtime volume.
- every time you fail an allocation, you need to reread this file.

> meta/super/blocksize
>   - filesystem block size

fcntl(FIGETBSZ).

Also:

- some filesystems can use different block sizes for different
  structures (e.g XFs directory blocks canbe larger than the fsb)
- stripe unit and stripe width need to be exposed so defrag too
  can make correct placement decisions.
- extent size hints, etc.

Hence this will require the spuer/ directory to be extensible
in a filesystem specific interface.

> meta/super/id
>   - filesystem ID (for paranoid tools to verify that they are accessing
>     really the right meta-filesystem)

- UUID, please.

> meta/nodes/<ident>
>   - this should be a directory containing things specific for a fs-object
>     with identification <ident>. In case of ext3 these would be inode
>     numbers, I guess this should be plausible also for XFS and others
>     but I'm open to suggestions...
>   - directory contains the following:
>   alloc_goal
>     - block number with current allocation goal

The kernel has to store this across syscalls until you write into
data/alloc? That sounds dangerous...

>   data/extents
>     - if you read from this file, you get a list of extents describing
>       data blocks (and holes) of the file. The listing starts at logical
>       block fpos. Fpos is shifted to the first unreported data block.

fcntl(FIBMAP)

>   data/alloc
>     - you write there a number L and fs allocates L blocks to a file
>       (preferable from alloc_goal) starting from file-block fpos. Fpos
>       is shifted after the last block allocated in this call.

You seek to the position you want (in blocks or bytes?), then write
a number into the file (in blocks or bytes)? That's messy compared
to a function call with an offset and length in it....

>   data/reloc
>     - you write there <ident> and relocation of data happens as follows:
>       All blocks that are allocated both in original file and <ident>
>       are relocated to <ident>. Write returns number of relocated
>       blocks.

You can only relocate to a new inode (which in XFS will change
the inode number)? What happens if there are blocks in duplicate
offsets in both inodes? What happens if all the blocks aren't
relocated - how do you handle this?

Let me get this straight - the interface you propose for
moving data about is:

	read and process extents into an internal structure
	find range where you want to relocate
	find free space you want to relocate into
	write desired block to alloc_goal
	seek to allocation offset in data/alloc
	write length into data/alloc
	allocate new inode
	write new inode number into data/reloc to relocate blocks

What I proposed:

	fcntl(src, FIBMAP);
	/* find range to relocate */
	open(tmp, O_CREATE);
	funlink(tmp);
	fs_get_free_list(src, policy, list);
	/* select free extent to use */
	fs_allocate_space(tmp, list[X], off, len);
	fs_move_data(src, tmp, off, len);
	close(tmp);
	close(src);

So the process is pretty close to the same except the interface I
proposed does not change the location of the inode holding the data.
The major difference is that one implementation requires 3 new
generically useful syscalls, and the other requires every filesystem
to implement a metadata filesystem and require root priviledges
to use.

>   metadata/
>     - this directory is fs-specific, contains fs block pointers and
>       similar. Here I describe what I'd like to have for ext3.

Nothing really useful for XFS here unless we start talking
about btree defragmentation and attribute fork optimisation,
etc. We really don't need a sysfs interface for this, just
an additional fs_move_metadata() type of call....

hmmm - how do you support objects in the filesystem not attached
to inodes (e.g. the freespace and inode btrees in XFS)? What sort
interface would they use?

>   This is all that is needed for my purposes. Any comments welcome.

Then your purpose is explicitly data defragmentation? If that is
the case, I still fail to see any need for a new metadata fs for
every filesystem to support this.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html