Re: fragmentation && blocks "realloc"

"Nate Diller" <nate.diller@xxxxxxxxx> · Thu, 23 Feb 2006 13:47:49 -0800

On 1/24/06, Anton Altaparmakov <aia21@xxxxxxxxx> wrote:
> On Mon, 23 Jan 2006, Jan Koss wrote:
> > On 1/23/06, Anton Altaparmakov <aia21@xxxxxxxxx> wrote:
> > > That depends entirely in which function you are / which call path you are
> > > in at present.  Taking minix as an example, tell me the call path where
> > > you end up wanting to do the above and I will tell you where to get the bh
> > > from...  (-:
> >
> > I told about 2.6.15.
> >
> > in fs/minix/bitmap.c there is minix_new_block we come in it from get_block in
> > fs/minix/itree_common.c.
> >
> > After analizing blocks<->file I want move some blocks to another location
> > and update page cache correspondingly, what should I do?
>
> <Argh I just spent ages writing an email and it got lost when the internet
> connection died...  I only have what was visible on the terminal screen,
> so starting again on the rest...>
>
> You cannot do what you want from such a low level because the upper layers
> hold locks that you need.  For example a readpage/writepage/prepare_write
> can be running concurrently with get_block() and even other instances of
> get_block() can be running at the same time and it would then be unsafe to
> do any sort of reallocation.  So you have to scrap that idea.
>
> You could do it in higher up levels, i.e. in file ->write itself but again
> this introduces a lot of complexity to your file system.
>
> Basically what you are trying to so is much harder than you think and
> involves a lot of work...
>
> There is a possible alternative however.  Your get_block function could
> take a reference on the inode (i_count), set a flag in the file system
> specific inode "need realloc" and add the inode to a queue of a
> "realloc-demon" for your fs which is just a kernel thread which will run
> periodically, say every five seconds, and it will take inodes one after
> the other from its queue, then take all necessary locks so you can do this
> (e.g i_mutex on the inode as well as i_alloc_sem/i_alloc_mutex - whatever
> it is called now) - note you will probably need an extra lock to prevent
> entry into readpage/writepage whilst this is happening and your
> readpage/writepage will need to take that lock for reading whilst your
> daemon takes it for writing so multiple read/writepage can run
> simultaneously but your deamon runs exclusive.
>
> Then, if the inode is marked "need realloc" it will allocate a contiguous
> chunk of space equal to the file size, clear the "need-realloc" bit, do
> the reallocation by starting at the first page (index 0) and working
> upwards, getting it (warning: deadlock possible with a read or writepage
> holding that page's lock and blocked on your "realloc lock" so maybe
> trylock and if fails abort and requeue the inode to the daemon at the end
> of the queue), then when you have a page, loop around its buffers and for
> each buffer move it from the old allocation to the new one as I described
> earlier (i.e. just change b_blocknr, invalidate underlying metadata, mark
> the buffer dirty).
>
> That or something simillar should work with minimal impact on your
> existing fs code.  And it has the huge benefit or performing the reallocs
> in the back ground.  Otherwise your original idea would be disastrous to
> performance.  Imagine a 8G file that you are appending data to.  Every
> time you append a new block you may end up having to reallocate the file
> from inside your get_block (you don't know that more writes are coming in
> a second) and each time it will take a few minutes so each little write
> will hang the system for a few minutes - hardly what you want...
>
> And the daemon at least batches things in 5 second intervals so multiple
> "need realloc" settings on an inode will be done in one go every 5
> seconds.
>
> You know, if it was that easy to keep fragmentation close or even equal to
> zero at all times without impact on performance, all file systems would be
> already doing that.  (-;

well, the above is a reasonable solution, but if you were willing to
put up with more allocation and flush complexity, you could try a
strict allocate-on-flush design.  Just read in a page, and promptly
unmap it, then you don't have to worry about CPU overhead until flush
time, when you map all the pages and write them out.  That would
result in the lowest amount of fragmentation you can get without a
repacker of some sort.  It's not even all that hard unless you try
supporting file holes, transactions, non-4k blocks, or other
complexities.  There are also potential OOM issues if you are using
something as old-fashioned as bitmaps in your allocation code, and
need to read them in under memory pressure...

NATE

--
Kernelnewbies: Help each other learn about the Linux kernel.
Archive:       http://mail.nl.linux.org/kernelnewbies/
FAQ:           http://kernelnewbies.org/faq/