Re: [PATCH RFC 0/3] Block reservation for ext3

"Amir G." <amir73il@xxxxxxxxxxxxxxxxxxxxx> · Wed, 13 Oct 2010 10:49:15 +0200

---------- Forwarded message ----------
From: Amir Goldstein <amir73il@xxxxxxxxx>
Date: Wed, Oct 13, 2010 at 10:44 AM
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3
To: Jan Kara <jack@xxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Ted Ts'o
<tytso@xxxxxxx>, linux-ext4@xxxxxxxxxxxxxxx

On Wed, Oct 13, 2010 at 1:14 AM, Jan Kara <jack@xxxxxxx> wrote:
>
> On Mon 11-10-10 14:59:45, Andrew Morton wrote:
> > On Mon, 11 Oct 2010 16:28:13 +0200ÂJan Kara <jack@xxxxxxx> wrote:
> >
> > > Â Doing allocation at mmap time does not really work - on each mmap we
> > > would have to map blocks for the whole file which would make mmap really
> > > expensive operation. Doing it at page-fault as you suggest in (2a) works
> > > (that's the second plausible option IMO) but the increased fragmentation
> > > and thus loss of performance is rather noticeable. I don't have current
> > > numbers but when I tried that last year Berkeley DB was like two or three
> > > times slower.
> >
> > ouch.
> >
> > Can we fix the layout problem? ÂAre reservation windows of no use here?
> ÂReservation windows do not work for this load. The reason is that the
> page-fault order is completely random so we just spend time creating and
> removing tiny reservation windows because the next page fault doing
> allocation is scarcely close enough to fall into the small window.
> ÂThe logic in ext3_find_goal() ends up picking blocks close together for
> blocks belonging to the same indirect block if we are lucky but they
> definitely won't be sequentially ordered. For Berkeley DB the situation is
> made worse by the fact that there are several database files and their
> blocks end up interleaved.
> ÂSo we could improve the layout but we'd have to tweak the reservation
> logic and allocator and it's not completely clear to me how.
> ÂOne thing to note is that currently, ext3 *is* in fact doing delayed
> allocation for writes via mmap. We just never called it like that and never
> bothered to do proper space estimation...
>
> > > > 3) Keep a global counter of sparse blocks which are mapped at mmap()
> > > > time, and update it as blocks are allocated, or when the region is
> > > > freed at munmap() time.
> > > Â Here again I see the problem that mapping all file blocks at mmap time
> > > is rather expensive and so does not seem viable to me. Also the
> > > overestimation of needed blocks could be rather huge.
> >
> > When I did ext2 delayed allocation back in, err, 2001 I had
> > considerable trouble working out how many blocks to actually reserve
> > for a file block, because it also had to reserve the indirect blocks.
> > One file block allocation can result in reserving four disk blocks!
> > And iirc it was not possible with existing in-core data structures to
> > work out whether all four blocks needed reserving until the actual
> > block allocation had occurred. ÂSo I ended up reserving the worst-case
> > number of indirects, based upon the file offset. ÂIf the disk ran out
> > of "space" I'd do a forced writeback to empty all the reservations and
> > would then take a look to see if the disk was _really_ out of space.
> >
> > Is all of this an issue with this work? ÂIf so, what approach did you
> > take?
> ÂYeah, I've spotted exactly the same problem. How I decided to solve it in
> the end is that in memory we keep track of each indirect block that has
> delay-allocated buffer under it. This allows us to reserve space for each
> indirect block at most once (I didn't bother with making the accounting
> precise for double or triple indirect blocks so when I need to reserve
> space for indirect block, I reserve the whole path just to be sure). This
> pushes the error in estimation to rather acceptable range for reasonably
> common workloads - the error can still be 50% for workloads which use just
> one data block in each indirect block but even in this case the absolute
> number of blocks falsely reserved is small.
> ÂThe cost is of course increased complexity of the code, the memory
> spent for tracking those indirect blocks (32 bytes per indirect block), and
> some time for lookups in the RB-tree of the structures. At least the nice
> thing is that when there are no delay-allocated blocks, there isn't any
> overhead (tree is empty).
>

How about allocating *only* the indirect blocks on page fault.
IMHO it seems like a fair mixture of high quota accuracy, low
complexity of the accounting code and low file fragmentation (only
indirect may be a bit further away from data).

In my snapshot patches I use the @create arg to get_blocks_handle() to
pass commands just like "allocate only indirect blocks".
The patch is rather simple. I can prepare it for ext3 if you like.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html