---------- Forwarded message ---------- From: Amir Goldstein <amir73il@xxxxxxxxx> Date: Wed, Oct 13, 2010 at 10:44 AM Subject: Re: [PATCH RFC 0/3] Block reservation for ext3 To: Jan Kara <jack@xxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Ted Ts'o <tytso@xxxxxxx>, linux-ext4@xxxxxxxxxxxxxxx On Wed, Oct 13, 2010 at 1:14 AM, Jan Kara <jack@xxxxxxx> wrote: > > On Mon 11-10-10 14:59:45, Andrew Morton wrote: > > On Mon, 11 Oct 2010 16:28:13 +0200ÂJan Kara <jack@xxxxxxx> wrote: > > > > > Â Doing allocation at mmap time does not really work - on each mmap we > > > would have to map blocks for the whole file which would make mmap really > > > expensive operation. Doing it at page-fault as you suggest in (2a) works > > > (that's the second plausible option IMO) but the increased fragmentation > > > and thus loss of performance is rather noticeable. I don't have current > > > numbers but when I tried that last year Berkeley DB was like two or three > > > times slower. > > > > ouch. > > > > Can we fix the layout problem? ÂAre reservation windows of no use here? > ÂReservation windows do not work for this load. The reason is that the > page-fault order is completely random so we just spend time creating and > removing tiny reservation windows because the next page fault doing > allocation is scarcely close enough to fall into the small window. > ÂThe logic in ext3_find_goal() ends up picking blocks close together for > blocks belonging to the same indirect block if we are lucky but they > definitely won't be sequentially ordered. For Berkeley DB the situation is > made worse by the fact that there are several database files and their > blocks end up interleaved. > ÂSo we could improve the layout but we'd have to tweak the reservation > logic and allocator and it's not completely clear to me how. > ÂOne thing to note is that currently, ext3 *is* in fact doing delayed > allocation for writes via mmap. We just never called it like that and never > bothered to do proper space estimation... > > > > > 3) Keep a global counter of sparse blocks which are mapped at mmap() > > > > time, and update it as blocks are allocated, or when the region is > > > > freed at munmap() time. > > > Â Here again I see the problem that mapping all file blocks at mmap time > > > is rather expensive and so does not seem viable to me. Also the > > > overestimation of needed blocks could be rather huge. > > > > When I did ext2 delayed allocation back in, err, 2001 I had > > considerable trouble working out how many blocks to actually reserve > > for a file block, because it also had to reserve the indirect blocks. > > One file block allocation can result in reserving four disk blocks! > > And iirc it was not possible with existing in-core data structures to > > work out whether all four blocks needed reserving until the actual > > block allocation had occurred. ÂSo I ended up reserving the worst-case > > number of indirects, based upon the file offset. ÂIf the disk ran out > > of "space" I'd do a forced writeback to empty all the reservations and > > would then take a look to see if the disk was _really_ out of space. > > > > Is all of this an issue with this work? ÂIf so, what approach did you > > take? > ÂYeah, I've spotted exactly the same problem. How I decided to solve it in > the end is that in memory we keep track of each indirect block that has > delay-allocated buffer under it. This allows us to reserve space for each > indirect block at most once (I didn't bother with making the accounting > precise for double or triple indirect blocks so when I need to reserve > space for indirect block, I reserve the whole path just to be sure). This > pushes the error in estimation to rather acceptable range for reasonably > common workloads - the error can still be 50% for workloads which use just > one data block in each indirect block but even in this case the absolute > number of blocks falsely reserved is small. > ÂThe cost is of course increased complexity of the code, the memory > spent for tracking those indirect blocks (32 bytes per indirect block), and > some time for lookups in the RB-tree of the structures. At least the nice > thing is that when there are no delay-allocated blocks, there isn't any > overhead (tree is empty). > How about allocating *only* the indirect blocks on page fault. IMHO it seems like a fair mixture of high quota accuracy, low complexity of the accounting code and low file fragmentation (only indirect may be a bit further away from data). In my snapshot patches I use the @create arg to get_blocks_handle() to pass commands just like "allocate only indirect blocks". The patch is rather simple. I can prepare it for ext3 if you like. Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html