On Mon, Mar 17, 2014 at 09:42:31PM -0700, Darrick J. Wong wrote: > On Mon, Mar 17, 2014 at 05:10:22PM -0600, Andreas Dilger wrote: > > > > On Mar 11, 2014, at 12:57 AM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > > > > e2fsck pass1 is modified to use the block group data prefetch function > > > to try to fetch the inode tables into the pagecache before it is > > > needed. In order to avoid cache thrashing, we limit ourselves to > > > prefetching at most half the available memory. > > > > It looks like the prefetching is done in huge chunks, and not incrementally? > > It makes more sense to have a steady amount of prefetch happening instead > > of waiting for it to all be consumed before starting a new batch. See in > > e2fsck_pass1() below. > > I agree that prefetch ought not to wait until the entire inode table is > consumed. > > > > pass2 is modified to use the dirblock prefetching function to prefetch > > > the list of directory blocks that are assembled in pass1. So long as > > > we don't anticipate rehashing the dirs (pass 3a), we can release the > > > dirblocks as soon as we're done checking them. > > > > > > pass4 is modified to prefetch the block and inode bitmaps in > > > anticipation of pass 5, because pass4 is entirely CPU bound. > > > > > > In general, these mechanisms can halve fsck time, if the host system > > > has sufficient memory and the storage system can provide a lot of > > > IOPs. SSDs and multi-spindle RAIDs see the most speedup; single disks > > > experience a modest speedup, and single-spindle USB mass storage > > > devices see hardly any benefit. > > > > > > By default, readahead will try to fill half the physical memory in the > > > system. The -R option can be given to specify the amount of memory to > > > use for readahead, or zero to disable it entirely; or an option can be > > > given in e2fsck.conf. > > > > > > > > > +static void *pass1_readahead(void *p) > > > +{ > > > + struct pass1ra_ctx *c = p; > > > + errcode_t err; > > > + > > > + ext2fs_readahead(c->fs, EXT2_READA_ITABLE, c->group, c->ngroups); > > > + return NULL; > > > +} > > > + > > > +static errcode_t initiate_readahead(e2fsck_t ctx, dgrp_t group, dgrp_t ngroups) > > > +{ > > > + struct pass1ra_ctx *ractx; > > > + errcode_t err; > > > + > > > + err = ext2fs_get_mem(sizeof(*ractx), &ractx); > > > + if (err) > > > + return err; > > > + > > > + ractx->fs = ctx->fs; > > > + ractx->group = group; > > > + ractx->ngroups = ngroups; > > > + > > > + err = e2fsck_run_thread(&ctx->ra_thread, pass1_readahead, > > > + pass1_readahead_cleanup, ractx); > > > + if (err) > > > + ext2fs_free_mem(&ractx); > > > + > > > + return err; > > > +} > > > + > > > void e2fsck_pass1(e2fsck_t ctx) > > > { > > > int i; > > > @@ -611,10 +654,37 @@ void e2fsck_pass1(e2fsck_t ctx) > > > int busted_fs_time = 0; > > > int inode_size; > > > int failed_csum = 0; > > > + dgrp_t grp; > > > + ext2_ino_t ra_threshold = 0; > > > + dgrp_t ra_groups = 0; > > > + errcode_t err; > > > > > > init_resource_track(&rtrack, ctx->fs->io); > > > clear_problem_context(&pctx); > > > > > > + /* If we can do readahead, figure out how many groups to pull in. */ > > > + if (!ext2fs_can_readahead(ctx->fs)) > > > + ctx->readahead_mem_kb = 0; > > > + if (ctx->readahead_mem_kb) { > > > + ra_groups = ctx->readahead_mem_kb / > > > + (fs->inode_blocks_per_group * fs->blocksize / > > > + 1024); > > > + if (ra_groups < 16) > > > + ra_groups = 0; > > > > It probably always makes sense to prefetch one group if possible? > > I was intending to skip pass1 RA if there wasn't a lot of memory around. Not > that I did a lot of work to figure out if < 16 groups really was a "lowmem" > situation. > > > > + else if (ra_groups > fs->group_desc_count) > > > + ra_groups = fs->group_desc_count; > > > + if (ra_groups) { > > > + err = initiate_readahead(ctx, grp, ra_groups); > > > > Looks like "grp" is used uninitialized here. Should be "grp = 0" to start. > > Oops, good catch. > > > > + if (err) { > > > + com_err(ctx->program_name, err, "%s", > > > + _("while starting pass1 readahead")); > > > + ra_groups = 0; > > > + } > > > + ra_threshold = ra_groups * > > > + fs->super->s_inodes_per_group; > > > > This is the threshold of the last inode to be prefetched. > > Yes. > > > > + } > > > + } > > > + > > > if (!(ctx->options & E2F_OPT_PREEN)) > > > fix_problem(ctx, PR_1_PASS_HEADER, &pctx); > > > > > > @@ -778,6 +848,19 @@ void e2fsck_pass1(e2fsck_t ctx) > > > if (e2fsck_mmp_update(fs)) > > > fatal_error(ctx, 0); > > > } > > > + if (ra_groups && ino > ra_threshold) { > > > > This doesn't start prefetching again until the last inode is checked. > > It probably makes sense to have a sliding window to start readahead > > again once half of the memory has been consumed or so. Otherwise, > > the scanning will block here until the next inode table is read from > > disk, instead of the readahead being started earlier and it is in RAM. > > You're right, it would be even faster if ra_threshold were to start RA a couple > of block groups *before* we run out of prefetched data. > > > > + grp = (ino - 1) / fs->super->s_inodes_per_group; > > > + ra_threshold = (grp + ra_groups) * > > > + fs->super->s_inodes_per_group; > > > > > + err = initiate_readahead(ctx, grp, ra_groups); > > > + if (err == EAGAIN) { > > > + printf("Disabling slow readahead.\n"); > > > + ra_groups = 0; > > > > I see that EAGAIN comes from e2fsck_run_thread(), if there is still a > > readahead thread running. Does it make sense to stop readahead in > > that case? It would seem to me that if readahead is taking a long > > time and the inode processing is catching up to it (i.e. IO bound) > > then it is even more important to do readahead in that case. > > This is tricky -- POSIX_FADV_WILLNEED starts a non-blocking readahead, so there > really isn't any good way to tell if the inode checker has caught up to RA. > Here I'm interpreting "RA thread still running" as a warning that soon the > inode checker will be ahead of the RA, so we might as well stop the RA. > However, there still isn't really much good way to find out exactly where RA > is. > > > Something like the following to readahead half of the inode tables once > > half of them have been processed, and shrink the readahead window if the > > readahead is being called too often: > > Hmm. I will give this a shot and report back; this seems like it ought to > produce a better result than "two before" as I suggested above. > > > if (ra_groups != 0 && ino > ra_threshold - (ra_groups + 1) / 2 * > > fs->super->s_inodes_per_group) { > > if (ra_threshold < ino) > > ra_threshold = ino; > > grp = (ra_threshold -1) / fs->super->s_inodes_per_group; > > err = initiate_readahead(ctx, grp, (ra_groups + 1) / 2); > > if (err == EAGAIN) > > ra_groups = (ra_groups + 1) / 2; > > else if (err) > > com_err(ctx->program_name, err, "%s", > > _("while starting pass1 readahead")); > > else > > ra_threshold += (ra_groups + 1) / 2 * > > fs->super->s_inodes_per_group; > > } Now that I've thought about this a little harder, even this isn't quite sufficient -- since the inode scan skips inode_uninit blockgroups, we have to figure out which group our new ra_threshold inode is in and scan backwards through the groups until we find a bg that isn't inode_uninit. If we don't do this, the scan will skip right past our ra_threshold, which means that RA starts late or possibly even after we've started scanning inodes from the group we're RAing. That said, even doing that I don't see much more of a speed up. --D > > > > > + } else if (err) { > > > + com_err(ctx->program_name, err, "%s", > > > + _("while starting pass1 readahead")); > > > + } > > > + } > > > old_op = ehandler_operation(_("getting next inode from scan")); > > > pctx.errcode = ext2fs_get_next_inode_full(scan, &ino, > > > inode, inode_size); > > > diff --git a/e2fsck/unix.c b/e2fsck/unix.c > > > index 80ebdb1..d6ef8c5 100644 > > > --- a/e2fsck/unix.c > > > +++ b/e2fsck/unix.c > > > @@ -74,7 +74,7 @@ static void usage(e2fsck_t ctx) > > > _("Usage: %s [-panyrcdfvtDFV] [-b superblock] [-B blocksize]\n" > > > "\t\t[-I inode_buffer_blocks] [-P process_inode_size]\n" > > > "\t\t[-l|-L bad_blocks_file] [-C fd] [-j external_journal]\n" > > > - "\t\t[-E extended-options] device\n"), > > > + "\t\t[-E extended-options] [-R readahead_kb] device\n"), > > > > Note that "-R" is only recently deprecated for raid options, why not make > > this an option under "-E"? > > Ok. > > --D > > > > > ctx->program_name); > > > > > > fprintf(stderr, "%s", _("\nEmergency help:\n" > > > @@ -90,6 +90,7 @@ static void usage(e2fsck_t ctx) > > > " -j external_journal Set location of the external journal\n" > > > " -l bad_blocks_file Add to badblocks list\n" > > > " -L bad_blocks_file Set badblocks list\n" > > > + " -R readahead_kb Allow this much readahead.\n" > > > > > > Cheers, Andreas > > > > > > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html