On Fri, Jul 26, 2013 at 02:36:15PM +0800, Zhi Yong Wu wrote: > Dave, > > All comments are good to me, and will be applied to next version, thanks a lot. > > On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.kernel@xxxxxxxxx wrote: > >> From: Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx> > >> > >> It can take a long time to run log recovery operation because it is > >> single threaded and is bound by read latency. We can find that it took > >> most of the time to wait for the read IO to occur, so if one object > >> readahead is introduced to log recovery, it will obviously reduce the > >> log recovery time. > >> > >> In dirty log case as below: > >> data device: 0xfd10 > >> log device: 0xfd10 daddr: 20480032 length: 20480 > >> > >> log tail: 7941 head: 11077 state: <DIRTY> > > > > That's only a small log (10MB). As I've said on irc, readahead won't > Yeah, it is one 10MB log, but how do you calculate it based on the above info? length = 20480 blocks. 20480 * 512 = 10MB.... > > And the recovery time from this is between 15-17s: > > > > .... > > log device: 0xfd20 daddr: 107374182032 length: 4173824 > > ^^^^^^^ almost 2GB > > log tail: 19288 head: 264809 state: <DIRTY> > > .... > > real 0m17.913s > > user 0m0.000s > > sys 0m2.381s > > > > And runs at 3-4000 read IOPs for most of that time. It's largely IO > > bound, even on SSDs. > > > > With your patch: > > > > log tail: 35871 head: 308393 state: <DIRTY> > > real 0m12.715s > > user 0m0.000s > > sys 0m2.247s > > > > And it peaked at ~5000 read IOPS. > How do you know its READ IOPS is ~5000? Other monitoring. iostat can tell you this, though I use PCP... > > Ok, so you've based the readahead on the transaction item list > > having a next pointer. What I think you should do is turn this into > > a readahead queue by moving objects to a new list. i.e. > > > > list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) { > > > > case XLOG_RECOVER_PASS2: > > if (ra_qdepth++ >= MAX_QDEPTH) { > > recover_items(log, trans, &buffer_list, &ra_item_list); > > ra_qdepth = 0; > > } else { > > xlog_recover_item_readahead(log, item); > > list_move_tail(&item->ri_list, &ra_item_list); > > } > > break; > > ... > > } > > } > > if (!list_empty(&ra_item_list)) > > recover_items(log, trans, &buffer_list, &ra_item_list); > > > > I'd suggest that a queue depth somewhere between 10 and 100 will > > be necessary to keep enough IO in flight to keep the pipeline full > > and prevent recovery from having to wait on IO... > Good suggestion, will apply it to next version, thanks. FWIW, I hacked a quick test of this into your patch here and a depth of 100 brought the reocvery time down to under 8s. For other workloads which have nothing but dirty inodes (like fsmark) a depth of 100 drops the recovery time from ~100s to ~25s, and the iop rate is peaking at well over 15,000 IOPS. So we definitely want to queue up more than a single readahead... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html