On Fri, Sep 24, 2010 at 12:24:04AM -0600, Andreas Dilger wrote: > On 2010-09-23, at 17:25, Darrick J. Wong wrote: > > To try to find an explanation, I started looking for connections between > > fsync delay values and average flush times. I noticed that the setups with > > low (< 8ms) flush times exhibit better performance when fsync coordination > > is not attempted, and the setups with higher flush times exhibit better > > performance when fsync coordination happens. This also is no surprise, as > > it seems perfectly reasonable that the more time consuming a flush is, the > > more desirous it is to spend a little time coordinating those flushes > > across CPUs. > > > > I think a reasonable next step would be to alter this patch so that > > ext4_sync_file always measures the duration of the flushes that it issues, > > but only enable the coordination steps if it detects the flushes taking > > more than about 8ms. One thing I don't know for sure is whether 8ms is a > > result of 2*HZ (currently set to 250) or if 8ms is a hardware property. > > Note that the JBD/JBD2 code will already dynamically adjust the journal flush > interval based on the delay seen when writing the journal commit block. This > was done to allow aggregating sync journal operations for slow devices, and > allowing fast (no delay) sync on fast devices. See jbd2_journal_stop() for > details. > > I think the best approach is to just depend on the journal to do this sync > aggregation, if at all possible, otherwise use the same mechanism in ext3/4 > for fsync operations that do not involve the journal (e.g. nojournal mode, > data sync in writeback mode, etc). I've been informed that there's confusion about how to interpret this spreadsheet. I'll first provide a few clarifications, then discuss Andreas' suggestion, which I've coded up and given some light testing. Zeroth, the kernel is 2.6.36-rc5 with a few patchsets applied: 1. Tejun Heo's conversion of barriers to flush/fua. 2. Jan Kara's barrier generation patch. 3. My old patch to record if there's dirty data in the disk cache. 4. My newer patch to implement fsync coordination in ext4. 5. My newest patch which implements coordination via jbd2. Patches 2, 3, 4, and 5 all have debugging toggles so I can quickly run experiments. First, the "fsync_delay_us" column records the behavior of my (latest) fsync coordination patch. The raw control values might be a bit confusing, so I elaborated them a little more in the spreadsheet. The "old fsync behavior" entries use the current upstream semantics (no coordination, everyone issues their own flush). "jbd2 fsync" means coordination of fsyncs through jbd2 as detailed below. "use avg sync time" measures the average time it takes to issue a flush command, and tells the first thread into ext4_sync_pages to wait that amount of time for other threads to catch up. Second, the "nojan" column is a control knob I added to Jan Kara's old barrier generation patch so that I could measure its effects. 0 means always track barrier generations and don't submit flushes for already-flushed data. 1 means always issue flushes, regardless of generation counts. Third, the "nodj" column is a control knob that controls my old EXT4_STATE_DIRTY_DATA patch. A zero here means that a flush will only be triggered if ext4_write_page has written some dirty data and there hasn't been a flush yet. 1 disables this logic. Fourth, the bolded cells in the table represent the highest transactions per second count across all fsync_delay_us values when holding the other four control variables constant. For example, let's take a look at host=elm3a4,directio=0,nojan=0,nodj=0. There are five fsync_delay_us values (old, jbd2, avg, 1, 500) and five corresponding results (145.84, 184.06, 181.58, 152.39, 158.19). 184.06 is the highest, hence jbd2 wins and is in bold face. Background colors are used to group the rows by fsync_delay_us. The barriers=0 results are, of course, the transactions per second count when the fs is mounted with barrier support disabled. This ought to provide a rough idea of the upper performance limit of each piece of hardware. ------ As for Andreas' suggestion, he wants ext4 to use jbd2 as coordination point for all fsync calls. I could be wrong, but I think that the following snippet ought to do the trick: h = ext4_journal_start(journal, 0); ext4_journal_stop(h); if (jbd2_journal_start_commit(journal, &target)) jbd2_log_wait_commit(journal, target); It looks as though this snippet effectively says "Send an empty transaction. Then, if there are any live or committing transactions, wait for them to finish", which sounds like what we want. I figured this also means that the nojan/nodj settings would not have any significant effect on the results, which seems to be true (though nojan/nodj have had little effect under Tejun's patchset). So I coded up that patch and gave it a spin on my testing farm. The results have been added to the 2.6.36-rc5 spreadsheet. Though I have to say, this seems like an awful lot of overhead just to issue a flush command. Given a quick look around the jbd2 code, it seems that going through the journal ought to have a higher overhead cost, which would negatively impact performance on hardware that features low flush times, and this seems to be true for elm3a63, elm3c44_sas, and elm3c71_sas in directio=1 mode, where we see rather large regressions against fsync_delay=avg_sync_time. Curiously, I saw a dramatic increase in speed for the SSDs when directio=1, which probably relates to the way SSDs perform writes. Other than those regressions, the jbd2 fsync coordination is about as fast as sending the flush directly from ext4. Unfortunately, where there _are_ regressions they seem rather large, which makes this approach (as implemented, anyway) less attractive. Perhaps there is a better way to do it? --D -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html