On Mon, Mar 28, 2011 at 12:43 PM, Peter Grandi <pg_ext3@xxxxxxxxxxxxxxxxxxx> wrote: > [ ... ] > >>> When executing an fsync(), in data=ordered mode you have to >>> write the data data blocks into the journal and wait for the >>> data blocks to be written. This requires generally will >>> require extra seeks. In data=journaled mode, the data blocks >>> can be written directly into the sjoujournal without needing >>> to seek. > >>> Of course eventually the data and metadata blocks will need >>> to be written to their permanent locations before the journal >>> space can be reused. But for short bursty write patterns, >>> the fsync() latency will be much smaller in data=journal >>> mode. > >> [ ... ] > >> In this case, if we conduct the experiment in data=journal >> mode and data=ordered mode respectively, > > That experiment is not necessarily demonstrative, it depends on > RAM caching, elevator, ... > >> since write latency is much smaller in data=journal mode, > > Write latency is actually much longer: because it requires *two* > writes instead of one. It is *fsync* latency as mentioned above > that is smaller, because it depends only on the first write to > what is in effect a small log based filesystem. This distinction > matters a great deal, because it is the reason why "short bursty > write patterns" is the qualification above. For long write > patterns things are very different as the journal eventually > fills up. For any given size it will also fill up a lot faster > for 'data=journal'. > > Ahhh while writing that I have just realized that large journals > can be a bad idea especially for metadata operations. Will have > to think more about that. > Well, the experiment I described was actually taken from the following article, http://www.ibm.com/developerworks/library/l-fs8.html?S_TACT=105AGX52&S_CMP=cn-a-l The author claims that it is Andrew Morton who tested this and showed that " data=journal mode allowed the 16-meg-file to be read from 9 to over 13 times faster than other ext3 modes, ReiserFS, and even ext2 (which has no journaling overhead)". Although I cannot find the original Andrew Morton's post in LKML, one fact is this article is widely copied to many other websites. Futhermore, in the kernel internal document,Documentation/filesystems/ext3.txt, there is saying: 195 * journal mode 196 data=journal mode provides full data and metadata journaling. All new data is 197 written to the journal first, and then to its final location. 198 In the event of a crash, the journal can be replayed, bringing both data and 199 metadata into a consistent state. This mode is the slowest except when data 200 needs to be read from and written to disk at the same time where it 201 outperforms all other modes. Although Ted and you both explained that the fsync latency is shorter in data=journal mode, my original question, as the title indicated, is why data=journal outperforms the other modes when read and write simultaneously? Or, this statement in the kernel doc is not accurate?If so, then we should submit a patch and modify this document so that the other people won't be mislead, and it would be better to show people some more demonstrative examples in which data=journal really outperforms the other modes. In addition, I am actually not very clear why you said that write() latency is longer while fsync() latency is shorter, I am trying to repeat what you said, please point out if I am incorrect: 1. Normally we call write() syscall first and then call fsync() to flush the data. 2. The write() returns as long as the data is written into page caches while the fsync() returns only if the data have been written into a stable store. 3. Although write() latency for data=journal mode is much longer because it requires two writes instead of one, however, since the write() means writing to page cache, so the actually cost is not so high, compared to the fsync() syscall where we have to write into disk and may require disk seeks. So we can mainly focus on the fsync() system call. 4. Since the journal is a stable store, for the data=journal mode, fsync() returns as long as the meta data and the real data have been written into the journal file, and this process is sequential access. But for the data=moded mode, fsync() will terminate only if the data itself has been written into the disk, since this process is random access, we do need many times of disk seeks, which is expensive, so in this case, fsync() latency is much longer than the in the data=journal mode. And that's why we claim that data=journal wins for this burst write case. Are these correct? Regards Jidong _______________________________________________ Ext3-users mailing list Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users