On Tue, Dec 22, 2020 at 2:24 PM Andreas Dilger <adilger@xxxxxxxxx> wrote: > > On Dec 22, 2020, at 10:47 AM, Jan Kara <jack@xxxxxxx> wrote: > > > > Hi! > > > > On Thu 03-12-20 01:07:51, lokesh jaliminche wrote: > >> Hi Martin, > >> > >> thanks for the quick response, > >> > >> Apologies from my side, I should have posted my fio job description > >> with the fio logs > >> Anyway here is my fio workload. > >> > >> [global] > >> filename=/mnt/ext4/test > >> direct=1 > >> runtime=30s > >> time_based > >> size=100G > >> group_reporting > >> > >> [writer] > >> new_group > >> rate_iops=250000 > >> bs=4k > >> iodepth=1 > >> ioengine=sync > >> rw=randomwrite > >> numjobs=1 > >> > >> I am using Intel Optane SSD so it's certainly very fast. > >> > >> I agree that delayed logging could help to hide the performance > >> degradation due to actual writes to SSD. However as per the iostat > >> output data is definitely crossing the block layer and since > >> data journaling logs both data and metadata I am wondering why > >> or how IO requests see reduced latencies compared to metadata > >> journaling or even no journaling. > >> > >> Also, I am using direct IO mode so ideally, it should not be using any type > >> of caching. I am not sure if it's applicable to journal writes but the whole > >> point of journaling is to prevent data loss in case of abrupt failures. So > >> caching journal writes may result in data loss unless we are using NVRAM. > > > > Well, first bear in mind that in data=journal mode, ext4 does not support > > direct IO so all the IO is in fact buffered. So your random-write workload > > will be transformed to semilinear writeback of the page cache pages. Now > > I think given your SSD storage this performs much better because the > > journalling thread commiting data will drive large IOs (IO to the journal > > will be sequential) and even when the journal is filled and we have to > > checkpoint, we will run many IOs in parallel which is beneficial for SSDs. > > Whereas without data journalling your fio job will just run one IO at a > > time which is far from utilizing full SSD bandwidth. > > > > So to summarize you see better results with data journalling because you in > > fact do buffered IO under the hood :). That makes sense thank you!! > > IMHO that is one of the benefits of data=journal in the first place, regardless > of whether the journal is NVMe or HDD - that it linearizes what would otherwise > be a random small-block IO workload to be much friendlier to the storage. As > long as it maintains the "written to stable storage" semantic for O_DIRECT, I > don't think it is a problem that the data is copied or not. Even without the > use of data=journal, there are still some code paths that copy O_DIRECT writes. > > Ideally, being able to dynamically/automatically change between data=journal > and data=ordered depending on the IO workload (e.g. large writes go straight > to their allocated blocks, small writes go into the journal) would be the best > of both worlds. High "IOPS" for workloads that need it (even on HDD), without > overwhelming the journal device bandwidth with large streaming writes. > > This would tie in well with the proposed SMR patches, which allow a very large > journal device to (essentially) transform ext4 into a log-structured filesystem > by allowing journal shadow buffers to be dropped from memory rather than being > pinned in RAM: > > https://github.com/tytso/ext4-patch-queue/blob/master/series > https://github.com/tytso/ext4-patch-queue/blob/master/jbd2-dont-double-bump-transaction-number > https://github.com/tytso/ext4-patch-queue/blob/master/journal-superblock-changes > https://github.com/tytso/ext4-patch-queue/blob/master/add-journal-no-cleanup-option > https://github.com/tytso/ext4-patch-queue/blob/master/add-support-for-log-metadata-block-tracking-in-log > https://github.com/tytso/ext4-patch-queue/blob/master/add-indirection-to-metadata-block-read-paths > https://github.com/tytso/ext4-patch-queue/blob/master/cleaner > https://github.com/tytso/ext4-patch-queue/blob/master/load-jmap-from-journal > https://github.com/tytso/ext4-patch-queue/blob/master/disable-writeback > https://github.com/tytso/ext4-patch-queue/blob/master/add-ext4-journal-lazy-mount-option > > > Having a 64GB-256GB NVMe device for the journal and handling most of the small > IO directly to the journal, and only periodically flushing to the filesystem to > HDD would really make those SMR disks more usable, since they are starting to > creep into consumer/NAS devices, even when users aren't really aware of it: > > https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shingled-magnetic-recording/ > > >> So questions come to my mind are > >> 1. why writes without journaling are having long latencies as compared to > >> writes requests with metadata and data journaling? > >> 2. Since metadata journaling have relatively fewer journal writes than data > >> journaling why writes with data journaling is faster than no journaling and > >> metadata journaling mode? > >> 3. If there is an optimization that allows data journaling to be so fast > >> without any risk of data loss, why the same optimization is not used in case > >> of metadata journaling? > >> > >> On Thu, Dec 3, 2020 at 12:20 AM Martin Steigerwald <martin@xxxxxxxxxxxx> wrote: > >>> > >>> lokesh jaliminche - 03.12.20, 08:28:49 CET: > >>>> I have been doing experiments to analyze the impact of data journaling > >>>> on IO latencies. Theoretically, data journaling should show long > >>>> latencies as compared to metadata journaling. However, I observed > >>>> that when I enable data journaling I see improved performance. Is > >>>> there any specific optimization for data journaling in the write > >>>> path? > >>> > >>> This has been discussed before as Andrew Morton found that data > >>> journalling would be surprisingly fast with interactive write workloads. > >>> I would need to look it up in my performance training slides or use > >>> internet search to find the reference to that discussion again. > >>> > >>> AFAIR even Andrew had no explanation for that. So I thought why would I > >>> have one? However an idea came to my mind: The journal is a sequential > >>> area on the disk. This could help with harddisks I thought at least if > >>> if it I/O mostly to the same not too big location/file – as you did not > >>> post it, I don't know exactly what your fio job file is doing. However the > >>> latencies you posted as well as the device name certainly point to fast > >>> flash storage :). > >>> > >>> Another idea that just came to my mind is: AFAIK ext4 uses quite some > >>> delayed logging and relogging. That means if a block in the journal is > >>> changed another time within a certain time frame Ext4 changes it in > >>> memory before the journal block is written out to disk. Thus if the same > >>> block if overwritten again and again in short time, at least some of the > >>> updates would only happen in RAM. That might help latencies even with > >>> NVMe flash as RAM usually still is faster. > >>> > >>> Of course I bet that Ext4 maintainers have a more accurate or detailed > >>> explanation than I do. But that was at least my idea about this. > >>> > >>> Best, > >>> -- > >>> Martin > >>> > >>> > > -- > > Jan Kara <jack@xxxxxxxx> > > SUSE Labs, CR > > > Cheers, Andreas > > > > >