Re: [PATCH] ovl: skip overlayfs superblocks at global sync

Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> · Thu, 9 Apr 2020 15:04:39 +0300

On 09/04/2020 14.48, Amir Goldstein wrote:
On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
<khlebnikov@xxxxxxxxxxxxxx> wrote:

On 09/04/2020 13.23, Amir Goldstein wrote:
On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
<khlebnikov@xxxxxxxxxxxxxx> wrote:

Stacked filesystems like overlayfs has no own writeback, but they have to
forward syncfs() requests to backend for keeping data integrity.

During global sync() each overlayfs instance calls method ->sync_fs()
for backend although it itself is in global list of superblocks too.
As a result one syscall sync() could write one superblock several times
and send multiple disk barriers.

This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

Reported-by: Dmitry Monakhov <dmtrmonakhov@xxxxxxxxxxxxxx>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx>
---

Seems reasonable.
You may add:
Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx>

+CC: containers list

Thanks

This bring up old memories.
I posted this way back to fix handling of emergency_remount() in the
presence of loop mounted fs:
https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@xxxxxxxxxxxxxx/

But seems to me that emergency_sync() and sync(2) are equally broken
for this use case.

I wonder if anyone cares enough about resilience of loop mounted fs to try
and change the iterate_* functions to iterate supers/bdevs in reverse order...

Now I see reason behind "sync; sync; sync; reboot" =)

Order old -> new allows to not miss new items if list modifies.
Might be important for some users.

That's not the reason I suggested reverse order.
The reason is that with loop mounted fs, the correct order of flushing is:
1. sync loop mounted fs inodes => writes to loop image file
2. sync loop mounted fs sb => fsyncs the loop image file
3. sync the loop image host fs sb

With forward sb iteration order, #3 happens before #1, so the
loop mounted fs changes are not really being made durable by
a single sync(2) call.

If fs in loop mounted with barriers then sync_fs will issue
REQ_OP_FLUSH to loop device and trigger fsync() for image file.
Sync() might write something twice but data should be safe.
Without barriers this scenario is broken for sure.

Emergency remount R/O is other thing. It really needs reverse order.

bdev iteration seems already reversed: inode_sb_list_add adds to the head

I think bdev iteration order will not make a difference in this case.
flushing /dev/loopX will not be needed and it happens too late
anyway.

Thanks,
Amir.