Re: Inconsistent behavior of fsync in btrfs

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]



Hi Chris,

We are using software we developed called CrashMonkey [1]. It
simulates the state on storage after a crash (taking into accounts
FLUSH and FUA flags). Talk slides on how it works can be found here
[2].

It is similar to dm-log-writes if you have used that in the past.

[1] https://github.com/utsaslab/crashmonkey
[2] http://www.cs.utexas.edu/~vijay/papers/hotstorage17-crashmonkey-slides.pdf

Thanks,
Vijay Chidambaram

On Tue, Apr 24, 2018 at 10:07 PM, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On Tue, Apr 24, 2018 at 8:35 PM, Jayashree Mohan <jayashree2912@xxxxxxxxx> wrote:
>>
>> Hi,
>>
>> While investigating crash consistency bugs on btrfs, we came across
>> workloads that demonstrate inconsistent behavior of fsync.
>>
>> Consider the following workload where fsync on the directory did not persist it.
>>
>> Workload 1:
>>
>> mkdir A
>> Sync
>> rename (A, B)
>> creat B/foo
>> fsync B/foo
>> fsync B
>> ---crash---
>>
>> In this case, the directory B as well as file B/foo are missing.
>> What's more worrying is that, on recovery from crash, we expect the
>> contents of directory to be
>>
>> Dir A : should not exist
>> Dir B :
>>     foo
>>
>> But instead, what we see is that:
>> Dir A :
>>     foo
>> Dir B : doesn't exist
>>
>>
>> This state is acceptable if we had created the file foo in dir A and
>> then renamed the directory - in that case it would mean the rename did
>> not persist. However what we see here is that, a file created in
>> directory B falsely appears in A, which is incorrect.
>>
>> However, if we did not persist the initial create of directory A, i.e
>>
>> Workload 2:
>>
>> mkdir A
>> rename (A, B)
>> creat B/foo
>> fsync B/foo
>> fsync B
>> ---crash---
>>
>> the directory B and its entry both get persisted in this case.
>>
>> Is this something to do with the directory entry A being already
>> present in the FS/subvolume tree and then the changes to the directory
>> inode going into the fsync log?
>>
>> We do not clearly understand the reason for such inconsistent
>> behavior, but it does seem incorrect.
>>
>> Consider another case where we found inconsistent behavior in the way
>> fsync is handled.
>>
>> Workload 3:
>>
>> mkdir A
>> mkdir B
>> creat A/foo
>> link (A/foo, B/foo)
>> fsync A/foo
>> fsync B/foo
>> ---crash---
>>
>> In this case,  file A/foo is persisted, but inspite of an explicit
>> fsync on B/foo, the file goes missing.
>>
>> Workload 4:
>>
>> mkdir A
>> mkdir B
>> creat A/foo
>> link (A/foo, B/foo)
>> fsync B/foo
>> fsync A/foo
>> ---crash---
>>
>> Note that, the only difference between workload 3 and 4 is the order
>> of fsync on files A/foo and B/foo. In this case, the file B/foo is
>> persisted, but A/foo is missing.
>>
>> What we interpret from the above workloads is that, the second fsync
>> is behaving like a no-op, and in either cases, only the file that is
>> fsynced first gets persisted. If we insert a sleep(45) between the two
>> fsyncs in the workloads above, we see both the files A/foo and B/foo
>> being persisted.
>>
>> No matter how many more links we create and fsync, only the first
>> fsync persists the file, i.e for example,
>>
>> Workload 5:
>>
>> mkdir A
>> mkdir B
>> mkdir C
>> creat A/foo
>> link (A/foo, B/foo)
>> link (A/foo, C/foo)
>> fsync B/foo
>> fsync A/foo
>> fsync C/foo
>> ---crash---
>>
>> Only file B/foo gets persisted, and both A/foo and C/foo are missing.
>>
>> This seems like inconsistent behavior as only the first fsync persists
>> the file, while all others don't seem to. Do you agree if this is
>> indeed incorrect and needs fixing?
>>
>> All the above tests pass on ext4 and xfs.
>>
>> Please let us know what you feel about such inconsistency.
>>
>
> I don't have answer to your question, but I'm curious exactly how you simulate a crash? For my own really rudimentary testing I've been doing crazy things like:
>
> # grub-mkconfig -o /boot/efi && echo b > /proc/sysrq-trigger
>
> And seeing what makes it to disk - or not. And I'm finding a some non-determinstic results are possible even in a VM which is a bit confusing. I'm sure with real hardware I'd find even more inconsistency.
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe fstests" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystems Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux