On 04/17/2015 02:20 PM, Filipe Manana wrote:
If we have concurrent fsync calls against files living in the same subvolume, we have some time window where we don't add the collected ordered extents to the running transaction's list of ordered extents and return success to userspace. This can result in data loss if the ordered extents complete after the current transaction commits and a power failure happens after the current transaction commits and before the next one commits. A sequence of steps that lead to this: CPU 0 CPU 1 btrfs_sync_file(inode A) btrfs_sync_file(inode B) btrfs_log_inode_parent() btrfs_log_inode_parent() start_log_trans() lock root->log_mutex ctx->log_transid = root->log_transid = N unlock root->log_mutex start_log_trans() lock root->log_mutex ctx->log_transid = root->log_transid = N unlock root->log_mutex btrfs_log_inode() btrfs_log_inode() btrfs_get_logged_extents() btrfs_get_logged_extents() --> gets orderede extent A -> gets ordered extent B into local list logged_list into local list logged_list write items into the log tree write items into the log tree btrfs_submit_logged_extents(&logged_list) --> splices logged_list into log_root->logged_list[N % 2] (N == log_root->log_transid) btrfs_sync_log() lock root->log_mutex atomic_set(&root->log_commit[N % 2], 1) (N == ctx->log_transid)
Except this can't happen, we have a wait_for_writer() in between here that will wait for CPU 1 to finish doing it's logging since it has already done it's start_log_trans(). Thanks,
Josef -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html