We can have multiple fsync operations against the same file during the same transaction and they can collect the same ordered extents while they don't complete (still accessible from the inode's ordered tree). If this happens, those ordered extents will never get their reference counts decremented to 0, leading to memory leaks and inode leaks (an iput for an ordered extent's inode is scheduled only when the ordered extent's refcount drops to 0). The following sequence diagram explains this race: CPU 1 CPU 2 btrfs_sync_file() btrfs_sync_file() mutex_lock(inode->i_mutex) btrfs_log_inode() btrfs_get_logged_extents() --> collects ordered extent X --> increments ordered extent X's refcount btrfs_submit_logged_extents() mutex_unlock(inode->i_mutex) mutex_lock(inode->i_mutex) btrfs_sync_log() btrfs_wait_logged_extents() --> list_del_init(&ordered->log_list) btrfs_log_inode() btrfs_get_logged_extents() --> Adds ordered extent X to logged_list because at this point: list_empty(&ordered->log_list) && test_bit(BTRFS_ORDERED_LOGGED, &ordered->flags) == 0 --> Increments ordered extent X's refcount --> check if ordered extent's io is finished or not, start it if necessary and wait for it to finish --> sets bit BTRFS_ORDERED_LOGGED on ordered extent X's flags and adds it to trans->ordered btrfs_sync_log() finishes btrfs_submit_logged_extents() btrfs_log_inode() finishes mutex_unlock(inode->i_mutex) btrfs_sync_file() finishes btrfs_sync_log() btrfs_wait_logged_extents() --> Sees ordered extent X has the bit BTRFS_ORDERED_LOGGED set in its flags --> X's refcount is untouched btrfs_sync_log() finishes btrfs_sync_file() finishes btrfs_commit_transaction() --> called by transaction kthread for e.g. btrfs_wait_pending_ordered() --> waits for ordered extent X to complete --> decrements ordered extent X's refcount by 1 only, corresponding to the increment done by the fsync task ran by CPU 1 In the scenario of the above diagram, after the transaction commit, the ordered extent will remain with a refcount of 1 forever, leaking the ordered extent structure and preventing the i_count of its inode from ever decreasing to 0, since the delayed iput is scheduled only when the ordered extent's refcount drops to 0, preventing the inode from ever being evicted by the VFS. Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to mean that an ordered extent is already being processed by an fsync call, which will attach it to the current transaction, preventing it from being collected by subsequent fsync operations against the same inode. This race was introduced with the following change (added in 3.19 and backported to stable 3.18 and 3.17): Btrfs: make sure logged extents complete in the current transaction V3 commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f CC: <stable@xxxxxxxxxxxxxxx> # 3.19, 3.18 and 3.17 Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx> --- V2: No code changes, re-worded commit message and diagram. CC'ed stable too. V3: No code changes, only adjusted commit message and its diagram to be more accurate. V4: Simplified code by removing unnecessary code. Updated comment and diagram. fs/btrfs/ordered-data.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 534544e..157cc54 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -452,9 +452,7 @@ void btrfs_get_logged_extents(struct inode *inode, continue; if (entry_end(ordered) <= start) break; - if (!list_empty(&ordered->log_list)) - continue; - if (test_bit(BTRFS_ORDERED_LOGGED, &ordered->flags)) + if (test_and_set_bit(BTRFS_ORDERED_LOGGED, &ordered->flags)) continue; list_add(&ordered->log_list, logged_list); atomic_inc(&ordered->refs); @@ -511,8 +509,7 @@ void btrfs_wait_logged_extents(struct btrfs_trans_handle *trans, wait_event(ordered->wait, test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags)); - if (!test_and_set_bit(BTRFS_ORDERED_LOGGED, &ordered->flags)) - list_add_tail(&ordered->trans_list, &trans->ordered); + list_add_tail(&ordered->trans_list, &trans->ordered); spin_lock_irq(&log->log_extents_lock[index]); } spin_unlock_irq(&log->log_extents_lock[index]); -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html