Re: [PATCH 5/5] radix tree: fix multi-order iteration race

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu 03-05-18 13:24:30, Ross Zwisler wrote:
> Fix a race in the multi-order iteration code which causes the kernel to hit
> a GP fault.  This was first seen with a production v4.15 based kernel
> (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used order 9 PMD
> DAX entries.
> 
> The race has to do with how we tear down multi-order sibling entries when
> we are removing an item from the tree.  Remember for example that an order
> 2 entry looks like this:
> 
> struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
> 
> where 'entry' is in some slot in the struct radix_tree_node, and the three
> slots following 'entry' contain sibling pointers which point back to
> 'entry.'
> 
> When we delete 'entry' from the tree, we call :
>   radix_tree_delete()
>     radix_tree_delete_item()
>       __radix_tree_delete()
>         replace_slot()
> 
> replace_slot() first removes the siblings in order from the first to the
> last, then at then replaces 'entry' with NULL.  This means that for a brief
> period of time we end up with one or more of the siblings removed, so:
> 
> struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
> 
> This causes an issue if you have a reader iterating over the slots in the
> tree via radix_tree_for_each_slot() while only under
> rcu_read_lock()/rcu_read_unlock() protection.  This is a common case in
> mm/filemap.c.
> 
> The issue is that when __radix_tree_next_slot() => skip_siblings() tries to
> skip over the sibling entries in the slots, it currently does so with an
> exact match on the slot directly preceding our current slot.  Normally this
> works:
>                                     V preceding slot
> struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
>                                             ^ current slot
> 
> This lets you find the first sibling, and you skip them all in order.
> 
> But in the case where one of the siblings is NULL, that slot is skipped and
> then our sibling detection is interrupted:
> 
>                                            V preceding slot
> struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
>                                                   ^ current slot
> 
> This means that the sibling pointers aren't recognized since they point all
> the way back to 'entry', so we think that they are normal internal radix
> tree pointers.  This causes us to think we need to walk down to a struct
> radix_tree_node starting at the address of 'entry'.
> 
> In a real running kernel this will crash the thread with a GP fault when
> you try and dereference the slots in your broken node starting at 'entry'.
> 
> We fix this race by fixing the way that skip_siblings() detects sibling
> nodes.  Instead of testing against the preceding slot we instead look for
> siblings via is_sibling_entry() which compares against the position of the
> struct radix_tree_node.slots[] array.  This ensures that sibling entries
> are properly identified, even if they are no longer contiguous with the
> 'entry' they point to.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> Reported-by: CR, Sapthagirish <sapthagirish.cr@xxxxxxxxx>
> Fixes: commit 148deab223b2 ("radix-tree: improve multiorder iterators")
> Cc: <stable@xxxxxxxxxxxxxxx>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

								Honza

> ---
>  lib/radix-tree.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index da9e10c827df..43e0cbedc3a0 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1612,11 +1612,9 @@ static void set_iter_tags(struct radix_tree_iter *iter,
>  static void __rcu **skip_siblings(struct radix_tree_node **nodep,
>  			void __rcu **slot, struct radix_tree_iter *iter)
>  {
> -	void *sib = node_to_entry(slot - 1);
> -
>  	while (iter->index < iter->next_index) {
>  		*nodep = rcu_dereference_raw(*slot);
> -		if (*nodep && *nodep != sib)
> +		if (*nodep && !is_sibling_entry(iter->node, *nodep))
>  			return slot;
>  		slot++;
>  		iter->index = __radix_tree_iter_add(iter, 1);
> @@ -1631,7 +1629,7 @@ void __rcu **__radix_tree_next_slot(void __rcu **slot,
>  				struct radix_tree_iter *iter, unsigned flags)
>  {
>  	unsigned tag = flags & RADIX_TREE_ITER_TAG_MASK;
> -	struct radix_tree_node *node = rcu_dereference_raw(*slot);
> +	struct radix_tree_node *node;
>  
>  	slot = skip_siblings(&node, slot, iter);
>  
> -- 
> 2.14.3
> 
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux