Eric Dumazet a écrit : > Nick Piggin a écrit : >> On Friday 12 December 2008 09:40, Eric Dumazet wrote: >>> From: Christoph Lameter <cl@xxxxxxxxxxxxxxxxxxxx> >>> >>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU >>> >>> Currently we schedule RCU frees for each file we free separately. That has >>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which >>> did not require RCU callbacks: >>> >>> 1. Excessive number of RCU callbacks can be generated causing long RCU >>> queues that in turn cause long latencies. We hit SLUB page allocation >>> more often than necessary. >>> >>> 2. The cache hot object is not preserved between free and realloc. A close >>> followed by another open is very fast with the RCUless approach because >>> the last freed object is returned by the slab allocator that is >>> still cache hot. RCU free means that the object is not immediately >>> available again. The new object is cache cold and therefore open/close >>> performance tests show a significant degradation with the RCU >>> implementation. >>> >>> One solution to this problem is to move the RCU freeing into the Slab >>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation >>> time. The slab allocator will do RCU frees only when it is necessary >>> to dispose of slabs of objects (rare). So with that approach we can cut >>> out the RCU overhead significantly. >>> >>> However, the slab allocator may return the object for another use even >>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means >>> there is the (unlikely) possibility that the object is going to be >>> switched under us in sections protected by rcu_read_lock() and >>> rcu_read_unlock(). So we need to verify that we have acquired the correct >>> object after establishing a stable object reference (incrementing the >>> refcounter does that). >>> >>> >>> Signed-off-by: Christoph Lameter <cl@xxxxxxxxxxxxxxxxxxxx> >>> Signed-off-by: Eric Dumazet <dada1@xxxxxxxxxxxxx> >>> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> >>> --- >>> Documentation/filesystems/files.txt | 21 ++++++++++++++-- >>> fs/file_table.c | 33 ++++++++++++++++++-------- >>> include/linux/fs.h | 5 --- >>> 3 files changed, 42 insertions(+), 17 deletions(-) >>> >>> diff --git a/Documentation/filesystems/files.txt >>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 >>> --- a/Documentation/filesystems/files.txt >>> +++ b/Documentation/filesystems/files.txt >>> @@ -78,13 +78,28 @@ the fdtable structure - >>> that look-up may race with the last put() operation on the >>> file structure. This is avoided using atomic_long_inc_not_zero() >>> on ->f_count : >>> + As file structures are allocated with SLAB_DESTROY_BY_RCU, >>> + they can also be freed before a RCU grace period, and reused, >>> + but still as a struct file. >>> + It is necessary to check again after getting >>> + a stable reference (ie after atomic_long_inc_not_zero()), >>> + that fcheck_files(files, fd) points to the same file. >>> >>> rcu_read_lock(); >>> file = fcheck_files(files, fd); >>> if (file) { >>> - if (atomic_long_inc_not_zero(&file->f_count)) >>> + if (atomic_long_inc_not_zero(&file->f_count)) { >>> *fput_needed = 1; >>> - else >>> + /* >>> + * Now we have a stable reference to an object. >>> + * Check if other threads freed file and reallocated it. >>> + */ >>> + if (file != fcheck_files(files, fd)) { >>> + *fput_needed = 0; >>> + put_filp(file); >>> + file = NULL; >>> + } >>> + } else >>> /* Didn't get the reference, someone's freed */ >>> file = NULL; >>> } >>> @@ -95,6 +110,8 @@ the fdtable structure - >>> atomic_long_inc_not_zero() detects if refcounts is already zero or >>> goes to zero during increment. If it does, we fail >>> fget()/fget_light(). >>> + The second call to fcheck_files(files, fd) checks that this filp >>> + was not freed, then reused by an other thread. >>> >>> 6. Since both fdtable and file structures can be looked up >>> lock-free, they must be installed using rcu_assign_pointer() >>> diff --git a/fs/file_table.c b/fs/file_table.c >>> index a46e880..3e9259d 100644 >>> --- a/fs/file_table.c >>> +++ b/fs/file_table.c >>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; >>> >>> static struct percpu_counter nr_files __cacheline_aligned_in_smp; >>> >>> -static inline void file_free_rcu(struct rcu_head *head) >>> -{ >>> - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); >>> - kmem_cache_free(filp_cachep, f); >>> -} >>> - >>> static inline void file_free(struct file *f) >>> { >>> percpu_counter_dec(&nr_files); >>> file_check_state(f); >>> - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); >>> + kmem_cache_free(filp_cachep, f); >>> } >>> >>> /* >>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) >>> rcu_read_unlock(); >>> return NULL; >>> } >>> + /* >>> + * Now we have a stable reference to an object. >>> + * Check if other threads freed file and re-allocated it. >>> + */ >>> + if (unlikely(file != fcheck_files(files, fd))) { >>> + put_filp(file); >>> + file = NULL; >>> + } >> This is a non-trivial change, because that put_filp may drop the last >> reference to the file. So now we have the case where we free the file >> from a context in which it had never been allocated. > > If we got at this point, we : > > Found a non NULL pointer in our fd table. > Then, another thread came, closed the file while we not yet added our reference. > This file was freed (kmem_cache_free(filp_cachep, file)) > This file was reused and inserted on another thread fd table. > We added our reference on refcount. > We checked if this file is still ours (in our fd tab). > We found this file is not anymore the file we wanted. > Calling put_filp() here is our only choice to safely remove the reference on > a truly allocated file. At this point the file is > a truly allocated file but not anymore ours. > Unfortunatly we added a reference on it : we must release it. > If the other thread already called put_filp() because it wanted to close its new file, > we must see f_refcnt going to zero, and we must call __fput(), to perform > all the relevant file cleanup ourself. Reading again this mail I realise we call put_filp(file), while this should be fput(file) or put_filp(file), we dont know. Damned, this patch is wrong as is. Christoph, Paul, do you see the problem ? In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file, and tried to close it while we got a reference on file) had to call put_filp() or fput() to release its own reference. So we call atomic_long_dec_and_test() and cannot take the appropriate action (calling the full __fput() version or the small one, that some systems use to 'close' an not really opened file. void put_filp(struct file *file) { if (atomic_long_dec_and_test(&file->f_count)) { security_file_free(file); file_kill(file); file_free(file); } } void fput(struct file *file) { if (atomic_long_dec_and_test(&file->f_count)) __fput(file); } I believe put_filp() is only called on slowpath (error cases). Should we just zap it and always call fput() ? -- To unsubscribe from this list: send the line "unsubscribe kernel-testers" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html