Re: [Cluster-devel] [PATCH 2/2] GFS2: Add a gfs2-specific prune_icache_sb

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 29 Jun 2016 08:47:40 +1000

On Tue, Jun 28, 2016 at 10:13:32AM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On 28/06/16 03:08, Dave Chinner wrote:
> >On Fri, Jun 24, 2016 at 02:50:11PM -0500, Bob Peterson wrote:
> >>This patch adds a new prune_icache_sb function for the VFS slab
> >>shrinker to call. Trying to directly free the inodes from memory
> >>might deadlock because it evicts inodes, which calls into DLM
> >>to acquire the glock. The DLM, in turn, may block on a pending
> >>fence operation, which may already be blocked on memory allocation
> >>that caused the slab shrinker to be called in the first place.
> >>
> >>Signed-off-by: Bob Peterson <rpeterso@xxxxxxxxxx>
> >All sorts of problems with this.
.....
> >Shrinker contexts are complex and the VFS shrinkers are much more
> >complex than most people understand. I strongly recommend that
> >filesystems leave inode cache reclaim to the superblock shrinker for
> >the above reasons - it's far too easy for people to get badly wrong.
> >
> >If there are specific limitations on how inodes can be freed, then
> >move the parts of inode *freeing* that cause problems to a different
> >context via the ->evict/destroy callouts and trigger that external
> >context processing on demand. That external context can just do bulk
> >"if it is on the list then free it" processing, because the reclaim
> >policy has already been executed to place that inode on the reclaim
> >list.
> That causes problems, unfortunately. As soon as the inode is taken
> out of the hash, then a new inode can be created by another process
> doing a lookup (for whatever reason - NFS fh lookup, scan for
> unlinked but still allocated inodes) and we'd land up with two in
> memory inodes referring to the same on disk inode at the same time.

Yes, that's a problem XFS has to deal with, too. the lookup needs to
search the reclaim list, too, and if it is found there recycle that
inode rather than allocate a new one. This generaly involves
clearing the reclaim state,  calling init_inode_always(inode) on the
reclaimed inode and then treating it like a freshly allocated struct
inode...

> We did look at
> leaving the inode hashed and making the evict async, but that didn't
> work either - the exact issue escapes me, but Bob's memory is
> probably better than mine on this issue.

There's no need to play games with hashing if you layer the reclaim
list as per above. The VFS state is always consistent, and the only
times you need to care about inodes in the "limbo" state of waiting
for reclaim is when a new lookup comes in or a superblock state
transition such as freeze, remount-ro and unmount...

> The issue is that there is a dependency loop involving the glocks
> required for inode deallocation, so we have:
> 
>  1. Event occurs which requires recovery somewhere in the cluster,
> which memory is short
>  2. Userspace gets blocked waiting for memory (on inode shrinker)
> due to requirement for a glock, to complete eviction
>  3. The glock is blocked on the DLM
>  4. The DLM is blocked until recovery is complete -> deadlock
>
> So we have various option on where to break this loop, the big
> question being where is the best place to do it, as the "obvious"
> ones seem to have various disadvantages to them,

Yes, I understand the problem. Essentially you have a "non-blocking"
requirement for reclaiming inodes because the DLM dependency means
that reclaim progress cannot be guaranteed. The superblock shrinker
has always assumed that unreferenced inodes can be reclaimed at any
time when GFP_KERNEL reclaim context is given. This is clearly not
true for gfs2 - it is only safe to attempt reclaim on inodes whose
glock can be obtained without blocking.

There are two ways of dealing with this. The first is as I described
above - defer final cleanup/freeing of specific inodes to a separate
list and context out of sight of the VFS inode life cycle and hook
your inode lookup code into it. This way the reclaim of the inode
can be deferred until the glock can be obtained without affecting
normal VFS level operations.

The second is to enable the shrinker to skip inodes that cannot be
immediately reclaimed. The shrinker already does this based on the
internal VFS inode state in inode_lru_isolate(). It would seem to me
that we could add a callout in this function to allow a non-blocking
determination of inode reclaimability by the filesystem. It would
truly have to be non-blocking - I think we could only sanely call it
under the lru and inode spinlocks because if we can't reclaim it
immediately we have to skip the inode. Dropping the locks mean we
can't move the inode in the LRU, and traversal must restart. Hence
we'd replace a deadlock with a livelock if we can't skip inodes we
can't reclaim because we dropped the lru lock.

I don't know whether you can grab a glock in a truly non-blocking
fashion, which is why I didn't initially suggest it. It seems like a
lot more surgery to both the inode reclaim code and the GFS2 code to
enable a hook into reclaim in this way...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html