Prioritizing readdirplus/getattr/lookup

Andrew Klaassen <clawsoon@xxxxxxxxx> · Mon, 4 Apr 2011 06:31:09 -0700 (PDT)

How difficult would it be to make nfsd give priority to the calls generated by "ls -l" (i.e. readdirplus, getattr, lookup) over read and write calls?  Is it a matter of tweaking a couple of sysctls or changing a few lines of code, or would it mean a major re-write?

I'm working in an environment where it's important to have reasonably good throughput for the HPC farm (50-200 machines reading and writing 5-10MB files as fast as they can pump them through), while simultaneously providing snappy responses to "ls -l" and equivalents for people reviewing file sizes and times and browsing through the filesystem and constructing new jobs and whatnot.

I've tried a small handful of server OSes (Solaris, Exastore, various Linux flavours and tunings and nfsd counts) that do great on the throughput side but horrible on the "ls -l" under load side (as mentioned in my previous emails).

However, I know what I need is possible, because Netapp GX on very similar hardware (similar processor, memory, and spindle count), does slightly worse (20% or so) on total throughput but much better (10-100 times better than Solaris/Linux/Exastore) on under-load "ls -l" responsiveness.

In the Linux case, I think I've narrowed to the problem down to nfsd rather than filesystem or VM system.  It's not filesystem or VM system, because when the server is under heavy local load equivalent to my HPC farm load, both local and remote "ls -l" commands are fast.  It's not that the NFS load overwhelms the server, because when the server is under heavy HPC farm load, local "ls -l" commands are still fast.

It's only when there's an NFS load and an NFS "ls -l" that the "ls -l" is slow.  Like so:

                                  throughput     ls -l
                                  ==========     =====
Heavy local load, local ls -l     fast           fast
Heavy local load, NFS ls -l       fast           fast
Heavy NFS load, local ls -l       fast           fast
Heavy NFS load, NFS ls -l         fast           very slow

This suggests to me that it's nfsd that's slowing down the ls -l response times rather than the filesystem or VM system.

Would fixing the bottom-right-corner case - even if it meant a modest throughput slowdown - be an easy tweak/patch?  Or major re-write?  (Or just a kernel upgrade?)

I know it's doable because the Netapp does it; the question is how large a job would it be on Linux.

Thanks again.

FWIW, here's what I've tried so far to try to make this problem go away without success:

Server side:

kernels (all x86_64): 2.6.32-[something] on Scientific Linux 6.0, 2.6.32.4 on Slackware, 2.6.37.5 on Slackware
filesystems: xfs, ext4
nfsd counts: 8,32,64,127,128,256,1024
schedulers: cfq,deadline
export options: async,no_root_squash

Client side:
kernel: 2.6.31.14-0.6-desktop, x86_64, from openSUSE 11.3
hard,intr,noatime,vers=3,mountvers=3 # always on
rsize,wsize:  32768,65536
proto:        tcp,udp
nolock        # on or off
noac          # on or off
actimeo:      3,5,60,240,600  # I had really hoped this would help

Andrew

--- On Thu, 3/31/11, Andrew Klaassen <clawsoon@xxxxxxxxx> wrote:

> Setting actimeo=600 gave me part of
> the behaviour I expected; on the first directory listing,
> the calls were all readdirplus and no getattr.
> 
> However, there were now long stretches where nothing was
> happening.  During a single directory listing to a
> loaded server, there'd be:
> 
>  ~10 seconds of readdirplus calls and replies, followed by
>  ~70 seconds of nothing, followed by
>  ~10 seconds of readdirplus calls and replies, followed by
>  ~100 seconds of nothing, followed by
>  ~10 seconds of readdirplus calls and replies, followed by
>  ~110 seconds of nothing, followed by
>  ~2 seconds of readdirplus calls and replies
> 
> Why the long stretches of nothing?  If I'm reading my
> tshark output properly, it doesn't seem like the client was
> waiting for a server response.  Here are a couple of
> lines before and after a long stretch of nothing:
> 
>  28.575537 192.168.10.158 -> 192.168.10.5 NFS V3
> READDIRPLUS Call, FH:0xa216e302
>  28.593943 192.168.10.5 -> 192.168.10.158 NFS V3
> READDIRPLUS Reply (Call In 358) random_1168.exr
> random_2159.exr random_2188
> .exr random_0969.exr random_1662.exr random_0022.exr
> random_0785.exr random_2316.exr random_0831.exr
> random_0443.exr random_
> 1203.exr random_1907.exr
>  28.594006 192.168.10.158 -> 192.168.10.5 NFS V3
> READDIRPLUS Call, FH:0xa216e302
>  28.623736 192.168.10.5 -> 192.168.10.158 NFS V3
> READDIRPLUS Reply (Call In 362) random_1575.exr
> random_0492.exr random_0335
> .exr random_2460.exr random_0754.exr random_1114.exr
> random_2001.exr random_2298.exr random_1858.exr
> random_1889.exr random_
> 2249.exr random_0782.exr
> 103.811801 192.168.10.158 -> 192.168.10.5 NFS V3
> READDIRPLUS Call, FH:0xa216e302
> 103.883930 192.168.10.5 -> 192.168.10.158 NFS V3
> READDIRPLUS Reply (Call In 2311) random_0025.exr
> random_1665.exr random_231
> 1.exr random_1204.exr random_0444.exr random_0836.exr
> random_0332.exr random_0495.exr random_1572.exr
> random_1900.exr random
> _2467.exr random_1113.exr
> 103.884014 192.168.10.158 -> 192.168.10.5 NFS V3
> READDIRPLUS Call, FH:0xa216e302
> 103.965167 192.168.10.5 -> 192.168.10.158 NFS V3
> READDIRPLUS Reply (Call In 2316) random_0753.exr
> random_2006.exr random_021
> 6.exr random_1824.exr random_1456.exr random_1790.exr
> random_1037.exr random_0677.exr random_2122.exr
> random_0101.exr random
> _1741.exr random_2235.exr
> 
> Calls are sent and replies received at the 28 second mark,
> and then... nothing... until the 103 second mark.  I'm
> sure the server must be somehow telling the client that it's
> busy, but - at least with the tools I'm looking at - I don't
> see how.  Is tshark just hiding TCP delays and
> retransmits from me?
> 
> Thanks again.
> 
> Andrew
> 
> 
> --- On Thu, 3/31/11, Andrew Klaassen <clawsoon@xxxxxxxxx>
> wrote:
> 
> > Interesting.  So the reason it's
> > switching back and forth between readdirplus and
> getattr
> > during the same ls command is because the command is
> taking
> > so long to run that the cache is periodically expiring
> as
> > the command is running?
> > 
> > I'll do some playing with actimeo to see if I'm
> actually
> > understanding this.
> > 
> > Thanks!
> > 
> > Andrew
> > 
> > 
> > --- On Thu, 3/31/11, Steven Procter <steven@xxxxxxxxxxxxxx>
> > wrote:
> > 
> > > This is due to client caching. 
> > > When the second ls -l runs the cache
> > > contains an entry for the directory.  The client
> can
> > > check if the cached
> > > directory data is still valid by issuing a
> GETATTR on
> > the
> > > directory.
> > > 
> > > But this only validates the names, not the
> > attributes,
> > > which are not
> > > actually part of the directory.  Those must be
> > > refetched.  So the client
> > > issues a GETATTR for each entry in the
> directory. 
> > It
> > > issues them
> > > sequentially, probably as ls calls readdir() and
> then
> > > stat()
> > > sequentially on the directory entries.
> > > 
> > > This takes so long that the cache entry times out
> and
> > the
> > > next time you
> > > run ls -l the client reloads the directory using
> > > READDIRPLUS.
> > > 
> > > --Steven
> > > 
> > > > X-Mailer: YahooMailClassic/12.0.2
> > > YahooMailWebService/0.8.109.295617
> > > > Date:    Thu, 31 Mar 2011 15:24:15
> > > -0700 (PDT)
> > > > From:    Andrew Klaassen <clawsoon@xxxxxxxxx>
> > > > Subject: readdirplus/getattr
> > > > To:    linux-nfs@xxxxxxxxxxxxxxx
> > > > Sender:    linux-nfs-owner@xxxxxxxxxxxxxxx
> > > > 
> > > > Hi,
> > > > 
> > > > I've been trying to get my Linux NFS clients
> to
> > be a
> > > little snappier about listing large directories
> from
> > > heavily-loaded servers.  I found the following
> > > fascinating behaviour (this is with
> > 2.6.31.14-0.6-desktop,
> > > x86_64, from openSUSE 11.3, Solaris Express 11
> NFS
> > server):
> > > > 
> > > > With "ls -l --color=none" on a directory
> with
> > 2500
> > > files:
> > > > 
> > > >             
> > > |      rdirplus   | 
> > >   nordirplus   |
> > > >             
> > > |1st  |2nd  |1st  |1st  |2nd 
> > > |1st  |
> > > >             
> > > |run  |run  |run  |run  |run 
> > > |run  |
> > > >             
> > > |light|light|heavy|light|light|heavy|
> > > >              |load
> > > |load |load |load |load |load |
> > > >
> > --------------------------------------------------
> > > > readdir      |   0
> > > |   0 |   0 |  25
> > > |   0 |  25 |
> > > > readdirplus  | 209 |   0 | 276
> > > |   0 |   0
> > > |   0 |
> > > > lookup       |  16
> > > |   0 |  10 |2316 |   0
> > > |2473 |
> > > > getattr      |   1 |2501
> > > |2452 |   1 |2465 |   1 |
> > > > 
> > > > The most interesting case is with rdirplus
> > specified
> > > as a mount option to a heavily loaded server. 
> The
> > NFS
> > > client keeps switching back and forth between
> > readdirplus
> > > and getattr:
> > > > 
> > > >  ~10 seconds doing ~70 readdirplus calls,
> > > followed by
> > > >  ~150 seconds doing ~800 gettattr calls,
> > followed
> > > by
> > > >  ~12 seconds doing ~70 readdirplus calls,
> > > followed by
> > > >  ~200 seconds doing ~800 gettattr calls,
> > followed
> > > by
> > > >  ~20 seconds doing ~130 readdirplus calls,
> > > followed by
> > > >  ~220 seconds doing ~800 gettattr calls
> > > > 
> > > > All the calls appear to get reasonably
> prompt
> > replies
> > > (never more than a second or so), which makes me
> > wonder why
> > > it keeps switching back and forth between the
> > > strategies.  (Especially since I've specified
> > rdirplus
> > > as a mount option.)
> > > > 
> > > > Is it supposed to do that?
> > > > 
> > > > I'd really like to see how it does with
> > readdirplus
> > > ~only~, no getattr calls, since it's spending
> only 40
> > > seconds in total on readdirplus calls compared to
> 570
> > > seconds in total on (redundant, I think, based on
> the
> > > lightly-loaded case) getattr calls.
> > > > 
> > > > It'd also be nice to be able to force
> > readdirplus
> > > calls instead of getattr calls for second and
> > subsequent
> > > listings of a directory.
> > > > 
> > > > I saw a recent thread talking about
> readdirplus
> > > changes in 2.6.37, so I'll give that a try when I
> get
> > a
> > > chance to see how it behaves.
> > > > 
> > > > Andrew
> > > > 
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the
> line
> > > "unsubscribe linux-nfs" in
> > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line
> > "unsubscribe
> > > linux-nfs" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > --
> > To unsubscribe from this list: send the line
> "unsubscribe
> > linux-nfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html