Re: nfs client: Now you see it, now you don't (aka spurious ESTALE errors)

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Fri, 26 Jul 2013 11:02:22 -0400

On Fri, Jul 26, 2013 at 09:12:25AM -0400, Jeff Layton wrote:
> On Fri, 26 Jul 2013 12:41:01 +0000
> Larry Keegan <lk@xxxxxxxxxxxxxxx> wrote:
> 
> > On Thu, 25 Jul 2013 14:18:28 -0400
> > Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > On Thu, 25 Jul 2013 17:05:26 +0000
> > > Larry Keegan <lk@xxxxxxxxxxxxxxx> wrote:
> > > 
> > > > On Thu, 25 Jul 2013 10:11:43 -0400
> > > > Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > On Thu, 25 Jul 2013 13:45:15 +0000
> > > > > Larry Keegan <lk@xxxxxxxxxxxxxxx> wrote:
> > > > > 
> > > > > > Dear Chaps,
> > > > > > 
> > > > > > I am experiencing some inexplicable NFS behaviour which I would
> > > > > > like to run past you.
> > > > > 
> > > > > Were these machines running older kernels before this started
> > > > > happening? What kernel did you upgrade from if so?
> > > > > 
> > > > [snip out my long rambling reply]
> > > > > What might be helpful is to do some network captures when the
> > > > > problem occurs. What we want to know is whether the ESTALE errors
> > > > > are coming from the server, or if the client is generating them.
> > > > > That'll narrow down where we need to look for problems.
> > > > 
> > > > As it was giving me gyp during typing I tried to capture some NFS
> > > > traffic. Unfortunately claws-mail started a mail box check in the
> > > > middle of this and the problem disappeared! Normally it's claws
> > > > which starts this. It'll come along again soon enough and I'll send
> > > > a trace.
> > > > 
> > > Ok, we had a number of changes to how ESTALE errors are handled over
> > > the last few releases. When you mentioned 3.10, I had assumed that you
> > > might be hitting a regression in one of those, but those went in well
> > > after the 3.4 series.
> > > 
> > > Captures are probably your best bet. My suspicion is that the server
> > > is returning these ESTALE errors occasionally, but it would be best to
> > > have you confirm that. They may also help make sense of why it's
> > > occurring...
> > > 
> > 
> > Dear Jeff,
> > 
> > I now have a good and a bad packet capture. I can run them through
> > tshark -V but if I do this, they're really long, so I'm wondering how
> > best to post them. I've posted the summaries below.
> > 
> > The set-up is as follows: I'm running a few xterms on my desktop (the
> > affected client) as well as claws-mail using the mailmbox plugin.
> > Claws keeps a cache of the mailbox in .clawsmail/tagsdb/<foldername>.
> > From time to time I blast a load of mail into these mail boxes using
> > procmail. This seems to demonstrate the problem most of the time. After
> > a few minutes everything gets back to normal.
> > 
> > The actual mail is being delivered on my file server pair directly
> > into /home/larry/Mail/<foldername>. Both file servers use automount to
> > mount the same filesystem

Wait, I'm confused: that sounds like you're mounting the same ext4
filesystem from two different machines?

--b.

> > and attempt to deliver mail into the boxes
> > simultaneously. Clearly the .lock files stop them stomping on each
> > other. This works well.
> > 
> > When it's in the mood to work, the test session on my desktop looks
> > like this:
> > 
> > # ls .claws-mail/tagsdb
> > #mailmbox  #mh
> > # _
> > 
> > When it doesn't it looks like this:
> > 
> > # ls .claws-mail/tagsdb
> > ls: cannot open directory .claws-mail/tagsdb: Stale NFS file handle
> > # _
> > 
> > I captured the packets on the network desktop. All else was quiet on
> > the network, at least as far as TCP traffic was concerned. Here are the
> > summaries:
> > 
> > # tshark -r good tcp
> >  10 1.304139000   10.1.1.139 -> 10.1.1.173   NFS 238 V4 Call ACCESS FH:0x4e5465ab, [Check: RD LU MD XT DL]
> >  11 1.304653000   10.1.1.173 -> 10.1.1.139   NFS 194 V4 Reply (Call In 10) ACCESS, [Allowed: RD LU MD XT DL]
> >  12 1.304694000   10.1.1.139 -> 10.1.1.173   TCP 66 gdoi > nfs [ACK] Seq=173 Ack=129 Win=3507 Len=0 TSval=119293240 TSecr=440910222
> >  13 1.304740000   10.1.1.139 -> 10.1.1.173   NFS 250 V4 Call LOOKUP DH:0x4e5465ab/tagsdb
> >  14 1.305225000   10.1.1.173 -> 10.1.1.139   NFS 310 V4 Reply (Call In 13) LOOKUP
> >  15 1.305283000   10.1.1.139 -> 10.1.1.173   NFS 238 V4 Call ACCESS FH:0x4e5465ab, [Check: RD LU MD XT DL]
> >  16 1.305798000   10.1.1.173 -> 10.1.1.139   NFS 194 V4 Reply (Call In 15) ACCESS, [Allowed: RD LU MD XT DL]
> >  17 1.305835000   10.1.1.139 -> 10.1.1.173   NFS 250 V4 Call LOOKUP DH:0x4e5465ab/tagsdb
> >  18 1.306330000   10.1.1.173 -> 10.1.1.139   NFS 310 V4 Reply (Call In 17) LOOKUP
> >  19 1.306373000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0x445c531a
> >  20 1.306864000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 19) GETATTR
> >  21 1.346003000   10.1.1.139 -> 10.1.1.173   TCP 66 gdoi > nfs [ACK]  Seq=877 Ack=941 Win=3507 Len=0 TSval=119293282 TSecr=440910225
> > # tshark -r bad tcp
> >  14 2.078769000   10.1.1.139 -> 10.1.1.173   NFS 238 V4 Call ACCESS FH:0x76aee435, [Check: RD LU MD XT DL]
> >  15 2.079266000   10.1.1.173 -> 10.1.1.139   NFS 194 V4 Reply (Call In 14) ACCESS, [Allowed: RD LU MD XT DL]
> >  16 2.079296000   10.1.1.139 -> 10.1.1.173   TCP 66 gdoi > nfs [ACK] Seq=173 Ack=129 Win=3507 Len=0 TSval=180576023 TSecr=502193004
> >  17 2.079338000   10.1.1.139 -> 10.1.1.173   NFS 238 V4 Call ACCESS FH:0x4e5465ab, [Check: RD LU MD XT DL]
> >  18 2.079797000   10.1.1.173 -> 10.1.1.139   NFS 194 V4 Reply (Call In 17) ACCESS, [Allowed: RD LU MD XT DL]
> >  19 2.079834000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
> >  20 2.080331000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 19) GETATTR
> >  21 2.080410000   10.1.1.139 -> 10.1.1.173   NFS 250 V4 Call LOOKUP DH:0x4e5465ab/tagsdb
> >  22 2.080903000   10.1.1.173 -> 10.1.1.139   NFS 310 V4 Reply (Call In 21) LOOKUP
> >  23 2.080982000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
> >  24 2.081477000   10.1.1.173 -> 10.1.1.139   NFS 162 V4 Reply (Call In 23) GETATTR
> >  25 2.081509000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
> >  26 2.082010000   10.1.1.173 -> 10.1.1.139   NFS 178 V4 Reply (Call In 25) GETATTR
> >  27 2.082040000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
> >  28 2.082542000   10.1.1.173 -> 10.1.1.139   NFS 142 V4 Reply (Call In 27) GETATTR
> >  29 2.089525000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
> >  30 2.089996000   10.1.1.173 -> 10.1.1.139   NFS 162 V4 Reply (Call In 29) GETATTR
> >  31 2.090028000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
> >  32 2.090529000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 31) GETATTR
> >  33 2.090577000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0x4e5465ab
> >  34 2.091061000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 33) GETATTR
> >  35 2.091110000   10.1.1.139 -> 10.1.1.173   NFS 250 V4 Call LOOKUP DH:0x4e5465ab/tagsdb
> >  36 2.091593000   10.1.1.173 -> 10.1.1.139   NFS 310 V4 Reply (Call In 35) LOOKUP
> >  37 2.091657000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
> >  38 2.092126000   10.1.1.173 -> 10.1.1.139   NFS 162 V4 Reply (Call In 37) GETATTR
> >  39 2.092157000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
> >  40 2.092658000   10.1.1.173 -> 10.1.1.139   NFS 178 V4 Reply (Call In 39) GETATTR
> >  41 2.092684000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
> >  42 2.093150000   10.1.1.173 -> 10.1.1.139   NFS 142 V4 Reply (Call In 41) GETATTR
> >  43 2.100520000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
> >  44 2.101014000   10.1.1.173 -> 10.1.1.139   NFS 162 V4 Reply (Call In 43) GETATTR
> >  45 2.101040000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
> >  46 2.101547000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 45) GETATTR
> >  47 2.141500000   10.1.1.139 -> 10.1.1.173   TCP 66 gdoi > nfs [ACK] Seq=2657 Ack=2289 Win=3507 Len=0 TSval=180576086 TSecr=502193026
> > # _
> > 
> > The first thing that strikes me is the bad trace is much longer. This
> > strikes me as reasonable because as well as the ESTALE problem I've
> > noticed that the whole system seems sluggish. claws-mail is
> > particularly so because it keeps saving my typing into a drafts
> > mailbox, and because claws doesn't really understand traditional
> > mboxes, it spends an inordinate amount of time locking and unlocking
> > the boxes for each message in them. Claws also spews tracebacks
> > frequently and it crashes from time to time, something it never did
> > before the ESTALE problem occurred.
> > 
> > Yours,
> > 
> > Larry
> 
> I'm afraid I can't tell much from the above output. I don't see any
> ESTALE errors there, but you can get similar issues if (for instance)
> certain attributes of a file change. You mentioned that this is a DRBD
> cluster, are you "floating" IP addresses between cluster nodes here? If
> so, do your problems occur around the times that that's happening?
> 
> Also, what sort of filesystem is being exported here?
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html