Re: nfs client: Now you see it, now you don't (aka spurious ESTALE errors)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 25 Jul 2013 14:18:28 -0400
Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> On Thu, 25 Jul 2013 17:05:26 +0000
> Larry Keegan <lk@xxxxxxxxxxxxxxx> wrote:
> 
> > On Thu, 25 Jul 2013 10:11:43 -0400
> > Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > On Thu, 25 Jul 2013 13:45:15 +0000
> > > Larry Keegan <lk@xxxxxxxxxxxxxxx> wrote:
> > > 
> > > > Dear Chaps,
> > > > 
> > > > I am experiencing some inexplicable NFS behaviour which I would
> > > > like to run past you.
> > > 
> > > Were these machines running older kernels before this started
> > > happening? What kernel did you upgrade from if so?
> > > 
> > [snip out my long rambling reply]
> > > What might be helpful is to do some network captures when the
> > > problem occurs. What we want to know is whether the ESTALE errors
> > > are coming from the server, or if the client is generating them.
> > > That'll narrow down where we need to look for problems.
> > 
> > As it was giving me gyp during typing I tried to capture some NFS
> > traffic. Unfortunately claws-mail started a mail box check in the
> > middle of this and the problem disappeared! Normally it's claws
> > which starts this. It'll come along again soon enough and I'll send
> > a trace.
> > 
> Ok, we had a number of changes to how ESTALE errors are handled over
> the last few releases. When you mentioned 3.10, I had assumed that you
> might be hitting a regression in one of those, but those went in well
> after the 3.4 series.
> 
> Captures are probably your best bet. My suspicion is that the server
> is returning these ESTALE errors occasionally, but it would be best to
> have you confirm that. They may also help make sense of why it's
> occurring...
> 

Dear Jeff,

I now have a good and a bad packet capture. I can run them through
tshark -V but if I do this, they're really long, so I'm wondering how
best to post them. I've posted the summaries below.

The set-up is as follows: I'm running a few xterms on my desktop (the
affected client) as well as claws-mail using the mailmbox plugin.
Claws keeps a cache of the mailbox in .clawsmail/tagsdb/<foldername>.
>From time to time I blast a load of mail into these mail boxes using
procmail. This seems to demonstrate the problem most of the time. After
a few minutes everything gets back to normal.

The actual mail is being delivered on my file server pair directly
into /home/larry/Mail/<foldername>. Both file servers use automount to
mount the same filesystem and attempt to deliver mail into the boxes
simultaneously. Clearly the .lock files stop them stomping on each
other. This works well.

When it's in the mood to work, the test session on my desktop looks
like this:

# ls .claws-mail/tagsdb
#mailmbox  #mh
# _

When it doesn't it looks like this:

# ls .claws-mail/tagsdb
ls: cannot open directory .claws-mail/tagsdb: Stale NFS file handle
# _

I captured the packets on the network desktop. All else was quiet on
the network, at least as far as TCP traffic was concerned. Here are the
summaries:

# tshark -r good tcp
 10 1.304139000   10.1.1.139 -> 10.1.1.173   NFS 238 V4 Call ACCESS FH:0x4e5465ab, [Check: RD LU MD XT DL]
 11 1.304653000   10.1.1.173 -> 10.1.1.139   NFS 194 V4 Reply (Call In 10) ACCESS, [Allowed: RD LU MD XT DL]
 12 1.304694000   10.1.1.139 -> 10.1.1.173   TCP 66 gdoi > nfs [ACK] Seq=173 Ack=129 Win=3507 Len=0 TSval=119293240 TSecr=440910222
 13 1.304740000   10.1.1.139 -> 10.1.1.173   NFS 250 V4 Call LOOKUP DH:0x4e5465ab/tagsdb
 14 1.305225000   10.1.1.173 -> 10.1.1.139   NFS 310 V4 Reply (Call In 13) LOOKUP
 15 1.305283000   10.1.1.139 -> 10.1.1.173   NFS 238 V4 Call ACCESS FH:0x4e5465ab, [Check: RD LU MD XT DL]
 16 1.305798000   10.1.1.173 -> 10.1.1.139   NFS 194 V4 Reply (Call In 15) ACCESS, [Allowed: RD LU MD XT DL]
 17 1.305835000   10.1.1.139 -> 10.1.1.173   NFS 250 V4 Call LOOKUP DH:0x4e5465ab/tagsdb
 18 1.306330000   10.1.1.173 -> 10.1.1.139   NFS 310 V4 Reply (Call In 17) LOOKUP
 19 1.306373000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0x445c531a
 20 1.306864000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 19) GETATTR
 21 1.346003000   10.1.1.139 -> 10.1.1.173   TCP 66 gdoi > nfs [ACK]  Seq=877 Ack=941 Win=3507 Len=0 TSval=119293282 TSecr=440910225
# tshark -r bad tcp
 14 2.078769000   10.1.1.139 -> 10.1.1.173   NFS 238 V4 Call ACCESS FH:0x76aee435, [Check: RD LU MD XT DL]
 15 2.079266000   10.1.1.173 -> 10.1.1.139   NFS 194 V4 Reply (Call In 14) ACCESS, [Allowed: RD LU MD XT DL]
 16 2.079296000   10.1.1.139 -> 10.1.1.173   TCP 66 gdoi > nfs [ACK] Seq=173 Ack=129 Win=3507 Len=0 TSval=180576023 TSecr=502193004
 17 2.079338000   10.1.1.139 -> 10.1.1.173   NFS 238 V4 Call ACCESS FH:0x4e5465ab, [Check: RD LU MD XT DL]
 18 2.079797000   10.1.1.173 -> 10.1.1.139   NFS 194 V4 Reply (Call In 17) ACCESS, [Allowed: RD LU MD XT DL]
 19 2.079834000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
 20 2.080331000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 19) GETATTR
 21 2.080410000   10.1.1.139 -> 10.1.1.173   NFS 250 V4 Call LOOKUP DH:0x4e5465ab/tagsdb
 22 2.080903000   10.1.1.173 -> 10.1.1.139   NFS 310 V4 Reply (Call In 21) LOOKUP
 23 2.080982000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
 24 2.081477000   10.1.1.173 -> 10.1.1.139   NFS 162 V4 Reply (Call In 23) GETATTR
 25 2.081509000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
 26 2.082010000   10.1.1.173 -> 10.1.1.139   NFS 178 V4 Reply (Call In 25) GETATTR
 27 2.082040000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
 28 2.082542000   10.1.1.173 -> 10.1.1.139   NFS 142 V4 Reply (Call In 27) GETATTR
 29 2.089525000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
 30 2.089996000   10.1.1.173 -> 10.1.1.139   NFS 162 V4 Reply (Call In 29) GETATTR
 31 2.090028000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
 32 2.090529000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 31) GETATTR
 33 2.090577000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0x4e5465ab
 34 2.091061000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 33) GETATTR
 35 2.091110000   10.1.1.139 -> 10.1.1.173   NFS 250 V4 Call LOOKUP DH:0x4e5465ab/tagsdb
 36 2.091593000   10.1.1.173 -> 10.1.1.139   NFS 310 V4 Reply (Call In 35) LOOKUP
 37 2.091657000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
 38 2.092126000   10.1.1.173 -> 10.1.1.139   NFS 162 V4 Reply (Call In 37) GETATTR
 39 2.092157000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
 40 2.092658000   10.1.1.173 -> 10.1.1.139   NFS 178 V4 Reply (Call In 39) GETATTR
 41 2.092684000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
 42 2.093150000   10.1.1.173 -> 10.1.1.139   NFS 142 V4 Reply (Call In 41) GETATTR
 43 2.100520000   10.1.1.139 -> 10.1.1.173   NFS 226 V4 Call GETATTR FH:0xb12cdc45
 44 2.101014000   10.1.1.173 -> 10.1.1.139   NFS 162 V4 Reply (Call In 43) GETATTR
 45 2.101040000   10.1.1.139 -> 10.1.1.173   NFS 230 V4 Call GETATTR FH:0xb12cdc45
 46 2.101547000   10.1.1.173 -> 10.1.1.139   NFS 262 V4 Reply (Call In 45) GETATTR
 47 2.141500000   10.1.1.139 -> 10.1.1.173   TCP 66 gdoi > nfs [ACK] Seq=2657 Ack=2289 Win=3507 Len=0 TSval=180576086 TSecr=502193026
# _

The first thing that strikes me is the bad trace is much longer. This
strikes me as reasonable because as well as the ESTALE problem I've
noticed that the whole system seems sluggish. claws-mail is
particularly so because it keeps saving my typing into a drafts
mailbox, and because claws doesn't really understand traditional
mboxes, it spends an inordinate amount of time locking and unlocking
the boxes for each message in them. Claws also spews tracebacks
frequently and it crashes from time to time, something it never did
before the ESTALE problem occurred.

Yours,

Larry
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux