Re: Cephfs and ERESTARTSYS on writes

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 24 Jul 2015 16:53:12 +0300

On Fri, Jul 24, 2015 at 4:29 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> On Fri, Jul 24, 2015 at 3:54 PM, Vedran Furač <vedran.furac@xxxxxxxxx> wrote:
>> On 07/24/2015 09:54 AM, Ilya Dryomov wrote:
>>>
>>> I don't know - looks like nginx isn't setting SA_RESTART, so it should
>>> be repeating the write()/writev() itself.  That said, if it happens
>>> only on cephfs, we need to track it down.
>>
>> Correct, this is first time I see such an error. I've never seen
>> something like this when writing on working local disks (unless it run
>> out space or inodes). It's indicative that it almost always happens as
>> soon as first 4MBs are written and that's the Ceph object size.
>>
>>> Try enabling nginx debug logging?  I seem to have a vague recollection
>>> that it records return values.  That should give us something to start
>>> with.
>>
>> Unfortunately it doesn't look very useful:
>>
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe length: 1758913
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 input buf #40
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 free: 000000000343E750
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe write chain
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 write: 48,
>> 000000000304FD60, 1758913, 54583296
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe write downstream: 0
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe write downstream
>> flush out
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http output filter "/139000
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http copy filter: "/139000
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 image filter
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 xslt filter body
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http postpone filter
>> "/139000 0000000001A60450
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 write old buf t:0 f:1
>> 0000000000000000, pos 0000000000000000, size: 0 file: 24338425, size: 856071
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 write old buf t:0 f:1
>> 0000000000000000, pos 0000000000000000, size: 0 file: 25194496, size:
>> 2101248
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 write new buf t:0 f:1
>> 0000000000000000, pos 0000000000000000, size: 0 file: 27295744, size:
>> 29046465
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http write filter: l:0
>> f:0 s:32003784
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http write filter limit 0
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http write filter
>> 0000000001A61948
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http copy filter: -2 "/139
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe write downstream done
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 event timer del: 47:
>> 1437742479805
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 event timer: 46, old:
>> 1437742179605, new: 1437742179866
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http file cache update
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http file cache rename:
>> "/home/ssd/proxy_temp/1/0/0000014501" to
>> "/home/ceph/1d/8b/5/ed04823384c648e574f2f078ec258b1d"
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 malloc: 0000000001F97CA0:78
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 malloc:
>> 0000000002AB08F0:65536
>> 2015/07/24 12:44:39 [alert] 26722#26722: *29428 write()
>> "/home/ceph/1d/8b/5/ed04823384c648e574f2f078ec258b1d.0000014503" failed
>> (4: Interrupted system call) while reading upstream
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http upstream exit:
>> 0000000000000000
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 finalize http upstream
>> request: 0
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 finalize http proxy request
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 free rr peer 1 0
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 close http upstream
>> connection: 47
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 free: 0000000001F816A0,
>> unused: 48
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 reusable connection: 0
>> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http upstream temp fd: 48
>
> A quick grep reveals this in src/core/ngx_file.c:
>
>  799     len = cf->buf_size ? cf->buf_size : 65536;
>
>  840         n = ngx_write_fd(nfd, buf, len);
>  841
>  842         if (n == -1) {
>  843             ngx_log_error(NGX_LOG_ALERT, cf->log, ngx_errno,
>  844                           ngx_write_fd_n " \"%s\" failed", to);
>  845             goto failed;
>  846         }
>
>  868 failed:
>  869
>  870     if (nfd != NGX_INVALID_FILE) {
>  871         if (ngx_close_file(nfd) == NGX_FILE_ERROR) {
>  872             ngx_log_error(NGX_LOG_ALERT, cf->log, ngx_errno,
>  873                           ngx_close_file_n " \"%s\" failed", to);
>  874         }
>  875     }
>  876
>  877     if (fd != NGX_INVALID_FILE) {
>  878         if (ngx_close_file(fd) == NGX_FILE_ERROR) {
>  879             ngx_log_error(NGX_LOG_ALERT, cf->log, ngx_errno,
>  880                           ngx_close_file_n " \"%s\" failed", from);
>  881         }
>  882     }
>
> which matches strace:
>
> 25382 18:11:24.674786 read(60, "*\213\331"..., 65536) = 65536 <0.000020>
> 25382 18:11:24.674835 write(61, "*\213\331"..., 65536) = 65536 <0.000233>
> 25382 18:11:24.675093 read(60, "\315\317\""..., 65536) = 65536 <0.000018>
> 25382 18:11:24.675129 write(61, "\315\317\""..., 65536) = ?
> ERESTARTSYS (To be restarted) <0.012122>
> 25382 18:11:24.687290 --- SIGALRM (Alarm clock) @ 0 (0) ---
> 25382 18:11:24.687442 rt_sigreturn(0xe) = -1 EINTR (Interrupted system
> call) <0.000006>
> 25382 18:11:24.687473 gettid()          = 25382 <0.000006>
> 25382 18:11:24.687502 write(21, "201"..., 641) = 641 <0.000014>
> 25382 18:11:24.687538 close(61)         = 0 <0.000009>
> 25382 18:11:24.687561 close(60)         = 0 <0.000006>
> 25382 18:11:24.687583 unlink("/home/ssd/proxy_temp/6/9/0000191396") =
> 0 <0.000020>
> 25382 18:11:24.687625 close(55)         = 0 <0.000042>
>
> ngx_write_fd() is just a write(), which, when interrupted by SIGALRM,
> fails with EINTR because SA_RESTART is not set.  We can try digging
> further, but I think nginx should retry in this case.

Funny thing, it looks like signal(7) has been recently updated to
stress that a non-local "disk" can't be considered a "fast" device.

It used to say:

If  a blocked call to one of the following interfaces is interrupted by
a signal handler, then the call will be automatically  restarted  after
the  signal  handler returns if the SA_RESTART flag was used; otherwise
the call will fail with the error EINTR:

    * read(2), readv(2), write(2), writev(2),  and  ioctl(2)  calls  on
      "slow"  devices.   A  "slow" device is one where the I/O call may
      block for an indefinite time, for example, a terminal,  pipe,  or
      socket.   (A  disk is not a slow device according to this defini‐
      tion.)  If an I/O call on a slow device has  already  transferred
      some data by the time it is interrupted by a signal handler, then
      the call will return a success status (normally,  the  number  of
      bytes transferred).

and now http://man7.org/linux/man-pages/man7/signal.7.html says:

      Note that a (local) disk is not a slow device according to this
      definition; I/O operations on disk devices are not interrupted by
      signals.

The discussion is at https://bugzilla.kernel.org/show_bug.cgi?id=97721.

cephfs is not really a disk, and further, not a local one, so I don't
think cephfs is to blame here...

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com