On Fri, Jul 24, 2015 at 4:29 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > On Fri, Jul 24, 2015 at 3:54 PM, Vedran Furač <vedran.furac@xxxxxxxxx> wrote: >> On 07/24/2015 09:54 AM, Ilya Dryomov wrote: >>> >>> I don't know - looks like nginx isn't setting SA_RESTART, so it should >>> be repeating the write()/writev() itself. That said, if it happens >>> only on cephfs, we need to track it down. >> >> Correct, this is first time I see such an error. I've never seen >> something like this when writing on working local disks (unless it run >> out space or inodes). It's indicative that it almost always happens as >> soon as first 4MBs are written and that's the Ceph object size. >> >>> Try enabling nginx debug logging? I seem to have a vague recollection >>> that it records return values. That should give us something to start >>> with. >> >> Unfortunately it doesn't look very useful: >> >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe length: 1758913 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 input buf #40 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 free: 000000000343E750 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe write chain >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 write: 48, >> 000000000304FD60, 1758913, 54583296 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe write downstream: 0 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe write downstream >> flush out >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http output filter "/139000 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http copy filter: "/139000 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 image filter >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 xslt filter body >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http postpone filter >> "/139000 0000000001A60450 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 write old buf t:0 f:1 >> 0000000000000000, pos 0000000000000000, size: 0 file: 24338425, size: 856071 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 write old buf t:0 f:1 >> 0000000000000000, pos 0000000000000000, size: 0 file: 25194496, size: >> 2101248 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 write new buf t:0 f:1 >> 0000000000000000, pos 0000000000000000, size: 0 file: 27295744, size: >> 29046465 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http write filter: l:0 >> f:0 s:32003784 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http write filter limit 0 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http write filter >> 0000000001A61948 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http copy filter: -2 "/139 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 pipe write downstream done >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 event timer del: 47: >> 1437742479805 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 event timer: 46, old: >> 1437742179605, new: 1437742179866 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http file cache update >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http file cache rename: >> "/home/ssd/proxy_temp/1/0/0000014501" to >> "/home/ceph/1d/8b/5/ed04823384c648e574f2f078ec258b1d" >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 malloc: 0000000001F97CA0:78 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 malloc: >> 0000000002AB08F0:65536 >> 2015/07/24 12:44:39 [alert] 26722#26722: *29428 write() >> "/home/ceph/1d/8b/5/ed04823384c648e574f2f078ec258b1d.0000014503" failed >> (4: Interrupted system call) while reading upstream >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http upstream exit: >> 0000000000000000 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 finalize http upstream >> request: 0 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 finalize http proxy request >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 free rr peer 1 0 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 close http upstream >> connection: 47 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 free: 0000000001F816A0, >> unused: 48 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 reusable connection: 0 >> 2015/07/24 12:44:39 [debug] 26722#26722: *29428 http upstream temp fd: 48 > > A quick grep reveals this in src/core/ngx_file.c: > > 799 len = cf->buf_size ? cf->buf_size : 65536; > > 840 n = ngx_write_fd(nfd, buf, len); > 841 > 842 if (n == -1) { > 843 ngx_log_error(NGX_LOG_ALERT, cf->log, ngx_errno, > 844 ngx_write_fd_n " \"%s\" failed", to); > 845 goto failed; > 846 } > > 868 failed: > 869 > 870 if (nfd != NGX_INVALID_FILE) { > 871 if (ngx_close_file(nfd) == NGX_FILE_ERROR) { > 872 ngx_log_error(NGX_LOG_ALERT, cf->log, ngx_errno, > 873 ngx_close_file_n " \"%s\" failed", to); > 874 } > 875 } > 876 > 877 if (fd != NGX_INVALID_FILE) { > 878 if (ngx_close_file(fd) == NGX_FILE_ERROR) { > 879 ngx_log_error(NGX_LOG_ALERT, cf->log, ngx_errno, > 880 ngx_close_file_n " \"%s\" failed", from); > 881 } > 882 } > > which matches strace: > > 25382 18:11:24.674786 read(60, "*\213\331"..., 65536) = 65536 <0.000020> > 25382 18:11:24.674835 write(61, "*\213\331"..., 65536) = 65536 <0.000233> > 25382 18:11:24.675093 read(60, "\315\317\""..., 65536) = 65536 <0.000018> > 25382 18:11:24.675129 write(61, "\315\317\""..., 65536) = ? > ERESTARTSYS (To be restarted) <0.012122> > 25382 18:11:24.687290 --- SIGALRM (Alarm clock) @ 0 (0) --- > 25382 18:11:24.687442 rt_sigreturn(0xe) = -1 EINTR (Interrupted system > call) <0.000006> > 25382 18:11:24.687473 gettid() = 25382 <0.000006> > 25382 18:11:24.687502 write(21, "201"..., 641) = 641 <0.000014> > 25382 18:11:24.687538 close(61) = 0 <0.000009> > 25382 18:11:24.687561 close(60) = 0 <0.000006> > 25382 18:11:24.687583 unlink("/home/ssd/proxy_temp/6/9/0000191396") = > 0 <0.000020> > 25382 18:11:24.687625 close(55) = 0 <0.000042> > > ngx_write_fd() is just a write(), which, when interrupted by SIGALRM, > fails with EINTR because SA_RESTART is not set. We can try digging > further, but I think nginx should retry in this case. Funny thing, it looks like signal(7) has been recently updated to stress that a non-local "disk" can't be considered a "fast" device. It used to say: If a blocked call to one of the following interfaces is interrupted by a signal handler, then the call will be automatically restarted after the signal handler returns if the SA_RESTART flag was used; otherwise the call will fail with the error EINTR: * read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices. A "slow" device is one where the I/O call may block for an indefinite time, for example, a terminal, pipe, or socket. (A disk is not a slow device according to this defini‐ tion.) If an I/O call on a slow device has already transferred some data by the time it is interrupted by a signal handler, then the call will return a success status (normally, the number of bytes transferred). and now http://man7.org/linux/man-pages/man7/signal.7.html says: Note that a (local) disk is not a slow device according to this definition; I/O operations on disk devices are not interrupted by signals. The discussion is at https://bugzilla.kernel.org/show_bug.cgi?id=97721. cephfs is not really a disk, and further, not a local one, so I don't think cephfs is to blame here... Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com