> > > > Hi, > > > > > I've had a few occasions where tapdisk has segfaulted: > > > > > > tapdisk[9180]: segfault at 7f7e3a5c8c10 ip 00007f7e387532d4 sp > > 00007f7e3a5c8c10 error 4 in libpthread-2.13.so[7f7e38748000+17000] > > > tapdisk:9180 blocked for more than 120 seconds. > > > tapdisk D ffff88043fc13540 0 9180 1 0x00000000 > > > > > > and then like: > > > > > > end_request: I/O error, dev tdc, sector 472008 > > > > > > I can't be sure but I suspect that when this happened either one OSD was > > > offline, or the cluster lost quorum briefly. > > > > Interesting. There might be an issue if a request ends in error, I'll > > have to check that. > > I'll have a look on monday. > > > > You say in tdrbd_finish_aiocb: > > while (1) { > /* POSIX says write will be atomic or blocking */ > rv = write(prv->pipe_fds[1], (void*)&req, sizeof(req)); > > but from what I've read in "man 7 pipe", the statement about being atomic > only applies if the pipe is open in non-blocking mode, and you open it with a > call to pipe() (same as pipe2(,0)) and you never call fcntl to change it. This > would be consistent with the random crashes I'm seeing - I thought they > were related to transient errors but my ceph cluster has been perfectly > stable for a few days now and it's still happening. > > What do you think? > Actually maybe not. What I was reading only applies for large number of bytes written to the pipe, and even then I got confused by the double negatives. Sorry for the noise. James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html