On Tue, 20 Nov 2012, Nick Bartos wrote: > Since I now have a decent script which can reproduce this, I decided > to re-test with the same 3.5.7 kernel, but just not applying the > patches from the wip-3.5 branch. With the patches, I can only go 2 > builds before I run into a hang. Without the patches, I have gone 9 > consecutive builds (and still going) without seeing the hang. So it > seems like a reasonable assumption that the problem was introduced in > one of those patches. > > We started seeing the problem before applying all the 3.5 patches, so > it seems like one of these is the culprit: > > 1-libceph-encapsulate-out-message-data-setup.patch > 2-libceph-dont-mark-footer-complete-before-it-is.patch > 3-libceph-move-init-of-bio_iter.patch > 4-libceph-dont-use-bio_iter-as-a-flag.patch > 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch > 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch > 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch > 8-libceph-protect-ceph_con_open-with-mutex.patch > 9-libceph-reset-connection-retry-on-successfully-negotiation.patch > 10-rbd-only-reset-capacity-when-pointing-to-head.patch > 11-rbd-set-image-size-when-header-is-updated.patch > 12-libceph-fix-crypto-key-null-deref-memory-leak.patch > 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch > 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch > 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch > 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch > 17-libceph-check-for-invalid-mapping.patch > 18-ceph-propagate-layout-error-on-osd-request-creation.patch > 19-rbd-BUG-on-invalid-layout.patch > 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch > 21-ceph-avoid-32-bit-page-index-overflow.patch > 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch > > I'll start doing some other builds to try and narrow down the patch > introducing the problem more specifically. Thanks for hunting this down. I'm very curious what the culprit is... sage > > > On Tue, Nov 20, 2012 at 1:53 PM, Nick Bartos <nick@xxxxxxxxxxxxxxx> wrote: > > I reproduced the problem and got several sysrq states captured. > > During this run, the monitor running on the host complained a few > > times about the clocks being off, but all messages were for under 0.55 > > seconds. > > > > Here are the kernel logs. Note that there are several traces, I > > thought multiple during the incident may help: > > https://raw.github.com/gist/4121395/a6dda7552ed8a45725ee5d632fe3ba38703f8cfc/gistfile1.txt > > > > > > On Mon, Nov 19, 2012 at 3:34 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > >> Hmm, yep ? that param is actually only used for the warning; I guess > >> we forgot what it actually covers. :( > >> > >> Have your monitor clocks been off by more than 5 seconds at any point? > >> > >> On Mon, Nov 19, 2012 at 3:04 PM, Nick Bartos <nick@xxxxxxxxxxxxxxx> wrote: > >>> Making 'mon clock drift allowed' very small (0.00001) does not > >>> reliably reproduce the hang. I started looking at the code for 0.48.2 > >>> and it looks like this is only used in Paxos::warn_on_future_time, > >>> which only handles the warning, nothing else. > >>> > >>> > >>> On Fri, Nov 16, 2012 at 2:21 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >>>> On Fri, 16 Nov 2012, Nick Bartos wrote: > >>>>> Should I be lowering the clock drift allowed, or the lease interval to > >>>>> help reproduce it? > >>>> > >>>> clock drift allowed. > >>>> > >>>> > >>>> > >>>>> > >>>>> On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >>>>> > You can safely set the clock drift allowed as high as 500ms. The real > >>>>> > limitation is that it needs to be well under the lease interval, which is > >>>>> > currently 5 seconds by default. > >>>>> > > >>>>> > You might be able to reproduce more easily by lowering the threshold... > >>>>> > > >>>>> > sage > >>>>> > > >>>>> > > >>>>> > On Fri, 16 Nov 2012, Nick Bartos wrote: > >>>>> > > >>>>> >> How far off do the clocks need to be before there is a problem? It > >>>>> >> would seem to be hard to ensure a very large cluster has all of it's > >>>>> >> nodes synchronized within 50ms (which seems to be the default for "mon > >>>>> >> clock drift allowed"). Does the mon clock drift allowed parameter > >>>>> >> change anything other than the log messages? Are there any other > >>>>> >> tuning options that may help, assuming that this is the issue and it's > >>>>> >> not feasible to get the clocks more than 500ms in sync between all > >>>>> >> nodes? > >>>>> >> > >>>>> >> I'm trying to get a good way of reproducing this and get a trace on > >>>>> >> the ceph processes to see what they're waiting on. I'll let you know > >>>>> >> when I have more info. > >>>>> >> > >>>>> >> > >>>>> >> On Fri, Nov 16, 2012 at 11:16 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >>>>> >> > I just realized I was mixing up this thread with the other deadlock > >>>>> >> > thread. > >>>>> >> > > >>>>> >> > On Fri, 16 Nov 2012, Nick Bartos wrote: > >>>>> >> >> Turns out we're having the 'rbd map' hang on startup again, after we > >>>>> >> >> started using the wip-3.5 patch set. How critical is the > >>>>> >> >> libceph_protect_ceph_con_open_with_mutex commit? That's the one I > >>>>> >> >> removed before which seemed to get rid of the problem (although I'm > >>>>> >> >> not completely sure if it completely got rid of it, at least seemed to > >>>>> >> >> happen much less often). > >>>>> >> >> > >>>>> >> >> It seems like we only started having this issue after we started > >>>>> >> >> patching the 3.5 ceph client (we started patching to try and get rid > >>>>> >> >> of a kernel oops, which the patches seem to have fixed). > >>>>> >> > > >>>>> >> > Right. That patch fixes a real bug. It also seems pretty unlikely that > >>>>> >> > this patch is related to the startup hang. The original log showed clock > >>>>> >> > drift on the monitor that could very easily cause this sort of hang. Can > >>>>> >> > you confirm that that isn't the case with this recent instance of the > >>>>> >> > problem? And/or attach a log? > >>>>> >> > > >>>>> >> > Thanks- > >>>>> >> > sage > >>>>> >> > > >>>>> >> > > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> On Thu, Nov 15, 2012 at 4:25 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >>>>> >> >> > On Thu, 15 Nov 2012, Nick Bartos wrote: > >>>>> >> >> >> Sorry I guess this e-mail got missed. I believe those patches came > >>>>> >> >> >> from the ceph/linux-3.5.5-ceph branch. I'm now using the wip-3.5 > >>>>> >> >> >> branch patches, which seem to all be fine. We'll stick with 3.5 and > >>>>> >> >> >> this backport for now until we can figure out what's wrong with 3.6. > >>>>> >> >> >> > >>>>> >> >> >> I typically ignore the wip branches just due to the naming when I'm > >>>>> >> >> >> looking for updates. Where should I typically look for updates that > >>>>> >> >> >> aren't in released kernels? Also, is there anything else in the wip* > >>>>> >> >> >> branches that you think we may find particularly useful? > >>>>> >> >> > > >>>>> >> >> > You were looking in the right place. The problem was we weren't super > >>>>> >> >> > organized with our stable patches, and changed our minds about what to > >>>>> >> >> > send upstream. These are 'wip' in the sense that they were in preparation > >>>>> >> >> > for going upstream. The goal is to push them to the mainline stable > >>>>> >> >> > kernels and ideally not keep them in our tree at all. > >>>>> >> >> > > >>>>> >> >> > wip-3.5 is an oddity because the mainline stable kernel is EOL'd, but > >>>>> >> >> > we're keeping it so that ubuntu can pick it up for quantal. > >>>>> >> >> > > >>>>> >> >> > I'll make sure these are more clearly marked as stable. > >>>>> >> >> > > >>>>> >> >> > sage > >>>>> >> >> > > >>>>> >> >> > > >>>>> >> >> >> > >>>>> >> >> >> > >>>>> >> >> >> On Mon, Nov 12, 2012 at 3:16 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >>>>> >> >> >> > On Mon, 12 Nov 2012, Nick Bartos wrote: > >>>>> >> >> >> >> After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it > >>>>> >> >> >> >> seems we no longer have this hang. > >>>>> >> >> >> > > >>>>> >> >> >> > Hmm, that's a bit disconcerting. Did this series come from our old 3.5 > >>>>> >> >> >> > stable series? I recently prepared a new one that backports *all* of the > >>>>> >> >> >> > fixes from 3.6 to 3.5 (and 3.4); see wip-3.5 in ceph-client.git. I would > >>>>> >> >> >> > be curious if you see problems with that. > >>>>> >> >> >> > > >>>>> >> >> >> > So far, with these fixes in place, we have not seen any unexplained kernel > >>>>> >> >> >> > crashes in this code. > >>>>> >> >> >> > > >>>>> >> >> >> > I take it you're going back to a 3.5 kernel because you weren't able to > >>>>> >> >> >> > get rid of the sync problem with 3.6? > >>>>> >> >> >> > > >>>>> >> >> >> > sage > >>>>> >> >> >> > > >>>>> >> >> >> > > >>>>> >> >> >> > > >>>>> >> >> >> >> > >>>>> >> >> >> >> On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote: > >>>>> >> >> >> >> > On 11/08/2012 02:10 PM, Mandell Degerness wrote: > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> We are seeing a somewhat random, but frequent hang on our systems > >>>>> >> >> >> >> >> during startup. The hang happens at the point where an "rbd map > >>>>> >> >> >> >> >> <rbdvol>" command is run. > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> I've attached the ceph logs from the cluster. The map command happens > >>>>> >> >> >> >> >> at Nov 8 18:41:09 on server 172.18.0.15. The process which hung can > >>>>> >> >> >> >> >> be seen in the log as 172.18.0.15:0/1143980479. > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> It appears as if the TCP socket is opened to the OSD, but then times > >>>>> >> >> >> >> >> out 15 minutes later, the process gets data when the socket is closed > >>>>> >> >> >> >> >> on the client server and it retries. > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> Please help. > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> We are using ceph version 0.48.2argonaut > >>>>> >> >> >> >> >> (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe). > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> We are using a 3.5.7 kernel with the following list of patches applied: > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> 1-libceph-encapsulate-out-message-data-setup.patch > >>>>> >> >> >> >> >> 2-libceph-dont-mark-footer-complete-before-it-is.patch > >>>>> >> >> >> >> >> 3-libceph-move-init-of-bio_iter.patch > >>>>> >> >> >> >> >> 4-libceph-dont-use-bio_iter-as-a-flag.patch > >>>>> >> >> >> >> >> 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch > >>>>> >> >> >> >> >> 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch > >>>>> >> >> >> >> >> 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch > >>>>> >> >> >> >> >> 8-libceph-protect-ceph_con_open-with-mutex.patch > >>>>> >> >> >> >> >> 9-libceph-reset-connection-retry-on-successfully-negotiation.patch > >>>>> >> >> >> >> >> 10-rbd-only-reset-capacity-when-pointing-to-head.patch > >>>>> >> >> >> >> >> 11-rbd-set-image-size-when-header-is-updated.patch > >>>>> >> >> >> >> >> 12-libceph-fix-crypto-key-null-deref-memory-leak.patch > >>>>> >> >> >> >> >> 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch > >>>>> >> >> >> >> >> 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch > >>>>> >> >> >> >> >> 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch > >>>>> >> >> >> >> >> 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch > >>>>> >> >> >> >> >> 17-libceph-check-for-invalid-mapping.patch > >>>>> >> >> >> >> >> 18-ceph-propagate-layout-error-on-osd-request-creation.patch > >>>>> >> >> >> >> >> 19-rbd-BUG-on-invalid-layout.patch > >>>>> >> >> >> >> >> 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch > >>>>> >> >> >> >> >> 21-ceph-avoid-32-bit-page-index-overflow.patch > >>>>> >> >> >> >> >> 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> Any suggestions? > >>>>> >> >> >> >> > > >>>>> >> >> >> >> > > >>>>> >> >> >> >> > The log shows your monitors don't have time sychronized enough among > >>>>> >> >> >> >> > them to make much progress (including authenticating new connections). > >>>>> >> >> >> >> > That's probably the real issue. 0.2s is pretty large clock drift. > >>>>> >> >> >> >> > > >>>>> >> >> >> >> > > >>>>> >> >> >> >> >> One thought is that the following patch (which we could not apply) is > >>>>> >> >> >> >> >> what is required: > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> 22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch > >>>>> >> >> >> >> > > >>>>> >> >> >> >> > > >>>>> >> >> >> >> > This is certainly useful too, but I don't think it's the cause of > >>>>> >> >> >> >> > the delay in this case. > >>>>> >> >> >> >> > > >>>>> >> >> >> >> > Josh > >>>>> >> >> >> >> > -- > >>>>> >> >> >> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>>> >> >> >> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>>> >> >> >> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html > >>>>> >> >> >> >> -- > >>>>> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>>> >> >> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>>> >> >> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>>>> >> >> >> >> > >>>>> >> >> >> >> > >>>>> >> >> >> -- > >>>>> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>>> >> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>>> >> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>>>> >> >> >> > >>>>> >> >> >> > >>>>> >> >> -- > >>>>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>>> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>>> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>>>> >> >> > >>>>> >> >> > >>>>> >> > >>>>> >> > >>>>> -- > >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>>>> > >>>>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html