rbd map command hangs for 15 minutes during system start up

Mandell Degerness <mandell@xxxxxxxxxxxxxxx> · Thu, 8 Nov 2012 14:10:41 -0800

We are seeing a somewhat random, but frequent hang on our systems
during startup.  The hang happens at the point where an "rbd map
<rbdvol>" command is run.

I've attached the ceph logs from the cluster.  The map command happens
at Nov  8 18:41:09 on server 172.18.0.15.  The process which hung can
be seen in the log as 172.18.0.15:0/1143980479.

It appears as if the TCP socket is opened to the OSD, but then times
out 15 minutes later, the process gets data when the socket is closed
on the client server and it retries.

Please help.

We are using ceph version 0.48.2argonaut
(commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe).

We are using a 3.5.7 kernel with the following list of patches applied:

1-libceph-encapsulate-out-message-data-setup.patch
2-libceph-dont-mark-footer-complete-before-it-is.patch
3-libceph-move-init-of-bio_iter.patch
4-libceph-dont-use-bio_iter-as-a-flag.patch
5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
8-libceph-protect-ceph_con_open-with-mutex.patch
9-libceph-reset-connection-retry-on-successfully-negotiation.patch
10-rbd-only-reset-capacity-when-pointing-to-head.patch
11-rbd-set-image-size-when-header-is-updated.patch
12-libceph-fix-crypto-key-null-deref-memory-leak.patch
13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
17-libceph-check-for-invalid-mapping.patch
18-ceph-propagate-layout-error-on-osd-request-creation.patch
19-rbd-BUG-on-invalid-layout.patch
20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
21-ceph-avoid-32-bit-page-index-overflow.patch
23-ceph-fix-dentry-reference-leak-in-encode_fh.patch

Any suggestions?

One thought is that the following patch (which we could not apply) is
what is required:

22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch

Regards,
Mandell Degerness
Attachment:
hanglog_ceph.log.gz

Description: GNU Zip compressed data