On Wed, 21 Nov 2012, Nick Bartos wrote: > FYI the build which included all 3.5 backports except patch #50 is > still going strong after 21 builds. Okay, that one at least makes some sense. I've opened http://tracker.newdream.net/issues/3519 How easy is this to reproduce? If it is something you can trigger with debugging enabled ('echo module libceph +p > /sys/kernel/debug/dynamic_debug/control') that would help tremendously. I'm guessing that during this startup time the OSDs are still in the process of starting? Alex, I bet that a test that does a lot of map/unmap stuff in a loop while thrashing OSDs could hit this. Thanks! sage > > On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos <nick@xxxxxxxxxxxxxxx> wrote: > > With 8 successful installs already done, I'm reasonably confident that > > it's patch #50. I'm making another build which applies all patches > > from the 3.5 backport branch, excluding that specific one. I'll let > > you know if that turns up any unexpected failures. > > > > What will the potential fall out be for removing that specific patch? > > > > > > On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos <nick@xxxxxxxxxxxxxxx> wrote: > >> It's really looking like it's the > >> libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When > >> patches 1-50 (listed below) are applied to 3.5.7, the hang is present. > >> So far I have gone through 4 successful installs with no hang with > >> only 1-49 applied. I'm still leaving my test run to make sure it's > >> not a fluke, but since previously it hangs within the first couple of > >> builds, it really looks like this is where the problem originated. > >> > >> 1-libceph_eliminate_connection_state_DEAD.patch > >> 2-libceph_kill_bad_proto_ceph_connection_op.patch > >> 3-libceph_rename_socket_callbacks.patch > >> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch > >> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch > >> 6-libceph_start_separating_connection_flags_from_state.patch > >> 7-libceph_start_tracking_connection_socket_state.patch > >> 8-libceph_provide_osd_number_when_creating_osd.patch > >> 9-libceph_set_CLOSED_state_bit_in_con_init.patch > >> 10-libceph_embed_ceph_connection_structure_in_mon_client.patch > >> 11-libceph_drop_connection_refcounting_for_mon_client.patch > >> 12-libceph_init_monitor_connection_when_opening.patch > >> 13-libceph_fully_initialize_connection_in_con_init.patch > >> 14-libceph_tweak_ceph_alloc_msg.patch > >> 15-libceph_have_messages_point_to_their_connection.patch > >> 16-libceph_have_messages_take_a_connection_reference.patch > >> 17-libceph_make_ceph_con_revoke_a_msg_operation.patch > >> 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch > >> 19-libceph_fix_overflow_in___decode_pool_names.patch > >> 20-libceph_fix_overflow_in_osdmap_decode.patch > >> 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch > >> 22-libceph_transition_socket_state_prior_to_actual_connect.patch > >> 23-libceph_fix_NULL_dereference_in_reset_connection.patch > >> 24-libceph_use_con_get_put_methods.patch > >> 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch > >> 26-libceph_encapsulate_out_message_data_setup.patch > >> 27-libceph_encapsulate_advancing_msg_page.patch > >> 28-libceph_don_t_mark_footer_complete_before_it_is.patch > >> 29-libceph_move_init_bio__functions_up.patch > >> 30-libceph_move_init_of_bio_iter.patch > >> 31-libceph_don_t_use_bio_iter_as_a_flag.patch > >> 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch > >> 33-libceph_don_t_change_socket_state_on_sock_event.patch > >> 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch > >> 35-libceph_don_t_touch_con_state_in_con_close_socket.patch > >> 36-libceph_clear_CONNECTING_in_ceph_con_close.patch > >> 37-libceph_clear_NEGOTIATING_when_done.patch > >> 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch > >> 39-libceph_separate_banner_and_connect_writes.patch > >> 40-libceph_distinguish_two_phases_of_connect_sequence.patch > >> 41-libceph_small_changes_to_messenger.c.patch > >> 42-libceph_add_some_fine_ASCII_art.patch > >> 43-libceph_set_peer_name_on_con_open_not_init.patch > >> 44-libceph_initialize_mon_client_con_only_once.patch > >> 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch > >> 46-libceph_initialize_msgpool_message_types.patch > >> 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch > >> 48-libceph_report_socket_read_write_error_message.patch > >> 49-libceph_fix_mutex_coverage_for_ceph_con_close.patch > >> 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch > >> > >> > >> On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >>> Thanks for hunting this down. I'm very curious what the culprit is... > >>> > >>> sage > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html