It's very easy to reproduce now with my automated install script, the most I've seen it succeed with that patch is 2 in a row, and hanging on the 3rd, although it hangs on most builds. So it shouldn't take much to get it to do it again. I'll try and get to that tomorrow, when I'm a bit more rested and my brain is working better. Yes during this the OSDs are probably all syncing up. All the osd and mon daemons have started by the time the rdb commands are ran, though. On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > On Wed, 21 Nov 2012, Nick Bartos wrote: >> FYI the build which included all 3.5 backports except patch #50 is >> still going strong after 21 builds. > > Okay, that one at least makes some sense. I've opened > > http://tracker.newdream.net/issues/3519 > > How easy is this to reproduce? If it is something you can trigger with > debugging enabled ('echo module libceph +p > > /sys/kernel/debug/dynamic_debug/control') that would help tremendously. > > I'm guessing that during this startup time the OSDs are still in the > process of starting? > > Alex, I bet that a test that does a lot of map/unmap stuff in a loop while > thrashing OSDs could hit this. > > Thanks! > sage > > >> >> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos <nick@xxxxxxxxxxxxxxx> wrote: >> > With 8 successful installs already done, I'm reasonably confident that >> > it's patch #50. I'm making another build which applies all patches >> > from the 3.5 backport branch, excluding that specific one. I'll let >> > you know if that turns up any unexpected failures. >> > >> > What will the potential fall out be for removing that specific patch? >> > >> > >> > On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos <nick@xxxxxxxxxxxxxxx> wrote: >> >> It's really looking like it's the >> >> libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When >> >> patches 1-50 (listed below) are applied to 3.5.7, the hang is present. >> >> So far I have gone through 4 successful installs with no hang with >> >> only 1-49 applied. I'm still leaving my test run to make sure it's >> >> not a fluke, but since previously it hangs within the first couple of >> >> builds, it really looks like this is where the problem originated. >> >> >> >> 1-libceph_eliminate_connection_state_DEAD.patch >> >> 2-libceph_kill_bad_proto_ceph_connection_op.patch >> >> 3-libceph_rename_socket_callbacks.patch >> >> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch >> >> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch >> >> 6-libceph_start_separating_connection_flags_from_state.patch >> >> 7-libceph_start_tracking_connection_socket_state.patch >> >> 8-libceph_provide_osd_number_when_creating_osd.patch >> >> 9-libceph_set_CLOSED_state_bit_in_con_init.patch >> >> 10-libceph_embed_ceph_connection_structure_in_mon_client.patch >> >> 11-libceph_drop_connection_refcounting_for_mon_client.patch >> >> 12-libceph_init_monitor_connection_when_opening.patch >> >> 13-libceph_fully_initialize_connection_in_con_init.patch >> >> 14-libceph_tweak_ceph_alloc_msg.patch >> >> 15-libceph_have_messages_point_to_their_connection.patch >> >> 16-libceph_have_messages_take_a_connection_reference.patch >> >> 17-libceph_make_ceph_con_revoke_a_msg_operation.patch >> >> 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch >> >> 19-libceph_fix_overflow_in___decode_pool_names.patch >> >> 20-libceph_fix_overflow_in_osdmap_decode.patch >> >> 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch >> >> 22-libceph_transition_socket_state_prior_to_actual_connect.patch >> >> 23-libceph_fix_NULL_dereference_in_reset_connection.patch >> >> 24-libceph_use_con_get_put_methods.patch >> >> 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch >> >> 26-libceph_encapsulate_out_message_data_setup.patch >> >> 27-libceph_encapsulate_advancing_msg_page.patch >> >> 28-libceph_don_t_mark_footer_complete_before_it_is.patch >> >> 29-libceph_move_init_bio__functions_up.patch >> >> 30-libceph_move_init_of_bio_iter.patch >> >> 31-libceph_don_t_use_bio_iter_as_a_flag.patch >> >> 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch >> >> 33-libceph_don_t_change_socket_state_on_sock_event.patch >> >> 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch >> >> 35-libceph_don_t_touch_con_state_in_con_close_socket.patch >> >> 36-libceph_clear_CONNECTING_in_ceph_con_close.patch >> >> 37-libceph_clear_NEGOTIATING_when_done.patch >> >> 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch >> >> 39-libceph_separate_banner_and_connect_writes.patch >> >> 40-libceph_distinguish_two_phases_of_connect_sequence.patch >> >> 41-libceph_small_changes_to_messenger.c.patch >> >> 42-libceph_add_some_fine_ASCII_art.patch >> >> 43-libceph_set_peer_name_on_con_open_not_init.patch >> >> 44-libceph_initialize_mon_client_con_only_once.patch >> >> 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch >> >> 46-libceph_initialize_msgpool_message_types.patch >> >> 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch >> >> 48-libceph_report_socket_read_write_error_message.patch >> >> 49-libceph_fix_mutex_coverage_for_ceph_con_close.patch >> >> 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch >> >> >> >> >> >> On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: >> >>> Thanks for hunting this down. I'm very curious what the culprit is... >> >>> >> >>> sage >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html