On Mon, 16 Nov 2015, Guang Yang wrote: > I spoke to a leveldb expert, it looks like this is a known pattern on > LSM tree data structure - the tail latency for range scan could be far > longer than avg/median since it might need to mmap several sst files > to get the record. > > Hi Sage, > Do you see any harm to increase the default value for this setting > (e.g. 20 minutes)? Or should I add the advise for monitor > trouble-shooting? The timeout is just for a round trip for the sync process, right? I think increasing it a bit (2x or 3x?) is okay, but 20 minutes to do a single chunk is a lot. The underlying problem in your cases is that your store is huge (by ~2 orders of magnitude), so I'm not sure we should tune against that :) sage > > Thanks, > Guang > > On Fri, Nov 13, 2015 at 9:07 PM, Guang Yang <guangyy@xxxxxxxxx> wrote: > > Thanks Sage! I will definitely try those patches. > > > > For this one, I finally managed to bring the new monitor in by > > increasing the mon_sync_timeout from its default 60 to 60000 to make > > sure the syncing does not restart and result in an infinite loop.. > > > > On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >> On Fri, 13 Nov 2015, Guang Yang wrote: > >>> Thanks Sage! > >>> > >>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >>> > On Fri, 13 Nov 2015, Guang Yang wrote: > >>> >> I was wrong the previous analysis, it was not the iterator got reset, > >>> >> the problem I can see now, is that during the syncing, a new round of > >>> >> election kicked off and thus it needs to probe the newly added > >>> >> monitor, however, since it hasn't been synced yet, it will restart the > >>> >> syncing from there. > >>> > > >>> > What version of this? I think this is something we fixed a while back? > >>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there > >>> a commit I can take a look? > >> > >> Hrm, I guess it was way befoer that.. I'm thinking of > >> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly. So I'm > >> not sure exactly why an election would be restarting the sync in your > >> case.. > >> > >> You mentioned elsewhere that your mon store was very large, though (more > >> than 10's of GB), which suggests you might be hitting the > >> min_last_epoch_clean problem (which prevents osdmap trimming).. see > >> b41408302b6529a7856a3b0a08c35e5fa284882e. This was backported to hammer > >> and firefly but not giant. > >> > >> sage > >> > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html