Re: Could be due to a race condition... Re: Help with understanding throttle.finish()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 6, 2016 at 8:29 AM, Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote:
> On 28-4-2016 20:15, Willem Jan Withagen wrote:
>>
>> Hi,
>>
>> I'm running a rather simple setup on my FreeBSD port.
>>
>> function TEST_simple() {
>>     # run the most simple config, and run a bechmark on it.
>>     local dir=$1
>>
>>     run_mon $dir a || return 1
>>     run_osd $dir 0 || return 1
>>
>>     #
>>     # default values should work
>>     #
>>     ceph tell osd.0 bench || return 1
>>
>> }
>>
>> This in the end crashes with:
>>     8059eec00 -1 FileStore: sync_entry timed out after 600 seconds.
>> exactly 10 minutes after startup.
>> This trhread does just about exactly nothing, it initialises the time,
>> and then traps after 10 minutes.
>> # grep 8059eec00  testdir/osd-bench/osd.0.log
>> 2016-04-28 19:51:44.444689 8059eec00 -1 FileStore: sync_entry timed out
>> after 600 seconds.
>> 2016-04-28 19:51:44.487104 8059eec00 -1 os/filestore/FileStore.cc: In
>> function 'virtual void SyncEntryTimeout::finish(int)' thread 8059eec00
>> time
>
>
> Haven't made much progress with this problem.
> Rebases, but that does not bring any "fixes" in.
>
> An extra measure point in time.
> I've ran the OSD thru truss (aka strace in linux speak) and that does
> complete.
>
> Now what truss/strace does it augments kernel entry and exit with monitoring
> code
> and as such it can (and will change) the micro-timing. Als a consequence of
> that
> it could also order the way threads interact.
> It could very well be a difference between semantics in Locks/Mutexes
> between
> Linux and FreeBSD, but I have not really found any suggestions to that
> regard.
>
> The fact that with truss/strace the osd does not generate a crash,
> (not even with: --filestore-commit-timeout=10)
> is in indication that I could very likely be either a deadlock or other lock
> related issue that is hiding somewhere under the lid of the OSD.
>
> What are people using to analyze timing/locking/deadlocking issues in the
> Cephcode?

Our Mutex implementations have a custom lockdep built in. That should
be checking for anything using those...

But I'd be inclined to just check exactly what the thread is doing. I
think it's a lot more likely to be getting an unexpected syscall value
and just sitting still or something.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux