Re: Deterministic thrashing

Loic Dachary <loic@xxxxxxxxxxx> · Mon, 07 Apr 2014 19:13:59 +0200

On 07/04/2014 18:55, Gregory Farnum wrote:
> This would be really nice but there are unfortunately even more
> hiccups than you've noted here:
> 1) Thrashing is both time and disk access sensitive, and hardware differs
> 2) The teuthology thrashing is triggered largely based on PG state
> events (eg, "all PGs are clean, so restart an OSD")
> 3) The actual failures tend to involve a combination of PG state and
> inbound client operations, and I can't think of any realistic way to
> coordinate those.
> 
> Those problems look technically insurmountable to me, but maybe I'm
> missing something?

There is no easy way to use the logs / events to significantly reduce the randomness of the workload ? I honestly have no clue ;-)

Cheers

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Sun, Apr 6, 2014 at 3:29 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>> Hi Ceph,
>>
>> It would be nice to have a way to replay the random events injected by stanzas such as
>>
>> - thrashosds:
>>     chance_pgnum_grow: 2
>>     chance_pgpnum_fix: 1
>>
>> When a teuthology workload (such as tracker.ceph.com/issues/7914#note-34) crashes once a week and the error is not obvious, it would increase the probability to reproduce the crash. Instead of the "trashosds" we could have something like "recorded-trashosds: trashosd.events" and instead of being random they would happen more deterministically (same number of events and same number of seconds between events ?).
>>
>> I realize this is non trivial to implement but maybe someone already thought about that and has a better idea ?
>>
>> Cheers
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
signature.asc

Description: OpenPGP digital signature