On 07/04/2014 18:55, Gregory Farnum wrote: > This would be really nice but there are unfortunately even more > hiccups than you've noted here: > 1) Thrashing is both time and disk access sensitive, and hardware differs > 2) The teuthology thrashing is triggered largely based on PG state > events (eg, "all PGs are clean, so restart an OSD") > 3) The actual failures tend to involve a combination of PG state and > inbound client operations, and I can't think of any realistic way to > coordinate those. > > Those problems look technically insurmountable to me, but maybe I'm > missing something? There is no easy way to use the logs / events to significantly reduce the randomness of the workload ? I honestly have no clue ;-) Cheers > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Sun, Apr 6, 2014 at 3:29 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >> Hi Ceph, >> >> It would be nice to have a way to replay the random events injected by stanzas such as >> >> - thrashosds: >> chance_pgnum_grow: 2 >> chance_pgpnum_fix: 1 >> >> When a teuthology workload (such as tracker.ceph.com/issues/7914#note-34) crashes once a week and the error is not obvious, it would increase the probability to reproduce the crash. Instead of the "trashosds" we could have something like "recorded-trashosds: trashosd.events" and instead of being random they would happen more deterministically (same number of events and same number of seconds between events ?). >> >> I realize this is non trivial to implement but maybe someone already thought about that and has a better idea ? >> >> Cheers >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
signature.asc
Description: OpenPGP digital signature