Re: reproducable rbd-nbd crashes

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 23 Jul 2019 08:41:05 -0400

On Tue, Jul 23, 2019 at 6:58 AM Marc Schöchlin <ms@xxxxxxxxxx> wrote:
>
>
> Am 23.07.19 um 07:28 schrieb Marc Schöchlin:
> >
> > Okay, i already experimented with high timeouts (i.e 600 seconds). As i can remember this leaded to pretty unusable system if i put high amounts of io on the ec volume.
> > This system also runs als krbd volume which saturates the system with ~30-60% iowait - this volume never had a problem.
> >
> > A comment writer in https://tracker.ceph.com/issues/40822#change-141205 suggests me to reduce the rbd cache.
> > What do you think about that?
>
> Test with reduce rbd cache still fail, therefore i made tests with disabled rbd cache:
>
> - i disabled rbd cache with "rbd cache = false"
> - unmounted and unmapped the image
> - mapped and mounted the image
> - re-executed my test'
>    find /srv_ec type f -name "*.sql" -exec gzip -v {} \;
>
>
> It took several hours, but at the end i have the same error situation.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Can you please test a consistent Ceph release w/ a known working
kernel release? It sounds like you have changed two variables, so it's
hard to know which one is broken. We need *you* to isolate what
specific Ceph or kernel release causes the break.

We really haven't made many changes to rbd-nbd, but the kernel has had
major changes to the nbd driver. As Mike pointed out on the tracker
ticket, one of those major changes effectively capped the number of
devices at 256. Can you repeat this with a single device? Can you
repeat this on Ceph rbd-nbd 12.2.11 with an older kernel?

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com