Re: Re: [PATCH] mark rbd requiring stable pages

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 30 Oct 2015 12:36:52 +0100

On Fri, Oct 23, 2015 at 9:06 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> On Fri, Oct 23, 2015 at 9:00 PM, ronny.hegewald@xxxxxxxxx
> <ronny.hegewald@xxxxxxxxx> wrote:
>>> Could you share the entire log snippet for those 10 minutes?
>>
>> Thats all in the logs.  But if more information would be useful tell me which logs
>> to activate and i will give it another run. At least this part is easy to reproduce.
>>
>>> Which kernel was this on?
>>
>> The latest kernel i used which produced the corruption was 3.19.8.
>> The earliest one was 3.11.
>
> No need for now, I'll poke around and report back.

So the "bad crc" errors are of course easily reproducible, but
I haven't managed to reproduce ext4 corruptions.  I amended your patch
to only require stable pages in case we actually compute checksums, see
https://github.com/ceph/ceph-client/commit/4febcceb866822c1a1aee2836c9c810e3ef29bbf.

Any other data points you can share?  Can you describe your cluster
(boxes, OSDs, clients, rbds mapped - where, how many, ext4 mkfs and
mount options, etc) in more detail?  Is there anything special about
your setup that you can think of?

You've mentioned that the best test case in your experience is kernel
compilation.  What .config are you using, how many threads (make -jX)
and how long does it take to build a kernel with that .config and that
number of threads?  You have more than one rbd device mapped on the
same box - how many exactly, do you put any load on the rest while the
kernel is compiling on one of them?  What about rbd devices mapped on
other boxes?

You get the idea - every bit counts.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html