On Thu, May 02, 2019 at 07:45:16PM +0200, Andre Noll wrote:
On Thu, May 02, 18:52, Greg Kroah-Hartman wrote
On Thu, May 02, 2019 at 05:27:36PM +0200, Andre Noll wrote:
> On Thu, May 02, 16:10, Greg Kroah-Hartman wrote
> > Ok, then how about we hold off on this patch for 4.9.y then. "no one"
> > should be using 4.9.y in a "server system" anymore, unless you happen to
> > have an enterprise kernel based on it. So we should be fine as the
> > users of the older kernels don't run xfs.
>
> Well, we do run xfs on top of bcache on vanilla 4.9 kernels on a few
> dozen production servers here. Mainly because we ran into all sorts
> of issues with newer kernels (not necessary related to xfs). 4.9,
> OTOH, appears to be rock solid for our workload.
Great, but what is wrong with 4.14.y or better yet, 4.19.y? Do those
also work for your workload? If not, we should fix that, and soon :)
Some months ago we tried 4.14 and it was a real disaster: random
crashes with nothing in the logs on the file servers and unkillable
hung processes on the compute machines. The thing is, I can't afford
an extended downtime of these production systems, or test patches, or
enable debugging options which slow down the systems too much. Also,
10 of the compute nodes load the nvidia module, so all bets are off
anyway. But we've seen the hung processes also on the non-gpu nodes
where the nvidia module is not loaded.
As for 4.19, xfs on bcache was broken until a couple of weeks
ago. Meanwhile the fix (e578f90d8a9c) went in, so I benchmarked 4.19.x
on one system briefly. To my surprise the results were *worse* than
with 4.9. This seems to be another cache bypass issue, but I need to
have a closer look, and more reliable numbers.
Is this something you can reproduce outside of those 10 magical
machines?
--
Thanks,
Sasha