Apparent cephfs bugs

Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx> · Tue, 3 May 2022 13:43:37 -0500

Is this mailing list an appropriate place to report apparent 
cephfs bugs? I haven't gotten traction over on ceph-users.

A couple of weeks ago we attempted to switch from Lustre to 
Cephfs for our compute cluster shared file system but had to 
roll back because users began reporting problems:

1) Some writes failing silently, resulting in 0-size files.
2) Some writes hanging indefinitely. In my experiments first 
4MB (4194304B) would be written out fine, but then the 
process would get stuck.

I've been generally unable to trigger these bugs, except (2) 
which seems to affect only some systems, but can be 
reproduced every time on an impacted system, at least for a 
while.

We have close to 200 kernel cephfs clients (trying fuse 
mounts resulted in hangs). They mostly run kernels between 
3.10.0-957.27.2.el7 and 3.10.0-1160.62.1.el7. A few machines 
have  4.18.0-348.20.1.el8_5.

The cluster is running 16.2.7, consists of 20 OSD servers 
with 24-26 disks. Cephfs metadata pool is stored across 12 
OSDs backed by NVMe flash on 3 servers. Single MDS daemon.

Do the problems we've experienced sound like any known bugs?

The MDS was complaining about slow IO when users were 
experiencing issues. Could this explain empty files?

Vlad
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx