Yes, they can hold up reads to the same object. Depending on where
they're stuck, they may be blocking other requests as well if they're
e.g. taking up all the filestore threads. Waiting for subops means
they're waiting for replicas to acknowledge the write and commit it to
disk. The real cause for slowness of those ops is the replicas. If you
enable 'debug osd = 25', 'filestore = 25', and 'debug journal = 20' you
can trace through the logs to see exactly what's happening with the
subops for those requests.
Despite my promise to crank up logging and investigate what's going on I
was unable to. Ceph ceased to work before I was able to do something
reasonable (I've got about one hour of logging which I will look through
shortly). But then I started to get slow request warnings delayed by
thousands of seconds. VMs came to a standstill. I have restarted all
ceph subsystems few times - no banana. ceph health reported that
everything was OK when clearly it was not running. So I ended up
rebooting hosts and that's where fun begin: btrfs has failed to umount ,
on boot up it spit out "btrfs: free space inode generation (0) did not
match free space cache generation (177431)". I have not started ceph and
made an attempt to umount and umount just froze. Another reboot: same
stuff. I have rebooted second host and it came back with the same error.
So in effect I was unable to mount btrfs and read it: no wonder that
ceph was unable to run. Actually according to mons ceph was OK - all osd
daemons were in place and running but underlying filesystem gave up
ghost. Which leads to a suggestion: if osd daemon was unable to obtain
any data from underlying fs for some period of time (failure to mount,
disk failure etc) then it perhaps should terminate so rest of ceph would
not be held up and it will be immediately apparent on ceph health.
But ceph being ceph served its purpose as it should: I have lost two
osds out of four but because I set replication to 3 I was able to afford
loss of two osds. After destroying faulty btrfses all VMs started as per
usual and so far do not have any issues. Ceph rebuilds two osds in the
mean time.
Looking back at the beginning of the thread I now can conclude what did
happen:
1. One of our customers ran random write intensive task (MySQL updates +
a lot of temporary files created/removed)
2. Over period of two days performance of underlying btrfs started to
deteriorate and I started to see noticeable latency (at this point I
have emailed the list)
3. While trying to ascertain origin of latency intensive random writes
continued and so latency continued to increase to the point where ceph
started to complain about slow requests.
4. And finally state of btrfs when beyond the point where it could run
and so osds just locked up completely.
Now I have blown away btrfses, made new ones with leafsize of 64K
(Calvin - this one for you - let's see where it will land me) and
rebuilding them. I will blow away other two osds to have totally fresh
btrfses all around (this one goes to Tommi - it looks like I just
followed his observations).
And of course hats of to Josh and ceph team as now I have clear idea
what to do when I need to debug latency (and other internal stuff).
But it leaves me with very final question: should we rely on btrfs at
this point given it is having such major faults? What if I will use well
tested by time ext4?
Regards,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html