On Wed, 18 Jul 2018, Oliver Freyermuth wrote: > Am 18.07.2018 um 14:20 schrieb Sage Weil: > > On Wed, 18 Jul 2018, Linh Vu wrote: > >> Thanks for all your hard work in putting out the fixes so quickly! :) > >> > >> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, > >> not RGW. In the release notes, it says RGW is a risk especially the > >> garbage collection, and the recommendation is to either pause IO or > >> disable RGW garbage collection. > >> > >> > >> In our case with CephFS, not RGW, is it a lot less risky to perform the > >> upgrade to 12.2.7 without the need to pause IO? > >> > >> > >> What does pause IO do? Do current sessions just get queued up and IO > >> resume normally with no problem after unpausing? > >> > >> > >> If we have to pause IO, is it better to do something like: pause IO, > >> restart OSDs on one node, unpause IO - repeated for all the nodes > >> involved in the EC pool? > > Hi! > > sorry for asking again, but... > > > > > CephFS can generate a problem rados workload too when files are deleted or > > truncated. If that isn't happening in your workload then you're probably > > fine. If deletes are mixed in, then you might consider pausing IO for the > > upgrade. > > > > FWIW, if you have been running 12.2.5 for a while and haven't encountered > > the OSD FileStore crashes with > > > > src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must exist") > > > > but have had OSDs go up/down then you are probably okay. > > => Does this issue only affect filestore, or also bluestore? > In your "IMPORTANT" warning mail, you wrote: > "It seems to affect filestore and busy clusters with this specific > workload." > concerning this issue. > However, the release notes do not mention explicitly that only Filestore is affected. > > Both Linh Vu and me are using Bluestore (exclusively). > Are we potentially affected unless we pause I/O during the upgrade? The bug should apply to both FileStore and BlueStore, but we have only seen crashes with FileStore. I'm not entirely sure why that is. One theory is that the filestore apply timing is different and that makes the bug more likely to happen. Another is that filestore splitting is a "good" source of that latency that tends to trigger the bug easily. If it were me I would err on the safe side. :) sage _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com