Am 18.07.2018 um 16:20 schrieb Sage Weil: > On Wed, 18 Jul 2018, Oliver Freyermuth wrote: >> Am 18.07.2018 um 14:20 schrieb Sage Weil: >>> On Wed, 18 Jul 2018, Linh Vu wrote: >>>> Thanks for all your hard work in putting out the fixes so quickly! :) >>>> >>>> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, >>>> not RGW. In the release notes, it says RGW is a risk especially the >>>> garbage collection, and the recommendation is to either pause IO or >>>> disable RGW garbage collection. >>>> >>>> >>>> In our case with CephFS, not RGW, is it a lot less risky to perform the >>>> upgrade to 12.2.7 without the need to pause IO? >>>> >>>> >>>> What does pause IO do? Do current sessions just get queued up and IO >>>> resume normally with no problem after unpausing? >>>> >>>> >>>> If we have to pause IO, is it better to do something like: pause IO, >>>> restart OSDs on one node, unpause IO - repeated for all the nodes >>>> involved in the EC pool? >> >> Hi! >> >> sorry for asking again, but... >> >>> >>> CephFS can generate a problem rados workload too when files are deleted or >>> truncated. If that isn't happening in your workload then you're probably >>> fine. If deletes are mixed in, then you might consider pausing IO for the >>> upgrade. >>> >>> FWIW, if you have been running 12.2.5 for a while and haven't encountered >>> the OSD FileStore crashes with >>> >>> src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must exist") >>> >>> but have had OSDs go up/down then you are probably okay. >> >> => Does this issue only affect filestore, or also bluestore? >> In your "IMPORTANT" warning mail, you wrote: >> "It seems to affect filestore and busy clusters with this specific >> workload." >> concerning this issue. >> However, the release notes do not mention explicitly that only Filestore is affected. >> >> Both Linh Vu and me are using Bluestore (exclusively). >> Are we potentially affected unless we pause I/O during the upgrade? > > The bug should apply to both FileStore and BlueStore, but we have only > seen crashes with FileStore. I'm not entirely sure why that is. One > theory is that the filestore apply timing is different and that makes the > bug more likely to happen. Another is that filestore splitting is a > "good" source of that latency that tends to trigger the bug easily. > > If it were me I would err on the safe side. :) That's certainly the choice of a sage ;-). We'll do that, too - we informed our users just now I/O will be blocked for thirty minutes or so to give us some leeway for the upgrade... They will certainly survive the pause with the nice weather outside :-). Cheers and many thanks, Oliver > > sage >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com