Re: v12.2.7 Luminous released

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Wed, 18 Jul 2018 16:31:24 +0200

Am 18.07.2018 um 16:20 schrieb Sage Weil:
> On Wed, 18 Jul 2018, Oliver Freyermuth wrote:
>> Am 18.07.2018 um 14:20 schrieb Sage Weil:
>>> On Wed, 18 Jul 2018, Linh Vu wrote:
>>>> Thanks for all your hard work in putting out the fixes so quickly! :)
>>>>
>>>> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, 
>>>> not RGW. In the release notes, it says RGW is a risk especially the 
>>>> garbage collection, and the recommendation is to either pause IO or 
>>>> disable RGW garbage collection.
>>>>
>>>>
>>>> In our case with CephFS, not RGW, is it a lot less risky to perform the 
>>>> upgrade to 12.2.7 without the need to pause IO?
>>>>
>>>>
>>>> What does pause IO do? Do current sessions just get queued up and IO 
>>>> resume normally with no problem after unpausing?
>>>>
>>>>
>>>> If we have to pause IO, is it better to do something like: pause IO, 
>>>> restart OSDs on one node, unpause IO - repeated for all the nodes 
>>>> involved in the EC pool?
>>
>> Hi!
>>
>> sorry for asking again, but... 
>>
>>>
>>> CephFS can generate a problem rados workload too when files are deleted or 
>>> truncated.  If that isn't happening in your workload then you're probably 
>>> fine.  If deletes are mixed in, then you might consider pausing IO for the 
>>> upgrade.
>>>
>>> FWIW, if you have been running 12.2.5 for a while and haven't encountered 
>>> the OSD FileStore crashes with
>>>
>>> src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must exist")
>>>
>>> but have had OSDs go up/down then you are probably okay.
>>
>> => Does this issue only affect filestore, or also bluestore? 
>> In your "IMPORTANT" warning mail, you wrote:
>> "It seems to affect filestore and busy clusters with this specific 
>> workload."
>> concerning this issue. 
>> However, the release notes do not mention explicitly that only Filestore is affected. 
>>
>> Both Linh Vu and me are using Bluestore (exclusively). 
>> Are we potentially affected unless we pause I/O during the upgrade? 
> 
> The bug should apply to both FileStore and BlueStore, but we have only 
> seen crashes with FileStore.  I'm not entirely sure why that is.  One 
> theory is that the filestore apply timing is different and that makes the 
> bug more likely to happen.  Another is that filestore splitting is a 
> "good" source of that latency that tends to trigger the bug easily.
> 
> If it were me I would err on the safe side. :)

That's certainly the choice of a sage ;-). 

We'll do that, too - we informed our users just now I/O will be blocked for thirty minutes or so to give us some leeway for the upgrade... 
They will certainly survive the pause with the nice weather outside :-). 

Cheers and many thanks,
	Oliver

> 
> sage
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com