Re: "issue pool application warning even if pool is empty" change

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 31, 2023 at 9:26 PM Prashant Dhange <pdhange@xxxxxxxxxx> wrote:
>
> Hi Ilya,
>
> We discussed this topic in yesterday's RADOS meeting. Overall sentiments are not to revert the PR#47560 till we have a viable solution from the RGW and orchestrator side. Similar problems can be seen with the application built on top of LIBRADOS APIs and fail to enable application for the pool. The end users may find it difficult to debug the issue of why pool is not writable.
>
> We believe the solution may lie outside RADOS but the end solution should be less intrusive and should be backward-compatible. RGW was silently failing to create buckets. We had to debug the issue through RGW debug logs which was time consuming and not at all user friendly. Reference : https://bugzilla.redhat.com/show_bug.cgi?id=2028999. One of ceph's users had a major production outage for more than 24 hours because the RGW was failing to create buckets after cluster upgrade due to enforcement of tag in OSD caps. Alternatively we can mute the POOL_APP_NOT_ENABLED warning in case HEALTH_WARN is a bit annoying for newly created pools.

Hi Prashant,

Thanks for providing the context.

I can't say I agree with the approach.  There are many other ways to
screw up OSD caps (especially if one tries to lock down as tight as
possible) none of which would be similarly highlighted in "ceph
status", so this doesn't address the general lack of user-friendliness
in this area.

>
> Let us know if you would like to join the next RADOS meeting (or separate meeting) to discuss the feasible solution with the RGW and cephadm team. I will invite all the stakeholders for the meeting.

That said, I don't really have a stake in the ground here.  Given that
creating pools is a rare operation, perhaps a bogus health alert that
shows up only briefly is acceptable.

                Ilya

>
> Regards,
> Prashant
>
> On Tue, Aug 29, 2023 at 2:20 PM Vikhyat Umrao <vikhyat@xxxxxxxxxx> wrote:
>>
>>
>>
>> On Tue, Aug 29, 2023 at 4:37 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>
>>> On Mon, Aug 28, 2023 at 11:22 PM Prashant Dhange <pdhange@xxxxxxxxxx> wrote:
>>> >
>>> > Hi Ilya and Vikhyat,
>>> >
>>> > On Mon, Aug 28, 2023 at 9:06 AM Vikhyat Umrao <vikhyat@xxxxxxxxxx> wrote:
>>> >>
>>> >> Ilya and Prashant - Or it could be we can have a feature in rados when the pool create command run should also take the application as input? This app not being set up has caused hard problems in troubleshooting.
>>> >
>>> > This could be an alternative approach to avoid BZ#2029585 issue by specifying application name at the time of pool creation. Let's discuss this in the next RADOS meeting.
>>> >
>>> >>
>>> >>
>>> >> On Fri, Aug 25, 2023 at 3:21 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>> >>>
>>> >>> On Fri, Aug 25, 2023 at 9:41 AM Prashant Dhange <pdhange@xxxxxxxxxx> wrote:
>>> >>> >
>>> >>> > Hi Ilya,
>>> >>> >
>>> >>> > G'day.
>>> >>> >
>>> >>> > We were seeing the rgw bucket creation failures if application is not
>>> >>> > enabled for the rgw control pool and ceph status was not reporting
>>> >>> > the warning message "x pool(s) do not have an application enabled
>>> >>> > (POOL_APP_NOT_ENABLED)".
>>> >>>
>>> >>> Hi Prashant,
>>> >>>
>>> >>> Could RGW be improved to emit a better log message in this case?
>>> >>>
>>> >>> > We also observed the RGW daemon crash in the absence of application
>>> >>> > was not enabled for the pool. There was no way to know the reason
>>> >>> > behind RGW bucket creation failure. This issue has been raised on
>>> >>> > BZ#2029585.
>>> >>>
>>> >>> I assume the crash is the following:
>>> >>>
>>> >>>     debug     -5> 2022-08-10T12:10:55.410+0000 7f6b90b27700 10
>>> >>> monclient: get_auth_request con 0x5652391ac000 auth_method 0
>>> >>>     debug     -4> 2022-08-10T12:10:55.532+0000 7f6ba64b2440  0 rgw
>>> >>> main: ERROR: notify_obj.operate() returned r=-1
>>> >>>     debug     -3> 2022-08-10T12:10:55.532+0000 7f6ba64b2440 -1 ERROR:
>>> >>> failed to initialize watch: (1) Operation not permitted
>>> >>>     debug     -2> 2022-08-10T12:10:55.532+0000 7f6ba64b2440  0 rgw
>>> >>> main: ERROR: failed to start notify service ((1) Operation not
>>> >>> permitted
>>> >>>     debug     -1> 2022-08-10T12:10:55.532+0000 7f6ba64b2440  0 rgw
>>> >>> main: ERROR: failed to init services (ret=(1) Operation not permitted)
>>> >>>     debug      0> 2022-08-10T12:10:55.539+0000 7f6ba64b2440 -1 ***
>>> >>> Caught signal (Segmentation fault) **
>>> >>>      in thread 7f6ba64b2440 thread_name:radosgw
>>> >>>
>>> >>>      ceph version 16.2.7-98.el8cp
>>> >>> (b20d33c3b301e005bed203d3cad7245da3549f80) pacific (stable)
>>> >>>      1: /lib64/libpthread.so.0(+0x12c20) [0x7f6b9ab19c20]
>>> >>>      2: /lib64/librados.so.2(+0xada95) [0x7f6ba4ecaa95]
>>> >>>      3: /lib64/librados.so.2(+0x9dfd8) [0x7f6ba4ebafd8]
>>> >>>      4: (RGWSI_Notify::unwatch(RGWSI_RADOS::Obj&, unsigned long)+0x2e)
>>> >>> [0x7f6ba5cac99e]
>>> >>>      5: (RGWSI_Notify::finalize_watch()+0x40) [0x7f6ba5cad290]
>>> >>>      6: (RGWSI_Notify::shutdown()+0x22) [0x7f6ba5cad302]
>>> >>>      7: (RGWServices_Def::shutdown()+0x4e) [0x7f6ba57abcde]
>>> >>>      8: (RGWServices_Def::~RGWServices_Def()+0x12) [0x7f6ba57abd62]
>>> >>>      9: (RGWRados::~RGWRados()+0x80) [0x7f6ba5b8e990]
>>> >>>      10: (RGWStoreManager::init_storage_provider(DoutPrefixProvider
>>> >>> const*, ceph::common::CephContext*, bool, bool, bool, bool, bool,
>>> >>> bool, bool)+0x137) [0x7f6ba5b8d277]
>>> >>>      11: (radosgw_Main(int, char const**)+0x154b) [0x7f6ba574a33b]
>>> >>>      12: __libc_start_main()
>>> >>>      13: _start()
>>> >>>      NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> >>> needed to interpret this.
>>> >>>
>>> >>> It's not in RGW per se, but could be caused by RGW passing an invalid
>>> >>> pointer librados.  Was this reported to the RGW team?
>>> >
>>> > Yes, this was the RGW crash. We had this reported in the tracker#54719. I was not able to debug this issue further due to missing coredump and
>>> > also not able to reproduce it again on another attempt.
>>> >
>>> >>>
>>> >>> >
>>> >>> > My opinion was that if we create a pool then we must specify the
>>> >>> > application for the pool even though the pool is not in use to avoid
>>> >>> > unnecessary creation of the pool.
>>> >>>
>>> >>> As I said in the previous message, unfortunately it doesn't work this
>>> >>> way because creating a pool and specifying an application are separate
>>> >>> steps.  With this change the cluster can temporarily go to HEALTH_WARN
>>> >>> on any pool creation, even if operator is following up with "ceph osd
>>> >>> pool application enable" command immediately.  The "in use" check was
>>> >>> put in place because there appeared to be no other (easy) way to avoid
>>> >>> a bogus health alert.
>>> >
>>> > Would it be a good approach in your view to compulsory specify application name at the time of pool
>>> > creation as suggested by Vikhyat ?
>>>
>>> Hi Prashant,
>>>
>>> I'm pretty sure there were $REASONS why it wasn't made compulsory back
>>> when support for application tags/metadata was being added.  The major
>>> one was definitely backwards compatibility, since changing an existing
>>> monitor command to require a parameter that isn't even there is tough.
>>
>>
>> I am not saying setting the app during pool creation to make it compulsory, I am saying keeping it optional but avoiding two steps approach one for pool creation and one for app setting that Ilya is pointing to fix all the tests. This will help to protect the backward compatibility as Ilya pointed out.  Now on the orchestration side, we can make it compulsory it could be cephadm or rook when you create a pool the orchestrator should make sure it is setting the app for the type that pool would be used, and if the daemon code is creating the pools like for example RGW then they should make sure they setting the app during pool creation. So my vote would good for having --app kind of flag in pool create commands and documenting the app enforcement for consumers like rook, cephadm, and respective daemons.
>>
>> --vikhyat
>>
>>
>>>
>>>
>>> Further, you would need to extend librados C/C++ API and also all
>>> bindings because none of the rados_pool_create variants allow passing
>>> arbitrary parameters.
>>>
>>> Overall, it doesn't seem worth the effort (and trouble) to me.
>>>
>>> Thanks,
>>>
>>>                 Ilya
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux