Hi Frank,
Thanks for the inputs. Kindly find my answers inline below...
On Wed, Mar 22, 2023 at 2:05 AM Frank Schilder <frans@xxxxxx> wrote:
Note: replying as a ceph cluster admin. Hope that is OK.
Hi Prashant,
that sounds like a very interesting idea. I have a few questions/concerns/suggestions from the point of view of a cluster admin.
Short version:
- please (!!) keep these logs on the dedicated MON storage below /var/lib/ceph
- however: take the logs out of the MON DB and write them to their own DB/file
- make the last-log size a configuration parameter (the log file becomes a ring buffer)
the config could be elastic and a combination of max_size and max_age
- optional: make filtering rules a config option (filter by type/debug level)
Long version:
1) What is the actual problem.
If I recall the cases about "MON store growing rapidly" correctly, I believe the problem was not that the logs go to the MONs, the problem was that the logs don't get trimmed unless health is health_ok. The MONs apparently had no (performance) problem receiving the logs, but a capacity problem storing them in case of health failures. If the logs are really just used for having the last entries available, why not look at the trimming first? Also, there is nothing in the logs stored on the MONs that isn't in the syslog, so loosing something here seems not really a problem to begin with.
Yes, you are right. Even in the case of HEALTH_OK, the logm trimming encountered one corner case because of potential corruption of the committed versions (https://tracker.ceph.com/issues/53485). If we trim logm (cluster log) entries aggressively in the event of excessive logm getting stored then there is no use of storing them at all as they will be trimmed sooner than logm entries getting fetched using log last or mgr dashboard.
2) .mgr pool
2.1) I have become really tired of these administrative pools that are created on the fly without any regards to device classes, available capacity, PG allocation and the like. The first one that showed up without warning was device_health_metrics, which turned the cluster health_err right away because the on-the-fly pool creation is, well, not exactly smart.
We don't even have drives below the default root. We have a lot of different pools on different (custom!) device classes with different replication schemes to accommodate a large variety of use cases. Administrative pools showing up randomly somewhere in the tree are a real pain. There are ceph-user cases where people deleted and recreated it only to make the device health module useless, because it seems to store the pool ID and there is no way to tell it to use the new pool.
I am not sure but isn't changing pool_name to new pool using "ceph config set mgr mgr/devicehealth/pool_name <new-pool>" for device health metrics work? Maybe we can address this issue related to device health module over a tracker ?
If you really think about adding a pool for that, please please make the pool creation part of the upgrade instructions with some hints on sizing, PGs and realistic (!!!) IOP/s requirements. I personally use the host-syslog and have drives with reasonable performance and capacity in the hosts to be able to pull debug logs with high logging values. All host logs are also aggregated to an rsyslogd instance. I don't see *any* need to aggregate these logs to a ceph pool.
2.2) Using a ceph pool for logging is not reliable during critical situations. The whole point of the logging is to provide information in case of disaster. In case of disaster, we can safely assume that an .mgr pool will not be available. The logging has to be on an alternative infrastructure that is not affected by ceph storage outages/health problems. Having it in the MON stores on local storage is such an alternative infrastructure. Why not just separate the logging storage from the actual MON DB store and make it max_size configurable?
Agree on 2.1 and 2.2. Really appreciate your efforts to document these concerns in detail. The other caveat with this solution is if mgr pool storing ceph cluster logs is not writable because of OSD full, network issue etc then we need to find an alternative way to get hold of cluster logs for troubleshooting purposes.
I would propose to keep it on the local dedicated MON storage (however outside of the MON DB) also to keep setting up a ceph cluster simple. If we needed now an additional MGR store, things would be more complicated. Just tell people that 60G is not enough for a MON store and at the same time make the last-log size a config option (it should really be a ring-buffer with a configurable fixed max-number of entries).
3) MGR performance
While it would possibly make sense to let the MGRs do more work, there is the problem of this work not being distributed (only 1 MGR does something) and that MGR modules seem not really performance optimized (too much python). If one wanted to outsource additional functionality to the MGRs, a good start would be to make all MGRs active and distribute the work (like a small distributed-memory compute cluster). A bit more module-crash resilience and performance improvements are also welcome.
Yes, mgr is not distributed and single mgr is responsible for all mgr workload. The major job of mgr is to lightweight MONs as much as possible. Another concern here is if the active mgr will be handling the cluster logging through the new pool then we will miss out cluster logs during the timeframe when all mgrs are down.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Regards,
Prashant
________________________________________
From: Prashant Dhange <pdhange@xxxxxxxxxx>
Sent: 22 March 2023 06:35:36
To: dev@xxxxxxx
Subject: Moving cluster log storage from monstore db
Hi All,
We are looking for inputs on a new feature to be implemented to move clog messages storage from monstore db, refer trello card [1] for more details around this topic.
Currently, every clog message goes to monstore db as well as debug/warning messages generates clog messages 1000s of times per seconds which leads to monstore db growing at an exponential rate in a catastrophic failure situation.
The primary use cases for the logm entries in monstore db are :
* For "ceph log last" commands to get historical clog entries
* Ceph dashboard (mgr is subscriber of log-info which propagate clog to dashboard module)
@Patrick Donnelly<mailto:pdonnell@xxxxxxxxxx> suggested a viable solution to move the cluster log storage to a new mgr module which handles the "ceph log last" command. The clog data can be stored in the .mgr pool via libcephsqlite.
Alternatively, if we donot want to get rid of logm storage from monstore db then the other solutions would be :
* Stop writing logm entries to mon db if there are excessive entries getting generated
* Filter out clog DBG entries and only log WRN/INF/ERR entries.
Looking forward to additional perspectives arounds this topic. Feel free to add your inputs to trello card [1] or reply to this email-thread.
[1] https://trello.com/c/oCGGFfTs/822-better-handling-of-cluster-log-messages-from-monstore-perspective
Regards,
Prashant
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx