Re: LARGE_OMAP_OBJECTS: any proper action possible?

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 26 Aug 2021 10:03:48 +0200

On Thu, Aug 26, 2021 at 9:49 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan,
>
> he he, I built a large omap object cluster, we are up to 5 now :)
>
> It is possible that our meta-data pool became a bottleneck. I'm re-deploying OSDs on these disks at the moment, increasing the OSD count from 1 to 4. The disks I use require high concurrency access to get close to spec performance and a single OSD per disk doesn't get close to saturation (its Intel enterprise NVMe-SSD SAS drives with really good performance specs). Therefore, I don't see the disks themselves as a bottleneck in iostat or atop, but it is very well possible that the OSD daemon is at its limit. It will take a couple of days to complete this and I will report back.
>
> > This covers the topic and relevant config:
> > https://docs.ceph.com/en/latest/cephfs/dirfrags/
>
> This is a classic ceph documentation page: just numbers without units (size of 10000 what??) without any explanation of how this would relate to object sizes and/or key counts :) After reading it, I don't think we are looking at dirfrags. The key-count is simply too large and the size probably as well. Could it be MDS journals? What other objects might become large? Or, how could I check what it is, for example, by looking at a hexdump?
>

Taking one example:
2021-08-25 11:17:06.866726 osd.37 osd.37 192.168.32.77:6850/12306 644
: cluster [WRN] Large omap object found. Object:
12:05982a7e:::1000d7fd167.02800000:head PG: 12.7e5419a0 (12.20) Key
count: 2293816 Size (bytes): 1078093520

This is inode 1000d7fd167, ie. 1099738108263
You can find this huge dir in the fs like `find /cephfs -type d -inum
1099738108263`. I expect it to be a huge directory.

You can observe the contents of the dir via rados :

  rados -p cephfs_metadata listomapkeys 1000d7fd167.02800000

> I should mention that we have a bunch of super-aggressive clients on the FS. Currently, I'm running 4 active MDS daemons and they seem to have distributed the client load very well between each other by now. The aggressive clients are probably open-foam or similar jobs that create millions and millions of small files in very short time. I have seen peaks of 4-8K requests per second to the MDSes. On our old Lustre system they managed to run out of inodes long before the storage capacity was reached, its probably the worst data to inode ratio one can think off. One of the advantages of ceph is its unlimited inode capacity and it seems to cope with the usage pattern reasonably well - modulo the problems I seem to observe here.

Advise your clients to spread these millions of small files across
many directories. in my experience users start to suffer once there
are more than a few hundred thousand files in a directory. ("suffer"
-- creating/deleting files and listing the dir starts to slow down
substantially, especially if they are working in the same dir from
many clients)

-- dan

>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> Sent: 25 August 2021 15:46:27
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re:  LARGE_OMAP_OBJECTS: any proper action possible?
>
> Hi,
>
> On Wed, Aug 25, 2021 at 2:37 PM Frank Schilder <frans@xxxxxx> wrote:
> >
> > Hi Dan,
> >
> > > [...] Do you have some custom mds config in this area?
> >
> > none that I'm aware of. What MDS config parameters should I look for?
>
> This covers the topic and relevant config:
> https://docs.ceph.com/en/latest/cephfs/dirfrags/
>
> Here in our clusters we've never had to tune any of these options --
> it works well with the defaults on our hw/workloads.
>
> > I recently seem to have had problems with very slow dirfrag operations that made an MDS unresponsive long enough for a MON to kick it out. I had to increase the MDS beacon timeout to get out of an MDS restart loop (it also had oversized cache by the time I discovered the problem). The dirfrag was reported as a slow op warning.
>
> That sounds related. In our env I've never noticed slow dirfrag ops.
> Do you have any underlying slowness or overload on your metadata osds?
>
> -- dan
>
>
>
> >
> > Thanks and best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> > Sent: 25 August 2021 14:05:00
> > To: Frank Schilder
> > Cc: ceph-users
> > Subject: Re:  LARGE_OMAP_OBJECTS: any proper action possible?
> >
> > Those are probably large directories; each omap key is a file/subdir
> > in the directory.
> >
> > Normally the mds fragments dirs across several objects, so you
> > shouldn't have a huge number of omap entries in any one single object.
> > Do you have some custom mds config in this area?
> >
> > -- dan
> >
> > On Wed, Aug 25, 2021 at 2:01 PM Frank Schilder <frans@xxxxxx> wrote:
> > >
> > > Hi Dan,
> > >
> > > thanks for looking at this. Here are the lines from health detail and ceph.log:
> > >
> > > [root@gnosis ~]# ceph health detail
> > > HEALTH_WARN 4 large omap objects
> > > LARGE_OMAP_OBJECTS 4 large omap objects
> > >     4 large objects found in pool 'con-fs2-meta1'
> > >     Search the cluster log for 'Large omap object found' for more details.
> > >
> > > The search gives:
> > >
> > > 2021-08-25 11:17:00.675474 osd.21 osd.21 192.168.32.77:6846/12302 651 : cluster [WRN] Large omap object found. Object: 12:373fb013:::1000eec35f5.01000000:head PG: 12.c80dfcec (12.6c) Key count: 216000 Size (bytes): 101520000
> > > 2021-08-25 11:17:06.866726 osd.37 osd.37 192.168.32.77:6850/12306 644 : cluster [WRN] Large omap object found. Object: 12:05982a7e:::1000d7fd167.02800000:head PG: 12.7e5419a0 (12.20) Key count: 2293816 Size (bytes): 1078093520
> > > 2021-08-25 11:17:11.152671 osd.37 osd.37 192.168.32.77:6850/12306 645 : cluster [WRN] Large omap object found. Object: 12:05da1450:::1000e118c0a.00000000:head PG: 12.a285ba0 (12.20) Key count: 220612 Size (bytes): 103687640
> > > 2021-08-25 11:17:36.603664 osd.36 osd.36 192.168.32.75:6848/11882 1243 : cluster [WRN] Large omap object found. Object: 12:0b298d19:::1000eec35f7.04e00000:head PG: 12.98b194d0 (12.50) Key count: 657212 Size (bytes): 308889640
> > >
> > > They are all in the fs meta-data pool.
> > >
> > > Best regards,
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > ________________________________________
> > > From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> > > Sent: 25 August 2021 13:57:44
> > > To: Frank Schilder
> > > Cc: ceph-users
> > > Subject: Re:  LARGE_OMAP_OBJECTS: any proper action possible?
> > >
> > > Hi Frank,
> > >
> > > Which objects are large? (You should see this in ceph.log when the
> > > large obj was detected).
> > >
> > > -- dan
> > >
> > > On Wed, Aug 25, 2021 at 12:27 PM Frank Schilder <frans@xxxxxx> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I have the notorious "LARGE_OMAP_OBJECTS: 4 large omap objects" warning and am again wondering if there is any proper action one can take except "wait it out and deep-scrub (numerous ceph-users threads)" or "ignore (https://docs.ceph.com/en/latest/rados/operations/health-checks/#large-omap-objects)". Only for RGWs is a proper action described, but mine come from MDSes. Is there any way to ask an MDS to clean up or split the objects?
> > > >
> > > > The disks with the meta-data pool can easily deal with objects of this size. My question is more along the lines: If I can't do anything anyway, why the warning? If there is a warning, I would assume that one can do something proper to prevent large omap objects from being born by an MDS. What is it?
> > > >
> > > > Best regards,
> > > > =================
> > > > Frank Schilder
> > > > AIT Risø Campus
> > > > Bygning 109, rum S14
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx