Re: Ceph cache pool full

Shawfeng Dong <shaw@xxxxxxxx> · Fri, 6 Oct 2017 09:39:34 -0700

Hi Christian,
I set those via CLI:
# ceph osd pool set cephfs_cache target_max_bytes 1099511627776
# ceph osd pool set cephfs_cache target_max_objects 1000000

but manual flushing doesn't appear to work:
# rados -p cephfs_cache cache-flush-evict-all
        1000000046a.00000ca6

it just gets stuck there for a long time.

Any suggestion? Do I need to restart the daemons or reboot the nodes?

Thanks,
Shaw

On Fri, Oct 6, 2017 at 9:31 AM, Christian Balzer <chibi@xxxxxxx> wrote:
On Fri, 6 Oct 2017 09:14:40 -0700 Shawfeng Dong wrote:

> I found the command: rados -p cephfs_cache cache-flush-evict-all

>

That's not what you want/need.

Though it will fix your current "full" issue.

> The documentation (

> http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/) has

> been improved a lot since I last checked it a few weeks ago!

>

The need to set max_bytes and max_objects has been documented for ages

(since Hammer).

more below...

> -Shaw

>

> On Fri, Oct 6, 2017 at 9:10 AM, Shawfeng Dong <shaw@xxxxxxxx> wrote:

>

> > Thanks, Luis.

> >

> > I've just set max_bytes and max_objects:

How?

Editing the conf file won't help until a restart.

> > target_max_objects: 1000000 (1M)

> > target_max_bytes: 1099511627776 (1TB)

>

I'd lower that or the cache_target_full_ratio by another 10%.

Christian

> >

> > but nothing appears to be happening. Is there a way to force flushing?

> >

> > Thanks,

> > Shaw

> >

> > On Fri, Oct 6, 2017 at 8:55 AM, Luis Periquito <periquito@xxxxxxxxx>

> > wrote:

> >

> >> Not looking at anything else, you didn't set the max_bytes or

> >> max_objects for it to start flushing...

> >>

> >> On Fri, Oct 6, 2017 at 4:49 PM, Shawfeng Dong <shaw@xxxxxxxx> wrote:

> >> > Dear all,

> >> >

> >> > Thanks a lot for the very insightful comments/suggestions!

> >> >

> >> > There are 3 OSD servers in our pilot Ceph cluster, each with 2x 1TB SSDs

> >> > (boot disks), 12x 8TB SATA HDDs and 2x 1.2TB NVMe SSDs. We use the

> >> bluestore

> >> > backend, with the first NVMe as the WAL and DB devices for OSDs on the

> >> HDDs.

> >> > And we try to create a cache tier out of the second NVMes.

> >> >

> >> > Here are the outputs of the commands suggested by David:

> >> >

> >> > 1) # ceph df

> >> > GLOBAL:

> >> >     SIZE     AVAIL     RAW USED     %RAW USED

> >> >     265T      262T        2847G          1.05

> >> > POOLS:

> >> >     NAME                ID     USED      %USED      MAX AVAIL

> >>  OBJECTS

> >> >     cephfs_data         1          0          0          248T

> >>  0

> >> >     cephfs_metadata     2      8515k          0          248T

> >> 24

> >> >     cephfs_cache        3      1381G     100.00             0

> >> 355385

> >> >

> >> > 2) # ceph osd df

> >> >  0   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 174

> >> >  1   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 169

> >> >  2   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173

> >> >  3   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 159

> >> >  4   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173

> >> >  5   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 162

> >> >  6   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 149

> >> >  7   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 179

> >> >  8   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 163

> >> >  9   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 194

> >> > 10   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 185

> >> > 11   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 168

> >> > 36  nvme 1.09149  1.00000 1117G  855G   262G 76.53 73.01  79

> >> > 12   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 180

> >> > 13   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 168

> >> > 14   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 178

> >> > 15   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 170

> >> > 16   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 149

> >> > 17   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 203

> >> > 18   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173

> >> > 19   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 158

> >> > 20   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 154

> >> > 21   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 160

> >> > 22   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 167

> >> > 23   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 188

> >> > 37  nvme 1.09149  1.00000 1117G 1061G 57214M 95.00 90.63  98

> >> > 24   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 187

> >> > 25   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 200

> >> > 26   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 147

> >> > 27   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 171

> >> > 28   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 162

> >> > 29   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 152

> >> > 30   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 174

> >> > 31   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 176

> >> > 32   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 182

> >> > 33   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 155

> >> > 34   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 166

> >> > 35   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 176

> >> > 38  nvme 1.09149  1.00000 1117G  857G   260G 76.71 73.18  79

> >> >                     TOTAL  265T 2847G   262T  1.05

> >> > MIN/MAX VAR: 0.03/90.63  STDDEV: 22.81

> >> >

> >> > 3) # ceph osd tree

> >> > -1       265.29291 root default

> >> > -3        88.43097     host pulpo-osd01

> >> >  0   hdd   7.27829         osd.0            up  1.00000 1.00000

> >> >  1   hdd   7.27829         osd.1            up  1.00000 1.00000

> >> >  2   hdd   7.27829         osd.2            up  1.00000 1.00000

> >> >  3   hdd   7.27829         osd.3            up  1.00000 1.00000

> >> >  4   hdd   7.27829         osd.4            up  1.00000 1.00000

> >> >  5   hdd   7.27829         osd.5            up  1.00000 1.00000

> >> >  6   hdd   7.27829         osd.6            up  1.00000 1.00000

> >> >  7   hdd   7.27829         osd.7            up  1.00000 1.00000

> >> >  8   hdd   7.27829         osd.8            up  1.00000 1.00000

> >> >  9   hdd   7.27829         osd.9            up  1.00000 1.00000

> >> > 10   hdd   7.27829         osd.10           up  1.00000 1.00000

> >> > 11   hdd   7.27829         osd.11           up  1.00000 1.00000

> >> > 36  nvme   1.09149         osd.36           up  1.00000 1.00000

> >> > -5        88.43097     host pulpo-osd02

> >> > 12   hdd   7.27829         osd.12           up  1.00000 1.00000

> >> > 13   hdd   7.27829         osd.13           up  1.00000 1.00000

> >> > 14   hdd   7.27829         osd.14           up  1.00000 1.00000

> >> > 15   hdd   7.27829         osd.15           up  1.00000 1.00000

> >> > 16   hdd   7.27829         osd.16           up  1.00000 1.00000

> >> > 17   hdd   7.27829         osd.17           up  1.00000 1.00000

> >> > 18   hdd   7.27829         osd.18           up  1.00000 1.00000

> >> > 19   hdd   7.27829         osd.19           up  1.00000 1.00000

> >> > 20   hdd   7.27829         osd.20           up  1.00000 1.00000

> >> > 21   hdd   7.27829         osd.21           up  1.00000 1.00000

> >> > 22   hdd   7.27829         osd.22           up  1.00000 1.00000

> >> > 23   hdd   7.27829         osd.23           up  1.00000 1.00000

> >> > 37  nvme   1.09149         osd.37           up  1.00000 1.00000

> >> > 36  nvme   1.09149         osd.36           up  1.00000 1.00000

> >> > -5        88.43097     host pulpo-osd02

> >> > 12   hdd   7.27829         osd.12           up  1.00000 1.00000

> >> > 13   hdd   7.27829         osd.13           up  1.00000 1.00000

> >> > 14   hdd   7.27829         osd.14           up  1.00000 1.00000

> >> > 15   hdd   7.27829         osd.15           up  1.00000 1.00000

> >> > 16   hdd   7.27829         osd.16           up  1.00000 1.00000

> >> > 17   hdd   7.27829         osd.17           up  1.00000 1.00000

> >> > 18   hdd   7.27829         osd.18           up  1.00000 1.00000

> >> > 19   hdd   7.27829         osd.19           up  1.00000 1.00000

> >> > 20   hdd   7.27829         osd.20           up  1.00000 1.00000

> >> > 21   hdd   7.27829         osd.21           up  1.00000 1.00000

> >> > 22   hdd   7.27829         osd.22           up  1.00000 1.00000

> >> > 23   hdd   7.27829         osd.23           up  1.00000 1.00000

> >> > 37  nvme   1.09149         osd.37           up  1.00000 1.00000

> >> > -7        88.43097     host pulpo-osd03

> >> > 24   hdd   7.27829         osd.24           up  1.00000 1.00000

> >> > 25   hdd   7.27829         osd.25           up  1.00000 1.00000

> >> > 26   hdd   7.27829         osd.26           up  1.00000 1.00000

> >> > 27   hdd   7.27829         osd.27           up  1.00000 1.00000

> >> > 28   hdd   7.27829         osd.28           up  1.00000 1.00000

> >> > 29   hdd   7.27829         osd.29           up  1.00000 1.00000

> >> > 30   hdd   7.27829         osd.30           up  1.00000 1.00000

> >> > 31   hdd   7.27829         osd.31           up  1.00000 1.00000

> >> > 32   hdd   7.27829         osd.32           up  1.00000 1.00000

> >> > 33   hdd   7.27829         osd.33           up  1.00000 1.00000

> >> > 34   hdd   7.27829         osd.34           up  1.00000 1.00000

> >> > 35   hdd   7.27829         osd.35           up  1.00000 1.00000

> >> > 38  nvme   1.09149         osd.38           up  1.00000 1.00000

> >> >

> >> > 4) # ceph osd pool get cephfs_cache all

> >> > min_size: 2

> >> > crash_replay_interval: 0

> >> > pg_num: 128

> >> > pgp_num: 128

> >> > crush_rule: pulpo_nvme

> >> > hashpspool: true

> >> > nodelete: false

> >> > nopgchange: false

> >> > nosizechange: false

> >> > write_fadvise_dontneed: false

> >> > noscrub: false

> >> > nodeep-scrub: false

> >> > hit_set_type: bloom

> >> > hit_set_period: 14400

> >> > hit_set_count: 12

> >> > hit_set_fpp: 0.05

> >> > use_gmt_hitset: 1

> >> > auid: 0

> >> > target_max_objects: 0

> >> > target_max_bytes: 0

> >> > cache_target_dirty_ratio: 0.4

> >> > cache_target_dirty_high_ratio: 0.6

> >> > cache_target_full_ratio: 0.8

> >> > cache_min_flush_age: 0

> >> > cache_min_evict_age: 0

> >> > min_read_recency_for_promote: 0

> >> > min_write_recency_for_promote: 0

> >> > fast_read: 0

> >> > hit_set_grade_decay_rate: 0

> >> > crash_replay_interval: 0

> >> >

> >> > Do you see anything wrong? We had written some small files to the CephFS

> >> > before we tried to write the big 1TB file. What is puzzling to me is

> >> that no

> >> > data has been written back to the data pool.

> >> >

> >> > Best,

> >> > Shaw

> >> >

> >> > On Fri, Oct 6, 2017 at 6:46 AM, David Turner <drakonstein@xxxxxxxxx>

> >> wrote:

> >> >>

> >> >>

> >> >>

> >> >> On Fri, Oct 6, 2017, 1:05 AM Christian Balzer <chibi@xxxxxxx> wrote:

> >> >>>

> >> >>>

> >> >>> Hello,

> >> >>>

> >> >>> On Fri, 06 Oct 2017 03:30:41 +0000 David Turner wrote:

> >> >>>

> >> >>> > You're missing most all of the important bits. What the osds in your

> >> >>> > cluster look like, your tree, and your cache pool settings.

> >> >>> >

> >> >>> > ceph df

> >> >>> > ceph osd df

> >> >>> > ceph osd tree

> >> >>> > ceph osd pool get cephfs_cache all

> >> >>> >

> >> >>> Especially the last one.

> >> >>>

> >> >>> My money is on not having set target_max_objects and target_max_bytes

> >> to

> >> >>> sensible values along with the ratios.

> >> >>> In short, not having read the (albeit spotty) documentation.

> >> >>>

> >> >>> > You have your writeback cache on 3 nvme drives. It looks like you

> >> have

> >> >>> > 1.6TB available between them for the cache. I don't know the

> >> behavior

> >> >>> > of a

> >> >>> > writeback cache tier on cephfs for large files, but I would guess

> >> that

> >> >>> > it

> >> >>> > can only hold full files and not flush partial files.

> >> >>>

> >> >>> I VERY much doubt that, if so it would be a massive flaw.

> >> >>> One assumes that cache operations work on the RADOS object level, no

> >> >>> matter what.

> >> >>

> >> >> I hope that it is on the rados level, but not a single object had been

> >> >> flushed to the backing pool. So I hazarded a guess. Seeing his

> >> settings will

> >> >> shed more light.

> >> >>>

> >> >>>

> >> >>> > That would mean your

> >> >>> > cache needs to have enough space for any file being written to the

> >> >>> > cluster.

> >> >>> > In this case a 1.3TB file with 3x replication would require 3.9TB

> >> (more

> >> >>> > than double what you have available) of available space in your

> >> >>> > writeback

> >> >>> > cache.

> >> >>> >

> >> >>> > There are very few use cases that benefit from a cache tier. The

> >> docs

> >> >>> > for

> >> >>> > Luminous warn as much.

> >> >>> You keep repeating that like a broken record.

> >> >>>

> >> >>> And while certainly not false I for one wouldn't be able to use

> >> (justify

> >> >>> using) Ceph w/o cache tiers in our main use case.

> >> >>>

> >> >>>

> >> >>> In this case I assume they were following and old cheat sheet or such,

> >> >>> suggesting the previously required cache tier with EC pools.

> >> >>

> >> >>

> >> >> http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/

> >> >>

> >> >> I know I keep repeating it, especially recently as there have been a

> >> lot

> >> >> of people asking about it. The Luminous docs added a large section

> >> about how

> >> >> it is probably not what you want. Like me, it is not saying that there

> >> are

> >> >> no use cases for it. There was no information provided about the use

> >> case

> >> >> and I made some suggestions/guesses. I'm also guessing that they are

> >> >> following a guide where a writeback cache was necessary for CephFS to

> >> use EC

> >> >> prior to Luminous. I also usually add that people should test it out

> >> and

> >> >> find what works best for them. I will always defer to your practical

> >> use of

> >> >> cache tiers as well, especially when using rbds.

> >> >>

> >> >> I manage a cluster that I intend to continue running a writeback cache

> >> in

> >> >> front of CephFS on the same drives as the EC pool. The use case

> >> receives a

> >> >> good enough benefit from the cache tier that it isn't even required to

> >> use

> >> >> flash media to see it. It is used for video editing and the files are

> >> >> usually modified and read within the first 24 hours and then left in

> >> cold

> >> >> storage until deleted. I have the cache timed to keep everything in it

> >> for

> >> >> 24 hours and then evict it by using a minimum time to flush and evict

> >> at 24

> >> >> hours and a target max bytes of 0. All files are in there for that

> >> time and

> >> >> then it never has to decide what to keep as it doesn't keep anything

> >> longer

> >> >> than that. Luckily read performance from cold storage is not a

> >> requirement

> >> >> of this cluster as any read operation has to first read it from EC

> >> storage,

> >> >> write it to replica storage, and then read it from replica storage...

> >> Yuck.

> >> >>>

> >> >>>

> >> >>> Christian

> >> >>>

> >> >>> >What is your goal by implementing this cache? If the

> >> >>> > answer is to utilize extra space on the nvmes, then just remove it

> >> and

> >> >>> > say

> >> >>> > thank you. The better use of nvmes in that case are as a part of the

> >> >>> > bluestore stack and give your osds larger DB partitions. Keeping

> >> your

> >> >>> > metadata pool on nvmes is still a good idea.

> >> >>> >

> >> >>> > On Thu, Oct 5, 2017, 7:45 PM Shawfeng Dong <shaw@xxxxxxxx> wrote:

> >> >>> >

> >> >>> > > Dear all,

> >> >>> > >

> >> >>> > > We just set up a Ceph cluster, running the latest stable release

> >> Ceph

> >> >>> > > v12.2.0 (Luminous):

> >> >>> > > # ceph --version

> >> >>> > > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)

> >> >>> > > luminous

> >> >>> > > (rc)

> >> >>> > >

> >> >>> > > The goal is to serve Ceph filesystem, for which we created 3

> >> pools:

> >> >>> > > # ceph osd lspools

> >> >>> > > 1 cephfs_data,2 cephfs_metadata,3 cephfs_cache,

> >> >>> > > where

> >> >>> > > * cephfs_data is the data pool (36 OSDs on HDDs), which is

> >> >>> > > erased-coded;

> >> >>> > > * cephfs_metadata is the metadata pool

> >> >>> > > * cephfs_cache is the cache tier (3 OSDs on NVMes) for

> >> cephfs_data.

> >> >>> > > The

> >> >>> > > cache-mode is writeback.

> >> >>> > >

> >> >>> > > Everything had worked fine, until today when we tried to copy a

> >> 1.3TB

> >> >>> > > file

> >> >>> > > to the CephFS.  We got the "No space left on device" error!

> >> >>> > >

> >> >>> > > 'ceph -s' says some OSDs are full:

> >> >>> > > # ceph -s

> >> >>> > >   cluster:

> >> >>> > >     id:     e18516bf-39cb-4670-9f13-88ccb7d19769

> >> >>> > >     health: HEALTH_ERR

> >> >>> > >             full flag(s) set

> >> >>> > >             1 full osd(s)

> >> >>> > >             1 pools have many more objects per pg than average

> >> >>> > >

> >> >>> > >   services:

> >> >>> > >     mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01

> >> >>> > >     mgr: pulpo-mds01(active), standbys: pulpo-admin, pulpo-mon01

> >> >>> > >     mds: pulpos-1/1/1 up  {0=pulpo-mds01=up:active}

> >> >>> > >     osd: 39 osds: 39 up, 39 in

> >> >>> > >          flags full

> >> >>> > >

> >> >>> > >   data:

> >> >>> > >     pools:   3 pools, 2176 pgs

> >> >>> > >     objects: 347k objects, 1381 GB

> >> >>> > >     usage:   2847 GB used, 262 TB / 265 TB avail

> >> >>> > >     pgs:     2176 active+clean

> >> >>> > >

> >> >>> > >   io:

> >> >>> > >     client:   19301 kB/s rd, 2935 op/s rd, 0 op/s wr

> >> >>> > >

> >> >>> > > And indeed the cache pool is full:

> >> >>> > > # rados df

> >> >>> > > POOL_NAME       USED  OBJECTS CLONES COPIES MISSING_ON_PRIMARY

> >> >>> > > UNFOUND

> >> >>> > > DEGRADED RD_OPS   RD

> >> >>> > >     WR_OPS  WR

> >> >>> > > cephfs_cache    1381G  355385      0 710770                  0

> >> >>> > > 0

> >> >>> > >     0 10004954 15

> >> >>> > > 22G 1398063  1611G

> >> >>> > > cephfs_data         0       0      0      0                  0

> >> >>> > > 0

> >> >>> > >     0        0

> >> >>> > >   0       0      0

> >> >>> > > cephfs_metadata 8515k      24      0     72                  0

> >> >>> > > 0

> >> >>> > >     0        3  3

> >> >>> > > 072    3953 10541k

> >> >>> > >

> >> >>> > > total_objects    355409

> >> >>> > > total_used       2847G

> >> >>> > > total_avail      262T

> >> >>> > > total_space      265T

> >> >>> > >

> >> >>> > > However, the data pool is completely empty! So it seems that data

> >> has

> >> >>> > > only

> >> >>> > > been written to the cache pool, but not written back to the data

> >> >>> > > pool.

> >> >>> > >

> >> >>> > > I am really at a loss whether this is due to a setup error on my

> >> >>> > > part, or

> >> >>> > > a Luminous bug. Could anyone shed some light on this? Please let

> >> me

> >> >>> > > know if

> >> >>> > > you need any further info.

> >> >>> > >

> >> >>> > > Best,

> >> >>> > > Shaw

> >> >>> > > _______________________________________________

> >> >>> > > ceph-users mailing list

> >> >>> > > ceph-users@xxxxxxxxxxxxxx

> >> >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >> >>> > >

> >> >>>

> >> >>>

> >> >>> --

> >> >>> Christian Balzer        Network/Systems Engineer

> >> >>> chibi@xxxxxxx           Rakuten Communications

> >> >>> _______________________________________________

> >> >>> ceph-users mailing list

> >> >>> ceph-users@xxxxxxxxxxxxxx

> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >> >>

> >> >>

> >> >> _______________________________________________

> >> >> ceph-users mailing list

> >> >> ceph-users@xxxxxxxxxxxxxx

> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >> >>

> >> >

> >> >

> >> > _______________________________________________

> >> > ceph-users mailing list

> >> > ceph-users@xxxxxxxxxxxxxx

> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >> >

> >>

> >

> >

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Rakuten Communications

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com