Re: Ceph cache pool full

Christian Balzer <chibi@xxxxxxx> · Sat, 7 Oct 2017 01:11:47 +0900



On Fri, 6 Oct 2017 16:55:31 +0100 Luis Periquito wrote:

> Not looking at anything else, you didn't set the max_bytes or
> max_objects for it to start flushing...
> 
Precisely!
He says, cackling, as he goes to cash in his bet. ^o^


> On Fri, Oct 6, 2017 at 4:49 PM, Shawfeng Dong <shaw@xxxxxxxx> wrote:
> > Dear all,
> >
> > Thanks a lot for the very insightful comments/suggestions!
> >
> > There are 3 OSD servers in our pilot Ceph cluster, each with 2x 1TB SSDs
> > (boot disks), 12x 8TB SATA HDDs and 2x 1.2TB NVMe SSDs. We use the bluestore
> > backend, with the first NVMe as the WAL and DB devices for OSDs on the HDDs.
> > And we try to create a cache tier out of the second NVMes.
> >
> > Here are the outputs of the commands suggested by David:
> >
> > 1) # ceph df
> > GLOBAL:
> >     SIZE     AVAIL     RAW USED     %RAW USED
> >     265T      262T        2847G          1.05
> > POOLS:
> >     NAME                ID     USED      %USED      MAX AVAIL     OBJECTS
> >     cephfs_data         1          0          0          248T           0
> >     cephfs_metadata     2      8515k          0          248T          24
> >     cephfs_cache        3      1381G     100.00             0      355385
> >
> > 2) # ceph osd df
> >  0   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 174
> >  1   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 169
> >  2   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173
> >  3   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 159
> >  4   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173
> >  5   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 162
> >  6   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 149
> >  7   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 179
> >  8   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 163
> >  9   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 194
> > 10   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 185
> > 11   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 168
> > 36  nvme 1.09149  1.00000 1117G  855G   262G 76.53 73.01  79
> > 12   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 180
> > 13   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 168
> > 14   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 178
> > 15   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 170
> > 16   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 149
> > 17   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 203
> > 18   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173
> > 19   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 158
> > 20   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 154
> > 21   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 160
> > 22   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 167
> > 23   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 188
> > 37  nvme 1.09149  1.00000 1117G 1061G 57214M 95.00 90.63  98
> > 24   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 187
> > 25   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 200
> > 26   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 147
> > 27   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 171
> > 28   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 162
> > 29   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 152
> > 30   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 174
> > 31   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 176
> > 32   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 182
> > 33   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 155
> > 34   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 166
> > 35   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 176
> > 38  nvme 1.09149  1.00000 1117G  857G   260G 76.71 73.18  79
> >                     TOTAL  265T 2847G   262T  1.05
> > MIN/MAX VAR: 0.03/90.63  STDDEV: 22.81
> >
> > 3) # ceph osd tree
> > -1       265.29291 root default
> > -3        88.43097     host pulpo-osd01
> >  0   hdd   7.27829         osd.0            up  1.00000 1.00000
> >  1   hdd   7.27829         osd.1            up  1.00000 1.00000
> >  2   hdd   7.27829         osd.2            up  1.00000 1.00000
> >  3   hdd   7.27829         osd.3            up  1.00000 1.00000
> >  4   hdd   7.27829         osd.4            up  1.00000 1.00000
> >  5   hdd   7.27829         osd.5            up  1.00000 1.00000
> >  6   hdd   7.27829         osd.6            up  1.00000 1.00000
> >  7   hdd   7.27829         osd.7            up  1.00000 1.00000
> >  8   hdd   7.27829         osd.8            up  1.00000 1.00000
> >  9   hdd   7.27829         osd.9            up  1.00000 1.00000
> > 10   hdd   7.27829         osd.10           up  1.00000 1.00000
> > 11   hdd   7.27829         osd.11           up  1.00000 1.00000
> > 36  nvme   1.09149         osd.36           up  1.00000 1.00000
> > -5        88.43097     host pulpo-osd02
> > 12   hdd   7.27829         osd.12           up  1.00000 1.00000
> > 13   hdd   7.27829         osd.13           up  1.00000 1.00000
> > 14   hdd   7.27829         osd.14           up  1.00000 1.00000
> > 15   hdd   7.27829         osd.15           up  1.00000 1.00000
> > 16   hdd   7.27829         osd.16           up  1.00000 1.00000
> > 17   hdd   7.27829         osd.17           up  1.00000 1.00000
> > 18   hdd   7.27829         osd.18           up  1.00000 1.00000
> > 19   hdd   7.27829         osd.19           up  1.00000 1.00000
> > 20   hdd   7.27829         osd.20           up  1.00000 1.00000
> > 21   hdd   7.27829         osd.21           up  1.00000 1.00000
> > 22   hdd   7.27829         osd.22           up  1.00000 1.00000
> > 23   hdd   7.27829         osd.23           up  1.00000 1.00000
> > 37  nvme   1.09149         osd.37           up  1.00000 1.00000
> > 36  nvme   1.09149         osd.36           up  1.00000 1.00000
> > -5        88.43097     host pulpo-osd02
> > 12   hdd   7.27829         osd.12           up  1.00000 1.00000
> > 13   hdd   7.27829         osd.13           up  1.00000 1.00000
> > 14   hdd   7.27829         osd.14           up  1.00000 1.00000
> > 15   hdd   7.27829         osd.15           up  1.00000 1.00000
> > 16   hdd   7.27829         osd.16           up  1.00000 1.00000
> > 17   hdd   7.27829         osd.17           up  1.00000 1.00000
> > 18   hdd   7.27829         osd.18           up  1.00000 1.00000
> > 19   hdd   7.27829         osd.19           up  1.00000 1.00000
> > 20   hdd   7.27829         osd.20           up  1.00000 1.00000
> > 21   hdd   7.27829         osd.21           up  1.00000 1.00000
> > 22   hdd   7.27829         osd.22           up  1.00000 1.00000
> > 23   hdd   7.27829         osd.23           up  1.00000 1.00000
> > 37  nvme   1.09149         osd.37           up  1.00000 1.00000
> > -7        88.43097     host pulpo-osd03
> > 24   hdd   7.27829         osd.24           up  1.00000 1.00000
> > 25   hdd   7.27829         osd.25           up  1.00000 1.00000
> > 26   hdd   7.27829         osd.26           up  1.00000 1.00000
> > 27   hdd   7.27829         osd.27           up  1.00000 1.00000
> > 28   hdd   7.27829         osd.28           up  1.00000 1.00000
> > 29   hdd   7.27829         osd.29           up  1.00000 1.00000
> > 30   hdd   7.27829         osd.30           up  1.00000 1.00000
> > 31   hdd   7.27829         osd.31           up  1.00000 1.00000
> > 32   hdd   7.27829         osd.32           up  1.00000 1.00000
> > 33   hdd   7.27829         osd.33           up  1.00000 1.00000
> > 34   hdd   7.27829         osd.34           up  1.00000 1.00000
> > 35   hdd   7.27829         osd.35           up  1.00000 1.00000
> > 38  nvme   1.09149         osd.38           up  1.00000 1.00000
> >
> > 4) # ceph osd pool get cephfs_cache all
> > min_size: 2
> > crash_replay_interval: 0
> > pg_num: 128
> > pgp_num: 128
> > crush_rule: pulpo_nvme
> > hashpspool: true
> > nodelete: false
> > nopgchange: false
> > nosizechange: false
> > write_fadvise_dontneed: false
> > noscrub: false
> > nodeep-scrub: false
> > hit_set_type: bloom
> > hit_set_period: 14400
> > hit_set_count: 12
> > hit_set_fpp: 0.05
> > use_gmt_hitset: 1
> > auid: 0
> > target_max_objects: 0
> > target_max_bytes: 0
> > cache_target_dirty_ratio: 0.4
> > cache_target_dirty_high_ratio: 0.6
> > cache_target_full_ratio: 0.8
> > cache_min_flush_age: 0
> > cache_min_evict_age: 0
> > min_read_recency_for_promote: 0
> > min_write_recency_for_promote: 0
> > fast_read: 0
> > hit_set_grade_decay_rate: 0
> > crash_replay_interval: 0
> >
> > Do you see anything wrong? We had written some small files to the CephFS
> > before we tried to write the big 1TB file. What is puzzling to me is that no
> > data has been written back to the data pool.
> >
> > Best,
> > Shaw
> >
> > On Fri, Oct 6, 2017 at 6:46 AM, David Turner <drakonstein@xxxxxxxxx> wrote:  
> >>
> >>
> >>
> >> On Fri, Oct 6, 2017, 1:05 AM Christian Balzer <chibi@xxxxxxx> wrote:  
> >>>
> >>>
> >>> Hello,
> >>>
> >>> On Fri, 06 Oct 2017 03:30:41 +0000 David Turner wrote:
> >>>  
> >>> > You're missing most all of the important bits. What the osds in your
> >>> > cluster look like, your tree, and your cache pool settings.
> >>> >
> >>> > ceph df
> >>> > ceph osd df
> >>> > ceph osd tree
> >>> > ceph osd pool get cephfs_cache all
> >>> >  
> >>> Especially the last one.
> >>>
> >>> My money is on not having set target_max_objects and target_max_bytes to
> >>> sensible values along with the ratios.
> >>> In short, not having read the (albeit spotty) documentation.
> >>>  
> >>> > You have your writeback cache on 3 nvme drives. It looks like you have
> >>> > 1.6TB available between them for the cache. I don't know the behavior
> >>> > of a
> >>> > writeback cache tier on cephfs for large files, but I would guess that
> >>> > it
> >>> > can only hold full files and not flush partial files.  
> >>>
> >>> I VERY much doubt that, if so it would be a massive flaw.
> >>> One assumes that cache operations work on the RADOS object level, no
> >>> matter what.  
> >>
> >> I hope that it is on the rados level, but not a single object had been
> >> flushed to the backing pool. So I hazarded a guess. Seeing his settings will
> >> shed more light.  
> >>>
> >>>  
> >>> > That would mean your
> >>> > cache needs to have enough space for any file being written to the
> >>> > cluster.
> >>> > In this case a 1.3TB file with 3x replication would require 3.9TB (more
> >>> > than double what you have available) of available space in your
> >>> > writeback
> >>> > cache.
> >>> >
> >>> > There are very few use cases that benefit from a cache tier. The docs
> >>> > for
> >>> > Luminous warn as much.  
> >>> You keep repeating that like a broken record.
> >>>
> >>> And while certainly not false I for one wouldn't be able to use (justify
> >>> using) Ceph w/o cache tiers in our main use case.
> >>>
> >>>
> >>> In this case I assume they were following and old cheat sheet or such,
> >>> suggesting the previously required cache tier with EC pools.  
> >>
> >>
> >> http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/
> >>
> >> I know I keep repeating it, especially recently as there have been a lot
> >> of people asking about it. The Luminous docs added a large section about how
> >> it is probably not what you want. Like me, it is not saying that there are
> >> no use cases for it. There was no information provided about the use case
> >> and I made some suggestions/guesses. I'm also guessing that they are
> >> following a guide where a writeback cache was necessary for CephFS to use EC
> >> prior to Luminous. I also usually add that people should test it out and
> >> find what works best for them. I will always defer to your practical use of
> >> cache tiers as well, especially when using rbds.
> >>
> >> I manage a cluster that I intend to continue running a writeback cache in
> >> front of CephFS on the same drives as the EC pool. The use case receives a
> >> good enough benefit from the cache tier that it isn't even required to use
> >> flash media to see it. It is used for video editing and the files are
> >> usually modified and read within the first 24 hours and then left in cold
> >> storage until deleted. I have the cache timed to keep everything in it for
> >> 24 hours and then evict it by using a minimum time to flush and evict at 24
> >> hours and a target max bytes of 0. All files are in there for that time and
> >> then it never has to decide what to keep as it doesn't keep anything longer
> >> than that. Luckily read performance from cold storage is not a requirement
> >> of this cluster as any read operation has to first read it from EC storage,
> >> write it to replica storage, and then read it from replica storage... Yuck.  
> >>>
> >>>
> >>> Christian
> >>>  
> >>> >What is your goal by implementing this cache? If the
> >>> > answer is to utilize extra space on the nvmes, then just remove it and
> >>> > say
> >>> > thank you. The better use of nvmes in that case are as a part of the
> >>> > bluestore stack and give your osds larger DB partitions. Keeping your
> >>> > metadata pool on nvmes is still a good idea.
> >>> >
> >>> > On Thu, Oct 5, 2017, 7:45 PM Shawfeng Dong <shaw@xxxxxxxx> wrote:
> >>> >  
> >>> > > Dear all,
> >>> > >
> >>> > > We just set up a Ceph cluster, running the latest stable release Ceph
> >>> > > v12.2.0 (Luminous):
> >>> > > # ceph --version
> >>> > > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> >>> > > luminous
> >>> > > (rc)
> >>> > >
> >>> > > The goal is to serve Ceph filesystem, for which we created 3 pools:
> >>> > > # ceph osd lspools
> >>> > > 1 cephfs_data,2 cephfs_metadata,3 cephfs_cache,
> >>> > > where
> >>> > > * cephfs_data is the data pool (36 OSDs on HDDs), which is
> >>> > > erased-coded;
> >>> > > * cephfs_metadata is the metadata pool
> >>> > > * cephfs_cache is the cache tier (3 OSDs on NVMes) for cephfs_data.
> >>> > > The
> >>> > > cache-mode is writeback.
> >>> > >
> >>> > > Everything had worked fine, until today when we tried to copy a 1.3TB
> >>> > > file
> >>> > > to the CephFS.  We got the "No space left on device" error!
> >>> > >
> >>> > > 'ceph -s' says some OSDs are full:
> >>> > > # ceph -s
> >>> > >   cluster:
> >>> > >     id:     e18516bf-39cb-4670-9f13-88ccb7d19769
> >>> > >     health: HEALTH_ERR
> >>> > >             full flag(s) set
> >>> > >             1 full osd(s)
> >>> > >             1 pools have many more objects per pg than average
> >>> > >
> >>> > >   services:
> >>> > >     mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
> >>> > >     mgr: pulpo-mds01(active), standbys: pulpo-admin, pulpo-mon01
> >>> > >     mds: pulpos-1/1/1 up  {0=pulpo-mds01=up:active}
> >>> > >     osd: 39 osds: 39 up, 39 in
> >>> > >          flags full
> >>> > >
> >>> > >   data:
> >>> > >     pools:   3 pools, 2176 pgs
> >>> > >     objects: 347k objects, 1381 GB
> >>> > >     usage:   2847 GB used, 262 TB / 265 TB avail
> >>> > >     pgs:     2176 active+clean
> >>> > >
> >>> > >   io:
> >>> > >     client:   19301 kB/s rd, 2935 op/s rd, 0 op/s wr
> >>> > >
> >>> > > And indeed the cache pool is full:
> >>> > > # rados df
> >>> > > POOL_NAME       USED  OBJECTS CLONES COPIES MISSING_ON_PRIMARY
> >>> > > UNFOUND
> >>> > > DEGRADED RD_OPS   RD
> >>> > >     WR_OPS  WR
> >>> > > cephfs_cache    1381G  355385      0 710770                  0
> >>> > > 0
> >>> > >     0 10004954 15
> >>> > > 22G 1398063  1611G
> >>> > > cephfs_data         0       0      0      0                  0
> >>> > > 0
> >>> > >     0        0
> >>> > >   0       0      0
> >>> > > cephfs_metadata 8515k      24      0     72                  0
> >>> > > 0
> >>> > >     0        3  3
> >>> > > 072    3953 10541k
> >>> > >
> >>> > > total_objects    355409
> >>> > > total_used       2847G
> >>> > > total_avail      262T
> >>> > > total_space      265T
> >>> > >
> >>> > > However, the data pool is completely empty! So it seems that data has
> >>> > > only
> >>> > > been written to the cache pool, but not written back to the data
> >>> > > pool.
> >>> > >
> >>> > > I am really at a loss whether this is due to a setup error on my
> >>> > > part, or
> >>> > > a Luminous bug. Could anyone shed some light on this? Please let me
> >>> > > know if
> >>> > > you need any further info.
> >>> > >
> >>> > > Best,
> >>> > > Shaw
> >>> > > _______________________________________________
> >>> > > ceph-users mailing list
> >>> > > ceph-users@xxxxxxxxxxxxxx
> >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> > >  
> >>>
> >>>
> >>> --
> >>> Christian Balzer        Network/Systems Engineer
> >>> chibi@xxxxxxx           Rakuten Communications
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>  
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com