I found the command: rados -p cephfs_cache cache-flush-evict-all
The documentation (http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/) has been improved a lot since I last checked it a few weeks ago!
-Shaw
On Fri, Oct 6, 2017 at 9:10 AM, Shawfeng Dong <shaw@xxxxxxxx> wrote:
Thanks, Luis.I've just set max_bytes and max_objects:target_max_objects: 1000000 (1M)target_max_bytes: 1099511627776 (1TB)but nothing appears to be happening. Is there a way to force flushing?Thanks,ShawOn Fri, Oct 6, 2017 at 8:55 AM, Luis Periquito <periquito@xxxxxxxxx> wrote:Not looking at anything else, you didn't set the max_bytes or
max_objects for it to start flushing...
On Fri, Oct 6, 2017 at 4:49 PM, Shawfeng Dong <shaw@xxxxxxxx> wrote:
> Dear all,
>
> Thanks a lot for the very insightful comments/suggestions!
>
> There are 3 OSD servers in our pilot Ceph cluster, each with 2x 1TB SSDs
> (boot disks), 12x 8TB SATA HDDs and 2x 1.2TB NVMe SSDs. We use the bluestore
> backend, with the first NVMe as the WAL and DB devices for OSDs on the HDDs.
> And we try to create a cache tier out of the second NVMes.
>
> Here are the outputs of the commands suggested by David:
>
> 1) # ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 265T 262T 2847G 1.05
> POOLS:
> NAME ID USED %USED MAX AVAIL OBJECTS
> cephfs_data 1 0 0 248T 0
> cephfs_metadata 2 8515k 0 248T 24
> cephfs_cache 3 1381G 100.00 0 355385
>
> 2) # ceph osd df
> 0 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 174
> 1 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 169
> 2 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173
> 3 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 159
> 4 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173
> 5 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 162
> 6 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 149
> 7 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 179
> 8 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 163
> 9 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 194
> 10 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 185
> 11 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 168
> 36 nvme 1.09149 1.00000 1117G 855G 262G 76.53 73.01 79
> 12 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 180
> 13 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 168
> 14 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 178
> 15 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 170
> 16 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 149
> 17 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 203
> 18 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173
> 19 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 158
> 20 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 154
> 21 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 160
> 22 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 167
> 23 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 188
> 37 nvme 1.09149 1.00000 1117G 1061G 57214M 95.00 90.63 98
> 24 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 187
> 25 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 200
> 26 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 147
> 27 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 171
> 28 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 162
> 29 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 152
> 30 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 174
> 31 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 176
> 32 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 182
> 33 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 155
> 34 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 166
> 35 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 176
> 38 nvme 1.09149 1.00000 1117G 857G 260G 76.71 73.18 79
> TOTAL 265T 2847G 262T 1.05
> MIN/MAX VAR: 0.03/90.63 STDDEV: 22.81
>
> 3) # ceph osd tree
> -1 265.29291 root default
> -3 88.43097 host pulpo-osd01
> 0 hdd 7.27829 osd.0 up 1.00000 1.00000
> 1 hdd 7.27829 osd.1 up 1.00000 1.00000
> 2 hdd 7.27829 osd.2 up 1.00000 1.00000
> 3 hdd 7.27829 osd.3 up 1.00000 1.00000
> 4 hdd 7.27829 osd.4 up 1.00000 1.00000
> 5 hdd 7.27829 osd.5 up 1.00000 1.00000
> 6 hdd 7.27829 osd.6 up 1.00000 1.00000
> 7 hdd 7.27829 osd.7 up 1.00000 1.00000
> 8 hdd 7.27829 osd.8 up 1.00000 1.00000
> 9 hdd 7.27829 osd.9 up 1.00000 1.00000
> 10 hdd 7.27829 osd.10 up 1.00000 1.00000
> 11 hdd 7.27829 osd.11 up 1.00000 1.00000
> 36 nvme 1.09149 osd.36 up 1.00000 1.00000
> -5 88.43097 host pulpo-osd02
> 12 hdd 7.27829 osd.12 up 1.00000 1.00000
> 13 hdd 7.27829 osd.13 up 1.00000 1.00000
> 14 hdd 7.27829 osd.14 up 1.00000 1.00000
> 15 hdd 7.27829 osd.15 up 1.00000 1.00000
> 16 hdd 7.27829 osd.16 up 1.00000 1.00000
> 17 hdd 7.27829 osd.17 up 1.00000 1.00000
> 18 hdd 7.27829 osd.18 up 1.00000 1.00000
> 19 hdd 7.27829 osd.19 up 1.00000 1.00000
> 20 hdd 7.27829 osd.20 up 1.00000 1.00000
> 21 hdd 7.27829 osd.21 up 1.00000 1.00000
> 22 hdd 7.27829 osd.22 up 1.00000 1.00000
> 23 hdd 7.27829 osd.23 up 1.00000 1.00000
> 37 nvme 1.09149 osd.37 up 1.00000 1.00000
> 36 nvme 1.09149 osd.36 up 1.00000 1.00000
> -5 88.43097 host pulpo-osd02
> 12 hdd 7.27829 osd.12 up 1.00000 1.00000
> 13 hdd 7.27829 osd.13 up 1.00000 1.00000
> 14 hdd 7.27829 osd.14 up 1.00000 1.00000
> 15 hdd 7.27829 osd.15 up 1.00000 1.00000
> 16 hdd 7.27829 osd.16 up 1.00000 1.00000
> 17 hdd 7.27829 osd.17 up 1.00000 1.00000
> 18 hdd 7.27829 osd.18 up 1.00000 1.00000
> 19 hdd 7.27829 osd.19 up 1.00000 1.00000
> 20 hdd 7.27829 osd.20 up 1.00000 1.00000
> 21 hdd 7.27829 osd.21 up 1.00000 1.00000
> 22 hdd 7.27829 osd.22 up 1.00000 1.00000
> 23 hdd 7.27829 osd.23 up 1.00000 1.00000
> 37 nvme 1.09149 osd.37 up 1.00000 1.00000
> -7 88.43097 host pulpo-osd03
> 24 hdd 7.27829 osd.24 up 1.00000 1.00000
> 25 hdd 7.27829 osd.25 up 1.00000 1.00000
> 26 hdd 7.27829 osd.26 up 1.00000 1.00000
> 27 hdd 7.27829 osd.27 up 1.00000 1.00000
> 28 hdd 7.27829 osd.28 up 1.00000 1.00000
> 29 hdd 7.27829 osd.29 up 1.00000 1.00000
> 30 hdd 7.27829 osd.30 up 1.00000 1.00000
> 31 hdd 7.27829 osd.31 up 1.00000 1.00000
> 32 hdd 7.27829 osd.32 up 1.00000 1.00000
> 33 hdd 7.27829 osd.33 up 1.00000 1.00000
> 34 hdd 7.27829 osd.34 up 1.00000 1.00000
> 35 hdd 7.27829 osd.35 up 1.00000 1.00000
> 38 nvme 1.09149 osd.38 up 1.00000 1.00000
>
> 4) # ceph osd pool get cephfs_cache all
> min_size: 2
> crash_replay_interval: 0
> pg_num: 128
> pgp_num: 128
> crush_rule: pulpo_nvme
> hashpspool: true
> nodelete: false
> nopgchange: false
> nosizechange: false
> write_fadvise_dontneed: false
> noscrub: false
> nodeep-scrub: false
> hit_set_type: bloom
> hit_set_period: 14400
> hit_set_count: 12
> hit_set_fpp: 0.05
> use_gmt_hitset: 1
> auid: 0
> target_max_objects: 0
> target_max_bytes: 0
> cache_target_dirty_ratio: 0.4
> cache_target_dirty_high_ratio: 0.6
> cache_target_full_ratio: 0.8
> cache_min_flush_age: 0
> cache_min_evict_age: 0
> min_read_recency_for_promote: 0
> min_write_recency_for_promote: 0
> fast_read: 0
> hit_set_grade_decay_rate: 0
> crash_replay_interval: 0
>
> Do you see anything wrong? We had written some small files to the CephFS
> before we tried to write the big 1TB file. What is puzzling to me is that no
> data has been written back to the data pool.
>
> Best,
> Shaw
>
> On Fri, Oct 6, 2017 at 6:46 AM, David Turner <drakonstein@xxxxxxxxx> wrote:
>>
>>
>>
>> On Fri, Oct 6, 2017, 1:05 AM Christian Balzer <chibi@xxxxxxx> wrote:
>>>
>>>
>>> Hello,
>>>
>>> On Fri, 06 Oct 2017 03:30:41 +0000 David Turner wrote:
>>>
>>> > You're missing most all of the important bits. What the osds in your
>>> > cluster look like, your tree, and your cache pool settings.
>>> >
>>> > ceph df
>>> > ceph osd df
>>> > ceph osd tree
>>> > ceph osd pool get cephfs_cache all
>>> >
>>> Especially the last one.
>>>
>>> My money is on not having set target_max_objects and target_max_bytes to
>>> sensible values along with the ratios.
>>> In short, not having read the (albeit spotty) documentation.
>>>
>>> > You have your writeback cache on 3 nvme drives. It looks like you have
>>> > 1.6TB available between them for the cache. I don't know the behavior
>>> > of a
>>> > writeback cache tier on cephfs for large files, but I would guess that
>>> > it
>>> > can only hold full files and not flush partial files.
>>>
>>> I VERY much doubt that, if so it would be a massive flaw.
>>> One assumes that cache operations work on the RADOS object level, no
>>> matter what.
>>
>> I hope that it is on the rados level, but not a single object had been
>> flushed to the backing pool. So I hazarded a guess. Seeing his settings will
>> shed more light.
>>>
>>>
>>> > That would mean your
>>> > cache needs to have enough space for any file being written to the
>>> > cluster.
>>> > In this case a 1.3TB file with 3x replication would require 3.9TB (more
>>> > than double what you have available) of available space in your
>>> > writeback
>>> > cache.
>>> >
>>> > There are very few use cases that benefit from a cache tier. The docs
>>> > for
>>> > Luminous warn as much.
>>> You keep repeating that like a broken record.
>>>
>>> And while certainly not false I for one wouldn't be able to use (justify
>>> using) Ceph w/o cache tiers in our main use case.
>>>
>>>
>>> In this case I assume they were following and old cheat sheet or such,
>>> suggesting the previously required cache tier with EC pools.
>>
>>
>> http://docs.ceph.com/docs/luminous/rados/operations/cache- tiering/
>>
>> I know I keep repeating it, especially recently as there have been a lot
>> of people asking about it. The Luminous docs added a large section about how
>> it is probably not what you want. Like me, it is not saying that there are
>> no use cases for it. There was no information provided about the use case
>> and I made some suggestions/guesses. I'm also guessing that they are
>> following a guide where a writeback cache was necessary for CephFS to use EC
>> prior to Luminous. I also usually add that people should test it out and
>> find what works best for them. I will always defer to your practical use of
>> cache tiers as well, especially when using rbds.
>>
>> I manage a cluster that I intend to continue running a writeback cache in
>> front of CephFS on the same drives as the EC pool. The use case receives a
>> good enough benefit from the cache tier that it isn't even required to use
>> flash media to see it. It is used for video editing and the files are
>> usually modified and read within the first 24 hours and then left in cold
>> storage until deleted. I have the cache timed to keep everything in it for
>> 24 hours and then evict it by using a minimum time to flush and evict at 24
>> hours and a target max bytes of 0. All files are in there for that time and
>> then it never has to decide what to keep as it doesn't keep anything longer
>> than that. Luckily read performance from cold storage is not a requirement
>> of this cluster as any read operation has to first read it from EC storage,
>> write it to replica storage, and then read it from replica storage... Yuck.
>>>
>>>
>>> Christian
>>>
>>> >What is your goal by implementing this cache? If the
>>> > answer is to utilize extra space on the nvmes, then just remove it and
>>> > say
>>> > thank you. The better use of nvmes in that case are as a part of the
>>> > bluestore stack and give your osds larger DB partitions. Keeping your
>>> > metadata pool on nvmes is still a good idea.
>>> >
>>> > On Thu, Oct 5, 2017, 7:45 PM Shawfeng Dong <shaw@xxxxxxxx> wrote:
>>> >
>>> > > Dear all,
>>> > >
>>> > > We just set up a Ceph cluster, running the latest stable release Ceph
>>> > > v12.2.0 (Luminous):
>>> > > # ceph --version
>>> > > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
>>> > > luminous
>>> > > (rc)
>>> > >
>>> > > The goal is to serve Ceph filesystem, for which we created 3 pools:
>>> > > # ceph osd lspools
>>> > > 1 cephfs_data,2 cephfs_metadata,3 cephfs_cache,
>>> > > where
>>> > > * cephfs_data is the data pool (36 OSDs on HDDs), which is
>>> > > erased-coded;
>>> > > * cephfs_metadata is the metadata pool
>>> > > * cephfs_cache is the cache tier (3 OSDs on NVMes) for cephfs_data.
>>> > > The
>>> > > cache-mode is writeback.
>>> > >
>>> > > Everything had worked fine, until today when we tried to copy a 1.3TB
>>> > > file
>>> > > to the CephFS. We got the "No space left on device" error!
>>> > >
>>> > > 'ceph -s' says some OSDs are full:
>>> > > # ceph -s
>>> > > cluster:
>>> > > id: e18516bf-39cb-4670-9f13-88ccb7d19769
>>> > > health: HEALTH_ERR
>>> > > full flag(s) set
>>> > > 1 full osd(s)
>>> > > 1 pools have many more objects per pg than average
>>> > >
>>> > > services:
>>> > > mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
>>> > > mgr: pulpo-mds01(active), standbys: pulpo-admin, pulpo-mon01
>>> > > mds: pulpos-1/1/1 up {0=pulpo-mds01=up:active}
>>> > > osd: 39 osds: 39 up, 39 in
>>> > > flags full
>>> > >
>>> > > data:
>>> > > pools: 3 pools, 2176 pgs
>>> > > objects: 347k objects, 1381 GB
>>> > > usage: 2847 GB used, 262 TB / 265 TB avail
>>> > > pgs: 2176 active+clean
>>> > >
>>> > > io:
>>> > > client: 19301 kB/s rd, 2935 op/s rd, 0 op/s wr
>>> > >
>>> > > And indeed the cache pool is full:
>>> > > # rados df
>>> > > POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY
>>> > > UNFOUND
>>> > > DEGRADED RD_OPS RD
>>> > > WR_OPS WR
>>> > > cephfs_cache 1381G 355385 0 710770 0
>>> > > 0
>>> > > 0 10004954 15
>>> > > 22G 1398063 1611G
>>> > > cephfs_data 0 0 0 0 0
>>> > > 0
>>> > > 0 0
>>> > > 0 0 0
>>> > > cephfs_metadata 8515k 24 0 72 0
>>> > > 0
>>> > > 0 3 3
>>> > > 072 3953 10541k
>>> > >
>>> > > total_objects 355409
>>> > > total_used 2847G
>>> > > total_avail 262T
>>> > > total_space 265T
>>> > >
>>> > > However, the data pool is completely empty! So it seems that data has
>>> > > only
>>> > > been written to the cache pool, but not written back to the data
>>> > > pool.
>>> > >
>>> > > I am really at a loss whether this is due to a setup error on my
>>> > > part, or
>>> > > a Luminous bug. Could anyone shed some light on this? Please let me
>>> > > know if
>>> > > you need any further info.
>>> > >
>>> > > Best,
>>> > > Shaw
>>> > > _______________________________________________
>>> > > ceph-users mailing list
>>> > > ceph-users@xxxxxxxxxxxxxx
>>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> > >
>>>
>>>
>>> --
>>> Christian Balzer Network/Systems Engineer
>>> chibi@xxxxxxx Rakuten Communications
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com