Dear all,
Thanks a lot for the very insightful comments/suggestions!
There are 3 OSD servers in our pilot Ceph cluster, each with 2x 1TB SSDs (boot disks), 12x 8TB SATA HDDs and 2x 1.2TB NVMe SSDs. We use the bluestore backend, with the first NVMe as the WAL and DB devices for OSDs on the HDDs. And we try to create a cache tier out of the second NVMes.
Here are the outputs of the commands suggested by David:
1) # ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
265T 262T 2847G 1.05
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
cephfs_data 1 0 0 248T 0
cephfs_metadata 2 8515k 0 248T 24
cephfs_cache 3 1381G 100.00 0 355385
2) # ceph osd df
0 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 174
1 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 169
2 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173
3 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 159
4 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173
5 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 162
6 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 149
7 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 179
8 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 163
9 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 194
10 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 185
11 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 168
36 nvme 1.09149 1.00000 1117G 855G 262G 76.53 73.01 79
12 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 180
13 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 168
14 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 178
15 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 170
16 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 149
17 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 203
18 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173
19 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 158
20 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 154
21 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 160
22 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 167
23 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 188
37 nvme 1.09149 1.00000 1117G 1061G 57214M 95.00 90.63 98
24 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 187
25 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 200
26 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 147
27 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 171
28 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 162
29 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 152
30 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 174
31 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 176
32 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 182
33 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 155
34 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 166
35 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 176
38 nvme 1.09149 1.00000 1117G 857G 260G 76.71 73.18 79
TOTAL 265T 2847G 262T 1.05
MIN/MAX VAR: 0.03/90.63 STDDEV: 22.81
3) # ceph osd tree
-1 265.29291 root default
-3 88.43097 host pulpo-osd01
0 hdd 7.27829 osd.0 up 1.00000 1.00000
1 hdd 7.27829 osd.1 up 1.00000 1.00000
2 hdd 7.27829 osd.2 up 1.00000 1.00000
3 hdd 7.27829 osd.3 up 1.00000 1.00000
4 hdd 7.27829 osd.4 up 1.00000 1.00000
5 hdd 7.27829 osd.5 up 1.00000 1.00000
6 hdd 7.27829 osd.6 up 1.00000 1.00000
7 hdd 7.27829 osd.7 up 1.00000 1.00000
8 hdd 7.27829 osd.8 up 1.00000 1.00000
9 hdd 7.27829 osd.9 up 1.00000 1.00000
10 hdd 7.27829 osd.10 up 1.00000 1.00000
11 hdd 7.27829 osd.11 up 1.00000 1.00000
36 nvme 1.09149 osd.36 up 1.00000 1.00000
-5 88.43097 host pulpo-osd02
12 hdd 7.27829 osd.12 up 1.00000 1.00000
13 hdd 7.27829 osd.13 up 1.00000 1.00000
14 hdd 7.27829 osd.14 up 1.00000 1.00000
15 hdd 7.27829 osd.15 up 1.00000 1.00000
16 hdd 7.27829 osd.16 up 1.00000 1.00000
17 hdd 7.27829 osd.17 up 1.00000 1.00000
18 hdd 7.27829 osd.18 up 1.00000 1.00000
19 hdd 7.27829 osd.19 up 1.00000 1.00000
20 hdd 7.27829 osd.20 up 1.00000 1.00000
21 hdd 7.27829 osd.21 up 1.00000 1.00000
22 hdd 7.27829 osd.22 up 1.00000 1.00000
23 hdd 7.27829 osd.23 up 1.00000 1.00000
37 nvme 1.09149 osd.37 up 1.00000 1.00000
36 nvme 1.09149 osd.36 up 1.00000 1.00000
-5 88.43097 host pulpo-osd02
12 hdd 7.27829 osd.12 up 1.00000 1.00000
13 hdd 7.27829 osd.13 up 1.00000 1.00000
14 hdd 7.27829 osd.14 up 1.00000 1.00000
15 hdd 7.27829 osd.15 up 1.00000 1.00000
16 hdd 7.27829 osd.16 up 1.00000 1.00000
17 hdd 7.27829 osd.17 up 1.00000 1.00000
18 hdd 7.27829 osd.18 up 1.00000 1.00000
19 hdd 7.27829 osd.19 up 1.00000 1.00000
20 hdd 7.27829 osd.20 up 1.00000 1.00000
21 hdd 7.27829 osd.21 up 1.00000 1.00000
22 hdd 7.27829 osd.22 up 1.00000 1.00000
23 hdd 7.27829 osd.23 up 1.00000 1.00000
37 nvme 1.09149 osd.37 up 1.00000 1.00000
-7 88.43097 host pulpo-osd03
24 hdd 7.27829 osd.24 up 1.00000 1.00000
25 hdd 7.27829 osd.25 up 1.00000 1.00000
26 hdd 7.27829 osd.26 up 1.00000 1.00000
27 hdd 7.27829 osd.27 up 1.00000 1.00000
28 hdd 7.27829 osd.28 up 1.00000 1.00000
29 hdd 7.27829 osd.29 up 1.00000 1.00000
30 hdd 7.27829 osd.30 up 1.00000 1.00000
31 hdd 7.27829 osd.31 up 1.00000 1.00000
32 hdd 7.27829 osd.32 up 1.00000 1.00000
33 hdd 7.27829 osd.33 up 1.00000 1.00000
34 hdd 7.27829 osd.34 up 1.00000 1.00000
35 hdd 7.27829 osd.35 up 1.00000 1.00000
38 nvme 1.09149 osd.38 up 1.00000 1.00000
4) # ceph osd pool get cephfs_cache all
min_size: 2
crash_replay_interval: 0
pg_num: 128
pgp_num: 128
crush_rule: pulpo_nvme
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
hit_set_type: bloom
hit_set_period: 14400
hit_set_count: 12
hit_set_fpp: 0.05
use_gmt_hitset: 1
auid: 0
target_max_objects: 0
target_max_bytes: 0
cache_target_dirty_ratio: 0.4
cache_target_dirty_high_ratio: 0.6
cache_target_full_ratio: 0.8
cache_min_flush_age: 0
cache_min_evict_age: 0
min_read_recency_for_promote: 0
min_write_recency_for_promote: 0
fast_read: 0
hit_set_grade_decay_rate: 0
crash_replay_interval: 0
Do you see anything wrong? We had written some small files to the CephFS before we tried to write the big 1TB file. What is puzzling to me is that no data has been written back to the data pool.
Best,
Shaw
On Fri, Oct 6, 2017 at 6:46 AM, David Turner <drakonstein@xxxxxxxxx> wrote:
On Fri, Oct 6, 2017, 1:05 AM Christian Balzer <chibi@xxxxxxx> wrote:
Hello,
On Fri, 06 Oct 2017 03:30:41 +0000 David Turner wrote:
> You're missing most all of the important bits. What the osds in your
> cluster look like, your tree, and your cache pool settings.
>
> ceph df
> ceph osd df
> ceph osd tree
> ceph osd pool get cephfs_cache all
>
Especially the last one.
My money is on not having set target_max_objects and target_max_bytes to
sensible values along with the ratios.
In short, not having read the (albeit spotty) documentation.
> You have your writeback cache on 3 nvme drives. It looks like you have
> 1.6TB available between them for the cache. I don't know the behavior of a
> writeback cache tier on cephfs for large files, but I would guess that it
> can only hold full files and not flush partial files.
I VERY much doubt that, if so it would be a massive flaw.
One assumes that cache operations work on the RADOS object level, no
matter what.I hope that it is on the rados level, but not a single object had been flushed to the backing pool. So I hazarded a guess. Seeing his settings will shed more light.
> That would mean your
> cache needs to have enough space for any file being written to the cluster.
> In this case a 1.3TB file with 3x replication would require 3.9TB (more
> than double what you have available) of available space in your writeback
> cache.
>
> There are very few use cases that benefit from a cache tier. The docs for
> Luminous warn as much.
You keep repeating that like a broken record.
And while certainly not false I for one wouldn't be able to use (justify
using) Ceph w/o cache tiers in our main use case.
In this case I assume they were following and old cheat sheet or such,
suggesting the previously required cache tier with EC pools.http://docs.ceph.com/docs/luminous/rados/operations/ cache-tiering/
I know I keep repeating it, especially recently as there have been a lot of people asking about it. The Luminous docs added a large section about how it is probably not what you want. Like me, it is not saying that there are no use cases for it. There was no information provided about the use case and I made some suggestions/guesses. I'm also guessing that they are following a guide where a writeback cache was necessary for CephFS to use EC prior to Luminous. I also usually add that people should test it out and find what works best for them. I will always defer to your practical use of cache tiers as well, especially when using rbds.
I manage a cluster that I intend to continue running a writeback cache in front of CephFS on the same drives as the EC pool. The use case receives a good enough benefit from the cache tier that it isn't even required to use flash media to see it. It is used for video editing and the files are usually modified and read within the first 24 hours and then left in cold storage until deleted. I have the cache timed to keep everything in it for 24 hours and then evict it by using a minimum time to flush and evict at 24 hours and a target max bytes of 0. All files are in there for that time and then it never has to decide what to keep as it doesn't keep anything longer than that. Luckily read performance from cold storage is not a requirement of this cluster as any read operation has to first read it from EC storage, write it to replica storage, and then read it from replica storage... Yuck.
Christian
>What is your goal by implementing this cache? If the
> answer is to utilize extra space on the nvmes, then just remove it and say
> thank you. The better use of nvmes in that case are as a part of the
> bluestore stack and give your osds larger DB partitions. Keeping your
> metadata pool on nvmes is still a good idea.
>
> On Thu, Oct 5, 2017, 7:45 PM Shawfeng Dong <shaw@xxxxxxxx> wrote:
>
> > Dear all,
> >
> > We just set up a Ceph cluster, running the latest stable release Ceph
> > v12.2.0 (Luminous):
> > # ceph --version
> > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24 d43bab910c) luminous
> > (rc)
> >
> > The goal is to serve Ceph filesystem, for which we created 3 pools:
> > # ceph osd lspools
> > 1 cephfs_data,2 cephfs_metadata,3 cephfs_cache,
> > where
> > * cephfs_data is the data pool (36 OSDs on HDDs), which is erased-coded;
> > * cephfs_metadata is the metadata pool
> > * cephfs_cache is the cache tier (3 OSDs on NVMes) for cephfs_data. The
> > cache-mode is writeback.
> >
> > Everything had worked fine, until today when we tried to copy a 1.3TB file
> > to the CephFS. We got the "No space left on device" error!
> >
> > 'ceph -s' says some OSDs are full:
> > # ceph -s
> > cluster:
> > id: e18516bf-39cb-4670-9f13-88ccb7d19769
> > health: HEALTH_ERR
> > full flag(s) set
> > 1 full osd(s)
> > 1 pools have many more objects per pg than average
> >
> > services:
> > mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
> > mgr: pulpo-mds01(active), standbys: pulpo-admin, pulpo-mon01
> > mds: pulpos-1/1/1 up {0=pulpo-mds01=up:active}
> > osd: 39 osds: 39 up, 39 in
> > flags full
> >
> > data:
> > pools: 3 pools, 2176 pgs
> > objects: 347k objects, 1381 GB
> > usage: 2847 GB used, 262 TB / 265 TB avail
> > pgs: 2176 active+clean
> >
> > io:
> > client: 19301 kB/s rd, 2935 op/s rd, 0 op/s wr
> >
> > And indeed the cache pool is full:
> > # rados df
> > POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND
> > DEGRADED RD_OPS RD
> > WR_OPS WR
> > cephfs_cache 1381G 355385 0 710770 0 0
> > 0 10004954 15
> > 22G 1398063 1611G
> > cephfs_data 0 0 0 0 0 0
> > 0 0
> > 0 0 0
> > cephfs_metadata 8515k 24 0 72 0 0
> > 0 3 3
> > 072 3953 10541k
> >
> > total_objects 355409
> > total_used 2847G
> > total_avail 262T
> > total_space 265T
> >
> > However, the data pool is completely empty! So it seems that data has only
> > been written to the cache pool, but not written back to the data pool.
> >
> > I am really at a loss whether this is due to a setup error on my part, or
> > a Luminous bug. Could anyone shed some light on this? Please let me know if
> > you need any further info.
> >
> > Best,
> > Shaw
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
> >
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com