Hi Igor,
thanks! Here a sample extract for one OSD, time stamp (+%F-%H%M%S) in file name. For the second collection I let it run for about 10 minutes after reset:
perf_dump_2020-07-29-142739.osd181: "bluestore_write_big": 10216689,
perf_dump_2020-07-29-142739.osd181: "bluestore_write_big_bytes": 992602882048,
perf_dump_2020-07-29-142739.osd181: "bluestore_write_big_blobs": 10758603,
perf_dump_2020-07-29-142739.osd181: "bluestore_write_small": 63863813,
perf_dump_2020-07-29-142739.osd181: "bluestore_write_small_bytes": 1481631167388,
perf_dump_2020-07-29-142739.osd181: "bluestore_write_small_unused": 17279108,
perf_dump_2020-07-29-142739.osd181: "bluestore_write_small_deferred": 13629951,
perf_dump_2020-07-29-142739.osd181: "bluestore_write_small_pre_read": 13629951,
perf_dump_2020-07-29-142739.osd181: "bluestore_write_small_new": 32954754,
perf_dump_2020-07-29-142739.osd181: "compress_success_count": 1167212,
perf_dump_2020-07-29-142739.osd181: "compress_rejected_count": 1493508,
perf_dump_2020-07-29-142739.osd181: "bluestore_compressed": 149993487447,
perf_dump_2020-07-29-142739.osd181: "bluestore_compressed_allocated": 206610432000,
perf_dump_2020-07-29-142739.osd181: "bluestore_compressed_original": 362672914432,
perf_dump_2020-07-29-142739.osd181: "bluestore_extent_compress": 24431903,
perf_dump_2020-07-29-143836.osd181: "bluestore_write_big": 10736,
perf_dump_2020-07-29-143836.osd181: "bluestore_write_big_bytes": 1363214336,
perf_dump_2020-07-29-143836.osd181: "bluestore_write_big_blobs": 12291,
perf_dump_2020-07-29-143836.osd181: "bluestore_write_small": 67527,
perf_dump_2020-07-29-143836.osd181: "bluestore_write_small_bytes": 1591140352,
perf_dump_2020-07-29-143836.osd181: "bluestore_write_small_unused": 17528,
perf_dump_2020-07-29-143836.osd181: "bluestore_write_small_deferred": 13854,
perf_dump_2020-07-29-143836.osd181: "bluestore_write_small_pre_read": 13854,
perf_dump_2020-07-29-143836.osd181: "bluestore_write_small_new": 36145,
perf_dump_2020-07-29-143836.osd181: "compress_success_count": 1641,
perf_dump_2020-07-29-143836.osd181: "compress_rejected_count": 2341,
perf_dump_2020-07-29-143836.osd181: "bluestore_compressed": 150044304023,
perf_dump_2020-07-29-143836.osd181: "bluestore_compressed_allocated": 206654210048,
perf_dump_2020-07-29-143836.osd181: "bluestore_compressed_original": 362729676800,
perf_dump_2020-07-29-143836.osd181: "bluestore_extent_compress": 24979,
If necessary, the full outputs for 3 OSDs can be found here:
Before reset:
https://pastebin.com/zNgRwuNv
https://pastebin.com/NDzdbhWc
https://pastebin.com/mpra6PAS
After reset:
https://pastebin.com/Ywrwscea
https://pastebin.com/sLjxK1Jw
https://pastebin.com/ik3n7Xtz
I do see an unreasonable number of small (re-)writes with average size of ca. 20K, seems not to be due to compression. Unfortunately, I can't see anything about alignment of writes.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Igor Fedotov <ifedotov@xxxxxxx>
Sent: 29 July 2020 14:04:34
To: Frank Schilder; ceph-users
Subject: Re: mimic: much more raw used than reported
Hi Frank,
you might want to proceed with perf counters' dump analysis in the
following way:
For 2-3 arbitrary osds
- save current perf counter dump
- reset perf counters
- leave OSD under the regular load for a while.
- dump perf counters again
- share both saved and new dumps and/or check stats on 'big' writes vs.
'small' ones.
Thanks,
Igor
On 7/29/2020 2:49 PM, Frank Schilder wrote:
Dear Igor,
please find below data from "ceph osd df tree" and per-OSD bluestore stats pasted together with the script for extraction for reference. We have now:
df USED: 142 TB
bluestore_stored: 190.9TB (142*8/6 = 189, so matches)
bluestore_allocated: 275.2TB
osd df tree USE: 276.1 (so matches with bluestore_allocated as well)
The situation has gotten worse, the mismatch of raw used to stored is now 85TB. Compression is almost irrelevant. This matches with my earlier report with data taken from "ceph osd df tree" alone. Compared with my previous report, what I seem to see is that a sequential write of 22TB (user data) causes an excess of 16TB (raw). This does not make sense and is not explained with the partial overwrite amplification you referred me to.
The real question I still have is how can I find out how much of the excess usage is attributed to the issue you pointed me to, and how much might be due to something else. I would probably need a way to find objects that are affected by partial overwrite amplification and account for their total to see how much of the excess they explain. Ideally allowing me to identify the RBD images responsible.
I do *not* believe that *all* this extra usage is due to the partial overwrite amplification. We do not have the use case simulated with the subsequent dd commands in your post https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/OHPO43J54TPBEUISYCK3SRV55SIZX2AT/, overwriting old data with an offset. On these images, we store very large files (15GB) that are written *only* *once* and not modified again. We currently do nothing else but sequential writes to a file system.
The only objects that might see a partial overwrite could be at the tail of such a file, when the beginning of a new file is written to an object that already holds a tail, and potentially objects holding file system meta data. With an RBD object size of 4M, this amounts to a comparably small number of objects that almost certainly cannot explain the observed 44% excess even assuming worst case amplification.
The data:
NAME ID USED %USED MAX AVAIL OBJECTS
sr-rbd-data-one-hdd 11 142 TiB 71.12 58 TiB 37415413
osd df tree blue stats
ID SIZE USE alloc store
84 8.9 6.2 6.1 4.3
145 8.9 5.6 5.5 3.7
156 8.9 6.3 6.2 4.2
168 8.9 6.1 6.0 4.1
181 8.9 6.6 6.6 4.4
74 8.9 5.2 5.2 3.7
144 8.9 5.9 5.9 4.0
157 8.9 6.6 6.5 4.5
169 8.9 6.4 6.3 4.4
180 8.9 6.6 6.6 4.5
60 8.9 5.7 5.6 4.0
146 8.9 5.9 5.8 4.0
158 8.9 6.7 6.7 4.6
170 8.9 6.5 6.5 4.4
182 8.9 5.8 5.7 4.0
63 8.9 5.8 5.8 4.1
148 8.9 6.5 6.4 4.4
159 8.9 4.9 4.9 3.3
172 8.9 6.4 6.3 4.4
183 8.9 6.5 6.4 4.4
229 8.9 5.6 5.6 3.8
232 8.9 6.3 6.2 4.3
235 8.9 5.0 4.9 3.3
238 8.9 6.6 6.5 4.4
259 11 7.5 7.4 5.1
231 8.9 6.2 6.1 4.2
233 8.9 6.7 6.6 4.5
236 8.9 6.3 6.2 4.2
239 8.9 5.2 5.1 3.5
263 11 6.5 6.5 4.4
228 8.9 6.3 6.3 4.3
230 8.9 6.0 5.9 4.0
234 8.9 6.5 6.4 4.4
237 8.9 6.0 5.9 4.1
260 11 6.6 6.5 4.5
0 8.9 6.3 6.3 4.3
2 8.9 6.4 6.4 4.5
72 8.9 5.4 5.4 3.7
76 8.9 6.2 6.1 4.3
86 8.9 5.6 5.5 3.9
1 8.9 6.0 5.9 4.1
3 8.9 5.7 5.7 4.0
73 8.9 6.1 6.0 4.3
85 8.9 6.8 6.7 4.6
87 8.9 6.1 6.1 4.3
SUM 406.8 276.1 275.2 190.9
The script:
#!/bin/bash
format_TB() {
tmp=$(($1/1024))
echo "${tmp}.$(( (10*($1-tmp*1024))/1024 ))"
}
blue_stats() {
al_tot=0
st_tot=0
printf "%12s\n" "blue stats"
printf "%5s %5s\n" "alloc" "store"
for o in "$@" ; do
host_ip="$(ceph osd find "$o" | jq -r '.ip' | cut -d ":" -f1)"
bs_data="$(ssh "$host_ip" ceph daemon "osd.$o" perf dump | jq '.bluestore')"
bs_alloc=$(( $(echo "$bs_data" | jq '.bluestore_allocated') /1024/1024/1024 ))
al_tot=$(( $al_tot+$bs_alloc ))
bs_store=$(( $(echo "$bs_data" | jq '.bluestore_stored') /1024/1024/1024 ))
st_tot=$(( $st_tot+$bs_store ))
printf "%5s %5s\n" "$(format_TB $bs_alloc)" "$(format_TB $bs_store)"
done
printf "%5s %5s\n" "$(format_TB $al_tot)" "$(format_TB $st_tot)"
}
df_tree_data="$(ceph osd df tree | sed -e "s/ *$//g" | awk 'BEGIN {printf("%18s\n", "osd df tree")} /root default/ {o=0} /datacenter ServerRoom/ {o=1} (o==1 && $2=="hdd") {s+=$5;u+=$7;printf("%4s %5s %5s\n", $1, $5, $7)} f==0 {printf("%4s %5s %5s\n", $1, $5, $6);f=1} END {printf("%4s %5.1f %5.1f\n", "SUM", s, u)}')"
OSDS=( $(echo "$df_tree_data" | tail -n +3 | awk '/SUM/ {next} {print $1}') )
bs_data="$(blue_stats "${OSDS[@]}")"
paste -d " " <(echo "$df_tree_data") <(echo "$bs_data")
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Igor Fedotov <ifedotov@xxxxxxx>
Sent: 27 July 2020 13:31
To: Frank Schilder; ceph-users
Subject: Re: mimic: much more raw used than reported
Frank,
suggest to start with perf counter analysis as per the second part of my
previous email...
Thanks,
Igor
On 7/27/2020 2:30 PM, Frank Schilder wrote:
Hi Igor,
thanks for your answer. I was thinking about that, but as far as I understood, to hit this bug actually requires a partial rewrite to happen. However, these are disk images in storage servers with basically static files, many of which very large (15GB). Therefore, I believe, the vast majority of objects is written to only once and should not be affected by the amplification bug.
Is there any way to confirm/rule out that/check how much amplification is happening?
I'm wondering if I might be observing something else. Since "ceph osd df tree" does report the actual utilization and I have only one pool on these OSDs, there is no problem with accounting allocated storage to a pool. I know its all used by this one pool. I'm more wondering if its not the known amplification but something else (at least partly) that plays a role here.
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Igor Fedotov <ifedotov@xxxxxxx>
Sent: 27 July 2020 12:54:02
To: Frank Schilder; ceph-users
Subject: Re: mimic: much more raw used than reported
Hi Frank,
you might be being hit by https://tracker.ceph.com/issues/44213
In short the root causes are significant space overhead due to high
bluestore allocation unit (64K) and EC overwrite design.
This is fixed for upcoming Pacific release by using 4K alloc unit but it
is unlikely to be backported to earlier releases due to its complexity.
To say nothing about the need for OSD redeployment. Hence please expect
no fix for mimic.
And your raw usage reports might still be not that good since mimic
lacks per-pool stats collection https://github.com/ceph/ceph/pull/19454.
I.e. your actual raw space usage is higher than reported. To estimate
proper raw usage one can use bluestore perf counters (namely
bluestore_stored and bluestore_allocated). Summing bluestore_allocated
over all involved OSDs will give actual RAW usage. Summing
bluestore_stored will provide actual data volume after EC processing,
i.e. presumably it should be around 158TiB.
Thanks,
Igor
On 7/26/2020 8:43 PM, Frank Schilder wrote:
Dear fellow cephers,
I observe a wired problem on our mimic-13.2.8 cluster. We have an EC RBD pool backed by HDDs. These disks are not in any other pool. I noticed that the total capacity (=USED+MAX AVAIL) reported by "ceph df detail" has shrunk recently from 300TiB to 200TiB. Part but by no means all of this can be explained by imbalance of the data distribution.
When I compare the output of "ceph df detail" and "ceph osd df tree", I find 69TiB raw capacity used but not accounted for; see calculations below. These 69TiB raw are equivalent to 20% usable capacity and I really need it back. Together with the imbalance, we loose about 30% capacity.
What is using these extra 69TiB and how can I get it back?
Some findings:
These are the 5 largest images in the pool, accounting for a total of 97TiB out of 119TiB usage:
# rbd du :
NAME PROVISIONED USED
one-133 25 TiB 14 TiB
NAME PROVISIONED USED
one-153@222 40 TiB 14 TiB
one-153@228 40 TiB 357 GiB
one-153@235 40 TiB 797 GiB
one-153@241 40 TiB 509 GiB
one-153@242 40 TiB 43 GiB
one-153@243 40 TiB 16 MiB
one-153@244 40 TiB 16 MiB
one-153@245 40 TiB 324 MiB
one-153@246 40 TiB 276 MiB
one-153@247 40 TiB 96 MiB
one-153@248 40 TiB 138 GiB
one-153@249 40 TiB 1.8 GiB
one-153@250 40 TiB 0 B
one-153 40 TiB 204 MiB
<TOTAL> 40 TiB 16 TiB
NAME PROVISIONED USED
one-391@3 40 TiB 432 MiB
one-391@9 40 TiB 26 GiB
one-391@15 40 TiB 90 GiB
one-391@16 40 TiB 0 B
one-391@17 40 TiB 0 B
one-391@18 40 TiB 0 B
one-391@19 40 TiB 0 B
one-391@20 40 TiB 3.5 TiB
one-391@21 40 TiB 5.4 TiB
one-391@22 40 TiB 5.8 TiB
one-391@23 40 TiB 8.4 TiB
one-391@24 40 TiB 1.4 TiB
one-391 40 TiB 2.2 TiB
<TOTAL> 40 TiB 27 TiB
NAME PROVISIONED USED
one-394@3 70 TiB 1.4 TiB
one-394@9 70 TiB 2.5 TiB
one-394@15 70 TiB 20 GiB
one-394@16 70 TiB 0 B
one-394@17 70 TiB 0 B
one-394@18 70 TiB 0 B
one-394@19 70 TiB 383 GiB
one-394@20 70 TiB 3.3 TiB
one-394@21 70 TiB 5.0 TiB
one-394@22 70 TiB 5.0 TiB
one-394@23 70 TiB 9.0 TiB
one-394@24 70 TiB 1.6 TiB
one-394 70 TiB 2.5 TiB
<TOTAL> 70 TiB 31 TiB
NAME PROVISIONED USED
one-434 25 TiB 9.1 TiB
The large 70TiB images one-391 and one-394 are currently copied to with ca. 5TiB per day.
Output of "ceph df detail" with some columns removed:
NAME ID USED %USED MAX AVAIL OBJECTS RAW USED
sr-rbd-data-one-hdd 11 119 TiB 58.45 84 TiB 31286554 158 TiB
Pool is EC 6+2.
USED is correct: 31286554*4MiB=119TiB.
RAW USED is correct: 119*8/6=158TiB.
Most of this data is freshly copied onto large RBD images.
Compression is enabled on this pool (aggressive,snappy).
However, when looking at "deph osd df tree", I get
The combined raw capacity of OSDs backing this pool is 406.8TiB (sum over SIZE).
Summing up column USE over all OSDs gives 227.5TiB.
This gives a difference of 69TiB (=227-158) that is not accounted for.
Here the output of "ceph osd df tree limited" to the drives backing the pool:
ID CLASS WEIGHT REWEIGHT SIZE USE DATA OMAP META AVAIL %USE VAR PGS TYPE NAME
84 hdd 8.90999 1.00000 8.9 TiB 5.0 TiB 5.0 TiB 180 MiB 16 GiB 3.9 TiB 56.43 1.72 103 osd.84
145 hdd 8.90999 1.00000 8.9 TiB 4.6 TiB 4.6 TiB 144 MiB 14 GiB 4.3 TiB 51.37 1.57 87 osd.145
156 hdd 8.90999 1.00000 8.9 TiB 5.2 TiB 5.1 TiB 173 MiB 16 GiB 3.8 TiB 57.91 1.77 100 osd.156
168 hdd 8.90999 1.00000 8.9 TiB 5.0 TiB 5.0 TiB 164 MiB 16 GiB 3.9 TiB 56.31 1.72 98 osd.168
181 hdd 8.90999 1.00000 8.9 TiB 5.5 TiB 5.4 TiB 121 MiB 17 GiB 3.5 TiB 61.26 1.87 105 osd.181
74 hdd 8.90999 1.00000 8.9 TiB 4.2 TiB 4.2 TiB 148 MiB 13 GiB 4.7 TiB 46.79 1.43 85 osd.74
144 hdd 8.90999 1.00000 8.9 TiB 4.7 TiB 4.7 TiB 106 MiB 15 GiB 4.2 TiB 53.17 1.62 94 osd.144
157 hdd 8.90999 1.00000 8.9 TiB 5.8 TiB 5.8 TiB 192 MiB 18 GiB 3.1 TiB 65.02 1.99 111 osd.157
169 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB 172 MiB 16 GiB 3.8 TiB 56.99 1.74 102 osd.169
180 hdd 8.90999 1.00000 8.9 TiB 5.8 TiB 5.8 TiB 131 MiB 18 GiB 3.1 TiB 65.04 1.99 111 osd.180
60 hdd 8.90999 1.00000 8.9 TiB 4.5 TiB 4.5 TiB 155 MiB 14 GiB 4.4 TiB 50.40 1.54 93 osd.60
146 hdd 8.90999 1.00000 8.9 TiB 4.8 TiB 4.8 TiB 139 MiB 15 GiB 4.1 TiB 53.70 1.64 92 osd.146
158 hdd 8.90999 1.00000 8.9 TiB 5.6 TiB 5.5 TiB 183 MiB 17 GiB 3.4 TiB 62.30 1.90 109 osd.158
170 hdd 8.90999 1.00000 8.9 TiB 5.7 TiB 5.6 TiB 205 MiB 18 GiB 3.2 TiB 63.53 1.94 112 osd.170
182 hdd 8.90999 1.00000 8.9 TiB 4.7 TiB 4.6 TiB 105 MiB 14 GiB 4.3 TiB 52.27 1.60 92 osd.182
63 hdd 8.90999 1.00000 8.9 TiB 4.7 TiB 4.7 TiB 156 MiB 15 GiB 4.2 TiB 52.74 1.61 98 osd.63
148 hdd 8.90999 1.00000 8.9 TiB 5.2 TiB 5.1 TiB 119 MiB 16 GiB 3.8 TiB 57.82 1.77 100 osd.148
159 hdd 8.90999 1.00000 8.9 TiB 4.0 TiB 4.0 TiB 89 MiB 12 GiB 4.9 TiB 44.61 1.36 79 osd.159
172 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB 173 MiB 16 GiB 3.8 TiB 57.22 1.75 98 osd.172
183 hdd 8.90999 1.00000 8.9 TiB 6.0 TiB 6.0 TiB 135 MiB 19 GiB 2.9 TiB 67.35 2.06 118 osd.183
229 hdd 8.90999 1.00000 8.9 TiB 4.6 TiB 4.6 TiB 127 MiB 15 GiB 4.3 TiB 52.05 1.59 93 osd.229
232 hdd 8.90999 1.00000 8.9 TiB 5.2 TiB 5.2 TiB 158 MiB 17 GiB 3.7 TiB 58.22 1.78 101 osd.232
235 hdd 8.90999 1.00000 8.9 TiB 4.1 TiB 4.1 TiB 103 MiB 13 GiB 4.8 TiB 45.96 1.40 79 osd.235
238 hdd 8.90999 1.00000 8.9 TiB 5.4 TiB 5.4 TiB 120 MiB 17 GiB 3.5 TiB 60.47 1.85 104 osd.238
259 hdd 10.91399 1.00000 11 TiB 6.2 TiB 6.2 TiB 140 MiB 19 GiB 4.7 TiB 56.54 1.73 120 osd.259
231 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB 114 MiB 16 GiB 3.8 TiB 56.90 1.74 101 osd.231
233 hdd 8.90999 1.00000 8.9 TiB 5.5 TiB 5.5 TiB 123 MiB 17 GiB 3.4 TiB 61.78 1.89 106 osd.233
236 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB 114 MiB 16 GiB 3.8 TiB 57.53 1.76 101 osd.236
239 hdd 8.90999 1.00000 8.9 TiB 4.2 TiB 4.2 TiB 95 MiB 13 GiB 4.7 TiB 47.41 1.45 86 osd.239
263 hdd 10.91399 1.00000 11 TiB 5.3 TiB 5.3 TiB 178 MiB 17 GiB 5.6 TiB 48.73 1.49 102 osd.263
228 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB 113 MiB 16 GiB 3.8 TiB 57.10 1.74 96 osd.228
230 hdd 8.90999 1.00000 8.9 TiB 4.9 TiB 4.9 TiB 144 MiB 16 GiB 4.0 TiB 55.20 1.69 99 osd.230
234 hdd 8.90999 1.00000 8.9 TiB 5.6 TiB 5.6 TiB 164 MiB 18 GiB 3.3 TiB 63.29 1.93 109 osd.234
237 hdd 8.90999 1.00000 8.9 TiB 4.8 TiB 4.8 TiB 110 MiB 15 GiB 4.1 TiB 54.33 1.66 97 osd.237
260 hdd 10.91399 1.00000 11 TiB 5.4 TiB 5.4 TiB 152 MiB 17 GiB 5.5 TiB 49.35 1.51 104 osd.260
0 hdd 8.90999 1.00000 8.9 TiB 5.2 TiB 5.2 TiB 157 MiB 16 GiB 3.7 TiB 58.28 1.78 102 osd.0
2 hdd 8.90999 1.00000 8.9 TiB 5.3 TiB 5.2 TiB 122 MiB 16 GiB 3.6 TiB 59.05 1.80 106 osd.2
72 hdd 8.90999 1.00000 8.9 TiB 4.4 TiB 4.4 TiB 145 MiB 14 GiB 4.5 TiB 49.89 1.52 89 osd.72
76 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB 178 MiB 16 GiB 3.8 TiB 56.89 1.74 102 osd.76
86 hdd 8.90999 1.00000 8.9 TiB 4.6 TiB 4.5 TiB 155 MiB 14 GiB 4.3 TiB 51.18 1.56 94 osd.86
1 hdd 8.90999 1.00000 8.9 TiB 4.9 TiB 4.9 TiB 141 MiB 15 GiB 4.0 TiB 54.73 1.67 95 osd.1
3 hdd 8.90999 1.00000 8.9 TiB 4.7 TiB 4.7 TiB 156 MiB 15 GiB 4.2 TiB 52.40 1.60 94 osd.3
73 hdd 8.90999 1.00000 8.9 TiB 5.0 TiB 4.9 TiB 146 MiB 16 GiB 3.9 TiB 55.68 1.70 102 osd.73
85 hdd 8.90999 1.00000 8.9 TiB 5.6 TiB 5.5 TiB 192 MiB 18 GiB 3.3 TiB 62.46 1.91 109 osd.85
87 hdd 8.90999 1.00000 8.9 TiB 5.0 TiB 5.0 TiB 189 MiB 16 GiB 3.9 TiB 55.91 1.71 102 osd.87
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx