Re: mimic: much more raw used than reported

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Igor,

please find below data from "ceph osd df tree" and per-OSD bluestore stats pasted together with the script for extraction for reference. We have now:

df USED: 142 TB
bluestore_stored: 190.9TB (142*8/6 = 189, so matches)
bluestore_allocated: 275.2TB
osd df tree USE: 276.1 (so matches with bluestore_allocated as well)

The situation has gotten worse, the mismatch of raw used to stored is now 85TB. Compression is almost irrelevant. This matches with my earlier report with data taken from "ceph osd df tree" alone. Compared with my previous report, what I seem to see is that a sequential write of 22TB (user data) causes an excess of 16TB (raw). This does not make sense and is not explained with the partial overwrite amplification you referred me to.

The real question I still have is how can I find out how much of the excess usage is attributed to the issue you pointed me to, and how much might be due to something else. I would probably need a way to find objects that are affected by partial overwrite amplification and account for their total to see how much of the excess they explain. Ideally allowing me to identify the RBD images responsible.

I do *not* believe that *all* this extra usage is due to the partial overwrite amplification. We do not have the use case simulated with the subsequent dd commands in your post https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/OHPO43J54TPBEUISYCK3SRV55SIZX2AT/, overwriting old data with an offset. On these images, we store very large files (15GB) that are written *only* *once* and not modified again. We currently do nothing else but sequential writes to a file system.

The only objects that might see a partial overwrite could be at the tail of such a file, when the beginning of a new file is written to an object that already holds a tail, and potentially objects holding file system meta data. With an RBD object size of 4M, this amounts to a comparably small number of objects that almost certainly cannot explain the observed 44% excess even assuming worst case amplification.

The data:

NAME                     ID     USED        %USED     MAX AVAIL     OBJECTS  
sr-rbd-data-one-hdd      11     142 TiB     71.12        58 TiB     37415413

       osd df tree   blue stats
  ID   SIZE    USE alloc  store
  84    8.9    6.2   6.1    4.3
 145    8.9    5.6   5.5    3.7
 156    8.9    6.3   6.2    4.2
 168    8.9    6.1   6.0    4.1
 181    8.9    6.6   6.6    4.4
  74    8.9    5.2   5.2    3.7
 144    8.9    5.9   5.9    4.0
 157    8.9    6.6   6.5    4.5
 169    8.9    6.4   6.3    4.4
 180    8.9    6.6   6.6    4.5
  60    8.9    5.7   5.6    4.0
 146    8.9    5.9   5.8    4.0
 158    8.9    6.7   6.7    4.6
 170    8.9    6.5   6.5    4.4
 182    8.9    5.8   5.7    4.0
  63    8.9    5.8   5.8    4.1
 148    8.9    6.5   6.4    4.4
 159    8.9    4.9   4.9    3.3
 172    8.9    6.4   6.3    4.4
 183    8.9    6.5   6.4    4.4
 229    8.9    5.6   5.6    3.8
 232    8.9    6.3   6.2    4.3
 235    8.9    5.0   4.9    3.3
 238    8.9    6.6   6.5    4.4
 259     11    7.5   7.4    5.1
 231    8.9    6.2   6.1    4.2
 233    8.9    6.7   6.6    4.5
 236    8.9    6.3   6.2    4.2
 239    8.9    5.2   5.1    3.5
 263     11    6.5   6.5    4.4
 228    8.9    6.3   6.3    4.3
 230    8.9    6.0   5.9    4.0
 234    8.9    6.5   6.4    4.4
 237    8.9    6.0   5.9    4.1
 260     11    6.6   6.5    4.5
   0    8.9    6.3   6.3    4.3
   2    8.9    6.4   6.4    4.5
  72    8.9    5.4   5.4    3.7
  76    8.9    6.2   6.1    4.3
  86    8.9    5.6   5.5    3.9
   1    8.9    6.0   5.9    4.1
   3    8.9    5.7   5.7    4.0
  73    8.9    6.1   6.0    4.3
  85    8.9    6.8   6.7    4.6
  87    8.9    6.1   6.1    4.3
 SUM  406.8  276.1 275.2  190.9

The script:

#!/bin/bash

format_TB() {
	tmp=$(($1/1024))
	echo "${tmp}.$(( (10*($1-tmp*1024))/1024 ))"
}

blue_stats() {
	al_tot=0
	st_tot=0
	printf "%12s\n" "blue stats"
	printf "%5s  %5s\n" "alloc" "store"
	for o in "$@" ; do
		host_ip="$(ceph osd find "$o" | jq -r '.ip' | cut -d ":" -f1)"
		bs_data="$(ssh "$host_ip" ceph daemon "osd.$o" perf dump | jq '.bluestore')"
		bs_alloc=$(( $(echo "$bs_data" | jq '.bluestore_allocated') /1024/1024/1024 ))
		al_tot=$(( $al_tot+$bs_alloc ))
		bs_store=$(( $(echo "$bs_data" | jq '.bluestore_stored') /1024/1024/1024 ))
		st_tot=$(( $st_tot+$bs_store ))
		printf "%5s  %5s\n" "$(format_TB $bs_alloc)" "$(format_TB $bs_store)"
	done
	printf "%5s  %5s\n" "$(format_TB $al_tot)" "$(format_TB $st_tot)"
}

df_tree_data="$(ceph osd df tree | sed -e "s/  *$//g" | awk 'BEGIN {printf("%18s\n", "osd df tree")} /root default/ {o=0} /datacenter ServerRoom/ {o=1} (o==1 && $2=="hdd") {s+=$5;u+=$7;printf("%4s  %5s  %5s\n", $1, $5, $7)} f==0 {printf("%4s  %5s  %5s\n", $1, $5, $6);f=1} END {printf("%4s  %5.1f  %5.1f\n", "SUM", s, u)}')"

OSDS=( $(echo "$df_tree_data" | tail -n +3 | awk '/SUM/ {next} {print $1}') )

bs_data="$(blue_stats "${OSDS[@]}")"

paste -d " " <(echo "$df_tree_data") <(echo "$bs_data")

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <ifedotov@xxxxxxx>
Sent: 27 July 2020 13:31
To: Frank Schilder; ceph-users
Subject: Re:  mimic: much more raw used than reported

Frank,

suggest to start with perf counter analysis as per the second part of my
previous email...


Thanks,

Igor

On 7/27/2020 2:30 PM, Frank Schilder wrote:
> Hi Igor,
>
> thanks for your answer. I was thinking about that, but as far as I understood, to hit this bug actually requires a partial rewrite to happen. However, these are disk images in storage servers with basically static files, many of which very large (15GB). Therefore, I believe, the vast majority of objects is written to only once and should not be affected by the amplification bug.
>
> Is there any way to  confirm/rule out that/check how much  amplification is happening?
>
> I'm wondering if I might be observing something else. Since "ceph osd df tree" does report the actual utilization and I have only one pool on these OSDs, there is no problem with accounting allocated storage to a pool. I know its all used by this one pool. I'm more wondering if its not the known amplification but something else (at least partly) that plays a role here.
>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Igor Fedotov <ifedotov@xxxxxxx>
> Sent: 27 July 2020 12:54:02
> To: Frank Schilder; ceph-users
> Subject: Re:  mimic: much more raw used than reported
>
> Hi Frank,
>
> you might be being hit by https://tracker.ceph.com/issues/44213
>
> In short the root causes are  significant space overhead due to high
> bluestore allocation unit (64K) and EC overwrite design.
>
> This is fixed for upcoming Pacific release by using 4K alloc unit but it
> is unlikely to be backported to earlier releases due to its complexity.
> To say nothing about the need for OSD redeployment. Hence please expect
> no fix for mimic.
>
>
> And your raw usage reports might still be not that good since mimic
> lacks per-pool stats collection https://github.com/ceph/ceph/pull/19454.
> I.e. your actual raw space usage is higher than reported. To estimate
> proper raw usage one can use bluestore perf counters (namely
> bluestore_stored and bluestore_allocated). Summing bluestore_allocated
> over all involved OSDs will give actual RAW usage. Summing
> bluestore_stored will provide actual data volume after EC processing,
> i.e. presumably it should be around 158TiB.
>
>
> Thanks,
>
> Igor
>
> On 7/26/2020 8:43 PM, Frank Schilder wrote:
>> Dear fellow cephers,
>>
>> I observe a wired problem on our mimic-13.2.8 cluster. We have an EC RBD pool backed by HDDs. These disks are not in any other pool. I noticed that the total capacity (=USED+MAX AVAIL) reported by "ceph df detail" has shrunk recently from 300TiB to 200TiB. Part but by no means all of this can be explained by imbalance of the data distribution.
>>
>> When I compare the output of "ceph df detail" and "ceph osd df tree", I find 69TiB raw capacity used but not accounted for; see calculations below. These 69TiB raw are equivalent to 20% usable capacity and I really need it back. Together with the imbalance, we loose about 30% capacity.
>>
>> What is using these extra 69TiB and how can I get it back?
>>
>>
>> Some findings:
>>
>> These are the 5 largest images in the pool, accounting for a total of 97TiB out of 119TiB usage:
>>
>> # rbd du :
>> NAME    PROVISIONED   USED
>> one-133      25 TiB 14 TiB
>> NAME        PROVISIONED    USED
>> one-153@222      40 TiB  14 TiB
>> one-153@228      40 TiB 357 GiB
>> one-153@235      40 TiB 797 GiB
>> one-153@241      40 TiB 509 GiB
>> one-153@242      40 TiB  43 GiB
>> one-153@243      40 TiB  16 MiB
>> one-153@244      40 TiB  16 MiB
>> one-153@245      40 TiB 324 MiB
>> one-153@246      40 TiB 276 MiB
>> one-153@247      40 TiB  96 MiB
>> one-153@248      40 TiB 138 GiB
>> one-153@249      40 TiB 1.8 GiB
>> one-153@250      40 TiB     0 B
>> one-153          40 TiB 204 MiB
>> <TOTAL>          40 TiB  16 TiB
>> NAME       PROVISIONED    USED
>> one-391@3       40 TiB 432 MiB
>> one-391@9       40 TiB  26 GiB
>> one-391@15      40 TiB  90 GiB
>> one-391@16      40 TiB     0 B
>> one-391@17      40 TiB     0 B
>> one-391@18      40 TiB     0 B
>> one-391@19      40 TiB     0 B
>> one-391@20      40 TiB 3.5 TiB
>> one-391@21      40 TiB 5.4 TiB
>> one-391@22      40 TiB 5.8 TiB
>> one-391@23      40 TiB 8.4 TiB
>> one-391@24      40 TiB 1.4 TiB
>> one-391         40 TiB 2.2 TiB
>> <TOTAL>         40 TiB  27 TiB
>> NAME       PROVISIONED    USED
>> one-394@3       70 TiB 1.4 TiB
>> one-394@9       70 TiB 2.5 TiB
>> one-394@15      70 TiB  20 GiB
>> one-394@16      70 TiB     0 B
>> one-394@17      70 TiB     0 B
>> one-394@18      70 TiB     0 B
>> one-394@19      70 TiB 383 GiB
>> one-394@20      70 TiB 3.3 TiB
>> one-394@21      70 TiB 5.0 TiB
>> one-394@22      70 TiB 5.0 TiB
>> one-394@23      70 TiB 9.0 TiB
>> one-394@24      70 TiB 1.6 TiB
>> one-394         70 TiB 2.5 TiB
>> <TOTAL>         70 TiB  31 TiB
>> NAME    PROVISIONED    USED
>> one-434      25 TiB 9.1 TiB
>>
>> The large 70TiB images one-391 and one-394 are currently copied to with ca. 5TiB per day.
>>
>> Output of "ceph df detail" with some columns removed:
>>
>> NAME                     ID     USED        %USED     MAX AVAIL     OBJECTS      RAW USED
>> sr-rbd-data-one-hdd      11     119 TiB     58.45        84 TiB     31286554      158 TiB
>>
>> Pool is EC 6+2.
>> USED is correct: 31286554*4MiB=119TiB.
>> RAW USED is correct: 119*8/6=158TiB.
>> Most of this data is freshly copied onto large RBD images.
>> Compression is enabled on this pool (aggressive,snappy).
>>
>> However, when looking at "deph osd df tree", I get
>>
>> The combined raw capacity of OSDs backing this pool is 406.8TiB (sum over SIZE).
>> Summing up column USE over all OSDs gives 227.5TiB.
>>
>> This gives a difference of 69TiB (=227-158) that is not accounted for.
>>
>> Here the output of "ceph osd df tree limited" to the drives backing the pool:
>>
>> ID   CLASS    WEIGHT     REWEIGHT SIZE    USE     DATA    OMAP    META     AVAIL   %USE  VAR  PGS TYPE NAME
>>     84      hdd    8.90999  1.00000 8.9 TiB 5.0 TiB 5.0 TiB 180 MiB   16 GiB 3.9 TiB 56.43 1.72 103                     osd.84
>>    145      hdd    8.90999  1.00000 8.9 TiB 4.6 TiB 4.6 TiB 144 MiB   14 GiB 4.3 TiB 51.37 1.57  87                     osd.145
>>    156      hdd    8.90999  1.00000 8.9 TiB 5.2 TiB 5.1 TiB 173 MiB   16 GiB 3.8 TiB 57.91 1.77 100                     osd.156
>>    168      hdd    8.90999  1.00000 8.9 TiB 5.0 TiB 5.0 TiB 164 MiB   16 GiB 3.9 TiB 56.31 1.72  98                     osd.168
>>    181      hdd    8.90999  1.00000 8.9 TiB 5.5 TiB 5.4 TiB 121 MiB   17 GiB 3.5 TiB 61.26 1.87 105                     osd.181
>>     74      hdd    8.90999  1.00000 8.9 TiB 4.2 TiB 4.2 TiB 148 MiB   13 GiB 4.7 TiB 46.79 1.43  85                     osd.74
>>    144      hdd    8.90999  1.00000 8.9 TiB 4.7 TiB 4.7 TiB 106 MiB   15 GiB 4.2 TiB 53.17 1.62  94                     osd.144
>>    157      hdd    8.90999  1.00000 8.9 TiB 5.8 TiB 5.8 TiB 192 MiB   18 GiB 3.1 TiB 65.02 1.99 111                     osd.157
>>    169      hdd    8.90999  1.00000 8.9 TiB 5.1 TiB 5.1 TiB 172 MiB   16 GiB 3.8 TiB 56.99 1.74 102                     osd.169
>>    180      hdd    8.90999  1.00000 8.9 TiB 5.8 TiB 5.8 TiB 131 MiB   18 GiB 3.1 TiB 65.04 1.99 111                     osd.180
>>     60      hdd    8.90999  1.00000 8.9 TiB 4.5 TiB 4.5 TiB 155 MiB   14 GiB 4.4 TiB 50.40 1.54  93                     osd.60
>>    146      hdd    8.90999  1.00000 8.9 TiB 4.8 TiB 4.8 TiB 139 MiB   15 GiB 4.1 TiB 53.70 1.64  92                     osd.146
>>    158      hdd    8.90999  1.00000 8.9 TiB 5.6 TiB 5.5 TiB 183 MiB   17 GiB 3.4 TiB 62.30 1.90 109                     osd.158
>>    170      hdd    8.90999  1.00000 8.9 TiB 5.7 TiB 5.6 TiB 205 MiB   18 GiB 3.2 TiB 63.53 1.94 112                     osd.170
>>    182      hdd    8.90999  1.00000 8.9 TiB 4.7 TiB 4.6 TiB 105 MiB   14 GiB 4.3 TiB 52.27 1.60  92                     osd.182
>>     63      hdd    8.90999  1.00000 8.9 TiB 4.7 TiB 4.7 TiB 156 MiB   15 GiB 4.2 TiB 52.74 1.61  98                     osd.63
>>    148      hdd    8.90999  1.00000 8.9 TiB 5.2 TiB 5.1 TiB 119 MiB   16 GiB 3.8 TiB 57.82 1.77 100                     osd.148
>>    159      hdd    8.90999  1.00000 8.9 TiB 4.0 TiB 4.0 TiB  89 MiB   12 GiB 4.9 TiB 44.61 1.36  79                     osd.159
>>    172      hdd    8.90999  1.00000 8.9 TiB 5.1 TiB 5.1 TiB 173 MiB   16 GiB 3.8 TiB 57.22 1.75  98                     osd.172
>>    183      hdd    8.90999  1.00000 8.9 TiB 6.0 TiB 6.0 TiB 135 MiB   19 GiB 2.9 TiB 67.35 2.06 118                     osd.183
>>    229      hdd    8.90999  1.00000 8.9 TiB 4.6 TiB 4.6 TiB 127 MiB   15 GiB 4.3 TiB 52.05 1.59  93                     osd.229
>>    232      hdd    8.90999  1.00000 8.9 TiB 5.2 TiB 5.2 TiB 158 MiB   17 GiB 3.7 TiB 58.22 1.78 101                     osd.232
>>    235      hdd    8.90999  1.00000 8.9 TiB 4.1 TiB 4.1 TiB 103 MiB   13 GiB 4.8 TiB 45.96 1.40  79                     osd.235
>>    238      hdd    8.90999  1.00000 8.9 TiB 5.4 TiB 5.4 TiB 120 MiB   17 GiB 3.5 TiB 60.47 1.85 104                     osd.238
>>    259      hdd   10.91399  1.00000  11 TiB 6.2 TiB 6.2 TiB 140 MiB   19 GiB 4.7 TiB 56.54 1.73 120                     osd.259
>>    231      hdd    8.90999  1.00000 8.9 TiB 5.1 TiB 5.1 TiB 114 MiB   16 GiB 3.8 TiB 56.90 1.74 101                     osd.231
>>    233      hdd    8.90999  1.00000 8.9 TiB 5.5 TiB 5.5 TiB 123 MiB   17 GiB 3.4 TiB 61.78 1.89 106                     osd.233
>>    236      hdd    8.90999  1.00000 8.9 TiB 5.1 TiB 5.1 TiB 114 MiB   16 GiB 3.8 TiB 57.53 1.76 101                     osd.236
>>    239      hdd    8.90999  1.00000 8.9 TiB 4.2 TiB 4.2 TiB  95 MiB   13 GiB 4.7 TiB 47.41 1.45  86                     osd.239
>>    263      hdd   10.91399  1.00000  11 TiB 5.3 TiB 5.3 TiB 178 MiB   17 GiB 5.6 TiB 48.73 1.49 102                     osd.263
>>    228      hdd    8.90999  1.00000 8.9 TiB 5.1 TiB 5.1 TiB 113 MiB   16 GiB 3.8 TiB 57.10 1.74  96                     osd.228
>>    230      hdd    8.90999  1.00000 8.9 TiB 4.9 TiB 4.9 TiB 144 MiB   16 GiB 4.0 TiB 55.20 1.69  99                     osd.230
>>    234      hdd    8.90999  1.00000 8.9 TiB 5.6 TiB 5.6 TiB 164 MiB   18 GiB 3.3 TiB 63.29 1.93 109                     osd.234
>>    237      hdd    8.90999  1.00000 8.9 TiB 4.8 TiB 4.8 TiB 110 MiB   15 GiB 4.1 TiB 54.33 1.66  97                     osd.237
>>    260      hdd   10.91399  1.00000  11 TiB 5.4 TiB 5.4 TiB 152 MiB   17 GiB 5.5 TiB 49.35 1.51 104                     osd.260
>>      0      hdd    8.90999  1.00000 8.9 TiB 5.2 TiB 5.2 TiB 157 MiB   16 GiB 3.7 TiB 58.28 1.78 102                     osd.0
>>      2      hdd    8.90999  1.00000 8.9 TiB 5.3 TiB 5.2 TiB 122 MiB   16 GiB 3.6 TiB 59.05 1.80 106                     osd.2
>>     72      hdd    8.90999  1.00000 8.9 TiB 4.4 TiB 4.4 TiB 145 MiB   14 GiB 4.5 TiB 49.89 1.52  89                     osd.72
>>     76      hdd    8.90999  1.00000 8.9 TiB 5.1 TiB 5.1 TiB 178 MiB   16 GiB 3.8 TiB 56.89 1.74 102                     osd.76
>>     86      hdd    8.90999  1.00000 8.9 TiB 4.6 TiB 4.5 TiB 155 MiB   14 GiB 4.3 TiB 51.18 1.56  94                     osd.86
>>      1      hdd    8.90999  1.00000 8.9 TiB 4.9 TiB 4.9 TiB 141 MiB   15 GiB 4.0 TiB 54.73 1.67  95                     osd.1
>>      3      hdd    8.90999  1.00000 8.9 TiB 4.7 TiB 4.7 TiB 156 MiB   15 GiB 4.2 TiB 52.40 1.60  94                     osd.3
>>     73      hdd    8.90999  1.00000 8.9 TiB 5.0 TiB 4.9 TiB 146 MiB   16 GiB 3.9 TiB 55.68 1.70 102                     osd.73
>>     85      hdd    8.90999  1.00000 8.9 TiB 5.6 TiB 5.5 TiB 192 MiB   18 GiB 3.3 TiB 62.46 1.91 109                     osd.85
>>     87      hdd    8.90999  1.00000 8.9 TiB 5.0 TiB 5.0 TiB 189 MiB   16 GiB 3.9 TiB 55.91 1.71 102                     osd.87
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux