Re: why is there heavy read traffic during object delete?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 11 Feb 2016 14:12:39 -0800



Looks like it to me, yeah. Not sure why it took so long to get noticed
though (that is, is that behavior present in the release you're using,
or is it a new bug)?
-Greg

On Thu, Feb 11, 2016 at 12:11 PM, Stephen Lord <Steve.Lord@xxxxxxxxxxx> wrote:
>
> I saw this go by in the commit log:
>
> commit cc2200c5e60caecf7931e546f6522b2ba364227f
> Merge: f8d5807 12c083e
> Author: Sage Weil <sage@xxxxxxxxxx>
> Date:   Thu Feb 11 08:44:35 2016 -0500
>
>     Merge pull request #7537 from ifed01/wip-no-promote-for-delete-fix
>
>     osd: fix unnecessary object promotion when deleting from cache pool
>
>     Reviewed-by: Sage Weil <sage@xxxxxxxxxx>
>
>
> Is there any chance that I was basically seeing with the same thing from the filesystem standpoint?
>
> Thanks
>
>   Steve
>
>> On Feb 5, 2016, at 8:42 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>
>> On Fri, Feb 5, 2016 at 6:39 AM, Stephen Lord <Steve.Lord@xxxxxxxxxxx> wrote:
>>>
>>> I looked at this system this morning, and the it actually finished what it was
>>> doing. The erasure coded pool still contains all the data and the cache
>>> pool has about a million zero sized objects:
>>>
>>>
>>> GLOBAL:
>>>    SIZE       AVAIL     RAW USED     %RAW USED     OBJECTS
>>>    15090G     9001G        6080G         40.29       2127k
>>> POOLS:
>>>    NAME                ID     CATEGORY     USED       %USED     MAX AVAIL     OBJECTS     DIRTY     READ       WRITE
>>>    cache-data          21     -                 0         0         7962G     1162258     1057k      22969     3220k
>>>    cephfs-data         22     -             3964G     26.27         5308G     1014840      991k       891k     1143k
>>>
>>> Definitely seems like a bug since I removed all references to these from the filesystem
>>> which created them.
>>>
>>> I originally wrote 4.5 Tbytes of data into the file system, the erasure coded
>>> pool is setup as 4+2, and the cache has a size limit of 1 Tbyte. Looks like not
>>> all the data made it out of the cache tier before I removed content, it removed the
>>> content which was only present in the cache tier and created a zero sized object
>>> in the cache for all the content. The used capacity is somewhat consistent with
>>> this.
>>>
>>> I tried to look at the extended attributes on one of the zero size object with ceph-dencoder,
>>> but it failed:
>>>
>>> error: buffer::malformed_input: void object_info_t::decode(ceph::buffer::list::iterator&) unknown encoding version > 15
>>>
>>> Same error on one of the objects in the erasure coded pool.
>>>
>>> Looks like I am a little too bleeding edge for this, or the contents of the .ceph_ attribute are not an object_info_t
>>
>> ghobject_info_t
>>
>> You can get the EC stuff actually deleted by getting the cache pool to
>> flush everything. That's discussed in the docs and in various mailing
>> list archives.
>> -Greg
>>
>>>
>>>
>>>
>>> Steve
>>>
>>>> On Feb 4, 2016, at 7:10 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>>>
>>>> On Thu, Feb 4, 2016 at 5:07 PM, Stephen Lord <Steve.Lord@xxxxxxxxxxx> wrote:
>>>>>
>>>>>> On Feb 4, 2016, at 6:51 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> I presume we're doing reads in order to gather some object metadata
>>>>>> from the cephfs-data pool; and the (small) newly-created objects in
>>>>>> cache-data are definitely whiteout objects indicating the object no
>>>>>> longer exists logically.
>>>>>>
>>>>>> What kinds of reads are you actually seeing? Does it appear to be
>>>>>> transferring data, or merely doing a bunch of seeks? I thought we were
>>>>>> trying to avoid doing reads-to-delete, but perhaps the way we're
>>>>>> handling snapshots or something is invoking behavior that isn't
>>>>>> amicable to a full-FS delete.
>>>>>>
>>>>>> I presume you're trying to characterize the system's behavior, but of
>>>>>> course if you just want to empty it out entirely you're better off
>>>>>> deleting the pools and the CephFS instance entirely and then starting
>>>>>> it over again from scratch.
>>>>>> -Greg
>>>>>
>>>>> I believe it is reading all the data, just from the volume of traffic and
>>>>> the cpu load on the OSDs maybe suggests it is doing more than
>>>>> just that.
>>>>>
>>>>> iostat is showing a lot of data moving, I am seeing about the same volume
>>>>> of read and write activity here. Because the OSDs underneath both pools
>>>>> are the same ones, I know that’s not exactly optimal, it is hard to tell what
>>>>> which pool is responsible for which I/O. Large reads and small writes suggest
>>>>> it is reading up all the data from the objects,  the write traffic is I presume all
>>>>> journal activity relating to deleting objects and creating the empty ones.
>>>>>
>>>>> The 9:1 ratio between things being deleted and created seems odd though.
>>>>>
>>>>> A previous version of this exercise with just a regular replicated data pool
>>>>> did not read anything, just a lot of write activity and eventually the content
>>>>> disappeared. So definitely related to the pool configuration here and probably
>>>>> not to the filesystem layer.
>>>>
>>>> Sam, does this make any sense to you in terms of how RADOS handles deletes?
>>>> -Greg
>>>
>>>
>>> ----------------------------------------------------------------------
>>> The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com