Re: Performance issues with writing files to Ceph via S3 API

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 8 Feb 2024 08:20:45 -0500

> On Feb 8, 2024, at 07:05, Renann Prado <prado.renann@xxxxxxxxx> wrote:
> 
> Hello Anthony,
> 
> Sorry for the late reply.
> My thought process behind it was that maybe there's some kind of indexing
> that Ceph does under the hood, and perhaps the bucket structure could
> influence that.

Absolutely, that's why I asked the questions.

> But if you say it's not the case, then I was on the wrong path.
> 
> Sorry for the daley, but I also wanted to gather info.
> 
>> How many millions?
> 
> About 75 millions.

In a single bucket???

> 
>> How big are they?
> 
> They vary from ~500kb to a couple of megabytes, say 5mb. I wouldn't be able
> to tell you if most files are closer to 5mb or to 500kb though, but if
> that's important I can try to figure it out.

No that's fine.  Ceph, and many other object storage systems, have a harder time with small objects.  If they're a lot smaller you can end up with wasted space.  But at 500KB, metadata operations rival just storing the data, so they can be a bottleneck and a hotspot.

> 
>> Are you writing them to a single bucket?
> 
> Yes. All these files are in a single bucket.

yikes.  Any chance you could refactor the application to use smaller buckets?

> 
>> How is the index pool configured?  On what media?
>> Same with the bucket pool.
> 
> I wouldn't be able to answer that unfortunately.
> 
>> Which Ceph release?
> 
> Pacific (https://docs.ceph.com/en/pacific/).
> 
>> Sharding config?
>> Are you mixing in bucket list operations ?
> 
> We don't use list operations on this bucket, but the Ceph infrastructure is
> shared across multiple companies and we are aware that there are others
> using list operations *on other buckets*. But also, I can say that list
> operations in this bucket IIRC are failing (to a point where we don't have
> the exact metric of how many objects are in the bucket).

Could be a timeout, I think the list API call only returns up to 1000 objects and for a larger bucket one has to iterate.

> The provider has a
> prometheus exporter which fails to expert the metrics in production
> currently.

> 
>> Do you have the ability to utilize more than one bucket? If you can limit
> the number of objects in a bucket that might help.
> 
> Technically it should be possible, but I'd assume that Ceph can abstract
> this complexity for the bucket user so that we don't have to care for that.
> If we do it, I would see it as a workaround more than a real solution.

I don't recall the succession of changes to bucket sharding.  With your Pacific release it could be that auto-resharding isn't enabled or isn't functioning.  I suspect that bucket sharding is the heart of the issue.

> 
>> If your application keeps track of object names you might try indexless
> buckets.
> 
> I didn't know there was this possibility.
> 
> I don't know how Ceph works under the hood, but assuming that all files are
> ultimately written to the same folder in disk, could that be a problem?

It doesn't work that way.  Ceph has an abstracted foundation layer called RADOS, and the data isn't stored on disk as traditional files.

> I have faced in the past struggle with linux file system getting too slow
> due to too many files written to the same folder.

It could be a similar but not identical issue.

When a Ceph cluster runs RGW to provide object storage, it has a dedicated pool that stores bucket indexes.  For any scale at all this must be placed on fast storage (SSDs) across enough separate drives and with enough placement groups.

Each bucket's index is broken into "shards".  With older releases that sharding was manual -- for very large buckets one would have to manually reshard the index, or pre-shard it in advance for the eventual size of the bucket.

Recent releases have a feature that does this automatically, if it's enabled.

My command of these dynamics is limited, so others on the list may be able to chime in with refinements.

> 
> Thanks for the help already!
> 
> Best regards,
> *Renann Prado*
> 
> 
> On Sat, Feb 3, 2024 at 7:13 PM Anthony D'Atri <anthony.datri@xxxxxxxxx>
> wrote:
> 
>> The slashes don’t mean much if anything to Ceph.  Buckets are not
>> hierarchical filesystems.
>> 
>> You speak of millions of files.  How many millions?
>> 
>> How big are they?  Very small objects stress any object system.  Very
>> large objects may be multi part uploads that stage to slow media or
>> otherwise add overhead.
>> 
>> Are you writing them to a single bucket?
>> 
>> How is the index pool configured?  On what media?
>> Same with the bucket pool.
>> 
>> Which Ceph release? Sharding config?
>> Are you mixing in bucket list operations ?
>> 
>> It could be that you have an older release or a cluster set up on an older
>> release that doesn’t effectively auto-reshard the bucket index.  If the
>> index pool is set up poorly - slow media, too few OSDs, too few PGs - that
>> may contribute.
>> 
>> In some circumstances pre-sharding might help.
>> 
>> Do you have the ability to utilize more than one bucket? If you can limit
>> the number of objects in a bucket that might help.
>> 
>> If your application keeps track of object names you might try indexless
>> buckets.
>> 
>>> On Feb 3, 2024, at 12:57 PM, Renann Prado <prado.renann@xxxxxxxxx>
>> wrote:
>>> 
>>> Hello,
>>> 
>>> I have an issue at my company where we have an underperforming Ceph
>>> instance.
>>> The issue that we have is that sometimes writing files to Ceph via S3 API
>>> (our only option) takes up to 40s, which is too long for us.
>>> We are a bit limited on what we can do to investigate why it's performing
>>> so badly, because we have a service provider in between, so getting to
>> the
>>> bottom of this really is not that easy.
>>> 
>>> That being said, the way we use the S3 APi (again, Ceph under the hood)
>> is
>>> by writing all files (multiple millions) to the root, so we don't use
>> *no*
>>> folder-like structure e.g. we write */<uuid>* instead of
>> */this/that/<uuid>*
>>> .
>>> 
>>> The question is:
>>> 
>>> Does anybody know whether Ceph has performance gains when you create a
>>> folder structure vs when you don't?
>>> Looking at Ceph's documentation I could not find such information.
>>> 
>>> Best regards,
>>> 
>>> *Renann Prado*
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx