Search Postgresql Archives

Re: UPDATE on two large datasets is very slow

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The problem with this approach is really that the tracker + website combination is designed to show really detailed statistics for every user on every torrent, which (as you mentioned) complicates things a great deal. It's also able to store all the history information for all the clients, which is periodically combined into a type of "Old Torrents" row, one per user, used as a placeholder for old statistics. The tracker was designed around the concept of the website, so it's not something that can easily be changed. In addition, the data inside the tracker is very transient; it sort of "garbage collects" its internal structures to save memory, which means a database-like system is needed for any kind of persistence.

As an aside, the tracker itself is designed with the asynchronous approach you mentioned, but it has a worker thread which handles the database updates so the main thread can continue running as fast as possible. The current production version, still using MySQL (sadly), handles ~500-800 connections/sec on average, which equates to about 2.7 MiB of raw query data needing to be updated in the database. However, benchmarking indicates it can handle up to 6,800 connections/sec with around 1,000 parallel connections on a fairly high end system, and at that rate the volume of SQL data being generated would be utterly enormous (I believe it'd be around 22.5 MiB), probably so much so that keeping statistics would be impossible no matter what RDBMS was used, which is a problem that will need solving eventually.

All that said, I'm really just looking for a way to make Postgres work for me as best as I can get it to, without having to re-engineer the fundamental ways in which the tracker operates, but I do appreciate the advice dearly.

Thanks for the reply.

Steve Gerhardt

Listmail wrote:
No, he can't, I also coded a bittorrent tracker of the same kind, and the problem is :

    - each user has N torrents active
    - torrent client does N tracker requests
    - tracker only does 1 UDPATE to update user's stats

So if you do instantaneous updates you multiply your query load by N (on average between 6 and 8).

    Besides, these kinds of trackers face several problems :

- they are accessed by clients which have near saturated connections since they're leeching illegal prOn like crazy
    - therefore these HTTP connections are very slow
    - therefore you have a hell of a lot of concurrent connections.

Therefore using a threaded server for this kind of load is asking for trouble. All decent torrent trackers are designed like lighttpd : select() / poll() or other variants, and no threads. No threads means, database queries are something long and to be avoided. Hosting providers will delete your account if they see a php torrent tracker on it, and for good reason.

    600 hits/s = 600 connections = 600 apache and PG process = you cry.

Anyway my tracker was in Python with select/poll asynchronous HTTP model. It handled 200 HTTP requests per second using 10% CPU on a Core 2. I guess thats pretty decent.

(I do NOT work on it anymore, DO NOT ASK for sources, it is illegal now in my country to code trackers so I have completely dropped the project, but I guess helping a fellow living in a free country is OK)

    Back to databases.

    You complain that postgres is slow for your application.
Yes, it is a lot slower than MyISAM *on this application* (but try InnoDB and cry). But PG is a real database. It is simply not the right tool to your application.
    You have to choose between in-place updates and transactions.

    (besides, your website is locked while MySQL does your big UPDATE).

    Here is how you can do it :

    Your problem is that you put the peers in the database.
    Ask yourself why ?

    You need seeders / leechers count for each torrent ?
-> Two INTEGERs in your torrents table, updated in batch by the tracker every hour.

You need to have all peers saved somewhere so that you may exit and restart your tracker ?
    -> code your tracker in python and be able to reload running code
    -> or just save it when you exit
-> or don't save it, it's not like it's your bank accounting data, who cares -> the FBI will be happy to have all that data when they seize your server (see: piratebay laughs as all data was in RAM and police had to unplug the server to seize it.)

    So, DO NOT put the peers in the database.
IF you put the peers info in the database you get one UPDATE per user per torrent.
    If you only update the user stats you only get one UPDATE per user.
And the tracker never inserts users and torrents (hopefully) so you only get UPDATES to users and to torrents tables, never inserts.

    Now you need to display realtime info on the user's and torrents pages.
This is easily done : your tracker is a HTTP server, it can serve data via HTTP (php serialized, JSON, whatever) that is inserted via AJAX of PHP in your webpages.

From my stats my tracker needs about 100 microseconds to serve a HTTP web page with the peer counts for a torrent.

So, you don't need Postgres for your tracker ! Use it for your website instead...


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux