fíam

(rhymes with liam)

  • The limits of Django: the answers

    April 20, 2008 at 23:43:48 CEST

    Since my last post about the limits of Django ended up being reddited and received a lot of comments, I've decided to write a new post answering all your questions. I'll try to reply to every question posted here or at reddit.com, but feel free to ask for more clarification or more questions on any subject.

    Let's start from the most simple to the more complex. All the tests were executed using lighttpd+FCGI with settings.DEBUG = False.

    On the other hand, for those who have pointed that ordering should be done at database layer, let me say that's not practical for this problem. Take into account that ordering these stories implies some calculations which can't be done by the database server. Furthermore, do you think that fetching 5000 stories joined to 500000 votes can be fast?

    Others have said that if I'm not displaying 5000 items per page, which is obviusly not the case, I should apply some LIMIT clause to my queries for fetching only the first stories. As far as I know, I need to order all the stories before I can split them in pages, because the ordering depends solely on the algorithm (you can read about it in "how ffloat.it works").

    Now, let's see the iterations the code went over before reaching its final stage.

    First, I implemented the algorithm using Django's ORM, using select_related() as appropiate, and then I rewrote it using only values. If memory serves correctly, I didn't change anymore. So, as Django developers crearly state on the docs, creating the objects introduces a non trivial overhead.

    Next step was the Python implementation using the native database driver. I this case I also introduced some optimizations to the algorithm which I couldn't do using the ORM, mostly in the SQL layer. I don't have any numbers, but the total number of queries was a bit lower. Note also that this code totally skipped the database abstraction machinery provided in Django, which should add some overhead.

    After my failed attemps using Python, it was time to move to C. The first approach was writing a C extension. This let my made some tricks not available in Python, like using bit level logic for some aritmetic operations. However, after some code profiling, I - unsurprisingly - discovered that the worst bottleneck in this code was fetching the ffloating stories from the database, which led me to the next stage.

    One of the possible solutions for the bottleneck was caching the stories using the excellent cache framework provided by Django. However, this would mean passing objects around between Python and C code, which would introduce additional complexity and overhead. So I decided to implement a daemon which could query the database for the story list and keep it for one or two minutes. I've already had the algorithm written in C, so adding a daemon on top of it was an easy task. Furthermore, Python provides the struct module, which makes passing numbers between the webapp and the daemon trivial. Maybe I'm cheated a bit here, since this first implementation did a bit of caching on the stories.

    At this point I was at 0.8 seconds for 5000 stories and 500000 votes. Maybe this is was fast enough, since the ffloating stories are the ones submitted in the last 24 hours. If I ever reach the point when users submit so much stories in 24 hours, I could justify paying for a better hosting. However, being a passionate of embedded Linux devices and having written code for them for the last four years, I though I could do better. And I was right.

    After some benchmarking, I ended arranging the full story list in memory using binary search trees. This way I could find a story in O(log n) and iterate over them in O(n). Every new story or vote was sent from the webapp to the daemon, so I could skip the database layer. Other dense data structures were stored as arrays, since they're blanzingly fast and the overhead of using them in dense sets is negligible. Then I added some more tricks like a biyective transformation between floats and integers, so I could sort them using radix, cache prefetching and SSE3 instructions (for those not familiar with SIMD, you can process multiple data using a single CPU instruction). With all these tricks in place, I could order 5000 stories in just 0.05 seconds.

    I don't think most of these optimizations can't be done in Python, but feel free to correct me if I'm wrong. On the other hand, I realize not every web application programmer is capable of doing something like this. Note, however, that this approach has some tradeoffs. For example, the daemon can segfault, so I have to run a watchdog for restarting it and emailing me a backtrace in case something goes wrong.

    Let me end this post with a curiosity. When I wrote the daemon I decided to write a compile-time database abstraction layer (providing me with zero runtime overhead), so I could benchmark the two most popular opensource database servers. One of them was almost twice slower than the other. Since I don't want to start a flamewar, I'm not telling you which one was faster.

    UPDATE: Just to clarify three things:

    When I say "I realize not every web application programmer is capable of doing something like this" I don't mean I did something extremely difficult. I mean not every programmer out there writing web applications knows C and assembler.

    On the other hand, for those that suggest denormalizing the data, how can I do it in real time if the order is different for every user and calculating it takes 93 seconds for every one of them?? In addition, 1000 users x 5000 stories = 5000000 potential writes to the database every time a user votes on something doesn't sound good to me. Please, don't try to tell me how to do things If you don't undertstand what I'm doing here.

    And finally, for those suggesting I've no fucking clue or I'm a bad designer or I'm smoking something or this article made you vomit, show some respect. I'm confortable with C and assembler, so I took that route and it worked for me.

  • The limits of Django

    April 18, 2008 at 18:34:37 CEST

    I myself love Django, it lets you develop websites really quickly. But there are some tasks which Django is not well suited for, and I recently ran into one of them: number crunching.

    As some of you may be aware of, I've recently launched ffloat.it, a recommendation based social news site. One of the most difficult task is ordering all the news received in the last 24 hours when the user hits the site, and you need to do it quickly. It involves a lot of mathematical operations, even when you precalculate as much as possible, since some things need to be done on demand because storing the order of the news for every user in the database and recalculating it every time a new vote is submitted doesn't make much sense.

    My first approach was coding it like the rest of the site, using the Django ORM. And it was fast enough, until I tried to order more than 50 stories. My tests with 5000 stories and 500000 votes took 93 seconds to complete for only a given user. That wasn't acceptable, so I started rewriting the code in a lower level. I tried some different approachs:

    • Django ORM using objects: 93 seconds
    • Django ORM using values: 78 seconds
    • Python: 51 seconds
    • C extension for Python: 22 seconds
    • Separate daemon in C communicating through UNIX sockets: 0.8 seconds
    • C daemon highly optimized, multihreaded, using in-process caching and SSE3 assembler: 0.05 seconds

    In addition, when I got to C the memory comsumption was reduced by one order of magnitude, even after introducing caching. All these test were run in my development environment, which consists in a virtual machine running on top of Mac OS X 10.5 on a Core2 @ 2.2 Macbook.

    Well, the point is sometimes you can't use all the abstractions Django provides, because that way your code is too slow. And that's even more important when you are running a web application - as opposed to desktop applications -, since the user can't wait one more second for the page to complete loading.

  • 12 hours until the first ffloat

    April 15, 2008 at 02:55:42 CEST

    After four months of hard work, I'm finally launching my social news site. Another one, like the millions of Digg clones? Well, definitely no. There are lots of differences, but let's talk about the key points:

    • Recommendation system: The ffloating section (think of it like the upcoming section in other sites) is always ordered according to your preferences, calculated from your previous ffloats and ssinks. You can't order the stories in any other way, and that's not a bug, it's a feature.

    • Hierarchicaly geolocated stories: Some sites of this type have been offering some options for browsing your local news. However, all you can do is enter something like your zip code or your city name and get the news tagged there. I'm doing something better: whenever you submit a new story, I ask you to geolocate it in a place defined by the geonames database (there's a map and a search input box for making this as easy as possible) and then, thanks to the geonames database, you can browse by any territorial division you want. Do you want to see the news happening in your state? No problem, browse to its surface and every story geolocated in a location inside your state will be listed. Do you prefer to see only stories for America? That's also possible. Or do you want to list only stories happening in your city? ffloat.it also supports it.

    • Hierarchical promotion: Well, now we have a nice upcoming queue, where you can browse by topic, by location or by both. But what about the promoted news? That's called the emerged section in ffloat.it and also works hierarchically. Let's suppose you've submitted a story concerning your city and it's now ffloating around. First people seing it will be the ones interested in news happening in your city or in the topics the story belongs to. When users start voting it, and given it gets points enough - let's say x -, it will first emerge in your city (however, it will go on ffloating until it gets 24 hours old). And now what? The story may be important enough for people interested in news happening in your state, so it may get more points - let's say x+y - and then emerge in your state also. And this continues until the story gets to a surface where it doesn't get points enough for emerge or when ir reaches the root surface: the main frontpage. What did I achieve with this system? Well, it doesn't matter if only a small userbase cares for a given story, it will have its place a ffloat.it anyway. Maybe it's not important enough for reaching the main frontpage, but it may be important enough for the frontpage for Europe or even for your city. Everything is relevant. In addition, everything I've said to locations applies also to topics, so your story may emerge in Globe/Technology, but not in Globe.

    • Multi-Language: I'm not a native English speaker, but I can read English news without a problem (as readers of my blog ,you may have noticed the same is not true for writing). In fact, there are a lot of news I do care for that are only available in English, but I'm also interested in news happening in my country, Spain. So I've decided to launch the site in two initial languages: English and Spanish. The English site will be at ffloat.it, whereas the Spanish one will be at fflotalo.com. Other languages like French, German and Catalan should follow shortly (Catalan flavour will probably appear as soon as next week).

    • Fair voting: I've been studying how other sites managed to determine when a story should be promoted and I found them to be really far from fair. So, when designing this site, I tried to be as fair as possible, and that's easy when you have a nice recommendation system in place. The more likely you are to like a story, the less your positive vote counts and the more likely you are to dislike a story, the less your negative vote counts (note however that the reverse is not true, if yo are likely to dislike a story, your positive vote doesn't get any bonus points, since this would make gaming the system trivial ). If you rely on your network of 1000 friends for voting on your stories and making them popular, I'm sorry for you, because this won't work at ffloat.it. Their votes toward your stories will start counting less and less, eventually reaching zero.

    I could continue writing about more features of ffloat.it, but I think you'll have more fun discovering them when the site opens tomorrow April 15th at 14:00 GMT. Remember there aren't irrelevant stories in ffloat.it, so feel free to submit whatever you like (read the ToS first, some things like porn are not allowed). ffloat.it, in contrast to any other social news site on the Internet, doesn't care about the stories which matter for 90% of the users, it cares about stories important to you.

  • New feature in Blango

    April 9, 2008 at 20:55:28 CEST

    I've started receiving some spam in comments. I thought it would take longer, since I'm running a custom software, but the bots have been posting no less than five a comments a day since the last week.

    My idea is implementing full spam detection, but it's something I don't currently have time for. So, in the meantime, I've implemented disabling comments on a given entry. Usually spammers take some days before starting to hit an entry, so users can leave comments and, when the entry starts receiving only spam comments, I can close it.

  • Integrating Django in Google App Engine

    April 8, 2008 at 16:59:53 CEST

    As some of you may be aware of, Google has launched a framework for developing and publishing web applications. Currently, it only supports Python, altough they say they're working on integrating more languages. The interesting thing is their framework it's really similar to Django - it even includes Django's template engine - diffing only on a few points:

    • The database layer: Google App Engine apps can't use a standard database, they use the Datastore API.
    • User authentication: Google provides a standard API for handling authentication based on Google/Gmail accounts. However, you can still use the Django auth system, provided we have some glue for the database layer.
    • Deployment: Google App Engine runs only under CGI, which Django doesn't support. However, there's a (refused) patch in Trac for addressing this (I haven't tested it yet, it may not work).
    • Static files: Every file or directory containing static files needs to be listed into a manifest-style file.

    Looking at this, running Django apps in Google App Engine seems really easy. We only need a wrapper translating calls from Django ORM to Google Datastore, write a manifest and, perhaps, a bit of tweaking to the mentioned patch. In fact, I'm planning to start working on all these things as soon as I have time. On the other hand, I'm finally launching ffloat.it on April 15th, so I'm a bit busy this week.