-
Building a website API with Django. Part 4: Complex functions
Aug. 19, 2008 at 14:17:07 CESTWe've talked about how the API functions are defined in WAPI, but we've only scratched the surface. This entry will go on details about function namespacing, optional parameters, parameter sets, parameter validation and allowed request methods, showing some real examples from the code I wrote for byNotes.
Namespacing your functions
As we've seen in the first post in this series, you API is loaded at an URL, for example http://example.com/api/1.0/rest/, so if you define a function named foo, it will be accessible from http://example.com/api/1.0/rest/foo.xml (or json or yaml). This is enough if your API consists only of a few functions. But what happens when your API has 50 functions?
WAPI provides support for namespacing your API. You only need to prepend your method with your namespace name and the binding will take care of splitting them. For example, the ReST binding uses the pattern namespace/function_name, while the upcoming XML-RPC binding will use namespace.function_name. Let's see an example:
class BynotesApi(object): ... @required_parameter('id', int, _('The geoname id you want to retrieve information for.')) def geo__hierarchy(self, request, dct): """Returns the hierarchy for the given geoname id, starting from itself.""" try: geoname = Geoname.objects.get(pk=dct['id']) except Geoname.DoesNotExist: return SingleSerializableResponse(None) return SerializableResponse([geoname] + geoname.hierarchy)
As you see, the namespace separator in WAPI is "__" (two underscores). So when this method gets exposed with the rest binding it will live at geo/geoname.{ format }, while the XML-RPC binding will serve it from geo.geoname.
Optional parameters and parameter sets
Sometimes your functions will accept optional parameters, so WAPI also provides support for working with them and lets you provide a default parameter. Note that if you don't specify a default value, there's always an implicit default value: None. Id est, the key in the dct parameter will always be present, so you don't need to check for it. The syntax for optional parameters is:
@required_parameter('id', int, _('The geoname id you want to retrieve information for.')) @optional_parameter('parent', bool, _('Wheter to include information for the parent location.'), default=False) def geo__geoname(self, request, dct): """Returns information about a given geoname, like class and code, coordinates, ...""" try: geoname = Geoname.objects.get(pk=dct.get('id')) except Geoname.DoesNotExist: return SingleSerializableResponse(None) method = None if dct['parent']: method = 'parent' return SingleSerializableResponse(geoname, method=method)
Well, you can also see that this serialization is receiving an additional parameter we haven't seen before (method). That's related to the complex serializations we'll see in the next article.
Another common use case is having multiple functions which receive the same parameters (or a subset of them). In WAPI, this is implemented using parameter sets. For example, there are multiple API functions at byNotes which support pagination, so I've defined a parameter set I add to all those functions, saving me from DRY:
from wapi.parameters import FunctionParameter, FunctionParameterSet PAGINATION_PARAM_SET = FunctionParameterSet( FunctionParameter('since_id', int, _('Return only notes with id greater than since_id.')), FunctionParameter('offset', int, _('Skip the first offset notes.'), default=0, validators=RangeValidator(min_value=0)), FunctionParameter('limit', int, _('The maximum number of notes returned.'), default=30, validators=RangeValidator(min_value=1, max_value=50)), )
Ignore the validators argument for now, we'll see it in the next paragraph. As you can see, parameter sets don't specify if the arguments are required or optional, since this approach provides more flexibility. Adding them to one of you functions uses the same syntax we've seen before:
@optional_parameter('geoname', int, _('Geoname identifier for the area you want to restrict notes to.')) @optional_parameter(PAGINATION_PARAM_SET) def timeline(self, request, dct): """Returns latest public notes. If a geoname is provided, only notes posted from it and its children will be returned.""" since_id, start, end = get_range(dct) nset = Note.live.order_by('-id').filter(id__gt=since_id, public=True) if dct.get('geoname'): geoname = get_object_or_404(Geoname, id=dct.get('geoname')) nset = nset.filter(Note.in_geoname_q(geoname)[0]) return SerializableResponse(nset[start:end])
optional_parameter and required_parameter decorators know how to deal with parameter sets, so you don't need to learn a new syntax. Parameter sets also support iteration and slicing, just in case you want to use only a subset of them.
Extended parameter validation
As we saw in part1, WAPI does basic type validation and conversion to all the parameters a function receives. However, sometimes you want to perform additional validation and that's when wapi.validators comes handy. First, let me say this validators are not same as django.core.validators (altought some wapi.validators call django.core.validators) and are incompatible with them, so you can't pass a validator from django.core.validators to a WAPI function. However, by the next release of WAPI I may add some code to "automagically" convert them, so you'll be able to pass any validator.
Currently WAPI provides validators for string length, unicode, range, comma separated values and multiple choices. If you want another validator added before the next release, ping me as soon as possible with the details.
Validators are specified using the validators keyword when creating a function parameter, and it can be a single instance of wapi.validators.Validator or an iterable containing some of them. For example:
@required_parameter('place_name', unicode, _('Place name for the event (this is only for display, this parameter is not for the full address).'), validators=StringValidator(max_length=50))
Allowed request methods
By default, functions in WAPI are readwrite. Id est, RestBinding exposes them as GET and POST methods. There are two decorators for altering this behavior, readonly and writeonly. As you surely have guessed, readonly methods get exposed only as GET, while writeonly methods only work with POST. As a final note, keep in mind that WAPI does not mix GET and POST, parameters passed by GET are ignored in POST requests.
Next article in this series
In the next article, I'll write about object serialization using wapi.serializers and how you can chain and include other serializations and pass parameters to them.
-
The limits of Django: the answers
April 20, 2008 at 23:43:48 CESTSince my last post about the limits of Django ended up being reddited and received a lot of comments, I've decided to write a new post answering all your questions. I'll try to reply to every question posted here or at reddit.com, but feel free to ask for more clarification or more questions on any subject.
Let's start from the most simple to the more complex. All the tests were executed using lighttpd+FCGI with settings.DEBUG = False.
On the other hand, for those who have pointed that ordering should be done at database layer, let me say that's not practical for this problem. Take into account that ordering these stories implies some calculations which can't be done by the database server. Furthermore, do you think that fetching 5000 stories joined to 500000 votes can be fast?
Others have said that if I'm not displaying 5000 items per page, which is obviusly not the case, I should apply some LIMIT clause to my queries for fetching only the first stories. As far as I know, I need to order all the stories before I can split them in pages, because the ordering depends solely on the algorithm (you can read about it in "how ffloat.it works").
Now, let's see the iterations the code went over before reaching its final stage.
First, I implemented the algorithm using Django's ORM, using select_related() as appropiate, and then I rewrote it using only values. If memory serves correctly, I didn't change anymore. So, as Django developers crearly state on the docs, creating the objects introduces a non trivial overhead.
Next step was the Python implementation using the native database driver. I this case I also introduced some optimizations to the algorithm which I couldn't do using the ORM, mostly in the SQL layer. I don't have any numbers, but the total number of queries was a bit lower. Note also that this code totally skipped the database abstraction machinery provided in Django, which should add some overhead.
After my failed attemps using Python, it was time to move to C. The first approach was writing a C extension. This let my made some tricks not available in Python, like using bit level logic for some aritmetic operations. However, after some code profiling, I - unsurprisingly - discovered that the worst bottleneck in this code was fetching the ffloating stories from the database, which led me to the next stage.
One of the possible solutions for the bottleneck was caching the stories using the excellent cache framework provided by Django. However, this would mean passing objects around between Python and C code, which would introduce additional complexity and overhead. So I decided to implement a daemon which could query the database for the story list and keep it for one or two minutes. I've already had the algorithm written in C, so adding a daemon on top of it was an easy task. Furthermore, Python provides the struct module, which makes passing numbers between the webapp and the daemon trivial. Maybe I'm cheated a bit here, since this first implementation did a bit of caching on the stories.
At this point I was at 0.8 seconds for 5000 stories and 500000 votes. Maybe this is was fast enough, since the ffloating stories are the ones submitted in the last 24 hours. If I ever reach the point when users submit so much stories in 24 hours, I could justify paying for a better hosting. However, being a passionate of embedded Linux devices and having written code for them for the last four years, I though I could do better. And I was right.
After some benchmarking, I ended arranging the full story list in memory using binary search trees. This way I could find a story in O(log n) and iterate over them in O(n). Every new story or vote was sent from the webapp to the daemon, so I could skip the database layer. Other dense data structures were stored as arrays, since they're blanzingly fast and the overhead of using them in dense sets is negligible. Then I added some more tricks like a biyective transformation between floats and integers, so I could sort them using radix, cache prefetching and SSE3 instructions (for those not familiar with SIMD, you can process multiple data using a single CPU instruction). With all these tricks in place, I could order 5000 stories in just 0.05 seconds.
I don't think most of these optimizations can't be done in Python, but feel free to correct me if I'm wrong. On the other hand, I realize not every web application programmer is capable of doing something like this. Note, however, that this approach has some tradeoffs. For example, the daemon can segfault, so I have to run a watchdog for restarting it and emailing me a backtrace in case something goes wrong.
Let me end this post with a curiosity. When I wrote the daemon I decided to write a compile-time database abstraction layer (providing me with zero runtime overhead), so I could benchmark the two most popular opensource database servers. One of them was almost twice slower than the other. Since I don't want to start a flamewar, I'm not telling you which one was faster.
UPDATE: Just to clarify three things:
When I say "I realize not every web application programmer is capable of doing something like this" I don't mean I did something extremely difficult. I mean not every programmer out there writing web applications knows C and assembler.
On the other hand, for those that suggest denormalizing the data, how can I do it in real time if the order is different for every user and calculating it takes 93 seconds for every one of them?? In addition, 1000 users x 5000 stories = 5000000 potential writes to the database every time a user votes on something doesn't sound good to me. Please, don't try to tell me how to do things If you don't undertstand what I'm doing here.
And finally, for those suggesting I've no fucking clue or I'm a bad designer or I'm smoking something or this article made you vomit, show some respect. I'm confortable with C and assembler, so I took that route and it worked for me.
-
The limits of Django
April 18, 2008 at 18:34:37 CESTI myself love Django, it lets you develop websites really quickly. But there are some tasks which Django is not well suited for, and I recently ran into one of them: number crunching.
As some of you may be aware of, I've recently launched ffloat.it, a recommendation based social news site. One of the most difficult task is ordering all the news received in the last 24 hours when the user hits the site, and you need to do it quickly. It involves a lot of mathematical operations, even when you precalculate as much as possible, since some things need to be done on demand because storing the order of the news for every user in the database and recalculating it every time a new vote is submitted doesn't make much sense.
My first approach was coding it like the rest of the site, using the Django ORM. And it was fast enough, until I tried to order more than 50 stories. My tests with 5000 stories and 500000 votes took 93 seconds to complete for only a given user. That wasn't acceptable, so I started rewriting the code in a lower level. I tried some different approachs:
- Django ORM using objects: 93 seconds
- Django ORM using values: 78 seconds
- Python: 51 seconds
- C extension for Python: 22 seconds
- Separate daemon in C communicating through UNIX sockets: 0.8 seconds
- C daemon highly optimized, multihreaded, using in-process caching and SSE3 assembler: 0.05 seconds
In addition, when I got to C the memory comsumption was reduced by one order of magnitude, even after introducing caching. All these test were run in my development environment, which consists in a virtual machine running on top of Mac OS X 10.5 on a Core2 @ 2.2 Macbook.
Well, the point is sometimes you can't use all the abstractions Django provides, because that way your code is too slow. And that's even more important when you are running a web application - as opposed to desktop applications -, since the user can't wait one more second for the page to complete loading.
-
12 hours until the first ffloat
April 15, 2008 at 02:55:42 CESTAfter four months of hard work, I'm finally launching my social news site. Another one, like the millions of Digg clones? Well, definitely no. There are lots of differences, but let's talk about the key points:
Recommendation system: The ffloating section (think of it like the upcoming section in other sites) is always ordered according to your preferences, calculated from your previous ffloats and ssinks. You can't order the stories in any other way, and that's not a bug, it's a feature.
Hierarchicaly geolocated stories: Some sites of this type have been offering some options for browsing your local news. However, all you can do is enter something like your zip code or your city name and get the news tagged there. I'm doing something better: whenever you submit a new story, I ask you to geolocate it in a place defined by the geonames database (there's a map and a search input box for making this as easy as possible) and then, thanks to the geonames database, you can browse by any territorial division you want. Do you want to see the news happening in your state? No problem, browse to its surface and every story geolocated in a location inside your state will be listed. Do you prefer to see only stories for America? That's also possible. Or do you want to list only stories happening in your city? ffloat.it also supports it.
Hierarchical promotion: Well, now we have a nice upcoming queue, where you can browse by topic, by location or by both. But what about the promoted news? That's called the emerged section in ffloat.it and also works hierarchically. Let's suppose you've submitted a story concerning your city and it's now ffloating around. First people seing it will be the ones interested in news happening in your city or in the topics the story belongs to. When users start voting it, and given it gets points enough - let's say x -, it will first emerge in your city (however, it will go on ffloating until it gets 24 hours old). And now what? The story may be important enough for people interested in news happening in your state, so it may get more points - let's say x+y - and then emerge in your state also. And this continues until the story gets to a surface where it doesn't get points enough for emerge or when ir reaches the root surface: the main frontpage. What did I achieve with this system? Well, it doesn't matter if only a small userbase cares for a given story, it will have its place a ffloat.it anyway. Maybe it's not important enough for reaching the main frontpage, but it may be important enough for the frontpage for Europe or even for your city. Everything is relevant. In addition, everything I've said to locations applies also to topics, so your story may emerge in Globe/Technology, but not in Globe.
Multi-Language: I'm not a native English speaker, but I can read English news without a problem (as readers of my blog ,you may have noticed the same is not true for writing). In fact, there are a lot of news I do care for that are only available in English, but I'm also interested in news happening in my country, Spain. So I've decided to launch the site in two initial languages: English and Spanish. The English site will be at ffloat.it, whereas the Spanish one will be at fflotalo.com. Other languages like French, German and Catalan should follow shortly (Catalan flavour will probably appear as soon as next week).
Fair voting: I've been studying how other sites managed to determine when a story should be promoted and I found them to be really far from fair. So, when designing this site, I tried to be as fair as possible, and that's easy when you have a nice recommendation system in place. The more likely you are to like a story, the less your positive vote counts and the more likely you are to dislike a story, the less your negative vote counts (note however that the reverse is not true, if yo are likely to dislike a story, your positive vote doesn't get any bonus points, since this would make gaming the system trivial ). If you rely on your network of 1000 friends for voting on your stories and making them popular, I'm sorry for you, because this won't work at ffloat.it. Their votes toward your stories will start counting less and less, eventually reaching zero.
I could continue writing about more features of ffloat.it, but I think you'll have more fun discovering them when the site opens tomorrow April 15th at 14:00 GMT. Remember there aren't irrelevant stories in ffloat.it, so feel free to submit whatever you like (read the ToS first, some things like porn are not allowed). ffloat.it, in contrast to any other social news site on the Internet, doesn't care about the stories which matter for 90% of the users, it cares about stories important to you.
-
The road to ffloat.it - I feel karmastic
March 22, 2008 at 17:41:02 CETAs some of you may know, I've been working on a social news site for some time. Now it's almost finished, so I'll be posting some articles about things I found interesting while developing it. This article is the first, and I'll be writing about one of the best ideas with the worst implementation: karma.
Latelly, some web 2.0 sites have been using karma not only for giving their users some recognition, but also for giving them more power in decision taking. The bad thing (tm) it's almost every site has done it in a wrong way.
It's not white. Nor black. Nor even gray.
It's in the color of your choice. That's what happens with every decission based in your opinion. You can't be right nor wrong, it's simply your opinion. The problem is some sites insist on measuring how right your opinion is and they find it by comparing your opinion against the majority. The result is either you think like the majority, your opinion gets promoted and your karma increases, giving more weight to your opinion or you think different (tm), your opinion gets silenced, your karma decreases and finally you can't post to the site.
In addition, your karma also influences how much karma the user receiving your votes get or lose. So the system feeds itself, giving an advantage to user groups voting themselves positively.
It's that karma thing, I swear it works!
Sure, it works wonderfully, but not for measuring how right is an opinion. However, there are some things which balance well with a karma system. I'm sure by now you've already guessed it: if it's black or white, karma works.
For example, suppose some news happens in London, but you geotag it in Japan. You're wrong, and there's no other way. However, if you karma is decreased because of that wrong action, you're opinion should weight the same. It's your ability to geotag which should get decreased!
Merito what?
Some sites justify karma saying it makes the site a meritocracy. They say increasing your karma is easy, the more time you spend on the site, the bigger your karma gets. So if you can't spend three hours a day in their site, you don't deserve a nice user experience. Only thing you deserve it's getting silenced.
In addition, their karma rule of thumb is a lie. For increasing your karma, you also need to think like the majority (or make you look like that). But they won't tell you that.
A minority controls the majority
Because of the karma rating system, every site implementing it will end as an information ghetto. Id est, the only promoted articles and comments will all be biased towards one type of opinion. The worst thing is these articles and comments will be published by a minority formed by users with the highest karmas, which increased it by voting themselves.
A living example
There's a well known Spanish site using this karma system: meneame.net. If you spend some time looking at it, you'll have no problem verifying the points exposed in this article.
In the next part, I'll discuss my thoughts in other social news sites voting paradigm: all votes weight equally.