Scalability in new startups

January 26th, 2008 6 comments

Since I started to build YouAre.com I knew that the scalability was an important matter to solve. Sometimes the scalability is more important for your pocket and for the success of your startup than originally thought. According to Google, a slow performance could cost you 20% of your revenue. If you are starting a new company, you ought to know that any savings in servers can accelerate the growth of your company. These costs include hardware, software, human resources and time (for many people the most appreciated resource). Apart from the monetary costs, it’s proved that half a second delay in page load time can kill a user’s satisfaction.

The scalability is a relative problem which depends on many things: the technology used, the fault tolerance and the availability of programming staff. Many people think that scalability=performance, and they are wrong as there are more aspects to be considered. For me, the scalability is to maintain the balance between the resources and the number of users, when the size of the problem increases. The size of the problem is the growth of the number of users and the resources. A graphic which represents a good scalability could be the following:

Scalability

We can appreciate how well the growth of users (n) have been solved. The amount of required resources grows logarithmically.

Some good points for scalability that should be considered:

  • Good database design: Normalize the database, select a suitable DBMS, consider the users’ necessities, …
  • Search engines: Use a search engine for your application. Lucene is a very high-performance text search engine library. You can also consider Nutch or Solr, both based in Lucene but oriented to web applications. If you are finding some engine more basic take a look at Sphinx.
  • The Keepalive problem: Enabling Keepalive for images and external files (such as CSS) is very good for clients, but bad for servers. Keeping Keepalive off we reduce a lot of the memory of the server. A good solution is to have separate images in a different server, getting the added benefit of higher browser concurrency with multiple hostnames (it will let you to load images in parallel). In YouAre, we are using Amazon Simple Storage Service to store our images.
  • Cache: Cache as much of your dynamic content as possible :) Memcache could be a great option.
  • Take care of your code: Take care of your code and it will take care of you ;)
  • Use GNU/Linux: GNU/Linux uses spare memory to cache files on disk. This means much faster I/O.

More information | Rico Mariani
More information | Shiflett
More information | No VC required

6 Responses to Scalability in new startups

  • DraXus

    Great post! Thanks :)

  • cvander

    Very useful information. If possible, a follow-up to the search engine libraries will be appreciated.

    And loved the graphic.

  • Alfonso Jiménez

    DraXus: Thank you

    cvander: This summer I have been testing some search engines (even I wrote some post about it). I recommend Solr because its powerful (Digg is currently using Solr), but if you are finding just a basic search engine, then I would recommend Sphinx :) I will write a post about it.

    Regards

  • Otis Gospodnetic

    Uh, uh. Don’t mix Nutch and Solr/Lucene. Nutch is quite a bit different and aimed at a different problem (typically web-wide crawling/fetching + document parsing + indexing + searching). Lucene and Solr know nothing about crawling/fetching nor parsing.

  • Alfonso Jiménez

    I know, I know. I’ve tested both :) I refer that you can use whatever depending on your necessities.

    Thanks

  • Julian

    I have been trying Lucene and Sphinx that you mention in this article. Lucene seems good but it takes too many hours to index the data (10m rows).

    Do you know if it normal that it takes soooo long. What can I do to reduce the indexing time of Lucene.

Leave a Reply