Improving performance by avoiding LIMIT in large data sets
Whether you are using MySQL, MongoDB or another data store, you will have some kind of paging mechanism to scroll through your data set.
This mostly boils down to:
- Order the data set in a way;
- Skip the first X records (the offset);
- Select the X following records (the limit);
You get the impression that only a slice of the data set is used/returned, but on the server side, a lot of work has been done.
This works quite well for the beginning of the data set, but performance will quickly drop once you start to skip 100.000 rows and select the next 100 records. For paginating search results this is acceptable, because most users won't get to page 6, but for iterating over a complete data set, this is disastrous.
Instead of using paging, you can implement some kind of scrolling or cursoring.
If you look at Twitter, they use a
max_id to enable this and Elasticsearch has a
scan and scroll mechanism for this.
Now you have two options to implement this for your local database:
- Completely do the selection with a WHERE clause;
- Or combine a WHERE clause with a LIMIT;
The easiest way to scroll through your data set, is to call the MAX(id), and fetch pages until you reach that max id.
Using this approach, you can only stop fetching once you reach the max id, because some of the pages can be empty.
This approach does not need an ORDER clause, because you set the range you want back.
WHERE clause combined with LIMIT
This approach can be used to display pagination on a website, because you cannot have empty pages.
The drawback here is that you have to sort the data set, but this is still much quicker that the OFFSET,LIMIT approach.