Tutorial: Indexing and Query performance

Indexing and Query performance

Loki.js has always been a fast, in-memory database solution. In fact, recent benchmarks indicate that its primary get() operation is about 1.4 million operations per second fast on a mid-range Core i5 running under node.js. The get() operation utilizes an auto generated '$loki' id column with its own auto generated binary index. If you wish to supply your own unique key, you can use add a single unique index to the collection to be used along with the collection.by() method. This method is every bit as fast as using the built in $loki id. So out of the gate if you intend to do single object lookups you get this performance.

Example lookup using autogenerated $loki column :

var users = db.addCollection("users");
var resultObj = users.insert({username:"Heimdallr"});

// now that our object has been inserted, it will have a $loki property added onto it
var heimdallr = users.get(resultObj.$loki);

Example object lookup specifying your own unique index :

var users = db.addCollection("users", {
    unique: ['username']
});

// after inserting records you might retrieve your record using coll.by() 
var result = users.by("username", "Heimdallr");

'Find' filtering

A more versatile way to query is to use collection.find() which accepts a mongo-style query object. If you do not index the column you are searching against, you can expect about 20k ops/sec under node.js (browser performance may vary but this serves as a good order of magnitude). For most purposes that is probably more performance than is needed, but you can now apply loki.js binary indexes on your object properties as well. Using the collection.ensureIndex(propertyName) method, you can create an index which can be used by various find() operations such as collection.find(). For our test benchmark, this increased performance to about 500k ops/sec.

Binary indices are can be used with range ops returning multiple document results for the given property/range. If you have applied a binary index to a property, you can utilize that index by calling collection.find() with a query object containing that property. The find() ops which are able to utilize binary indices include $eq, $aeq, $lt, $lte, $gt, $gte, $between, $in.

By default, if you have a binary index applied on a property and you insert a document containing a javascript Date object as a value for that property, loki will replace it with a serializable-safe epoch time format date (integer). This is to prevent the index from becoming corrupt if we were to save it as a Date and load it as a string (they should be sorted differently). If you do not intend to save your database (using entirely in memory), you may pass a 'serializableIndices:false' collection option during collection creation and we will not alter your dates.

Binary index example :

var coll = db.addCollection('users', {
  indices: ['location']
});

// after inserting records you might use equality or range ops,
// such as this implicit (strict) $eq op :
var results = users.find({ location:  'Himinbjörg' });

Loki Sorting and Ranges

Native javascript provides == (abstract) equality, === (strict) equality, < (abstract) less than, > (abstract) greater than, etc. Javascript deals with alot of mixed types, so you might want a numeric 4 to be (abstractly) equal to string '4'. If you wanted to test for 'less than' 4, it would by default convert to string so if you did not want strings you would have to test using 'typeof' or other type detection to manually filter out other types. Loki would prefer to deal with pure clean data but has had to evolve to support various levels of 'dirty' data which are found in the wild. As such we have tried to adapt our concept of 'ranges' as it pertains to mixed types on properties so that they accommodate mixed types and provide similar find() results whether you are using a binary index or not.

  • All values in loki are interpreted by loki as being 'less than', 'greater than', or 'equal' to any other value for the purposes of our find() op. This is different than javascript, so loki establishes a unified range ordering among two values.
  • 4 is $aeq (abstractly equal to) '4', as is '3' $lt 4, and 4 is $gte '3', so mixing numbers and 'number-like' strings range in loki find as numbers.
  • 'Number-like' strings are kept in different range (with the numbers) than 'non-number-like' strings, so 9999 will be $lt '111asdf"
  • objects are all equal (unless you use dot notation to query an objects properties)
  • Dates are sorted as numbers as their epoch time... those numbers are large so generally at the high end of number range
  • $type op can be used to filter out results which do not match a specific javascript type
  • $finite op can be used to filter out results which are either 'number-like' or 'non-number-like'

'Where' filters

'Where' filters (javascript filter functions) should be used sparingly if performance is of concern. They are unable to utilize indexes, so performance will be no better than an unindexed find, and depending on the complexity of your filter function even less so. Unindex queries and where filters always require a full array scan but they can be useful if thousands of ops/sec are sufficient or if used later in a query chain or dynamic view filter pipeline with less penalty.

Indexing in Query chains

The Resultset class introduced method chaining as an option for querying. You might use this method chaining to apply several find operations in succession or mix find(), where(), and sort() operations into a sequential chained pipe. For simplicity, an example of this might be (where users is a collection object) :

    users.chain().find(queryObj).where(queryFunc).simplesort('name').data();

Examining this statement, if queryObj (a mongo-style query object) were { 'age': { '$gt': 30 } }, then that age column would be best to apply an index on, and that find() chain operation should come first in the chain. In chained operations, only the first chained filter can utilize the indexes for filtering. If it filtered out a sufficient number of records, the impact of the (where) query function will be less. The overhead of maintaining the filtered result set reduces performance by about 20% over collection.find, but they enable much more versatility. In our benchmarks this is still about 400k ops/sec.

Indexes and sorting

When no filters are applied and a binary index exists on (for example) a 'name' property, the following can fully utilize the binary index :

coll.chain().simplesort('name').data();

If filtering has occurred we will detect whether we can leverage the index within an 'index intersect' algorithm to speed up the sorting over a typical loki sort. This 'index intersect' algorithm will only be enabled if your resultset has more than 10% of the total documents within its resultset, otherwise a standard loki sort will be determined to be the faster method. The performance advantages of 'index intersect' are somewhat inversely proportional to your filter quality, so it leverages the binary index to help to reduce 'worst case' sorting penalties.

Loki sorting is used not just for sorting but for building the binary indices, but if you do not need it's more unified sorting across mixed types, you might be able to shave additional milliseconds off your sort call by calling :

coll.chain().simplesort('name', { useJavascriptSorting: true }).data();

Which (if a binary index exists on that 'name' property) we will use the index intersect algorithm unless resultset has 10% or less total document count, at which point we will fallback to javascript sorting on the property passed to simplesort. If you do not have a binary index applied on that property, we would always use javascript sorting if that option were passed.

Dynamic View pipelines

Dynamic Views behave similarly to resultsets in that you want to utilize an index, your first filter must be applied using

  var userview = users.addDynamicView("over30");
  userview.applyFind({'Age': {'$gte':30}});

  // at any time later you can grab the latest view results
  var results = userview.data();

  // or branch the results for further filtering
  results = userview.branchResultset().find({'Country': 'JP'}).data();

That find filter should ideally refer to a field which you have applied an index to ('Age' in this case). Dynamic Views run their filters once however, so even non performant query pipelines are fast after they are set up. This is due to re-evaluation of those filters on single objects as they are inserted, updated, or deleted from the collection. Being single object evaluations there is no array scan penalty which occurs during the first evaluation. The overhead of dynamic views, which ride on top of the resultset, reduces performance of the first evaluation by about 40%, however subsequent queries are highly optimized (faster than collection.find). Even with that overhead, our benchmarks show roughly 300k ops/sec performance on initial evaluation. Depending on update frequency, subsequent evaluations can scale up to over 1 million ops/sec.

In loki.js, Dynamic Views have currently have two options, 'persistent' (default is false) and 'sortPriority' (default is 'passive').

The 'persistent' option indicates that results will be kept in an internal array (in addition to normal resultset). This 'resultdata' array is filtered and sorted according to your specifications. This copying of results into the internal array occurs during the first data() evaluation or once filters or sorts are dirty (documents inserted, updated, removed from view). This options adds memory overhead, but possibly optimizes data() calls.

The 'sortPriority' option can be either 'passive' or 'active'. By default sorting occurs lazily ('passive') when data() is called and the sorts are flagged as dirty. If you wish the sorting cost to be 'up-front', you can specify 'active' sortPriority. With active sortPriority, once an insert/update/delete flags a sort as dirty, we will queue and throttle an async sort to run when the thread yields. So with lower update frequency, or isolated batched modifications, you can pay the performance cost up-front to ensure optimal data() retrieval speed later. If your data modifications are frequent and sporadic, an active sortPriority might waste computation sorting if no one ever reads the data.