Caching
HarperDB has integrated support for caching data. With built-in caching capabilities and distributed high-performance low-latency responsiveness, HarperDB makes an ideal data caching server. HarperDB can store cached data as queryable structured data, so data can easily be consumed in one format (for example JSON or CSV) and provided to end users in different formats with different selected properties (for example MessagePack, with a subset of selected properties), or even with customized querying capabilities. HarperDB also manages and provides timestamps/tags for proper caching control, facilitating further downstreaming caching. With these combined capabilities, HarperDB is an extremely fast, interoperable, flexible, and customizable caching server.
Configuring Caching
To set up caching, first you will need to define a table that you will use as your cache (to store the cached data). You can review the introduction to building applications for more information on setting up the application (and the defining schemas documentation), but once you have defined an application folder with a schema, you can add a table for caching to your schema.graphql
:
You may also note that we can define a time-to-live (TTL) expiration on the table, indicating when table records/entries should expire. This is generally necessary for "passive" caches where there is no active notification of when entries expire. However, this is not needed if you provide a means of notifying when data is invalidated and changed.
While you can provide a single expiration time, there are actually several expiration timings that are potentially relevant, and can be independently configured. These settings are available as directive properties on the table configuration (like expiration
above): stale expiration: The point when a request for a record should trigger a request to origin (but might possibly return the current stale record depending on policy) must-revalidate expiration: The point when a request for a record must make a request to origin first and return the latest value from origin. eviction expiration: The point when a record is actually removed from the caching table.
You can provide a single expiration and it defines the behavior for all three. You can also provide three settings for expiration, through table directives: expiration - The amount of time until a record goes stale. eviction - The amount of time after expiration before a record can be evicted (defaults to zero). scanInterval - The interval for scanning for expired records (defaults to one quarter of the total of expiration and eviction).
Define External Data Source
Next, you need to define the source for your cache. External data sources could be HTTP APIs, other databases, microservices, or any other source of data. This can be defined as a resource class in your application's resources.js
module. You can extend the Resource
class (which is available as a global variable in the HarperDB environment) as your base class. The first method to implement is a get()
method to define how to retrieve the source data. For example, if we were caching an external HTTP API, we might define it as such:
Next, we define this external data resource as the "source" for the caching table we defined above:
Now we have a fully configured and connected cache. If you access data from MyCache
(for example, through the REST API, like /MyCache/some-id
), HarperDB will check to see if the requested entry is in the table and return it if it is available (and hasn't expired). If there is no entry, or it has expired (it is older than one hour in this case), it will go to the source, calling the get()
method, which will then retrieve the requested entry. Once the entry is retrieved, it will be saved/cached in the caching table (for one hour based on our expiration time).
HarperDB handles waiting for an existing cache resolution to finish and uses its result. This prevents a "cache stampede" when entries expire, ensuring that multiple requests to a cache entry will all wait on a single request to the data source.
Cache tables with an expiration are periodically pruned for expired entries. Because this is done periodically, there is usually some amount of time between when a record has expired and when the record is actually evicted (the cached data is removed). But when a record is checked for availability, the expiration time is used to determine if the record is fresh (and the cache entry can be used).
Eviction with Indexing
Eviction is the removal of a locally cached copy of data, but it does not imply the deletion of the actual data from the canonical or origin data source. Because evicted records still exist (just not in the local cache), if a caching table uses expiration (and eviction), and has indexing on certain attributes, the data is not removed from the indexes. The indexes that reference the evicted record are preserved, along with the attribute data necessary to maintain these indexes. Therefore eviction means the removal of non-indexed data (in this case evictions are stored as "partial" records). Eviction only removes the data that can be safely removed from a cache without affecting the integrity or behavior of the indexes. If a search query is performed that matches this evicted record, the record will be requested on-demand to fulfill the search query.
Specifying a Timestamp
In the example above, we simply retrieved data to fulfill a cache request. We may want to supply the timestamp of the record we are fulfilling as well. This can be set on the context for the request:
Specifying an Expiration
In addition, we can also specify when a cached record "expires". When a cached record expires, this means that a request for that record will trigger a request to the data source again. This does not necessarily mean that the cached record has been evicted (removed), although expired records will be periodically evicted. If the cached record still exists, the data source can revalidate it and return it. For example:
Active Caching and Invalidation
The cache we have created above is a "passive" cache; it only pulls data from the data source as needed, and has no knowledge of if and when data from the data source has actually changed, so it must rely on timer-based expiration to periodically retrieve possibly updated data. This means that it is possible that the cache may have stale data for a while (if the underlying data has changed, but the cached data hasn't expired), and the cache may have to refresh more than necessary if the data source data hasn't changed. Consequently it can be significantly more effective to implement an "active" cache, in which the data source is monitored and notifies the cache when any data changes. This ensures that when data changes, the cache can immediately load the updated data, and unchanged data can remain cached much longer (or indefinitely).
Invalidate
One way to provide more active caching is to specifically invalidate individual records. Invalidation is useful when you know the source data has changed, and the cache needs to re-retrieve data from the source the next time that record is accessed. This can be done by executing the invalidate()
method on a resource. For example, you could extend a table (in your resources.js) and provide a custom POST handler that does invalidation:
(Note that if you are now exporting this endpoint through resources.js, you don't necessarily need to directly export the table separately in your schema.graphql).
Subscriptions
We can provide more control of an active cache with subscriptions. If there is a way to receive notifications from the external data source of data changes, we can implement this data source as an "active" data source for our cache by implementing a subscribe
method. A subscribe
method should return an asynchronous iterable that iterates and returns events indicating the updates. One straightforward way of creating an asynchronous iterable is by defining the subscribe
method as an asynchronous generator. If we had an endpoint that we could poll for changes, we could implement this like:
Notification events should always include an id
to indicate the primary key of the updated record. The event should have a value
for put
and message
event types. The timestamp
is optional and can be used to indicate the exact timestamp of the change. The following event type
s are supported:
put
- This indicates that the record has been updated and provides the new value of the recordinvalidate
- Alternately, you can notify with an event type ofinvalidate
to indicate that the data has changed, but without the overhead of actually sending the data (thevalue
property is not needed), so the data only needs to be sent if and when the data is requested through the cache. Aninvalidate
will evict the entry and update the timestamp to indicate that there is new data that should be requested (if needed).delete
- This indicates that the record has been deleted.message
- This indicates a message is being passed through the record. The record value has not changed, but this is used for publish/subscribe messaging.transaction
- This indicates that there are multiple writes that should be treated as a single atomic transaction. These writes should be included as an array of data notification events in thewrites
property.
And the following properties can be defined on event objects:
type
: The event type as described above.id
: The primary key of the record that updatedvalue
: The new value of the record that updated (for put and message)writes
: An array of event properties that are part of a transaction (used in conjunction with the transaction event type).table
: The name of the table with the record that was updated. This can be used with events within a transaction to specify events across multiple tables.timestamp
: The timestamp of when the data change occurred
With an active external data source with a subscribe
method, the data source will proactively notify the cache, ensuring a fresh and efficient active cache. Note that with an active data source, we still use the sourcedFrom
method to register the source for a caching table, and the table will automatically detect and call the subscribe method on the data source.
By default, HarperDB will only run the subscribe method on one thread. HarperDB is multi-threaded and normally runs many concurrent worker threads, but typically running a subscription on multiple threads can introduce overlap in notifications and race conditions and running on a subscription on a single thread is preferable. However, if you want to enable subscribe on multiple threads, you can define a static subscribeOnThisThread
method to specify if the subscription should run on the current thread:
An alternative to using asynchronous generators is to use a subscription stream and send events to it. A default subscription stream (that doesn't generate its own events) is available from the Resource's default subscribe method:
Downstream Caching
It is highly recommended that you utilize the REST interface for accessing caching tables, as it facilitates downstreaming caching for clients. Timestamps are recorded with all cached entries. Timestamps are then used for incoming REST requests to specify the ETag
in the response. Clients can cache data themselves and send requests using the If-None-Match
header to conditionally get a 304 and preserve their cached data based on the timestamp/ETag
of the entries that are cached in HarperDB. Caching tables also have subscription capabilities, which means that downstream caches can be fully "layered" on top of HarperDB, both as passive or active caches.
Write-Through Caching
The cache we have defined so far only has data flowing from the data source to the cache. However, you may wish to support write methods, so that writes to the cache table can flow through to underlying canonical data source, as well as populate the cache. This can be accomplished by implementing the standard write methods, like put
and delete
. If you were using an API with standard RESTful methods, you can pass writes through to the data source like this:
When doing an insert or update to the MyCache table, the data will be sent to the underlying data source through the put
method and the new record value will be stored in the cache as well.
Loading from Source in Methods
When you are using a caching table, it is important to remember that any resource methods besides get()
, will not automatically load data from the source. If you have defined a put()
, post()
, or delete()
method and you need the source data, you can ensure it is loaded by calling the ensureLoaded()
method. For example, if you want to modify the existing record from the source, adding a property to it:
Subscribing to Caching Tables
You can subscribe to a caching table just like any other table. The one difference is that normal tables do not usually have invalidate
events, but an active caching table may have invalidate
events. Again, this event type gives listeners an opportunity to choose whether or not to actually retrieve the value that changed.
Caching with Replication
Caching tables can be configured to replicate in HarperDB clusters. When replicating caching tables, there are a couple of options. If each node will be separately connected to the data source and you do not need the subscription data notification events to replicate, you can set the replicationSource
to false
. In this case, only data requests (that come through standard requests like REST interface or operations API), will be replicated. However, if you data notification will only be delivered to a single node (at once) and you need the subscription data notification events to replicate, you can set the replicationSource
to true
and the incoming events from the subscription will be replicated to all other nodes:
Passive-Active Updates
With our passive update examples, we have provided a data source handler with a get()
method that returns the specific requested record as the response. However, we can also actively update other records in our response handler (if our data source provides data that should be propagated to other related records). This can be done transactionally, to ensure that all updates occur atomically. The context that is provided to the data source holds the transaction information, so we can simply pass the context to any update/write methods that we call. For example, let's say we are loading a blog post, which should also includes comment records:
Here both the update to the post and the update to the comments will be atomically/transactionally committed together with the same timestamp.
Cache-Control header
When interacting with cached data, you can also use the Cache-Control
request header to specify certain caching behaviors. When performing a PUT (or POST) method, you can use the max-age
directive to indicate how long the resource should be cached (until stale):
You can use the only-if-cached
directive on GET requests to only return a resource if it is cached (otherwise will return 504). Note, that if the entry is not cached, this will still trigger a request for the source data from the data source. If you do not want source data retrieved, you can add the no-store
directive. You can also use the no-cache
directive if you do not want to use the cached resource. If you wanted to check if there is a cached resource without triggering a request to the data source:
You may also use the stale-if-error
to indicate if it is acceptable to return a stale cached resource when the data source returns an error (network connection error, 500, 502, 503, or 504). The must-revalidate
directive can indicate a stale cached resource can not be returned, even when the data source has an error (by default a stale cached resource is returned when there is a network connection error).
Last updated