Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
To create a cluster you must have two or more nodes* (aka instances) of HarperDB running.
*A node is a single instance/installation of HarperDB. A node of HarperDB can operate independently with clustering on or off.
On the following pages we'll walk you through the steps required, in order, to set up a HarperDB cluster.
HarperDB clustering is the process of connecting multiple HarperDB databases together to create a database mesh network that enables users to define data replication patterns.
HarperDB’s clustering engine replicates data between instances of HarperDB using a highly performant, bi-directional pub/sub model on a per-table basis. Data replicates asynchronously with eventual consistency across the cluster following the defined pub/sub configuration. Individual transactions are sent in the order in which they were transacted, once received by the destination instance, they are processed in an ACID-compliant manor. Conflict resolution follows a last writer wins model based on recorded transaction time on the transaction and the timestamp on the record on the node.
A common use case is an edge application collecting and analyzing sensor data that creates an alert if a sensor value exceeds a given threshold:
The edge application should not be making outbound http requests for security purposes.
There may not be a reliable network connection.
Not all sensor data will be sent to the cloud--either because of the unreliable network connection, or maybe it’s just a pain to store it.
The edge node should be inaccessible from outside the firewall.
The edge node will send alerts to the cloud with a snippet of sensor data containing the offending sensor readings.
HarperDB simplifies the architecture of such an application with its bi-directional, table-level replication:
The edge instance subscribes to a “thresholds” table on the cloud instance, so the application only makes localhost calls to get the thresholds.
The application continually pushes sensor data into a “sensor_data” table via the localhost API, comparing it to the threshold values as it does so.
When a threshold violation occurs, the application adds a record to the “alerts” table.
The application appends to that record array “sensor_data” entries for the 60 seconds (or minutes, or days) leading up to the threshold violation.
The edge instance publishes the “alerts” table up to the cloud instance.
By letting HarperDB focus on the fault-tolerant logistics of transporting your data, you get to write less code. By moving data only when and where it’s needed, you lower storage and bandwidth costs. And by restricting your app to only making local calls to HarperDB, you reduce the overall exposure of your application to outside forces.
Clustering does not run by default; it needs to be enabled.
To enable clustering the clustering.enabled
configuration element in the harperdb-config.yaml
file must be set to true
.
There are multiple ways to update this element, they are:
Directly editing the harperdb-config.yaml
file and setting enabled to true
Note: When making any changes to the harperdb-config.yaml
file HarperDB must be restarted for the changes to take effect.
Calling set_configuration
through the operations API
Note: When making any changes to HarperDB configuration HarperDB must be restarted for the changes to take effect.
Using command line variables.
Using environment variables.
An efficient way to install HarperDB, create the cluster user, set the node name and enable clustering in one operation is to combine the steps using command line and/or environment variables. Here is an example using command line variables.
Additional information that will help you define your clustering topology.
Transactions that are replicated across the cluster are:
Insert
Update
Upsert
Delete
Bulk loads
CSV data load
CSV file load
CSV URL load
Import from S3
When adding or updating a node any schemas and tables in the subscription that don’t exist on the remote node will be automatically created.
Destructive schema operations do not replicate across a cluster. Those operations include drop_schema
, drop_table
, and drop_attribute
. If the desired outcome is to drop schema information from any nodes then the operation(s) will need to be run on each node independently.
Users and roles are not replicated across the cluster.
HarperDB has built-in resiliency for when network connectivity is lost within a subscription. When connections are reestablished, a catchup routine is executed to ensure data that was missed, specific to the subscription, is sent/received as defined.
HarperDB clustering creates a mesh network between nodes giving end users the ability to create an infinite number of topologies. subscription topologies can be simple or as complex as needed.
A subscription defines how data should move between two nodes. They are exclusively table level and operate independently. They connect a table on one node to a table on another node, the subscription will apply to a matching schema name and table name on both nodes.
Note: ‘local’ and ‘remote’ will often be referred to. In the context of these docs ‘local’ is the node that is receiving the API request to create/update a subscription and remote is the other node that is referred to in the request, the node on the other end of the subscription.
A subscription consists of:
schema
- the name of the schema that the table you are creating the subscription for belongs to.
table
- the name of the table the subscription will apply to.
publish
- a boolean which determines if transactions on the local table should be replicated on the remote table.
subscribe
- a boolean which determines if transactions on the remote table should be replicated on the local table.
This diagram is an example of a publish
subscription from the perspective of Node1.
The record with id 2 has been inserted in the dog table on Node1, after it has completed that insert it is sent to Node 2 and inserted in the dog table there.
This diagram is an example of a subscribe
subscription from the perspective of Node1.
The record with id 3 has been inserted in the dog table on Node2, after it has completed that insert it is sent to Node1 and inserted there.
This diagram shows both subscribe and publish but publish is set to false. You can see that because subscribe is true the insert on Node2 is being replicated on Node1 but because publish is set to false the insert on Node1 is not being replicated on Node2.
This shows both subscribe and publish set to true. The insert on Node1 is replicated on Node2 and the update on Node2 is replicated on Node1.
Node name is the name given to a node. It is how nodes are identified within the cluster and must be unique to the cluster.
The name cannot contain any of the following characters: .,*>
. Dot, comma, asterisk, greater than, or whitespace.
The name is set in the harperdb-config.yaml
file using the clustering.nodeName
configuration element.
Note: If you want to change the node name make sure there are no subscriptions in place before doing so. After the name has been changed a full restart is required.
There are multiple ways to update this element, they are:
Directly editing the harperdb-config.yaml
file.
Note: When making any changes to the harperdb-config.yaml
file HarperDB must be restarted for the changes to take effect.
Calling set_configuration
through the operations API
Using command line variables.
Using environment variables.
A route is a connection between two nodes. It is how the clustering network is established.
Routes do not need to cross connect all nodes in the cluster. You can select a leader node or a few leaders and all nodes connect to them, you can chain, etc… As long as there is one route connecting a node to the cluster all other nodes should be able to reach that node.
Using routes the clustering servers will create a mesh network between nodes. This mesh network ensures that if a node drops out all other nodes can still communicate with each other. That being said, we recommend designing your routing with failover in mind, this means not storing all your routes on one node but dispersing them throughout the network.
A simple route example is a two node topology, if Node1 adds a route to connect it to Node2, Node2 does not need to add a route to Node1. That one route configuration is all that’s needed to establish a bidirectional connection between the nodes.
A route consists of a port
and a host
.
port
- the clustering port of the remote instance you are creating the connection with. This is going to be the clustering.hubServer.cluster.network.port
in the HarperDB configuration on the node you are connecting with.
host
- the host of the remote instance you are creating the connection with.This can be an IP address or a URL.
Routes are set in the harperdb-config.yaml
file using the clustering.hubServer.cluster.network.routes
element, which expects an object array, where each object has two properties, port
and host
.
This diagram shows one way of using routes to connect a network of nodes. Node2 and Node3 do not reference any routes in their config. Node1 contains routes for Node2 and Node3, which is enough to establish a network between all three nodes.
There are multiple ways to set routes, they are:
Directly editing the harperdb-config.yaml
file (refer to code snippet above).
Calling cluster_set_routes
through the API.
Note: When making any changes to HarperDB configuration HarperDB must be restarted for the changes to take effect.
From the command line.
Using environment variables.
The API also has cluster_get_routes
for getting all routes in the config and cluster_delete_routes
for deleting routes.
Inter-node authentication takes place via HarperDB users. There is a special role type called cluster_user
that exists by default and limits the user to only clustering functionality.
A cluster_user
must be created and added to the harperdb-config.yaml
file for clustering to be enabled.
All nodes that are intended to be clustered together need to share the same cluster_user
credentials (i.e. username and password).
There are multiple ways a cluster_user
can be created, they are:
Through the operations API by calling add_user
When using the API to create a cluster user the harperdb-config.yaml
file must be updated with the username of the new cluster user.
This can be done through the API by calling set_configuration
or by editing the harperdb-config.yaml
file.
In the harperdb-config.yaml
file under the top-level clustering
element there will be a user element. Set this to the name of the cluster user.
Note: When making any changes to the harperdb-config.yaml
file, HarperDB must be restarted for the changes to take effect.
Upon installation using command line variables. This will automatically set the user in the harperdb-config.yaml
file.
Note: Using command line or environment variables for setting the cluster user only works on install.
Upon installation using environment variables. This will automatically set the user in the harperdb-config.yaml
file.
Subscriptions can be added, updated, or removed through the API.
Note: The schema and tables in the subscription must exist on either the local or the remote node. Any schema and tables that do not exist on one particular node, for example, the local node, will be automatically created on the local node.
To add a single node and create one or more subscriptions use add_node
.
This is an example of adding Node2 to your local node. Subscriptions are created for two tables, dog and chicken.
To update one or more subscriptions with a single node use update_node
.
This call will update the subscription with the dog table. Any other subscriptions with Node2 will not change.
To add or update subscriptions with one or more nodes in one API call use configure_cluster
.
Note: configure_cluster
will override any and all existing subscriptions defined on the local node. This means that before going through the connections in the request and adding the subscriptions, it will first go through all existing subscriptions the local node has and remove them. To get all existing subscriptions use cluster_status
.
There is an optional property called start_time
that can be passed in the subscription. This property accepts an ISO formatted UTC date.
start_time
can be used to set from what time you would like to source transactions from a table when creating or updating a subscription.
This example will get all transactions on Node2’s dog table starting from 2022-09-02T20:06:35.993Z
and replicate them locally on the dog table.
If no start time is passed it defaults to the current time.
Note: start time utilizes clustering to back source transactions. For this reason it can only source transactions that occurred when clustering was enabled.
To remove a node and all its subscriptions use remove_node
.
To get the status of all connected nodes and see their subscriptions use cluster_status
.