1 of 7

Reference

This section contains technical details and reference materials for HarperDB.

Storage Algorithm
Dynamic Schema
Headers
Limitations
Content Types
Data Types

Storage Algorithm

The HarperDB storage algorithm is fundamental to the HarperDB core functionality, enabling the Dynamic Schema and all other user-facing functionality. HarperDB is built on top of Lightning Memory-Mapped Database (LMDB), a key-value store offering industry leading performance and functionality, which allows for our storage algorithm to store data in tables as rows/objects. This document will provide additional details on how data is stored within HarperDB.

Query Language Agnostic

The HarperDB storage algorithm was designed to abstract the data storage from any individual query language. HarperDB currently supports both SQL and NoSQL on top of this storage algorithm, with the ability to add additional query languages in the future. This means data can be inserted via NoSQL and read via SQL while hitting the same underlying data storage.

ACID Compliant

Utilizing Multi-Version Concurrency Control (MVCC) through LMDB, HarperDB offers ACID compliance independently on each node. Readers and writers operate independently of each other, meaning readers don’t block writers and writers don’t block readers. Each HarperDB table has a single writer process, avoiding deadlocks and assuring that writes are executed in the order in which they were received. HarperDB tables can have multiple reader processes operating at the same time for consistent, high scale reads.

Universally Indexed

All top level attributes are automatically indexed immediately upon ingestion. The HarperDB Dynamic Schema reflexively creates both the attribute and index reflexively as new schema metadata comes in. Indexes are agnostic of datatype, honoring the following order: booleans, numbers ordered naturally, strings ordered lexically. Within the LMDB implementation, table records are grouped together into a single LMDB environment file, where each attribute index is a sub-database (dbi) inside said environment file. An example of the indexing scheme can be seen below.

Additional LMDB Benefits

HarperDB inherits both functional and performance benefits by implementing LMDB as the underlying key-value store. Data is memory-mapped, which enables quick data access without data duplication. All writers are fully serialized, making writes deadlock-free. LMDB is built to maximize operating system features and functionality, fully exploiting buffer cache and built to run in CPU cache. To learn more about LMDB, visit their documentation.

HarperDB Indexing Example (Single Table)

Dynamic Schema

HarperDB is built to make data ingestion simple. A primary driver of that is the Dynamic Schema. The purpose of this document is to provide a detailed explanation of the dynamic schema specifically related to schema definition and data ingestion.

The dynamic schema provides the structure of schema and table namespaces while simultaneously providing the flexibility of a data-defined schema. Individual attributes are reflexively created as data is ingested, meaning the table will adapt to the structure of data ingested. HarperDB tracks the metadata around schemas, tables, and attributes allowing for describe table, describe schema, and describe all operations.

Schemas

HarperDB schemas are analogous to a namespace that groups tables together. A schema is required to create a table.

Tables

HarperDB tables group records together with a common data pattern. To create a table users must provide a table name and a primary key.

Table Name: Used to identify the table.
Primary Key: This is a required attribute that serves as the unique identifier for a record and is also known as the hash_attribute in HarperDB.

Primary Key

The primary key (also referred to as the hash_attribute) is used to uniquely identify records. Uniqueness is enforced on the primary; inserts with the same primary key will be rejected. If a primary key is not provided on insert, a GUID will be automatically generated and returned to the user. The HarperDB Storage Algorithm utilizes this value for indexing.

Standard Attributes

Additional attributes are reflexively added via insert and update operations (in both SQL and NoSQL) when new attributes are included in the data structure provided to HarperDB. As a result, schemas are additive, meaning new attributes are created in the underlying storage algorithm as additional data structures are provided. HarperDB offers create_attribute and drop_attribute operations for users who prefer to manually define their data model independent of data ingestion. When new attributes are added to tables with existing data the value of that new attribute will be assumed null for all existing records.

Audit Attributes

HarperDB automatically creates two audit attributes used on each record.

__createdtime__: The time the record was created in Unix Epoch with milliseconds format.
__updatedtime__: The time the record was updated in Unix Epoch with milliseconds format.

Dynamic Schema Example

To better understand the behavior let’s take a look at an example. This example utilizes HarperDB API operations.

Create a Schema

{
    "operation": "create_schema",
    "schema": "dev"
}

Create a Table

Notice the schema name, table name, and hash attribute name are the only required parameters.

{
    "operation": "create_table",
    "schema": "dev",
    "table": "dog",
    "hash_attribute": "id"
}

At this point the table does not have structure beyond what we provided, so the table looks like this:

dev.dog

Insert Record

To define attributes we do not need to do anything beyond sending them in with an insert operation.

{
    "operation": "insert",
    "schema": "dev",
    "table": "dog",
    "records": [
      {"id": 1, "dog_name": "Penny", "owner_name": "Kyle"}
    ]
}

With a single record inserted and new attributes defined, our table now looks like this:

dev.dog

Indexes have been automatically created for dog_name and owner_name attributes.

Insert Additional Record

If we continue inserting records with the same data schema no schema updates are required. One record will omit the hash attribute from the insert to demonstrate GUID generation.

{
    "operation": "insert",
    "schema": "dev",
    "table": "dog",
    "records": [
        {"id": 2, "dog_name": "Monk", "owner_name": "Aron"},
        {"dog_name": "Harper","owner_name": "Stephen"}
    ]
}

In this case, there is no change to the schema. Our table now looks like this:

dev.dog

Update Existing Record

In this case, we will update a record with a new attribute not previously defined on the table.

{
    "operation": "update",
    "schema": "dev",
    "table": "dog",
    "records": [
      {"id": 2, "weight_lbs": 35}
    ]
}

Now we have a new attribute called weight_lbs. Our table now looks like this:

dev.dog

Query Table with SQL

Now if we query for all records where weight_lbs is null we expect to get back two records.

{
    "operation": "sql",
    "sql": "SELECT * FROM dev.dog WHERE weight_lbs IS NULL"
}

This results in the expected two records being returned.

Data Types

HarperDB supports a rich set of data types for use in records in databases. Various data types can be used from both direct JavaScript interfaces in Custom Functions and the HTTP operations APIs. Using JSON for communication naturally limits the data types to those available in JSON (HarperDB’s supports all of JSON data types), but JavaScript code and alternate data formats facilitate the use of additional data types. As of v4.1, HarperDB supports MessagePack and CBOR, which allows for all of HarperDB supported data types. This includes:

Boolean

true or false.

String

Strings, or text, are a sequence of any unicode characters and are internally encoded with UTF-8.

Number

Numbers can be stored as signed integers up to 64-bit or floating point with 64-bit floating point precision, and numbers are automatically stored using the most optimal type. JSON is parsed by JS, so the maximum safe (precise) integer is 9007199254740991 (larger numbers can be stored, but aren’t guaranteed integer precision). Custom Functions may use BigInt numbers to store/access larger 64-bit integers, but integers beyond 64-bit can’t be stored with integer precision (will be stored as standard double-precision numbers).

Object/Map

Objects, or maps, that hold a set named properties can be stored in HarperDB. When provided as JSON objects or JavaScript objects, all property keys are stored as strings. The order of properties is also preserved in HarperDB’s storage. Duplicate property keys are not allowed (they are dropped in parsing any incoming data).

Array

Arrays hold an ordered sequence of values and can be stored in HarperDB. There is no support for sparse arrays, although you can use objects to store data with numbers (converted to strings) as properties.

Null

A null value can be stored in HarperDB property values as well.

Date

Dates can be stored as a specific data type. This is not supported in JSON, but is supported by MessagePack and CBOR. Custom Functions can also store and use Dates using JavaScript Date instances.

Binary Data

Binary data can be stored in property values as well. JSON doesn’t have any support for encoding binary data, but MessagePack and CBOR support binary data in data structures, and this will be preserved in HarperDB. Custom Functions can also store binary data by using NodeJS’s Buffer or Uint8Array instances to hold the binary data.

Explicit Map/Set

Explicit instances of JavaScript Maps and Sets can be stored and preserved in HarperDB as well. This can’t be represented with JSON, but can be with CBOR.

Content Types/Data Formats

HarperDB supports several different content types (or MIME types) for both HTTP request bodies (describing operations) as well as for serializing content into HTTP response bodies. HarperDB follows HTTP standards for specifying both request body content types and acceptable response body content types. Any of these content types can be used with any of the standard HarperDB operations.

For request body content, the content type should be specified with the Content-Type header. For example with JSON, use Content-Type: application/json and for CBOR, include Content-Type: application/cbor. To request that the response body be encoded with a specific content type, use the Accept header. If you want the response to be in JSON, use Accept: application/json. If you want the response to be in CBOR, use Accept: application/cbor.

The following content types are supported:

JSON - application/json

JSON is the most widely used content type, and is relatively readable and easy to work with. However, JSON does not support all the data types that are supported by HarperDB, and can't be used to natively encode data types like binary data or explicit Maps/Sets. Also, JSON is not as efficient as binary formats. When using JSON, compression is recommended (this also follows standard HTTP protocol with the Accept-Encoding header) to improve network transfer performance (although there is server performance overhead). JSON is a good choice for web development and when standard JSON types are sufficient and when combined with compression and debuggability/observability is important.

CBOR - application/cbor

CBOR is a highly efficient binary format, and is a recommended format for most production use cases with HarperDB. CBOR supports the full range of HarperDB data types, including binary data, typed dates, and explicit Maps/Sets. CBOR is very performant and space efficient even without compression. Compression will still yield better network transfer size/performance, but compressed CBOR is generally not any smaller than compressed JSON. CBOR also natively supports streaming for optimal performance (using indefinite length arrays). The CBOR format has excellent standardization and HarperDB's CBOR provides an excellent balance of performance and size efficiency.

MessagePack - application/x-msgpack

MessagePack is another efficient binary format like CBOR, with a support for all HarperDB data types. MessagePack generally has wider adoption than CBOR and can be useful in systems that don't have CBOR support (or good support). However, MessagePack does not have native support for streaming of arrays of data (for query results), and so query results are returned as a (concatenated) sequence of MessagePack objects/maps. MessagePack decoders used with HarperDB's MessagePack must be prepared to decode a direct sequence of MessagePack values to properly read responses.

Comma-separated Values (CSV) - text/csv

Comma-separated values is an easy to use and understand format that can be readily imported into spreadsheets or used for data processing. CSV lacks hierarchical structure most data types, and shouldn't be used for frequent/production use, but when you need it, it is available.

HarperDB Headers

All HarperDB API responses include headers that are important for interoperability and debugging purposes. The following headers are returned with all HarperDB API responses:

HarperDB Limits

This document outlines limitations of HarperDB.

Schema Naming Restrictions

Case Sensitivity

HarperDB schema metadata (schema names, table names, and attribute/column names) are case sensitive. Meaning schemas, tables, and attributes can differ only by the case of their characters.

Restrictions on Schema Metadata Names

HarperDB schema metadata (schema names, table names, and attribute names) cannot contain the following UTF-8 characters:

/`¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Additionally, they cannot contain the first 31 non-printing characters. Spaces are allowed, but not recommended as best practice. The regular expression used to verify a name is valid is:

^[\x20-\x2E|\x30-\x5F|\x61-\x7E]*$

Table Limitations

Attribute Maximum

HarperDB limits number of attributes to 10,000 per table.

Dynamic Schema

Schemas

HarperDB schemas are analogous to a namespace that groups tables together. A schema is required to create a table.

Tables

HarperDB tables group records together with a common data pattern. To create a table users must provide a table name and a primary key.

Table Name: Used to identify the table.
Primary Key: This is a required attribute that serves as the unique identifier for a record and is also known as the hash_attribute in HarperDB.

Primary Key

Standard Attributes

Audit Attributes

HarperDB automatically creates two audit attributes used on each record.

__createdtime__: The time the record was created in Unix Epoch with milliseconds format.
__updatedtime__: The time the record was updated in Unix Epoch with milliseconds format.

Dynamic Schema Example

To better understand the behavior let’s take a look at an example. This example utilizes HarperDB API operations.

Create a Schema

{
    "operation": "create_schema",
    "schema": "dev"
}

Create a Table

Notice the schema name, table name, and hash attribute name are the only required parameters.

{
    "operation": "create_table",
    "schema": "dev",
    "table": "dog",
    "hash_attribute": "id"
}

At this point the table does not have structure beyond what we provided, so the table looks like this:

dev.dog

Insert Record

To define attributes we do not need to do anything beyond sending them in with an insert operation.

{
    "operation": "insert",
    "schema": "dev",
    "table": "dog",
    "records": [
      {"id": 1, "dog_name": "Penny", "owner_name": "Kyle"}
    ]
}

With a single record inserted and new attributes defined, our table now looks like this:

dev.dog

Indexes have been automatically created for dog_name and owner_name attributes.

Insert Additional Record

If we continue inserting records with the same data schema no schema updates are required. One record will omit the hash attribute from the insert to demonstrate GUID generation.

{
    "operation": "insert",
    "schema": "dev",
    "table": "dog",
    "records": [
        {"id": 2, "dog_name": "Monk", "owner_name": "Aron"},
        {"dog_name": "Harper","owner_name": "Stephen"}
    ]
}

In this case, there is no change to the schema. Our table now looks like this:

dev.dog

Update Existing Record

In this case, we will update a record with a new attribute not previously defined on the table.

{
    "operation": "update",
    "schema": "dev",
    "table": "dog",
    "records": [
      {"id": 2, "weight_lbs": 35}
    ]
}

Now we have a new attribute called weight_lbs. Our table now looks like this:

dev.dog

Query Table with SQL

Now if we query for all records where weight_lbs is null we expect to get back two records.

{
    "operation": "sql",
    "sql": "SELECT * FROM dev.dog WHERE weight_lbs IS NULL"
}

This results in the expected two records being returned.