Files
kopia/doc/format.md

11 KiB

Storage Format

This document describes how backups generated by kopia are stored and how encryption is used to secure them.

The bulk of the backup data is stored in a Repository, while meta-information about backups (typically orders of magnitude smaller) is stored in a Vault.

This design allows repository and vault to be stored in two physically different locations. The Repository is typically stored in highly durable and available cloud storage, such as Google Cloud Storage, while the Vault is small enough that it could be carried on a thumb drive, but typically is also stored in the cloud, together with the Repository.

Repositories can be shared among users, while Vaults are typically owned by single user.

BLOB Storage

Kopia stores all data (both vault and repository) in BLOB storage, which stores Blocks of unstructured binary data. Currently supported storage mechanisms include:

  • filesystem
  • Google Cloud Storage

Repository

Repository is a content-addressable storage, which stores arbitrary Objects (files, directory listings, etc.) that can be referenced using Object Identifiers, which are human-readable strings.

When objects get stored in a repository, the user does not pick their identifiers, but instead the repository computes the identifier as a function of the contents of the object itself. Because of that repository achieves data deduplication, because multiple identical objects will have the same object IDs, they can be stored only once, saving upload time and storage space.

  • For very small binary objects (up to few kilobytes), the object identifier is B followed by base-64 encoded contents of the object.

    EXAMPLE: An object with 4 bytes 01 02 03 04 has identifier BAQIDBA.

  • For very small text objects (ASCII files, up to few kilobytes), the object identifier is T followed by the contents of the object.

    EXAMPLE: The text quick brown fox has identifier Tquick brown fox

  • For medium-sized objects (less than about 20MB), Kopia applies cryptographic hash function to compute the message digest (typically 128-512 bits long), and stores the object contents in a Block whose name is the digest. Object for medium-sized objects is D{digest}.

    EXAMPLE: Da76999788386641a3ec798554f1fe7e6 could be an identifier of an object whose cryptographic hash is a76999788386641a3ec798554f1fe7e6. In certain formats (see below) object identifier is followed by base-16 encoded per-object encryption key, such as Da76999788386641a3ec798554f1fe7e6.82f948a549f7b791a5b41915ee4d1ec3935357e4e2317250d0372afa2ebeeb3a

  • Larger objects are split into chunks of 20MB which get stored as medium-sized objects. The list of object IDs representing chunks is then stored in JSON object, which is also stored in a repository. The resulting object ID is L1{digest of list object}.

  • Sometimes smaller objects are combined into bundles. This allows all small files in a directory to be stored in a single medium-sized object. In order to refer to sections of the bundle objects S{offset},{length},{bundle} is used.

    EXAMPLE: S2000,5000,Da76999788386641a3ec798554f1fe7e6 refers to a section of Da76999788386641a3ec798554f1fe7e6 from byte 2000 till 6999

Sharing and Encryption

Depending on data sharing and encryption needs, Kopia supports three encryption modes:

  • Unencrypted - object data in the repository is stored unencrypted

    • This mode is recommended for repositories that are stored in trusted locations
    • Any user with access to of repository can see the contents of all the files, although they can't easily find names of files.
    • Single-user and shared repositories are supported.
  • Single-key encryption - object data is encrypted using shared key

    • Encryption key is shared among users and stored in each user's Vault
    • Any user with access to repository and the shared key can decrypt all files, although they can't easily find names of files.
    • Per-object initialization vector (synthetic IV, or SIV) is derived from the object contents and a secret also stored in the vault.
    • Object identifiers are typically short, for example: Da76999788386641a3ec798554f1fe7e6.
    • This mode is recommended for data owned by a single user or a set of trusted users, where key sharing is possible
  • Per-object encryption - object data is encrypted using per-object key

    • Per-object encryption key is derived from the object contents and a secret stored in users vaults.
    • Encryption key is stored as part of object identifier and is required to decrypt the object.
    • Object identifiers are relatively long, because they include 256-bit encryption keys for example: Da76999788386641a3ec798554f1fe7e6.82f948a549f7b791a5b41915ee4d1ec3935357e4e2317250d0372afa2ebeeb3a
    • Access to repository and knowledge of the shared secret is not enough to decrypt files, per-object encryption key is also required.
    • This mode is recommended for cases where more than one user is sharing a repository, when sharing encryption key is not feasible

Object Formats

The following formats are supported:

ID Mode ObjectID Length Encryption
UNENCRYPTED_HMAC_SHA256 Unencrypted 65
UNENCRYPTED_HMAC_SHA256_128 Unencrypted 33
ENCRYPTED_HMAC_SHA256_AES256_SIV Single-key 33 AES-256
ENCRYPTED_HMAC_SHA512_384_AES256 Per-object-key 98 AES-256
ENCRYPTED_HMAC_SHA512_AES256 Per-object-key 130 AES-256

The default format is ENCRYPTED_HMAC_SHA256_AES256_SIV which is best-suited for single-user deployments.

  • UNENCRYPTED_HMAC_SHA256:

    • contents are not encrypted and stored in a block named:
        blockID := BASE16(HMACSHA256(secret,content)
    
    • per-repository secret is stored in the Vault
  • UNENCRYPTED_HMAC_SHA256_128:

    • contents are not encrypted and stored in a block named:
        blockID := BASE16(TRUNCATE(HMACSHA256(secret,content),16)
    
    • per-repository secret is stored in the Vault
  • ENCRYPTED_HMAC_SHA256_AES256_SIV:

    • block contents are encrypted with AES-256 in CTR mode with synthetic IV derived from the content:
        iv := TRUNCATE(HMACSHA256(secret,content),16)
        cipherText := AES256CTR(encryptionKey,iv,content)
        blockID := BASE16(iv)
    
    • per-repository encryptionKey and secret are stored in the Vault
  • ENCRYPTED_HMAC_SHA512_384_AES256:

    • block contents are encrypted with AES-256 in CTR mode with key derived from content and constant IV:
        digest := HMACSHA512384(secret,content)
        blockID := BASE16(digest[0:16])
        encryptionKey := digest[16:48]
        iv := "kopiakopiakopiak"
        cipherText := AES256CTR(encryptionKey,iv,content)
    
    • per-repository encryptionKey and secret are stored in the Vault
  • ENCRYPTED_HMAC_SHA512_AES256:

    • block contents are encrypted with AES-256 in CTR mode with key derived from content and constant IV:
        digest := HMACSHA512(secret,content)
        blockID := BASE16(digest[0:32])
        encryptionKey := digest[32:64]
        iv := "kopiakopiakopiak"
        cipherText := AES256CTR(encryptionKey,iv,content)
    
    • per-repository encryptionKey and secret are stored in the Vault

Vault Vormat

Vault provides storage for backup metadata that is typically encrypted with per-user key.

Each vault contains an unencrypted block named format describing the vault encryption format and key derivation algorithm:

    {
      "version": "1",
      "uniqueID": "Rig5PvhA5HxHcfBV7MwY7US6XXwm40Sz5RzL1hEc4LM=",
      "keyAlgo": "scrypt-65536-8-1",
      "encryption": "AES256_GCM"
    }

All other vault blocks are encrypted using AES256 in Galois/Counter Mode. The encryption key and authenticated data derived from a master key. Master key is either user-provided or derived from a password using Scrypt.

One encrypted block is of particular importance, the block named repo, which describes the location and format of the repository:

```json
    {
      "connection": {
        "type": "filesystem",
        "config": {
          "path": "/tmp/kopia-test-repo"
        },
      },
      "format": {
        "version": 1,
        "objectFormat": "ENCRYPTED_HMAC_SHA512_384_AES256",
        "secret": "TzQzQDQ7jfBf6/RGNJAIXYZMRbc4Ty8270wiLTfBUHU=",
        "maxInlineContentLength": 32768,
        "maxBlockSize": 20971520,
        "masterKey": "h1jU2A+tSnzRot2Me5ZQNdjjox6KUTqd8H9TqZvtypw="
      }
    }
```

Directory Format

Directory is represented as JSON object, which can be examined using:

$ kopia show <object id>

It lists all directory entries, sorted lexicographically with directory entry attributes such as length and permissions included. Each entry has an identifier of an object (obj) that contains the file contents or in the case of a directory the JSON object with subdirectory entries.

Note that the directory name is not stored as part of the object, this preserves object IDs of directories that have been moved around but not modified.

{
  "stream":"kopia:directory",
  "entries":[
    {"name":"IMG_0032.JPG","type":"f","mode":"0600","size":1690375,
     "mtime":"2016-11-06T00:01:05Z","uid":501,"gid":20,
      "obj":"D38861041c27cfeb5fb2b03b69579b3ce"},
    {"name":"IMG_0032.MOV","type":"f","mode":"0600","size":3325165,
     "mtime":"2016-11-06T00:01:05Z","uid":501,"gid":20,
      "obj":"Dd1ed2787f0c3f975afd4cbd733f79533"},
    {"name":"IMG_0033.JPG","type":"f","mode":"0600","size":1591460,
     "mtime":"2016-11-06T00:01:05Z","uid":501,"gid":20,
      "obj":"D6f6c202a0074074bbfe49bbf69d8a1bf"},
...
    {"name":"bundle-1","type":"b","size":"465",
     "mtime":"2016-11-06T00:01:06Z","obj":"D4cb4013f0cb66d24e6569119e0a122aa",
     "bundled":[
       {"name":"IMG_0130.JPG","type":"f","mode":"0600","size":"124",
        "mtime":"2016-11-06T00:01:06Z","uid":501,"gid":20},
       {"name":"IMG_0131.JPG","type":"f","mode":"0600","size":"341",
        "mtime":"2016-11-06T00:01:06Z","uid":501,"gid":20}
     ]}
]}