mirror/spacedrive

Fork 0

mirror of https://github.com/spacedriveapp/spacedrive.git synced 2026-06-04 05:37:16 -04:00

Files

History

…

README.md

…

README.md

Archive System

Archive is Spacedrive's data archival system for indexing external data sources beyond the filesystem. While the VDFS manages files, Archive handles everything else: emails, notes, messages, bookmarks, calendar events, contacts, and more.

Features

Universal Indexing - Adapters ingest data from Gmail, Slack, Obsidian, Chrome, Safari, GitHub, Apple Notes, Calendar, Contacts, and more via a script-based protocol
Hybrid Search - Combines full-text search (SQLite FTS5) with semantic vector search (LanceDB + FastEmbed) merged via Reciprocal Rank Fusion
Safety Screening - Prompt Guard 2 classifies indexed text for injection attacks before it enters the search index
Schema-Driven Sources - Each data source is self-contained with its own SQLite database, vector index, and TOML schema
AI-Ready - Spacebot queries archived data through structured search APIs with built-in safety metadata
P2P Sync - Source metadata syncs across devices via library sync

Quick Start

1. Create a Source

// Create a Gmail source
const source = await core.sources.create({
  name: "Work Gmail",
  adapter_id: "gmail",
  trust_tier: "external",
  config: {
    email: "work@example.com",
    // OAuth flow happens automatically
  }
});

2. Sync Data

// Trigger sync job
const jobId = await core.sources.sync({
  source_id: source.id
});

// Monitor progress
core.jobs.subscribe(jobId, (progress) => {
  console.log(`Synced ${progress.current}/${progress.total} items`);
});

3. Search

// Hybrid search across all sources
const results = await core.sources.search({
  query: "budget proposal Q4",
  source_ids: [source.id],
  limit: 20
});

// Results include both FTS and vector matches
results.forEach(result => {
  console.log(`${result.title} (score: ${result.score})`);
  console.log(`Trust: ${result.trust_tier}, Safe: ${result.safety_verdict}`);
});

Architecture

Components

┌─────────────────────────────────────────────────────────┐
│                    Spacedrive Library                    │
├─────────────────────────────────────────────────────────┤
│  VDFS (Files)              Archive (Everything Else)    │
│  ├─ Locations              ├─ Sources                   │
│  ├─ Entries                │  ├─ Gmail                  │
│  ├─ Content IDs            │  ├─ Slack                  │
│  └─ Sidecars               │  ├─ Obsidian               │
│                            │  └─ Chrome History         │
│                            │                             │
│                            ├─ Hybrid Search              │
│                            │  ├─ FTS5 (keywords)         │
│                            │  └─ LanceDB (semantic)      │
│                            │                             │
│                            └─ Safety Pipeline            │
│                               ├─ Prompt Guard 2          │
│                               ├─ Trust Tiers             │
│                               └─ Quarantine              │
└─────────────────────────────────────────────────────────┘

Storage Layout

Each library contains a sources/ directory alongside the VDFS:

.sdlibrary/
├─ library.db              # VDFS + source metadata
├─ sidecars/               # VDFS sidecars
└─ sources/                # Archive sources
   ├─ registry.db          # Optional separate registry
   └─ {source-uuid}/
      ├─ data.db           # Generated from TOML schema
      ├─ embeddings.lance/ # Vector index
      ├─ schema.toml       # Data type definition
      ├─ state/            # Adapter cursor state
      └─ cache/            # Adapter-specific caches

Adapters

Adapters are script-based data source connectors that communicate via stdin/stdout JSONL protocol.

Built-in Adapters

Gmail - Emails, threads, labels
Obsidian - Notes, links, tags
Slack - Messages, threads, channels
Chrome Bookmarks - Bookmarks, folders
Chrome History - Browsing history
Safari History - Browsing history
Apple Notes - Notes, attachments
Apple Calendar - Events, reminders
Apple Contacts - Contacts, groups
GitHub - Issues, PRs, commits
OpenCode - Code snippets, projects

Creating an Adapter

1. Create adapter manifest (adapters/my-adapter/adapter.toml):

[adapter]
id = "my-adapter"
name = "My Adapter"
version = "1.0.0"
trust_tier = "external"

[sync]
command = "python3"
args = ["sync.py"]

[schema]
inline = """
[type]
name = "MyRecord"
fields = [
  { name = "title", type = "String", indexed = true },
  { name = "content", type = "Text", indexed = true, embedded = true },
  { name = "created_at", type = "DateTime" }
]
"""

2. Create sync script (adapters/my-adapter/sync.py):

#!/usr/bin/env python3
import json
import sys

def sync():
    # Read config from stdin
    config = json.loads(sys.stdin.readline())

    # Fetch data from source
    records = fetch_from_api(config)

    # Emit records as JSONL
    for record in records:
        print(json.dumps({
            "op": "upsert",
            "id": record["id"],
            "data": {
                "title": record["title"],
                "content": record["content"],
                "created_at": record["timestamp"]
            }
        }))
        sys.stdout.flush()

if __name__ == "__main__":
    sync()

3. Install adapter:

# Adapters are auto-discovered from adapters/ directory
# Just place your adapter folder in adapters/ and restart

Operations

Sources

// Create
core.sources.create(input: CreateSourceInput): SourceInfo

// List
core.sources.list(): SourceInfo[]

// Get
core.sources.get(id: Uuid): SourceInfo

// Update
core.sources.update(id: Uuid, updates: SourceUpdates): SourceInfo

// Delete
core.sources.delete(id: Uuid): void

// Sync
core.sources.sync(id: Uuid): JobId

// Sync all
core.sources.sync_all(): JobId[]

// Search
core.sources.search(query: SearchInput): SearchResult[]

Records

// List records in a source
core.sources.records.list(source_id: Uuid, limit?: number): Record[]

// Get specific record
core.sources.records.get(source_id: Uuid, record_id: string): Record

// Delete record
core.sources.delete_record(source_id: Uuid, record_id: string): void

Quarantine

// List quarantined records
core.sources.quarantine.list(source_id: Uuid): QuarantinedRecord[]

// Release from quarantine
core.sources.release_quarantined(source_id: Uuid, record_id: string): void

Adapters

// List available adapters
core.sources.adapters.list(): AdapterInfo[]

// Get adapter details
core.sources.adapters.get(id: string): AdapterInfo

// List schemas
core.sources.schemas.list(): SchemaInfo[]

Safety & Trust

Trust Tiers

Sources are assigned trust tiers that determine screening strictness:

authored - Content you created (Obsidian notes, drafts)
collaborative - Shared workspaces (Slack channels, shared docs)
external - Public or untrusted sources (Gmail, GitHub issues)

Safety Pipeline

Adapter Sync
    ↓
Screening (Prompt Guard 2)
    ├─ Safe → Continue
    └─ Flagged → Quarantine
         ↓
Classification (optional)
    ↓
Embedding (FastEmbed)
    ↓
Searchable

Quarantine

Flagged records are:

Excluded from search results by default
Visible in quarantine UI for review
Can be manually released or deleted
Never exposed to AI agents

Development

Running Tests

# Test the archive crate
cargo test -p sd-archive

# Test core integration
cargo test -p spacedrive-core -- sources::

# Test specific adapter
python3 adapters/gmail/test.py

Adding a Job

Jobs live in core/src/ops/sources/ alongside their operations:

// core/src/ops/sources/my_job.rs
use crate::infra::job::prelude::*;

#[derive(Debug, Serialize, Deserialize)]
pub struct MyJob {
    pub source_id: Uuid,
}

impl Job for MyJob {
    const NAME: &'static str = "my_job";
    const RESUMABLE: bool = true;
}

#[async_trait]
impl JobHandler for MyJob {
    type Output = MyJobOutput;

    async fn run(&mut self, ctx: JobContext<'_>) -> JobResult<Self::Output> {
        // Get source manager
        let mgr = ctx.library.source_manager()
            .ok_or_else(|| JobError::Internal("Source manager not initialized".into()))?;

        // Do work with progress reporting
        ctx.report_progress(MyProgress { current: 10, total: 100 }).await?;

        // Return output
        Ok(MyJobOutput { ... })
    }
}

Debugging

Enable verbose logging:

RUST_LOG=sd_archive=debug,spacedrive_core::data=debug cargo run

View source database:

sqlite3 ~/.sdlibrary/MyLibrary/sources/{source-uuid}/data.db
.schema
SELECT * FROM records LIMIT 10;

Inspect vector index:

import lancedb
db = lancedb.connect("~/.sdlibrary/MyLibrary/sources/{source-uuid}/embeddings.lance")
table = db.open_table("embeddings")
print(table.schema)

FAQ

Q: How is this different from the VDFS?

A: VDFS manages files on disk with content identity and cross-device awareness. Archive manages structured data from external sources (emails, notes, etc.) that aren't files.

Q: Do adapters run in a sandbox?

A: Adapters run as subprocess with limited privileges. They receive config via stdin and emit records via stdout. No filesystem or network access unless explicitly granted.

Q: Can I sync the same source to multiple devices?

A: Yes. Source metadata syncs via library sync. Each device can independently sync data from the source, or you can configure one device to sync and distribute snapshots.

Q: What happens if an adapter crashes?

A: The sync job tracks progress via cursor state. Resume from the last successful checkpoint. Partial syncs don't corrupt the database.

Q: Can I search across both files and sources?

A: Not yet. Currently file search and source search are separate. Unified federated search is planned for a future release.

Q: How do I handle OAuth secrets?

A: Secrets are stored encrypted in Spacedrive's KeyManager (OS keychain + redb). Adapters receive decrypted secrets as environment variables during sync.

Q: What's the performance impact?

A: Archive runs as background jobs. Embeddings are generated incrementally. Search is fast (FTS5 + LanceDB are both optimized for low-latency queries). Typical overhead: <5% CPU during sync, <100MB RAM per source.

Contributing

See CONTRIBUTING.md for general guidelines.

For adapter contributions, see ADAPTERS.md.

License

FSL-1.1-ALv2 - See LICENSE for details.