mirror/spacedrive

Fork 0

mirror of https://github.com/spacedriveapp/spacedrive.git synced 2026-05-19 13:55:40 -04:00

Files

History

…

src

…

Cargo.toml

…

README.md

…

README.md

sd-archive

Archive engine for Spacedrive - indexes external data sources beyond the filesystem.

Overview

This crate provides the core archival engine that powers Spacedrive's data source integration. It is designed to be used as a standalone library or integrated into Spacedrive's core.

Key features:

Schema-driven SQLite databases generated from TOML schemas
Script-based adapter runtime (stdin/stdout JSONL protocol)
Hybrid search (FTS5 + LanceDB vector search + RRF merging)
Safety screening (Prompt Guard 2 for injection detection)
Portable sources (copy folder, it works)

Usage

Standalone

use sd_archive::{Engine, EngineConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize engine
    let config = EngineConfig {
        data_dir: PathBuf::from("./data"),
    };
    let engine = Engine::new(config).await?;

    // Create source from adapter
    let source_id = engine.create_source(
        "my-gmail",
        "gmail",
        serde_json::json!({
            "email": "user@example.com"
        })
    ).await?;

    // Sync data
    let report = engine.sync_source(&source_id, |progress| {
        println!("Progress: {}/{}", progress.current, progress.total);
    }).await?;

    // Search
    let results = engine.search(&source_id, "budget proposal", 10).await?;
    for result in results {
        println!("{}: {} (score: {})", result.id, result.title, result.score);
    }

    Ok(())
}

Integrated with Spacedrive

See core/src/data/ for the Spacedrive integration wrapper that adds:

Library-scoped lifecycle
Job system integration
Event bus integration
KeyManager for secrets
Operation/query registration

Architecture

Components

Engine - Top-level coordinator
Schema - TOML parser, SQL codegen, migrations
SourceDb - SQLite database per source
Registry - Source metadata management
Adapter - Script subprocess runtime
Search - Hybrid search router (FTS + vector)
Safety - Prompt Guard 2 screening
Embedding - FastEmbed vector generation

Data Flow

Adapter (script)
    ↓ JSONL
ScriptAdapter
    ↓ Records
SourceDb (upsert/delete)
    ↓
Safety Screening
    ↓
Embedding Generation
    ↓
Search Index (FTS5 + LanceDB)

Features

Default Features

None. The crate compiles with minimal dependencies by default.

Optional Features

safety-screening - Enable Prompt Guard 2 safety classifier
- Adds: ort, tokenizers, hf-hub
- Enables: safety::PromptGuard module
- Use when: Building with AI safety features

Schema Format

Sources are defined by TOML schemas:

[type]
name = "Email"
fields = [
  { name = "subject", type = "String", indexed = true },
  { name = "body", type = "Text", indexed = true, embedded = true },
  { name = "from", type = "String" },
  { name = "to", type = "String" },
  { name = "received_at", type = "DateTime" },
]

[type]
name = "Attachment"
fields = [
  { name = "filename", type = "String" },
  { name = "size", type = "Integer" },
  { name = "email_id", type = "ForeignKey", references = "Email" }
]

Field types:

String - Short text (up to 1KB)
Text - Long text (unlimited)
Integer - i64
Float - f64
Boolean - bool
DateTime - ISO 8601 timestamp
Json - Arbitrary JSON
ForeignKey - Reference to another type

Field flags:

indexed: true - Create FTS5 index for full-text search
embedded: true - Generate vector embeddings for semantic search
unique: true - Enforce uniqueness constraint
nullable: false - Require non-null values

Adapter Protocol

Adapters communicate via stdin/stdout using line-delimited JSON.

Input (stdin)

Config object sent once at startup:

{"email": "user@example.com", "cursor": "abc123"}

Output (stdout)

Stream of operation objects:

{"op": "upsert", "id": "msg-1", "data": {"subject": "Hello", "body": "..."}}
{"op": "upsert", "id": "msg-2", "data": {"subject": "Re: Hello", "body": "..."}}
{"op": "delete", "id": "msg-3"}

Operations:

upsert - Insert or update record
delete - Delete record
link - Create relationship between records

Cursor State

Adapters maintain cursor state for incremental sync:

{"op": "cursor", "value": "next-page-token-xyz"}

The engine persists cursor state and provides it on next sync.

Dependencies

Core:

sqlx - SQLite database operations
toml - Schema parsing
serde / serde_json - Serialization
tokio - Async runtime
uuid - Source IDs
blake3 - Content hashing

Search:

lancedb - Vector database
fastembed - Embedding model

Safety (optional):

ort - ONNX Runtime for Prompt Guard 2
tokenizers - Text tokenization
hf-hub - Model downloads

Performance

Benchmarks (M2 Max, 10k emails):

Schema parsing: ~1ms
Schema migration: ~50ms (first time), ~5ms (no-op)
Adapter sync: ~2000 records/sec (I/O bound)
FTS5 search: ~5ms (p95)
Vector search: ~20ms (p95)
Hybrid search (RRF): ~30ms (p95)
Embedding generation: ~100 records/sec (CPU bound)

Memory:

Engine overhead: ~10MB
Per-source overhead: ~5MB
LanceDB cache: ~50MB
FastEmbed model: ~100MB (shared across sources)

Testing

# Run all tests
cargo test -p sd-archive

# Run with safety features
cargo test -p sd-archive --features safety-screening

# Run specific test
cargo test -p sd-archive schema::tests::parse_simple_schema

# Benchmark
cargo bench -p sd-archive

License

FSL-1.1-ALv2 - See ../../LICENSE for details.