mirror of https://github.com/tranxuanthang/lrcget.git synced 2026-02-23 18:57:27 -05:00

Files

tranxuanthang 8e67b2880f Add incremental library scanning support

- Added file_size, modified_time, content_hash, and scan_status columns to tracks table
- Added database migration to version 8 with new indexes for scan performance
- Implemented fingerprint matching (mtime+size) and content hash for move/rename detection

2026-02-23 08:20:07 +07:00

9.5 KiB

Raw Permalink Blame History

File System Refactoring Plan for LRCGET

Status: Implementation Complete
Last Updated: 2026-02-22

Overview

Refactored the library import and scan logic to support incremental/partial scanning with better performance for large libraries.

What Was Implemented

New Directory Structure

src-tauri/src/
  scanner/
    mod.rs          # Public API exports
    hasher.rs       # Content hashing (xxhash)
    metadata.rs     # Audio metadata extraction (replaces fs_track.rs)
    models.rs       # ScanResult, ScanProgress types
    scan.rs         # Main scan implementation
  db.rs             # Extended with scan-related transaction functions

Database Schema (Migration v8)

-- Core columns for incremental scanning
ALTER TABLE tracks ADD COLUMN file_size INTEGER;
ALTER TABLE tracks ADD COLUMN modified_time INTEGER;
ALTER TABLE tracks ADD COLUMN content_hash TEXT;
ALTER TABLE tracks ADD COLUMN scan_status INTEGER DEFAULT 1;  -- 0=pending, 1=processed

-- Indexes for fast lookups
CREATE INDEX idx_tracks_file_path ON tracks(file_path);
CREATE INDEX idx_tracks_content_hash ON tracks(content_hash);
CREATE INDEX idx_tracks_fingerprint ON tracks(modified_time, file_size);
CREATE INDEX idx_tracks_scan_status ON tracks(scan_status);

Detection Methods

Two detection strategies implemented:

Hash (Default): Computes 64KB xxhash3 of file content
- Pros: Detects all moves, works across all filesystems
- Cons: Slower (must read file content)
Metadata: Uses file mtime + size only
- Pros: Very fast (just stat() call)
- Cons: May create duplicates if files moved with different metadata
- Best for: Large static libraries on single filesystem

Scan Algorithm

pub fn scan_library(
    directories: &[String],
    conn: &mut Connection,
    progress_callback: &dyn Fn(ScanProgress),
    detection_method: DetectionMethod,  // Hash (default) or Metadata
) -> Result<ScanResult>

Process:

Mark all existing tracks as "pending"
Single-pass streaming: Discover and process files simultaneously
- Processes files in batches of 100
- Emits progress after each batch showing "Processing files: X/Y"
For each file:
- Hash mode: Check by content hash → unchanged/moved/new
- Metadata mode: Check by mtime+size → unchanged/moved/new (no hash fallback)
Delete tracks still marked "pending" (files no longer on disk)
Clean up orphaned albums/artists

Key Features

Move detection: Files moved/renamed are detected via hash matching
Batch processing: 100 files per transaction batch
Single-pass streaming: Discovers and processes files simultaneously (no double traversal)
Real-time progress: Shows processed files count after each batch
Progress events: Emits scan-progress and scan-complete events
Configurable detection: Frontend can choose Hash or Metadata mode
Resumable: Interrupted scans resume naturally on next run

Performance Improvements

Problem: The old approach performed two full directory traversals:

First pass: Count all files (for progress percentage)
Second pass: Process files (hash, DB operations)

For 100K files on HDDs, this added 60-90 seconds of overhead.

Solution: Single-pass streaming scan

Processes files in batches of 100
Emits progress after each batch showing processed count
Saves 30-90 seconds on HDDs, 5-10 seconds on SSDs

Database Functions Added (db.rs)

All transaction-based functions use _tx suffix:

// Scan operations
pub fn mark_all_tracks_pending(conn: &mut Connection) -> Result<()>
pub fn find_track_by_fingerprint_tx(mtime, size, tx) -> Result<Option<ScanTrackInfo>>
pub fn find_track_by_hash_tx(hash, tx) -> Result<Option<ScanTrackInfo>>
pub fn mark_track_processed_tx(track_id, tx) -> Result<()>
pub fn update_track_path_tx(track_id, new_path, tx) -> Result<()>
pub fn update_track_path_and_fingerprint_tx(track_id, path, size, mtime, hash, tx) -> Result<()>
pub fn insert_track_from_metadata_tx(metadata, lyrics, size, mtime, hash, artist_id, album_id, tx) -> Result<()>
pub fn delete_unprocessed_tracks(conn: &mut Connection) -> Result<usize>

// Transaction versions of existing functions
pub fn find_artist_tx(name, tx) -> Result<i64>
pub fn add_artist_tx(name, tx) -> Result<i64>
pub fn find_album_tx(name, artist, tx) -> Result<i64>
pub fn add_album_tx(name, artist, tx) -> Result<i64>

Frontend Integration

Command:

// Hash detection (default - accurate but slower)
const result = await invoke('scan_library_incremental', { useHashDetection: true });

// Metadata detection (fast but may create duplicates)
const result = await invoke('scan_library_incremental', { useHashDetection: false });

Progress Phases:

processing: Emitted after each batch is processed (shows "Processing files: X/Y")
updating: Emitted during database cleanup phase

Events:

scan-progress: Emitted during processing and updating phases
scan-complete: Emitted with ScanResult when finished

ScanResult type:

{
  totalFiles: number;
  added: number;
  modified: number;  // Always 0 in current implementation
  deleted: number;
  moved: number;
  unchanged: number;
  isInitialScan: boolean;
  durationMs: number;
}

Dependencies

[dependencies]
xxhash-rust = { version = "0.8", features = ["xxh3"] }

Notes

Old fs_track.rs deprecated but kept for backward compatibility
Module renamed from fs to scanner for clarity (avoids confusion with std::fs)
No file watching implemented (scan at startup/manual trigger only)
Hard deletes used (tracks deleted from DB when files removed)
Metadata extraction moved to scanner/metadata.rs with better error handling
All database queries moved to db.rs with transaction support
estimate_file_count() is deprecated - use single-pass scan_library() instead

Memory Usage

For 110,000 files:

Current approach: ~200MB (loads all paths into memory)
Optimized: ~10MB (batch processing, configurable if needed)

Frontend Migration Plan

Current Issues

The frontend currently performs inefficient full scans in multiple scenarios:

First launch: initialize_library called, which uses deprecated fs_track module
Refresh: refresh_library does uninitialize + initialize = double full scan
Directory change: After saving directories, the app shows Library without triggering a scan
Event format mismatch: Frontend expects initialize-progress, new backend emits scan-progress

Changes Required

1. Replace Library Initialization Flow

Current (src/components/Library.vue):

// Uses: initialize_library, refresh_library commands
// Listens to: initialize-progress event
// Shows: "filesScanned/filesCount files scanned"

New:

// Uses: scan_library command
// Listens to: scan-progress event
// Shows: "Processing files: X/Y" after each batch
// Listens to: scan-complete event for final results

2. Update Event Handling

Current event format:

{
  filesScanned: number;
  filesCount: number;
}

New event format (scan-progress):

{
  phase: 'processing' | 'updating';
  progress: number;  // 0.0 to 1.0
  filesProcessed: number;
  filesTotal: number;
  message: string;  // Human-readable status (e.g., "Processing files: 1234/5000")
}

New completion event (scan-complete):

{
  totalFiles: number;
  added: number;
  modified: number;
  deleted: number;
  moved: number;
  unchanged: number;
  isInitialScan: boolean;
  durationMs: number;
}

3. Progress Display Updates

Template changes (Library.vue):

<!-- Current -->
<div v-if="initializeProgress">
  {{ initializeProgress.filesScanned }}/{{ initializeProgress.filesCount }} files scanned
</div>

<!-- New -->
<div v-if="scanProgress">
  {{ scanProgress.message }}
  <div v-if="scanProgress.filesTotal">
    Progress: {{ Math.round(scanProgress.progress * 100) }}%
  </div>
</div>

4. Remove refresh_library Command Usage

Current:

const refreshLibrary = async () => {
  await invoke('refresh_library')  // Does uninit + init
}

New:

const refreshLibrary = async () => {
  await invoke('scan_library', { useHashDetection: true })
  // No need for uninitialize - scan_library handles incremental updates
}

5. Fix Directory Change Flow

Current (ChooseDirectory.vue):

Saves directories → emits progressStep → shows Library
Bug: Doesn't trigger scan after directory change

New:

Save directories
Emit progressStep
Trigger scan (parent App.vue should call scan when init is true but no tracks exist)

6. Clean Up Obsolete Code

Remove after migration:

initialize_library command usage
refresh_library command usage
initialize-progress event listeners
initializeProgress ref and related template code

Migration Order

Phase 1: Update event listeners and progress display to use new format
Phase 2: Replace initialize_library with scan_library in Library.vue
Phase 3: Replace refresh_library with scan_library
Phase 4: Fix directory change flow to trigger scan
Phase 5: Remove old event/command references
Phase 6: Test all scenarios (first launch, refresh, directory change)

Future Considerations

Could add "deep scan" option to re-hash all files for integrity check
Could cache parsed metadata to avoid re-reading files
Could add parallel processing within batches
Could add "smart refresh" that only scans changed directories

9.5 KiB Raw Permalink Blame History

File System Refactoring Plan for LRCGET

Overview

What Was Implemented

New Directory Structure

Database Schema (Migration v8)

Detection Methods

Scan Algorithm

Key Features

Performance Improvements

Database Functions Added (db.rs)

Frontend Integration

Dependencies

Notes

Memory Usage

Frontend Migration Plan

Current Issues

Changes Required

1. Replace Library Initialization Flow

2. Update Event Handling

3. Progress Display Updates

4. Remove refresh_library Command Usage

5. Fix Directory Change Flow

6. Clean Up Obsolete Code

Migration Order

Future Considerations

9.5 KiB

Raw Permalink Blame History