Perl's != / == coerces both sides to NV (double-precision float).
BIGINT DiskSpace columns can exceed 2^53 (~9 PB), at which point distinct
integers collapse to the same double. The "already correct, skip"
checks for both the Event_Summaries CAS loop and the Storage CAS loop
were comparing scalars with numeric operators, so two distinct
DiskSpace values above 2^53 could be treated as equal and the resync
skipped — and skipped again on every subsequent pass because the
collapse is deterministic, so drift would persist indefinitely.
Switch both checks to string equality. DBI binds these scalars as
strings anyway, so this matches what the database will compare against
when the WHERE clause runs.
zmaudit's per-monitor UPDATE writes absolute snapshot values across all
12 counter columns. Without a CAS guard, a concurrent writer touching ES
between our snapshot and our UPDATE gets clobbered. The TotalEvents /
ArchivedEvents columns are particularly exposed because zmstats doesn't
maintain them (it only touches Hour/Day/Week/Month), so any drift
introduced by zmaudit racing event_delete_trigger or the zmc insert
path persists until the next zmaudit cycle.
Read the current ES row alongside the aggregates, skip monitors whose
snapshot matches the current row (no X-lock for no-op writes), and
guard each UPDATE with null-safe equality (MariaDB <=>) on every
column we're writing. Track CAS-deferred and skipped counts separately
from failures in the audit log.
zmstats does not need the same treatment: its UPDATE runs inside a TX
that already holds the bucket X-locks that gate the trigger writers
updating its column set.
The DELETE WHERE EventId IN (?,?,...) is intentional: it locks each row
via the primary key, keeping the lock range minimal and preserving the
canonical lock order that this PR's deadlock fix relies on. But a single
IN-list with tens of thousands of placeholders (Events_Month after weeks
of accumulation) can hit max_allowed_packet and max_prepared_stmt_count.
Split the EventId list into 1000-row batches and loop. PK-based locking
is preserved; SQL/packet size stays bounded. Switching to a predicate-
based DELETE would re-introduce range locks on the bucket index and
undo the deadlock work.
Unlike Event_Summaries, Storage.DiskSpace has no trigger-based
incremental maintenance — Event::delete and the event-finalize paths do
their own +/- adjustments in application code. The previous absolute
snapshot+overwrite could undo a concurrent Event::delete adjustment
that committed between our Events SUM SELECT and our Storage UPDATE,
making accounting transiently wrong under normal load until the next
zmaudit pass.
Read the current Storage.DiskSpace alongside the aggregate, skip rows
that are already correct, and UPDATE with a null-safe equality guard
(MariaDB <=>) so a concurrent writer's newer value blocks the
overwrite. Track CAS-deferred rows separately from failures and
surface both counts in the audit log.
SET TRANSACTION ISOLATION LEVEL applies to the very next transaction on
the connection. zmDbDo's success Debug INSERT INTO Logs is a real
statement on the same $dbh; with database debug logging enabled, that
INSERT becomes the "next transaction" and silently consumes the
isolation directive. The intended READ COMMITTED then never applies to
the prune/resync/delete TX that follows.
Call $dbh->do directly for SET TRANSACTION in both Event::delete and
zmstats.pl, bypassing zmDbDo's logging. SET TRANSACTION can't deadlock
so zmDbDo's retry was no benefit here anyway.
Same hazard as the failure-path Debug: ZoneMinder::Logger->logPrint
INSERTs into Logs using the same $dbh, so a success Debug fires an
extra write inside a TX that's trying to minimize lock interactions —
and any err/errstr change it provokes is visible to the caller.
The autocommit path keeps the success Debug (it's a separate TX, no
caller interaction).
Every row in the previous arrayref-of-arrayref carried the same single
bind value (the event Id), so the [$sql, $$event{Id}] wrapping and the
my ($sql, @bind) = @$stmt unpacking were doing no work. Iterate over the
SQL strings directly and pass $$event{Id} as the one bind value.
Without this, an enumerate failure (SELECT MonitorId FROM Event_Summaries)
left @skipped empty and the per-monitor UPDATE loop ran for whatever rows
the bucket aggregates returned — but monitors that exist only in
Event_Summaries (no current bucket rows) never got zeroed, while the
audit log claimed a full resync. Track enumerate success and report
partial resync when it fails.
Two issues with the previous implementation:
1. Aggregate SELECTs ran GROUP BY MonitorId across the full bucket
tables every zmstats cycle (default 60s). Events_Month grows for
weeks; this turned the stats daemon into a constant full-scan
workload on busy installs.
2. The per-monitor UPDATE loop X-locked every Event_Summaries row on
every cycle even when nothing changed, adding avoidable contention
with the trigger writers this rewrite is supposed to protect.
Capture MonitorIds as we SELECT bucket rows for pruning, then skip the
resync entirely if no rows were pruned. When rows were pruned, restrict
the aggregate SELECTs (WHERE MonitorId IN ...) and the per-monitor
UPDATEs to that touched set. zmaudit remains the periodic deep-resync
safety net for drift in untouched monitors.
Also capture errstr before rollback so the gave-up Error reports the
actual reason instead of an empty string on drivers that clear errstr
on rollback.
ZoneMinder::Logger->logPrint runs INSERT INTO Logs on the same dbh.
Calling Debug()/Error() from zmDbDo's failure path inside a caller-managed
transaction would execute another statement on the connection, clearing
the err/errstr state the caller needs to see for rollback/retry. The
result could be a caller observing err=0 after a deadlock-victim TX and
committing what looks like success but is actually a rolled-back no-op.
Bail silently from zmDbDo when AutoCommit is off; the caller owns the
retry loop and is responsible for logging. Logging in the autocommit
path is still safe because each statement is its own TX.
Previously zmaudit logged "Finished resyncing Event_Summaries" /
"Finished updating Storage DiskSpace" unconditionally as long as the
aggregate SELECTs succeeded, masking per-row UPDATE failures (e.g.
zmDbDo exhausting its deadlock retries) and skipped aggregate column
groups. Track which aggregates were skipped and which per-monitor /
per-storage UPDATEs failed (zmDbDo returns undef on failure), and
surface that in the audit log instead of claiming the resync is
complete.
The previous comment claimed each UPDATE couldn't hold any bucket lock
that would deadlock with the trigger path, which conflated statement-
level locks with TX-level locks. By the time we reach this loop the TX
already holds bucket-row X-locks from the earlier DELETEs plus any ES
X-locks acquired by the bucket DELETE triggers cascading. Rewrite the
comment to distinguish those TX-held locks from the locks acquired by
the new UPDATE statement and to be explicit that the TX's lock
acquisition direction is preserved.
zmDbDo suppresses its Error log on 1213 inside a caller-managed TX (the
caller owns the retry), and the previous fallthrough at the end of the
retry loop just `return`ed silently. After 5 failed attempts on persistent
contention the event was effectively un-deleted with no record of the
failure. Capture errstr before rollback (some drivers clear it) and emit
an Error on the bail path.
A concurrent trigger writer can adjust Event_Summaries between our
snapshot SELECTs and the per-monitor UPDATE; the UPDATE then overwrites
that adjustment with the older snapshot. Drift is bounded by the
zmstats/zmaudit interval and corrected on the next pass, because the
incremental triggers continue to maintain ES correctly between resyncs.
Locking ES before reading aggregates would invert the canonical lock
order and re-introduce the deadlock cycle the resync rewrite eliminated.
When zmDbDo is called inside a caller-managed transaction (AutoCommit off),
max_attempts is 1 and the loop falls through to Error on a 1213 deadlock —
which is misleading, because the caller (Event::delete, the zmstats
prune+resync TX) has its own outer retry loop that will roll back and
succeed. Downgrade to a Debug message in that path; Error is still emitted
for non-deadlock failures and for autocommit calls that exhaust their
retries.
The non-HTML branch of sendTheEmail() declared Content-Transfer-Encoding
as quoted-printable but passed the body to MIME::Lite unencoded. Mail
clients then QP-decoded literal '=NN' digit pairs in URLs, eating
characters from substitution tags like %EP%, %EPS%, %EPI%. For example
&eid=1947908 was decoded as &eid<0x19>47908 and rendered as eid47908.
Match the HTML branch's pattern by encoding the body via
MIME::QuotedPrint::encode_qp before attaching.
fixes#4822
Inline comments at every occurrence so future readers don't have to look up
the errno. No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
selectcol_arrayref returns undef on a DB error. The previous code only
checked truthiness of the result before deciding to DELETE, so a transient
SELECT failure would silently skip the prune for that bucket and let the
transaction continue to commit an incomplete state. Capture \$dbh->err()
after the SELECT and bail out the same way the DELETE error path does.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, if any of the five aggregate SELECTs (Events, Events_Hour/Day/
Week/Month) failed transiently, the per-monitor UPDATE phase still ran and
wrote `// 0` for every column from those failed groups, destroying valid
counters across all monitors.
Track per-aggregate success and build the UPDATE SET clause from only the
column groups whose SELECT succeeded. zmaudit re-runs on its normal
interval, so a missed group is corrected on the next pass instead of being
overwritten with zeros now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous comment block claimed event_update_trigger fires BEFORE UPDATE
and that the lock order was buckets -> Event_Summaries -> Events. Neither
matches the code: triggers.sql defines event_update_trigger as AFTER UPDATE,
and InnoDB X-locks the matched Events row during WHERE evaluation before
either BEFORE or AFTER trigger bodies fire — so the canonical chain is
Events[Id] -> buckets[EventId] -> Event_Summaries[MonitorId]
which is what triggers.sql already documents. Comment-only change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous resync code in zmstats and zmaudit used multi-table
UPDATEs against Event_Summaries that joined the bucket tables:
UPDATE Event_Summaries es
LEFT JOIN (SELECT ... FROM Events_Hour ...) h ON ...
LEFT JOIN (SELECT ... FROM Events_Day ...) d ON ...
... SET es.HourEvents = h.c, ...
zmaudit additionally used scalar correlated subqueries against Events
for the Total/Archived columns and against Events for Storage.DiskSpace.
MariaDB takes S-locks on the joined and sub-queried rows for the
duration of any multi-table UPDATE statement, regardless of isolation
level. event_update_trigger and event_delete_trigger hold X-locks on
those same bucket rows while they walk the trigger body, so the resync
deadlocks against active event lifecycle traffic. Captured a textbook
example in SHOW ENGINE INNODB STATUS:
TX(1) zma: HOLDS X Events_Hour[42229643]
WAITS X Event_Summaries[28]
TX(2) zmstats: HOLDS X Event_Summaries[2,4,5,...,28,...,73]
WAITS S Events_Hour[42229643]
Replace the JOIN/subquery pattern in both scripts with a snapshot phase
followed by per-monitor UPDATEs:
1. SELECT MonitorId, COUNT(*), SUM(DiskSpace) FROM each bucket
(and the equivalent Total/Archived aggregate from Events).
Plain SELECTs do consistent reads and take no row locks.
2. SELECT MonitorId FROM Event_Summaries to widen the universe so
monitors with empty buckets still get zeroed out.
3. For each monitor, UPDATE Event_Summaries SET ... WHERE MonitorId=?.
Each UPDATE only X-locks one ES row and reads no other table.
zmaudit's five separate UPDATEs collapse to one snapshot phase plus
one UPDATE per monitor. Storage DiskSpace gets the same treatment.
zmstats keeps the same outer transaction (BEGIN ... COMMIT, RC
isolation, retry on 1213) so the bucket DELETEs and the resync stay
atomic, but the resync no longer reads the bucket tables under lock.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues in the existing Stats/Event_Data/Frames/Events delete
sequence:
- On any zmDbDo error inside the transaction the code called
dbh->commit() instead of dbh->rollback(). The server-side
transaction was already rolled back when InnoDB picked us as the
deadlock victim, so the commit() was effectively against a fresh
auto-started TX, but the bug pattern leaked through as confusing
state and prevented any retry.
- There was no retry. errno 1213 is expected under contention with
zmstats and zma touching the same Event_Summaries[MonitorId] row,
and the loser is supposed to re-run.
- At REPEATABLE READ, two concurrent filter workers deleting events
with adjacent EventIds take next-key/gap locks on each other's
rows in the bucket tables.
Rewrite the delete block as a retry loop: SET TRANSACTION ISOLATION
LEVEL READ COMMITTED, begin_work, run the four DELETEs, commit on
success. On any error rollback (was: commit). On errno 1213 retry up
to 5 times with backoff. Skip both the isolation switch and the
rollback-then-retry when the caller is managing their own transaction
(in_transaction); they would be the wrong scope to act in.
Falls through to the storage DiskSpace adjustment only on commit, so
a deadlocked delete leaves the event for the next filter pass instead
of orphaning the row with stale storage accounting.
Note: do NOT pre-lock Event_Summaries[MonitorId] FOR UPDATE here, even
though the trigger touches it last. Pre-locking puts ES before
buckets[Id] in the lock acquisition order, which inverts against zma's
event_update_trigger path (Events[A] -> buckets[A] -> ES[N]) and
re-introduces the cycle the rest of this work is removing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deadlock detection (errno 1213) is part of normal InnoDB operation
under contention; the engine rolls back the loser and expects the
caller to re-run the statement on a fresh transaction. Most callers
into zmDbDo go through autocommit, where there's no caller-managed
transaction state for a retry to disturb.
When AutoCommit is on, retry the statement up to 5 times with
exponential backoff (~100ms -> ~1.6s, jittered). When AutoCommit is
off, the caller owns the transaction and a unilateral retry would
silently succeed against a TX that no longer reflects the work the
caller staged before this statement; preserve the existing behavior
of logging and returning undef so the caller can rebuild the TX
itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zmDbDo built log messages by s/\?/'%s'/g on the SQL and then passing
the result to sprintf with the bind values. Any literal % in the SQL
(LIKE '%foo%' patterns, or the disk-percent substitution used by
dynamic filters) was interpreted as a sprintf format spec, producing
garbage output or an uncaught sprintf error.
Replace the two-step approach with a single regex that substitutes
bind values directly, so literal % in the SQL is preserved verbatim.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follow-up to 419846c87 (GHSA-g66m-77fq-79v9). The Device path check was
applied to all monitor Types in three places, but the Device column is
only passed to a shell for Type='Local'. Non-Local monitors (Ffmpeg,
Remote, Libvlc, cURL, VNC) may legitimately hold legacy values such as
an RTSP URL in that column and should not be rejected or warned about.
- scripts/ZoneMinder/lib/ZoneMinder/Monitor.pm: control() dropped the
spurious Warning for non-Local monitors that was flooding zmwatch
logs. The Error/early-return path is preserved for Local.
- web/includes/actions/monitor.php: save action only runs
validDevicePath() when Type=='Local'.
- web/api/app/Model/Monitor.php: replaced the unconditional regex rule
with a validDevicePath() method that checks Type before enforcing
the /dev/ pattern.
Also add client-side validation matching the server rule, so Local
monitors get immediate feedback instead of a round-trip error:
- web/skins/classic/views/monitor.php: HTML5 pattern attribute on the
Device input. Escaped for the v-flag regex engine used by pattern=.
- web/skins/classic/views/js/monitor.js.php: validateForm() now also
rejects Device values that don't match the /dev/ pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 376 (2006) closed all FDs starting from 0 when daemonizing. This
caused FD reuse problems: libx264 writing to a reused stderr FD led to
memory corruption (fixed in child spawn by 66f11435b, but not in the
parent's run()). It also meant children inherited closed FDs 0-2, so
any Perl die/warn output was silently lost, making daemon crashes
impossible to diagnose.
Redirect 0-2 to /dev/null (standard daemon practice) and close only
FDs 3+ for inherited sockets/DB connections. Children now inherit valid
FDs that won't crash or corrupt on write.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Warn() is a natural shorthand that's easy to reach for. Its absence
caused a silent crash in zmfilter when Warn() resolved to Perl's
built-in warn(), which wrote to a closed stderr under zmdc and killed
the process.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each filter now has its own ExecuteInterval column, making the global
ZM_FILTER_EXECUTE_INTERVAL unused. The DB row will be cleaned up
automatically by zmupdate.pl on the next upgrade.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sleep delay was overwritten by each filter in the loop, so only the
last filter's timing controlled the sleep. Now tracks the minimum delay
across all filters so no filter oversleeps its ExecuteInterval.
Also removes the unused ZM_FILTER_EXECUTE_INTERVAL reference since each
filter has its own ExecuteInterval, and fixes the overdue warning to
check the unclamped filter_delay and include the filter name.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SQL changes every cycle due to zmDiskPercent/zmDiskBlocks/zmSystemLoad
string substitution with live values, so prepare_cached never actually
returns a cached handle. Instead it leaks one cache entry per distinct
substituted value over the process lifetime.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PostSQLConditions, HasDiskPercent, HasDiskBlocks, and HasSystemLoad were
set during Sql() term processing but never cleared between rebuilds.
Each Execute() cycle pushed duplicate ExistsInFileSystem terms onto the
PostSQLConditions array, causing it to grow unboundedly over the process
lifetime.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reolink cameras may return a 302 redirect from HTTP to HTTPS.
LWP::UserAgent does not follow redirects for POST requests, so the
login would fail. Detect the redirect, update protocol/host/port from
the Location header, disable SSL verification for self-signed camera
certs, and retry the POST to the HTTPS endpoint. Subsequent API calls
use the updated protocol.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>