Add a ZM_DB_SSL_VERIFY_SERVER_CERT setting so a database connection that uses
ZM_DB_SSL_CA_CERT can talk to a server with a self-signed or otherwise
non-matching certificate. When enabled, verification is by identity (the cert
must chain to the CA and its CN/SAN must match ZM_DB_HOST), consistent across
the C++ daemons, the PHP web interface, the CakePHP API and the Perl scripts.
This re-does the reverted #3817. That PR broke the build because it called
mysql_options(MYSQL_OPT_SSL_VERIFY_SERVER_CERT, ...), and that enum was removed
from the MySQL 8.0 C client in favour of MYSQL_OPT_SSL_MODE; it also passed a
c_str() where a my_bool* was expected, and referenced the PHP constant
unconditionally (fatal on PHP 8 for an upgraded install whose zm.conf predates
the option).
The option that controls server-cert verification differs by client library and
the symbols are enum values, not macros, so CMake feature-detects them by
compiling:
- HAVE_MYSQL_OPT_SSL_MODE (MySQL 5.7.11+/8.0, MariaDB Connector/C 3.1+)
- HAVE_MYSQL_OPT_SSL_VERIFY_SERVER_CERT (older MariaDB/MySQL)
zm_db.cpp uses SSL_MODE_VERIFY_IDENTITY / SSL_MODE_REQUIRED when the former is
available, else falls back to the latter with a proper my_bool.
Value handling is three-way in every layer: a truthy value verifies, a false-y
value (0/false/no/off) skips verification, and an empty/unset value leaves the
client default in place so existing installs are unchanged on upgrade. PHP, the
API datasource (via PDO flags) and the Perl DSN are all guarded with defined()
checks. Fresh installs default to 1.
Documents the full ZM_DB_* connection and SSL settings, including the hostname
verification gotcha when connecting by IP, in docs/userguide/configfiles.rst.
refs #3816
open() is contracted to return true or false so callers (zmcontrol.pl,
zmwatch.pl) can tell whether the camera is reachable, but it always ended
with an assignment that evaluated truthy and reported success regardless.
When the initial login probe failed it also rebuilt BASE_URL in the old
basic-auth style and returned without ever testing that connection.
Return 0 on a failed probe and 1 only after a successful exchange. Check
is_success() on the authcode response in the modern path, and actually issue
and check a request on the old-style URL in the fallback path so success
means we can talk to the camera. Log the previously unused ResCode.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two spots wrote temp files outside ZoneMinder's controlled temp tree:
- zmvideo.pl concat mode built its ffmpeg concat list at a predictable
path, /tmp/<concat_name>.concat.lst, in world-writable /tmp. A
predictable name there is open to a symlink/race and leaks monitor and
event names. Create it instead with File::Temp (randomized name, atomic
O_EXCL) inside ZM_TMPDIR. The list entries are absolute paths, so its
location does not affect ffmpeg's resolution.
- web/ajax/training.php created its detection scratch image with
tempnam(sys_get_temp_dir(), ...), escaping ZM's temp tree and its
cleanup. Use tempnam(ZM_DIR_TEMP, ...) so it stays under the configured
temp dir.
Both now resolve to the per-distro temp dir (e.g. /var/lib/zoneminder/temp
on RedHat, /var/tmp/zm on Debian), keeping scratch files inside the tree
that packaging and systemd hardening already cover.
refs #2915
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ZoneMinder Perl modules carried a GPL-2-or-later notice in their file
header but ended with an h2xs-scaffolding POD footer stating the code could
be used "under the same terms as Perl itself" (Artistic-1.0-or-GPL-1+). The
two statements contradict each other.
Every affected file has the same copyright holder in both the header and the
footer, and was contributed to ZoneMinder under the GPL-2+ header; the
Perl-terms line is inherited boilerplate that was never the intended license.
The project as a whole is GPL-2+ (see COPYING).
Replace the contradictory POD paragraph with a one-line pointer to the GPL so
each file states a single, consistent license without duplicating the full
notice already present in the header. Copyright and author attribution lines
are left unchanged.
Affects 60 .pm and .pm.in files under scripts/ZoneMinder/lib/ZoneMinder.
fixes#817
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
recover_timestamps() left Length null for events consisting of only an
mp4 plus a snapshot.jpg, because the mp4 branch parsed ffprobe's stderr
Duration: line and produced a bogus value when the regex did not match.
Length is decimal(10,2) NOT NULL, so the recovered row failed to insert.
Add mp4_duration() which asks ffprobe for the machine-readable
format=duration, falling back to the human-readable Duration: line, and
always set Length via sprintf('%.2f', ...) defaulting to 0 on failure.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zmcontrol calls open() once at startup, then issues light commands much later.
By then the camera has dropped the kept-alive HTTP connection and rejects the
stale digest token with 401. The write path (PutXML) already retried, but the
GET path did not, so lightStatus returned undef (toggle button never updated)
and lightOn failed silently.
Add GetWithRetry (rebuild UA + re-auth once on 401, same workaround as
PutCmd/PutXML) and use it in supplementLightDoc/supplementLightModes. Log a
failed supplementLight GET instead of returning undef silently. Verified live:
the Light toggle now drives the white light and reflects state.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add lightOn/lightOff/lightStatus to the HikVision control module, driving the
camera's ISAPI supplement light (ISAPI/Image/channels/<n>/supplementLight).
"On" selects the white light where the model has one (colorVuWhiteLight), else
IR; "off" restores the camera's smart/auto default (eventIntelligence) so night
IR keeps working. The methods GET the current document and rewrite only
<supplementLightMode>, preserving the std-cgi namespace and the sibling
brightness fields the firmware rejects PUTs without. Mode selection is driven by
the model's advertised supplementLight/capabilities. lightStatus returns
{ WhiteLight => 'On'|'Off'|undef } in the shape the existing web toggle consumes,
so no web changes are needed (the CanLight UI is already generic).
Add a model-specific Controls row for the LTS CMIP1342WE-28MDA (a fixed ColorVu
camera: white light, reboot, no PTZ/focus/iris), in zm_create.sql.in and
migration zm_update-1.39.16.sql. Bump version to 1.39.16.
The pure mode-selection/XML helpers are covered by t/hikvision_light.t. Verified
live against a CMIP1342WE-28MDA: the GET-modify-PUT round-trip turns the white
light on and restores the prior mode.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
These cameras enforce an absolute session lifetime (~30 min) that keepAlive
cannot extend, so a periodic keepAlive failure followed by re-authentication is
normal, not an error. Log the (re-)login at Debug instead of Info so it no
longer spams the logs every ~30 min, and raise the requested keepAlive timeout
from 20s to 60s so it comfortably exceeds the daemon's 30s ping interval.
refs #4875
The credential-handling rewrite of the address captured only the host
([^:/]+) and rebuilt the address without the port or path, and the
http/https branches below then forced the port to 80/443 regardless of
what the operator configured. Any ControlAddress with credentials and a
non-standard port resolved to port 80.
Capture the full remainder after the @ so port and path survive, and
take the port from URI->port() which returns the explicit port or the
scheme default.
The pre-escape applied to legacy non-url-encoded passwords also escaped
% itself, so a url-encoded password (e.g. %40 for @) round-tripped
still encoded instead of decoding, and ua->credentials() then received
the wrong password for basic auth. Leave % unescaped so encoded
passwords decode while legacy raw passwords still pass through
unchanged.
Verified with parse_ControlAddress over: plain credentials with default
and non-standard ports, url-encoded @ in password, legacy raw passwords
containing space and literal %, full URL with path, credential-less
host:port, and https default port.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a CanIndicatorLight capability and status-aware Indicator toggle button. The indicator LED on the ASH21-B, ADC2W and ASH42-B is controlled via the LightGlobal config (configManager get/setConfig); add indicatorLightOn/Off/Status to Dahua_RPC and a model-specific 'Amcrest ASH42-B RPC' Controls row, with the capability also enabled on the ASH21-B/ADC2W/generic rows. Migration zm_update-1.39.13.sql adds the column.
Add Dahua_RPC keepAlive (global.keepAlive) wired into a 30s zmcontrol idle tick, plus session-expiry re-login retry in set_config and the status queries, so the long-running control daemon does not silently fail after the ~60s session timeout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a CanLight control capability rendering a single status-aware Light toggle button. The ADC2W white light is driven via CoaxialControlIO.control (Type 1, numeric IO); the button queries live state and reflects it (amber when on).
To get device state to the browser, add an opt-in two-way response path to the control protocol: zmcontrol writes a JSON result back only when a request sets wants_response (fire-and-forget commands unchanged, SIGPIPE-safe); Monitor::sendControlCommandWithResponse and ajax/control.php return it.
Also adds get_config/set_config/probe to Dahua_RPC for characterising cameras, the CanLight column (migration zm_update-1.39.12.sql), edit-UI checkbox, and a config unit test.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a control protocol module for cameras built on the HiSilicon
Hi3510 SoC which expose cgi-bin/hi3510/ptzctrl.cgi rather than the
older Foscam decoder_control.cgi interface. Contributed by Turgut
Kalfaoglu on the forums, tested on a Tenvis TH661.
Supports continuous pan/tilt with auto-stop, emulated diagonals,
presets 1-8, and horizontal/vertical patrol via presets 9/10.
Credential and host parsing uses the Control.pm guess_credentials
helper; the camera-tested wire format (usr/pwd query parameters)
is preserved.
Adds the Controls table row to zm_create.sql.in and an idempotent
zm_update-1.39.15.sql migration, and bumps version to 1.39.15.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add ZoneMinder::Control::Dahua_RPC, a PTZ control module driving Amcrest Smart Home (ASH21/ASH42/ADC2W) and Dahua cameras over the JSON-RPC /RPC2 interface, since their cgi-bin API is disabled and ONVIF exposes no PTZ service. Two-stage MD5-challenge login with session reuse and self-healing re-login; continuous pan/tilt + diagonals, stop, presets, zoom, focus, reboot.
Adds a generic 'Dahua/Amcrest RPC' Controls row plus model-specific rows for the ASH21-B (pan/tilt only) and ADC2W (reboot only), the ASH21-B/ADC2W models, migration zm_update-1.39.11.sql, and a login-hash unit test.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
zmdc wrote ZM_PID only once at startup and removed it only at shutdown, so
if the pid file was removed out-of-band while the daemon kept running (a
partial/aborted stop that orphaned it, or /run tmpfiles cleanup), systemd
(Type=forking, PIDFile=) lost its handle: the service showed dead/failed,
stop no-op'd the orphaned tree, and start failed because ZM was already up.
Add ensure_pid(), called each iteration of the main server loop, which
rewrites ZM_PID only when it is missing or its contents do not match our
pid. Shutdown-safe: the shutdown unlink runs only after the loop exits.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
zmfilter and the web filter UI generated SQL like
to_days(E.StartDateTime) = to_days('2026-05-06 09:42:56')
which prevents MySQL from using the StartDateTime index, forcing a
full table scan. With many filter daemons against a large Events
table this saturates mysqld and makes the system unresponsive.
Rewrite the SQL generation in ZoneMinder::Filter (Perl) and
ZM\FilterTerm (PHP) so Date/StartDate/EndDate attrs emit range
expressions against the underlying datetime column:
E.StartDateTime >= '2026-05-06 00:00:00'
AND E.StartDateTime < '2026-05-07 00:00:00'
Covers =, !=, >, >=, <, <=, IS, IS NOT, IN, NOT IN, and the
CURDATE()/NOW() values (which use INTERVAL 1 DAY for the upper
bound). EXPLAIN now reports type=range on Events_StartDateTime_idx
where it previously reported type=ALL.
CurrentDate (the constant left-hand expression to_days(NOW()))
keeps its existing form since it does not touch the indexed column.
Add Perl and PHP unit tests under tests/perl/ and tests/php/
exercising the generated SQL across operators.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Perl's != / == coerces both sides to NV (double-precision float).
BIGINT DiskSpace columns can exceed 2^53 (~9 PB), at which point distinct
integers collapse to the same double. The "already correct, skip"
checks for both the Event_Summaries CAS loop and the Storage CAS loop
were comparing scalars with numeric operators, so two distinct
DiskSpace values above 2^53 could be treated as equal and the resync
skipped — and skipped again on every subsequent pass because the
collapse is deterministic, so drift would persist indefinitely.
Switch both checks to string equality. DBI binds these scalars as
strings anyway, so this matches what the database will compare against
when the WHERE clause runs.
zmaudit's per-monitor UPDATE writes absolute snapshot values across all
12 counter columns. Without a CAS guard, a concurrent writer touching ES
between our snapshot and our UPDATE gets clobbered. The TotalEvents /
ArchivedEvents columns are particularly exposed because zmstats doesn't
maintain them (it only touches Hour/Day/Week/Month), so any drift
introduced by zmaudit racing event_delete_trigger or the zmc insert
path persists until the next zmaudit cycle.
Read the current ES row alongside the aggregates, skip monitors whose
snapshot matches the current row (no X-lock for no-op writes), and
guard each UPDATE with null-safe equality (MariaDB <=>) on every
column we're writing. Track CAS-deferred and skipped counts separately
from failures in the audit log.
zmstats does not need the same treatment: its UPDATE runs inside a TX
that already holds the bucket X-locks that gate the trigger writers
updating its column set.
The DELETE WHERE EventId IN (?,?,...) is intentional: it locks each row
via the primary key, keeping the lock range minimal and preserving the
canonical lock order that this PR's deadlock fix relies on. But a single
IN-list with tens of thousands of placeholders (Events_Month after weeks
of accumulation) can hit max_allowed_packet and max_prepared_stmt_count.
Split the EventId list into 1000-row batches and loop. PK-based locking
is preserved; SQL/packet size stays bounded. Switching to a predicate-
based DELETE would re-introduce range locks on the bucket index and
undo the deadlock work.
Unlike Event_Summaries, Storage.DiskSpace has no trigger-based
incremental maintenance — Event::delete and the event-finalize paths do
their own +/- adjustments in application code. The previous absolute
snapshot+overwrite could undo a concurrent Event::delete adjustment
that committed between our Events SUM SELECT and our Storage UPDATE,
making accounting transiently wrong under normal load until the next
zmaudit pass.
Read the current Storage.DiskSpace alongside the aggregate, skip rows
that are already correct, and UPDATE with a null-safe equality guard
(MariaDB <=>) so a concurrent writer's newer value blocks the
overwrite. Track CAS-deferred rows separately from failures and
surface both counts in the audit log.
SET TRANSACTION ISOLATION LEVEL applies to the very next transaction on
the connection. zmDbDo's success Debug INSERT INTO Logs is a real
statement on the same $dbh; with database debug logging enabled, that
INSERT becomes the "next transaction" and silently consumes the
isolation directive. The intended READ COMMITTED then never applies to
the prune/resync/delete TX that follows.
Call $dbh->do directly for SET TRANSACTION in both Event::delete and
zmstats.pl, bypassing zmDbDo's logging. SET TRANSACTION can't deadlock
so zmDbDo's retry was no benefit here anyway.
Same hazard as the failure-path Debug: ZoneMinder::Logger->logPrint
INSERTs into Logs using the same $dbh, so a success Debug fires an
extra write inside a TX that's trying to minimize lock interactions —
and any err/errstr change it provokes is visible to the caller.
The autocommit path keeps the success Debug (it's a separate TX, no
caller interaction).
Every row in the previous arrayref-of-arrayref carried the same single
bind value (the event Id), so the [$sql, $$event{Id}] wrapping and the
my ($sql, @bind) = @$stmt unpacking were doing no work. Iterate over the
SQL strings directly and pass $$event{Id} as the one bind value.
Without this, an enumerate failure (SELECT MonitorId FROM Event_Summaries)
left @skipped empty and the per-monitor UPDATE loop ran for whatever rows
the bucket aggregates returned — but monitors that exist only in
Event_Summaries (no current bucket rows) never got zeroed, while the
audit log claimed a full resync. Track enumerate success and report
partial resync when it fails.
Two issues with the previous implementation:
1. Aggregate SELECTs ran GROUP BY MonitorId across the full bucket
tables every zmstats cycle (default 60s). Events_Month grows for
weeks; this turned the stats daemon into a constant full-scan
workload on busy installs.
2. The per-monitor UPDATE loop X-locked every Event_Summaries row on
every cycle even when nothing changed, adding avoidable contention
with the trigger writers this rewrite is supposed to protect.
Capture MonitorIds as we SELECT bucket rows for pruning, then skip the
resync entirely if no rows were pruned. When rows were pruned, restrict
the aggregate SELECTs (WHERE MonitorId IN ...) and the per-monitor
UPDATEs to that touched set. zmaudit remains the periodic deep-resync
safety net for drift in untouched monitors.
Also capture errstr before rollback so the gave-up Error reports the
actual reason instead of an empty string on drivers that clear errstr
on rollback.
ZoneMinder::Logger->logPrint runs INSERT INTO Logs on the same dbh.
Calling Debug()/Error() from zmDbDo's failure path inside a caller-managed
transaction would execute another statement on the connection, clearing
the err/errstr state the caller needs to see for rollback/retry. The
result could be a caller observing err=0 after a deadlock-victim TX and
committing what looks like success but is actually a rolled-back no-op.
Bail silently from zmDbDo when AutoCommit is off; the caller owns the
retry loop and is responsible for logging. Logging in the autocommit
path is still safe because each statement is its own TX.
Previously zmaudit logged "Finished resyncing Event_Summaries" /
"Finished updating Storage DiskSpace" unconditionally as long as the
aggregate SELECTs succeeded, masking per-row UPDATE failures (e.g.
zmDbDo exhausting its deadlock retries) and skipped aggregate column
groups. Track which aggregates were skipped and which per-monitor /
per-storage UPDATEs failed (zmDbDo returns undef on failure), and
surface that in the audit log instead of claiming the resync is
complete.
The previous comment claimed each UPDATE couldn't hold any bucket lock
that would deadlock with the trigger path, which conflated statement-
level locks with TX-level locks. By the time we reach this loop the TX
already holds bucket-row X-locks from the earlier DELETEs plus any ES
X-locks acquired by the bucket DELETE triggers cascading. Rewrite the
comment to distinguish those TX-held locks from the locks acquired by
the new UPDATE statement and to be explicit that the TX's lock
acquisition direction is preserved.
zmDbDo suppresses its Error log on 1213 inside a caller-managed TX (the
caller owns the retry), and the previous fallthrough at the end of the
retry loop just `return`ed silently. After 5 failed attempts on persistent
contention the event was effectively un-deleted with no record of the
failure. Capture errstr before rollback (some drivers clear it) and emit
an Error on the bail path.
A concurrent trigger writer can adjust Event_Summaries between our
snapshot SELECTs and the per-monitor UPDATE; the UPDATE then overwrites
that adjustment with the older snapshot. Drift is bounded by the
zmstats/zmaudit interval and corrected on the next pass, because the
incremental triggers continue to maintain ES correctly between resyncs.
Locking ES before reading aggregates would invert the canonical lock
order and re-introduce the deadlock cycle the resync rewrite eliminated.
When zmDbDo is called inside a caller-managed transaction (AutoCommit off),
max_attempts is 1 and the loop falls through to Error on a 1213 deadlock —
which is misleading, because the caller (Event::delete, the zmstats
prune+resync TX) has its own outer retry loop that will roll back and
succeed. Downgrade to a Debug message in that path; Error is still emitted
for non-deadlock failures and for autocommit calls that exhaust their
retries.
The non-HTML branch of sendTheEmail() declared Content-Transfer-Encoding
as quoted-printable but passed the body to MIME::Lite unencoded. Mail
clients then QP-decoded literal '=NN' digit pairs in URLs, eating
characters from substitution tags like %EP%, %EPS%, %EPI%. For example
&eid=1947908 was decoded as &eid<0x19>47908 and rendered as eid47908.
Match the HTML branch's pattern by encoding the body via
MIME::QuotedPrint::encode_qp before attaching.
fixes#4822
Inline comments at every occurrence so future readers don't have to look up
the errno. No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
selectcol_arrayref returns undef on a DB error. The previous code only
checked truthiness of the result before deciding to DELETE, so a transient
SELECT failure would silently skip the prune for that bucket and let the
transaction continue to commit an incomplete state. Capture \$dbh->err()
after the SELECT and bail out the same way the DELETE error path does.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, if any of the five aggregate SELECTs (Events, Events_Hour/Day/
Week/Month) failed transiently, the per-monitor UPDATE phase still ran and
wrote `// 0` for every column from those failed groups, destroying valid
counters across all monitors.
Track per-aggregate success and build the UPDATE SET clause from only the
column groups whose SELECT succeeded. zmaudit re-runs on its normal
interval, so a missed group is corrected on the next pass instead of being
overwritten with zeros now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous comment block claimed event_update_trigger fires BEFORE UPDATE
and that the lock order was buckets -> Event_Summaries -> Events. Neither
matches the code: triggers.sql defines event_update_trigger as AFTER UPDATE,
and InnoDB X-locks the matched Events row during WHERE evaluation before
either BEFORE or AFTER trigger bodies fire — so the canonical chain is
Events[Id] -> buckets[EventId] -> Event_Summaries[MonitorId]
which is what triggers.sql already documents. Comment-only change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous resync code in zmstats and zmaudit used multi-table
UPDATEs against Event_Summaries that joined the bucket tables:
UPDATE Event_Summaries es
LEFT JOIN (SELECT ... FROM Events_Hour ...) h ON ...
LEFT JOIN (SELECT ... FROM Events_Day ...) d ON ...
... SET es.HourEvents = h.c, ...
zmaudit additionally used scalar correlated subqueries against Events
for the Total/Archived columns and against Events for Storage.DiskSpace.
MariaDB takes S-locks on the joined and sub-queried rows for the
duration of any multi-table UPDATE statement, regardless of isolation
level. event_update_trigger and event_delete_trigger hold X-locks on
those same bucket rows while they walk the trigger body, so the resync
deadlocks against active event lifecycle traffic. Captured a textbook
example in SHOW ENGINE INNODB STATUS:
TX(1) zma: HOLDS X Events_Hour[42229643]
WAITS X Event_Summaries[28]
TX(2) zmstats: HOLDS X Event_Summaries[2,4,5,...,28,...,73]
WAITS S Events_Hour[42229643]
Replace the JOIN/subquery pattern in both scripts with a snapshot phase
followed by per-monitor UPDATEs:
1. SELECT MonitorId, COUNT(*), SUM(DiskSpace) FROM each bucket
(and the equivalent Total/Archived aggregate from Events).
Plain SELECTs do consistent reads and take no row locks.
2. SELECT MonitorId FROM Event_Summaries to widen the universe so
monitors with empty buckets still get zeroed out.
3. For each monitor, UPDATE Event_Summaries SET ... WHERE MonitorId=?.
Each UPDATE only X-locks one ES row and reads no other table.
zmaudit's five separate UPDATEs collapse to one snapshot phase plus
one UPDATE per monitor. Storage DiskSpace gets the same treatment.
zmstats keeps the same outer transaction (BEGIN ... COMMIT, RC
isolation, retry on 1213) so the bucket DELETEs and the resync stay
atomic, but the resync no longer reads the bucket tables under lock.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues in the existing Stats/Event_Data/Frames/Events delete
sequence:
- On any zmDbDo error inside the transaction the code called
dbh->commit() instead of dbh->rollback(). The server-side
transaction was already rolled back when InnoDB picked us as the
deadlock victim, so the commit() was effectively against a fresh
auto-started TX, but the bug pattern leaked through as confusing
state and prevented any retry.
- There was no retry. errno 1213 is expected under contention with
zmstats and zma touching the same Event_Summaries[MonitorId] row,
and the loser is supposed to re-run.
- At REPEATABLE READ, two concurrent filter workers deleting events
with adjacent EventIds take next-key/gap locks on each other's
rows in the bucket tables.
Rewrite the delete block as a retry loop: SET TRANSACTION ISOLATION
LEVEL READ COMMITTED, begin_work, run the four DELETEs, commit on
success. On any error rollback (was: commit). On errno 1213 retry up
to 5 times with backoff. Skip both the isolation switch and the
rollback-then-retry when the caller is managing their own transaction
(in_transaction); they would be the wrong scope to act in.
Falls through to the storage DiskSpace adjustment only on commit, so
a deadlocked delete leaves the event for the next filter pass instead
of orphaning the row with stale storage accounting.
Note: do NOT pre-lock Event_Summaries[MonitorId] FOR UPDATE here, even
though the trigger touches it last. Pre-locking puts ES before
buckets[Id] in the lock acquisition order, which inverts against zma's
event_update_trigger path (Events[A] -> buckets[A] -> ES[N]) and
re-introduces the cycle the rest of this work is removing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deadlock detection (errno 1213) is part of normal InnoDB operation
under contention; the engine rolls back the loser and expects the
caller to re-run the statement on a fresh transaction. Most callers
into zmDbDo go through autocommit, where there's no caller-managed
transaction state for a retry to disturb.
When AutoCommit is on, retry the statement up to 5 times with
exponential backoff (~100ms -> ~1.6s, jittered). When AutoCommit is
off, the caller owns the transaction and a unilateral retry would
silently succeed against a TX that no longer reflects the work the
caller staged before this statement; preserve the existing behavior
of logging and returning undef so the caller can rebuild the TX
itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>