Skip to main content

Monitoring

Monitor rstmdb using Prometheus metrics and logging.

Metrics Endpoint

Enable metrics in configuration:

metrics:
enabled: true
bind_addr: "0.0.0.0:9090"

Access metrics:

curl http://localhost:9090/metrics

Available Metrics

Connection Metrics

MetricTypeDescription
rstmdb_connections_activeGaugeCurrent active connections
rstmdb_connections_totalCounterTotal connections since start
rstmdb_connections_rejectedCounterRejected connections (limit reached)

Request Metrics

MetricTypeDescription
rstmdb_requests_total{op}CounterTotal requests by operation
rstmdb_requests_duration_seconds{op}HistogramRequest duration by operation
rstmdb_requests_errors_total{op,code}CounterErrors by operation and code

State Machine Metrics

MetricTypeDescription
rstmdb_machines_totalGaugeTotal machine definitions
rstmdb_instances_totalGaugeTotal instances
rstmdb_instances_by_state{machine,state}GaugeInstances by machine and state
rstmdb_events_applied_total{machine}CounterEvents applied by machine
rstmdb_transitions_total{machine,from,to}CounterTransitions by machine and states

WAL Metrics

MetricTypeDescription
rstmdb_wal_offsetGaugeCurrent WAL offset
rstmdb_wal_size_bytesGaugeTotal WAL size
rstmdb_wal_segmentsGaugeNumber of WAL segments
rstmdb_wal_writes_totalCounterWAL writes
rstmdb_wal_write_duration_secondsHistogramWAL write duration
rstmdb_wal_fsyncs_totalCounterFsync operations

Subscription Metrics

MetricTypeDescription
rstmdb_subscriptions_activeGaugeActive subscriptions
rstmdb_events_broadcast_totalCounterEvents broadcast to subscribers

System Metrics

MetricTypeDescription
rstmdb_uptime_secondsGaugeServer uptime
rstmdb_memory_used_bytesGaugeMemory usage

Prometheus Configuration

Add to prometheus.yml:

scrape_configs:
- job_name: 'rstmdb'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s

With Service Discovery

scrape_configs:
- job_name: 'rstmdb'
dns_sd_configs:
- names:
- '_rstmdb._tcp.service.consul'

Grafana Dashboard

Import the pre-built dashboard or create custom panels.

Key Panels

Request Rate

rate(rstmdb_requests_total[5m])

Request Latency (p99)

histogram_quantile(0.99, rate(rstmdb_requests_duration_seconds_bucket[5m]))

Error Rate

rate(rstmdb_requests_errors_total[5m])

Instance Count

rstmdb_instances_total

WAL Size

rstmdb_wal_size_bytes / 1024 / 1024

Events per Second

rate(rstmdb_events_applied_total[1m])

Sample Dashboard JSON

{
"dashboard": {
"title": "rstmdb Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(rstmdb_requests_total[5m])) by (op)",
"legendFormat": "{{op}}"
}
]
},
{
"title": "Request Latency (p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(rstmdb_requests_duration_seconds_bucket[5m])) by (le, op))",
"legendFormat": "{{op}}"
}
]
},
{
"title": "Active Connections",
"type": "stat",
"targets": [
{
"expr": "rstmdb_connections_active"
}
]
},
{
"title": "Instance Count",
"type": "stat",
"targets": [
{
"expr": "rstmdb_instances_total"
}
]
},
{
"title": "WAL Size (MB)",
"type": "stat",
"targets": [
{
"expr": "rstmdb_wal_size_bytes / 1024 / 1024"
}
]
}
]
}
}

Alerting Rules

Prometheus Alerting

groups:
- name: rstmdb
rules:
- alert: RstmdbDown
expr: up{job="rstmdb"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "rstmdb is down"

- alert: RstmdbHighErrorRate
expr: rate(rstmdb_requests_errors_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in rstmdb"

- alert: RstmdbHighLatency
expr: histogram_quantile(0.99, rate(rstmdb_requests_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency in rstmdb (p99 > 1s)"

- alert: RstmdbWALGrowing
expr: increase(rstmdb_wal_size_bytes[1h]) > 1073741824
for: 30m
labels:
severity: warning
annotations:
summary: "WAL growing rapidly (>1GB/hour)"

- alert: RstmdbConnectionsNearLimit
expr: rstmdb_connections_active / rstmdb_connections_limit > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Connections near limit (>80%)"

Logging

Structured Logging

Configure JSON logging:

logging:
format: "json"
level: "info"

Log output:

{"timestamp":"2024-01-15T10:30:00.123Z","level":"INFO","target":"rstmdb_server","message":"Request completed","op":"APPLY_EVENT","duration_ms":5,"instance_id":"order-001"}

Log Levels

LevelDescription
errorErrors only
warnWarnings and errors
infoNormal operations (default)
debugDetailed debugging
traceVery verbose tracing

Set via environment:

RUST_LOG=debug rstmdb

Log Aggregation

Fluentd

# fluent.conf
<source>
@type forward
port 24224
</source>

<filter rstmdb.**>
@type parser
key_name log
<parse>
@type json
</parse>
</filter>

<match rstmdb.**>
@type elasticsearch
host elasticsearch
port 9200
index_name rstmdb
</match>

Vector

# vector.toml
[sources.rstmdb]
type = "docker_logs"
include_containers = ["rstmdb"]

[transforms.parse_json]
type = "json_parser"
inputs = ["rstmdb"]

[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoint = "http://elasticsearch:9200"
index = "rstmdb-%Y-%m-%d"

Health Checks

CLI Health Check

#!/bin/bash
# health-check.sh

if rstmdb-cli -s localhost:7401 ping > /dev/null 2>&1; then
echo "OK"
exit 0
else
echo "FAIL"
exit 1
fi

HTTP Health Check

The metrics endpoint serves as a health check:

curl -f http://localhost:9090/health

Kubernetes Probes

livenessProbe:
exec:
command:
- rstmdb-cli
- -s
- localhost:7401
- ping
initialDelaySeconds: 5
periodSeconds: 10

readinessProbe:
exec:
command:
- rstmdb-cli
- -s
- localhost:7401
- ping
initialDelaySeconds: 5
periodSeconds: 5