Ana içeriğe geç
Versiyon: 1.0.1

Monitoring & Observability

This guide covers monitoring Milvaion in production, including health checks, metrics, logging, and alerting.

Health Checks

API Health Endpoints

EndpointPurposeUse For
/health/liveIs the process running?Kubernetes liveness probe
/health/readyAre dependencies healthy?Kubernetes readiness probe
/healthFull health with detailsDebugging, dashboards

Liveness Check

curl http://localhost:5000/api/v1/healthcheck/live

Response:

{
"status": "Healthy",
"timestamp": "2026-01-14T17:55:12.5466734Z",
"uptime": "16.05:34:46.9359664"
}

Readiness Check

curl http://localhost:5000/api/v1/healthcheck/ready

Response:

{
"status": "Healthy",
"duration": "00:00:00.0015505",
"timestamp": "2026-01-14T17:55:39.7914455Z",
"checks": [
{
"name": "PostgreSQL",
"status": "Healthy",
"description": "PostgreSQL database connection is healthy",
"duration": "00:00:00.0014398",
"tags": [
"database",
"sql"
],
"data": {
"DatabaseName": "MilvaionDb",
"ConnectionStatus": "Connected",
"ProviderName": "Npgsql.EntityFrameworkCore.PostgreSQL"
}
},
{
"name": "Redis",
"status": "Healthy",
"description": "Redis connection is healthy",
"duration": "00:00:00.0004801",
"tags": [
"redis",
"cache"
],
"data": {
"ConnectionStatus": "Connected",
"Database": "0"
}
},
{
"name": "RabbitMQ",
"status": "Healthy",
"description": "RabbitMQ connection is healthy",
"duration": "00:00:00.0000037",
"tags": [
"rabbitmq",
"messaging"
],
"data": {
"ConnectionStatus": "Connected",
"Host": "rabbitmq",
"Port": "5672",
"IsOpen": "True"
}
}
]
}

Kubernetes Probes

spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /api/v1/healthcheck/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/v1/healthcheck/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3

Worker Health

Workers support two health check approaches: file-based (for Console Workers) and HTTP endpoint-based (for API Workers).

Configuration

Enable health checks in appsettings.json:

{
"Worker": {
"HealthCheck": {
"Enabled": true,
"LiveFilePath": "/tmp/live", // Only for file health check. (for console workers)
"ReadyFilePath": "/tmp/ready", // Only for file health check. (for console workers)
"IntervalSeconds": 30
}
}
}
SettingDefaultDescription
EnabledfalseEnable/disable health checks
LiveFilePath/tmp/liveFile path for liveness probe
ReadyFilePath/tmp/readyFile path for readiness probe
IntervalSeconds30Health check interval

Option 1: Console Worker (File-Based)

For workers without HTTP endpoints, use AddFileHealthCheck():

using Microsoft.Extensions.Hosting;
using Milvasoft.Milvaion.Sdk.Worker;

var builder = Host.CreateApplicationBuilder(args);

// Register Worker SDK
builder.Services.AddMilvaionWorkerWithJobs(builder.Configuration);

// Add file-based health checks (Redis + RabbitMQ)
builder.Services.AddFileHealthCheck(builder.Configuration);

var host = builder.Build();
await host.RunAsync();

Kubernetes probes:

spec:
containers:
- name: worker
livenessProbe:
exec:
command: ["test", "-f", "/tmp/live"]
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
readinessProbe:
exec:
command: ["test", "-f", "/tmp/ready"]
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3

Docker Compose healthcheck:

worker:
healthcheck:
test: ["CMD", "test", "-f", "/tmp/live"]
interval: 30s
timeout: 5s
retries: 3
start_period: 30s

Option 2: API Worker (HTTP Endpoints)

For workers with HTTP endpoints, use AddHealthCheckEndpoints() and UseHealthCheckEndpoints():

using Milvasoft.Milvaion.Sdk.Worker;
using Milvasoft.Milvaion.Sdk.Worker.HealthChecks;

var builder = WebApplication.CreateBuilder(args);

// Add health checks (Redis + RabbitMQ)
builder.Services.AddHealthChecks()
.AddCheck<RedisHealthCheck>("Redis", tags: ["redis", "cache"])
.AddCheck<RabbitMQHealthCheck>("RabbitMQ", tags: ["rabbitmq", "messaging"]);

// Register Worker SDK
builder.Services.AddMilvaionWorkerWithJobs(builder.Configuration);

// Register health check endpoint services
builder.Services.AddHealthCheckEndpoints(builder.Configuration);

var app = builder.Build();

// Map health check endpoints
app.UseHealthCheckEndpoints(builder.Configuration);

await app.RunAsync();

Available endpoints:

EndpointPurposeResponse
/healthSimple check"Ok"
/health/liveLiveness probe{ status, timestamp, uptime }
/health/readyReadiness probe{ status, duration, checks[] }
/health/startupStartup probe{ status, timestamp, uptime }

Kubernetes probes:

spec:
containers:
- name: api-worker
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30

Health Check Behavior

The SDK includes built-in health checks for:

CheckWhat it Verifies
RedisConnection status via PING command
RabbitMQConnection status via IConnectionMonitor

File-based health check logic:

  • Live file exists: Worker process is running and not completely unhealthy
  • Ready file exists: All health checks (Redis, RabbitMQ) are healthy
  • Files are deleted on graceful shutdown
  • Files are updated every IntervalSeconds

Dashboard Metrics

Built-in Statistics

Access via Dashboard home or API:

curl http://localhost:5000/api/v1/dashboard
{
"isSuccess": true,
"statusCode": 200,
"messages": [
{
"key": "",
"message": "Operation successful!",
"type": 1
}
],
"data": {
"totalExecutions": 3546,
"queuedJobs": 0,
"completedJobs": 3513,
"failedOccurrences": 32,
"cancelledJobs": 1,
"timedOutJobs": 0,
"runningJobs": 0,
"averageDuration": 2509.362653003131,
"successRate": 99.06937394247038,
"totalWorkers": 1,
"totalWorkerInstances": 1,
"workerCurrentJobs": 0,
"workerMaxCapacity": 128,
"workerUtilization": 0,
"executionsPerMinute": 10,
"executionsPerSecond": 0, // It means lower than 1
"peakExecutionsPerMinute": 10
},
"metadatas": []
}

Worker Status

curl http://localhost:5000/api/v1/workers
[
{
"isSuccess": true,
"statusCode": 200,
"messages": [],
"data": [
{
"workerId": "sample-worker",
"displayName": "sample-worker (sample-worker-172a5243)",
"routingPatterns": {
"AlwaysFailingJob": "alwaysfailing.*",
"LongRunningTestJob": "longrunningtest.*",
"NonParallelJob": "nonparallel.*",
"SendEmailJob": "sendemail.*",
"TestJob": "test.*"
},
"jobNames": [
"AlwaysFailingJob",
"LongRunningTestJob",
"NonParallelJob",
"SendEmailJob",
"TestJob"
],
"currentJobs": 0,
"status": "Active",
"lastHeartbeat": "2026-01-14T18:04:52.0539333+00:00",
"registeredAt": "2026-01-14T18:01:27.1426831+00:00",
"version": "1.0.0.0",
"metadata": "{\"ProcessorCount\":16,\"OSVersion\":\"Unix 6.6.87.1\",\"RuntimeVersion\":\"10.0.1\",\"JobConfigs\":[{\"JobType\":\"AlwaysFailingJob\",\"ConsumerId\":\"alwaysfailing-consumer\",\"MaxParallelJobs\":8,\"ExecutionTimeoutSeconds\":30},{\"JobType\":\"LongRunningTestJob\",\"ConsumerId\":\"longrunning-consumer\",\"MaxParallelJobs\":8,\"ExecutionTimeoutSeconds\":10},{\"JobType\":\"NonParallelJob\",\"ConsumerId\":\"nonparallel-consumer\",\"MaxParallelJobs\":1,\"ExecutionTimeoutSeconds\":30},{\"JobType\":\"SendEmailJob\",\"ConsumerId\":\"email-consumer\",\"MaxParallelJobs\":16,\"ExecutionTimeoutSeconds\":600},{\"JobType\":\"TestJob\",\"ConsumerId\":\"test-consumer\",\"MaxParallelJobs\":32,\"ExecutionTimeoutSeconds\":120}]}",
"instances": [
{
"instanceId": "sample-worker-172a5243",
"hostName": "1fc7768572fd",
"ipAddress": "172.18.0.6",
"currentJobs": 0,
"status": 0,
"lastHeartbeat": "2026-01-14T18:04:52.0539333+00:00",
"registeredAt": "2026-01-14T18:01:27.1454977+00:00"
}
]
}
],
"metadatas": []
}
]

Logging

Milvaion uses Serilog for structured logging. By default, logs are written to the console. You can optionally enable Seq integration for centralized log management.

Default Behavior

  • Console output: All logs are written to console by default
  • Structured format: Logs include contextual properties for filtering
  • Automatic enrichment: Each log entry includes AppName and Environment properties
  • Path filtering: Health check and metrics endpoints are excluded from logs

Seq Integration (Optional)

To send logs to Seq, enable it in appsettings.json:

{
"MilvaionConfig": {
"Logging": {
"Seq": {
"Enabled": true,
"Uri": "http://seq:5341"
}
}
}
}
SettingDefaultDescription
EnabledfalseEnable/disable Seq logging
Uri-Seq server URL

Log Level Configuration

Configure log levels via the standard Serilog configuration:

{
"Serilog": {
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft.AspNetCore": "Information",
"System": "Warning",
"Microsoft.AspNetCore.Mvc": "Warning",
"Microsoft.AspNetCore.Cors": "Warning",
"Microsoft.AspNetCore.Routing": "Warning",
"Microsoft.AspNetCore.Hosting.Diagnostics": "Warning",
"Microsoft.EntityFrameworkCore.Database.Command": "Warning",
"Microsoft.EntityFrameworkCore.Update": "Warning",
"Microsoft.AspNetCore.Authentication.JwtBearer.JwtBearerHandler": "Warning"
}
}
}
}
LevelUse For
VerboseDetailed debugging (very noisy)
DebugDevelopment diagnostics
InformationGeneral operational events
WarningUnexpected but handled events
ErrorFailures requiring attention
FatalCritical failures

Log Enrichment

All log entries are automatically enriched with:

PropertyDescription
AppNameApplication identifier (milvaion-api)
EnvironmentCurrent deployment environment (MILVA_ENV environment variable)

Log Correlation

Use CorrelationId to trace a job across services. In Seq:

CorrelationId = "corr-789"

OpenTelemetry Integration

Milvaion includes built-in OpenTelemetry support for metrics and distributed tracing. Metrics are exposed via a Prometheus-compatible HTTP endpoint.

Configuration

Enable and configure OpenTelemetry in appsettings.json:

{
"MilvaionConfig": {
"OpenTelemetry": {
"Enabled": true,
"ExportPath": "/api/metrics",
"Service": "milvaion-api",
"Environment": "production",
"Job": "api",
"Instance": "milvaion-prod-01"
}
}
}
SettingDefaultDescription
EnabledfalseEnable/disable OpenTelemetry observability
ExportPath/api/metricsPrometheus scraping endpoint path
Servicemilvaion-apiService name for resource identification
EnvironmentMILVA_ENV env varEnvironment label (e.g., production, staging)
JobapiJob label for Prometheus
InstanceMachine nameInstance identifier for multi-instance deployments

Accessing Metrics

Prometheus metrics are exposed at the configured ExportPath:

curl http://localhost:5000/api/metrics

This endpoint returns metrics in Prometheus text format, ready for scraping.

Collected Metrics

The following metric sources are automatically instrumented:

SourceDescription
ASP.NET CoreHTTP request duration, status codes, active requests
HTTP ClientOutbound HTTP request metrics
ProcessCPU, memory, GC, thread pool metrics
Npgsql (PostgreSQL)Database query metrics
Entity Framework CoreORM-level database metrics
System.Net.HttpHTTP client diagnostics
System.Net.NameResolutionDNS resolution metrics
System.ThreadingThread pool and synchronization metrics
System.Runtime.NET runtime metrics

Distributed Tracing

Tracing is automatically configured for:

  • ASP.NET Core requests (with exception recording)
  • HTTP client calls
  • PostgreSQL queries (via Npgsql)
  • Entity Framework Core operations
  • Milvaion internal activity sources

Trace data includes resource attributes for correlation:

AttributeDescription
serviceService name
environmentDeployment environment
jobJob identifier
instanceInstance identifier

Prometheus Scrape Configuration

Add Milvaion to your Prometheus configuration:

scrape_configs:
- job_name: 'milvaion-api'
scrape_interval: 15s
static_configs:
- targets: ['milvaion-api:5000']
metrics_path: '/api/metrics'

Alerting

Critical Alerts

ConditionSeverityAction
API health check failingCriticalPage on-call
All workers offlineCriticalPage on-call
DLQ depth > 100HighInvestigate failures
Success rate < 95%HighCheck failing jobs
Queue depth growingMediumScale workers
Zombie jobs detectedMediumCheck worker health

Prometheus Alert Rules

groups:
- name: milvaion
rules:
- alert: MilvaionApiDown
expr: up{job="milvaion-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Milvaion API is down"

- alert: MilvaionNoActiveWorkers
expr: count(milvaion_worker_active) == 0
for: 2m
labels:
severity: critical
annotations:
summary: "No active Milvaion workers"

- alert: MilvaionHighFailureRate
expr: |
rate(milvaion_jobs_failed[5m]) /
rate(milvaion_jobs_completed[5m]) > 0.1
for: 5m
labels:
severity: high
annotations:
summary: "Job failure rate > 10%"

- alert: MilvaionDLQGrowing
expr: milvaion_dlq_depth > 50
for: 10m
labels:
severity: high
annotations:
summary: "Dead letter queue has {{ $value }} messages"

Grafana Dashboard

Key panels to include:

  1. Jobs Overview

    • Jobs dispatched/minute
    • Success vs failure rate
    • Average duration
  2. Worker Health

    • Active workers count
    • Jobs per worker
    • Heartbeat status
  3. Queue Status

    • RabbitMQ queue depth
    • Messages published/consumed
    • DLQ depth
  4. Infrastructure

    • Redis memory/connections
    • PostgreSQL connections
    • API response times

RabbitMQ Monitoring

Management UI

Access at http://localhost:15672:

  • Overview: Message rates, connections
  • Queues: Depth, consumers, message rates
  • Exchanges: Routing statistics

Key Metrics

MetricHealthy RangeAction if Exceeded
Queue depth< 1000Scale workers
Unacked messages< 100Check worker health
Memory usage< 80%Add RAM or scale
Disk alarmNot triggeredAdd disk space

CLI Monitoring

# Queue status
docker exec milvaion-rabbitmq rabbitmqctl list_queues name messages consumers

# Connection count
docker exec milvaion-rabbitmq rabbitmqctl list_connections

Redis Monitoring

Key Metrics

MetricCommandHealthy Range
Memory usedINFO memory< 80% maxmemory
Connected clientsINFO clients< 10000
Commands/secINFO statsVaries
KeyspaceINFO keyspaceGrowing slowly

CLI Commands

# Memory info
docker exec milvaion-redis redis-cli INFO memory

# Slow queries
docker exec milvaion-redis redis-cli SLOWLOG GET 10

# Active keys
docker exec milvaion-redis redis-cli DBSIZE

Database Monitoring

PostgreSQL Key Metrics

MetricQueryThreshold
Active connectionsSELECT count(*) FROM pg_stat_activity< max_connections
Long-running queriesSELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '5 minutes'0
Table bloatSELECT pg_size_pretty(pg_total_relation_size('JobOccurrences'))Monitor growth

Useful Queries

-- Occurrence count by status (last 24h)
SELECT "Status", COUNT(*)
FROM "JobOccurrences"
WHERE "CreatedAt" > NOW() - INTERVAL '24 hours'
GROUP BY "Status";

-- Slowest jobs (avg duration)
SELECT j."JobType", AVG(o."DurationMs") as avg_ms, COUNT(*) as count
FROM "JobOccurrences" o
JOIN "ScheduledJobs" j ON o."JobId" = j."Id"
WHERE o."Status" = 2 -- Completed
GROUP BY j."JobType"
ORDER BY avg_ms DESC
LIMIT 10;

-- Failed jobs by type (last 7 days)
SELECT j."JobType", COUNT(*) as failures
FROM "JobOccurrences" o
JOIN "ScheduledJobs" j ON o."JobId" = j."Id"
WHERE o."Status" = 3 -- Failed
AND o."CreatedAt" > NOW() - INTERVAL '7 days'
GROUP BY j."JobType"
ORDER BY failures DESC;

Troubleshooting

Jobs Not Executing

  1. Check dispatcher is running: docker logs milvaion-api | grep -i dispatch
  2. Check workers are registered: curl http://localhost:5000/api/v1/workers
  3. Check RabbitMQ queues: http://localhost:15672
  4. Check Redis scheduled jobs:
    docker exec milvaion-redis redis-cli ZRANGE "Milvaion:JobScheduler:scheduled_jobs" 0 -1 WITHSCORES

High Memory Usage

  1. Check for large job payloads
  2. Check occurrence log sizes
  3. Enable database cleanup jobs
  4. Check for memory leaks in custom jobs

Slow Dashboard

  1. Check PostgreSQL query performance
  2. Add indexes if missing
  3. Increase API connection pool
  4. Enable response caching

What's Next?