MessWala Monitoring & Alerting Setup Guide¶

Overview¶

This guide provides comprehensive monitoring and alerting setup for MessWala production deployment. Monitoring is essential for detecting issues early, tracking performance, and ensuring system reliability.

Built-in Monitoring Endpoints¶

Your application includes several built-in monitoring endpoints that can be integrated with monitoring platforms:

Health & Status Endpoints¶

GET /api/health
  - Detailed health status including database, memory, cache
  - Response: { status, database, memory, cache, overall }
  - Use for: Manual health checks, basic monitoring

GET /api/ready
  - Readiness probe (used by Kubernetes/Docker)
  - Returns 200 if database is connected
  - Returns 503 if service not ready
  - Use for: Container orchestration, load balancer health checks

GET /api/live
  - Liveness probe (used by Kubernetes/Docker)
  - Returns 200 if process is alive
  - Use for: Detecting hung processes, automatic restarts

GET /api/metrics
  - Performance metrics and system stats
  - Response: { uptime, memory, performance }
  - Use for: Performance monitoring dashboards

GET /api/admin/health-summary
  - Comprehensive system health summary
  - Response: { status, cache, jobQueue, rateLimiter }
  - Authentication: Admin required
  - Use for: Admin dashboards, alerting rules

Option 1: Render.com Monitoring (Built-in)¶

If deploying to Render.com, monitoring is included automatically:

Features¶

✅ Automatic health checks
✅ CPU, memory, disk monitoring
✅ Uptime tracking
✅ Application logs
✅ Deployment history

Setup¶

Deploy to Render following PRODUCTION_DEPLOYMENT_GUIDE.md
In Render dashboard, go to Service Settings
Configure Health Check:
HTTP Method: GET
Path: /api/ready
Port: 5000
Timeout: 30 seconds
Interval: 30-60 seconds
View metrics in Analytics tab:
CPU usage
Memory usage
Request count
Response times
Error rates

Alerting with Render¶

Go to Notifications in Render dashboard
Configure notification channels:
Email (free)
Slack (free)
PagerDuty (premium)
Create alert rules:
Service down (automated)
High error rate
High CPU usage
High memory usage

Option 2: Docker/Self-Hosted Monitoring with Prometheus¶

For self-hosted deployments using Docker:

Install Prometheus¶

Add Prometheus to docker-compose.yml:

prometheus:
  image: prom/prometheus:latest
  container_name: messwala-prometheus
  ports:
    - "9090:9090"
  volumes:
    - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
    - prometheus_data:/prometheus
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
  networks:
    - messwala-network

Create monitoring/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/alert-rules.yml

scrape_configs:
  - job_name: 'messwala-backend'
    static_configs:
      - targets: ['messwala-backend:5000']
    metrics_path: '/api/metrics'
    scrape_interval: 30s

  - job_name: 'messwala-frontend'
    static_configs:
      - targets: ['messwala-frontend:80']
    scrape_interval: 60s

Install Grafana (Visualization)¶

Add Grafana to docker-compose.yml:

grafana:
  image: grafana/grafana:latest
  container_name: messwala-grafana
  ports:
    - "3001:3000"
  environment:
    - GF_SECURITY_ADMIN_PASSWORD=admin
    - GF_USERS_ALLOW_SIGN_UP=false
  volumes:
    - grafana_data:/var/lib/grafana
    - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
  networks:
    - messwala-network

Create Grafana datasource configuration:
Add Prometheus as data source
URL: http://prometheus:9090
Scrape interval: 15s
Import pre-built dashboards:
Node Exporter (system metrics)
Redis (cache metrics)
MongoDB (database metrics)

Install AlertManager¶

Add AlertManager to docker-compose.yml:

alertmanager:
  image: prom/alertmanager:latest
  container_name: messwala-alertmanager
  ports:
    - "9093:9093"
  volumes:
    - ./monitoring/alertmanager-config.yml:/etc/alertmanager/alertmanager-config.yml
    - alertmanager_data:/alertmanager
  command:
    - '--config.file=/etc/alertmanager/alertmanager-config.yml'
  networks:
    - messwala-network

Create monitoring/alert-rules.yml:

groups:
  - name: messwala
    rules:
      # High error rate (>5% error rate for 5 minutes)
      - alert: HighErrorRate
        expr: |
          (rate(messwala_errors_total[5m]) / rate(messwala_requests_total[5m])) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # High CPU usage (>80%)
      - alert: HighCPUUsage
        expr: |
          container_cpu_usage_seconds_total{job="messwala-backend"} > 0.8
        for: 5m
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is {{ $value | humanizePercentage }}"

      # High memory usage (>80%)
      - alert: HighMemoryUsage
        expr: |
          container_memory_usage_bytes{job="messwala-backend"} / container_spec_memory_limit_bytes > 0.8
        for: 5m
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is {{ $value | humanizePercentage }}"

      # Database connection failures
      - alert: DatabaseConnectionFailure
        expr: |
          messwala_database_connection_errors_total > 10
        for: 2m
        annotations:
          summary: "Database connection failures"
          description: "{{ $value }} connection errors in 2 minutes"

      # Service down
      - alert: ServiceDown
        expr: |
          up{job="messwala-backend"} == 0
        for: 1m
        annotations:
          summary: "MessWala API service is down"
          description: "Service has been unreachable for 1 minute"

      # Notification queue buildup
      - alert: NotificationQueueBacklog
        expr: |
          messwala_notification_queue_size > 1000
        for: 10m
        annotations:
          summary: "Notification queue backing up"
          description: "{{ $value }} notifications pending"

      # Cache hit rate low
      - alert: LowCacheHitRate
        expr: |
          messwala_cache_hits / (messwala_cache_hits + messwala_cache_misses) < 0.5
        for: 15m
        annotations:
          summary: "Low cache hit rate"
          description: "Cache effectiveness is {{ $value | humanizePercentage }}"

Create monitoring/alertmanager-config.yml:

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

receivers:
  - name: 'default'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    email_configs:
      - to: 'alerts@yourdomain.com'
        from: 'prometheus@yourdomain.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'your-email@gmail.com'
        auth_password: 'your-app-password'

Start Monitoring Stack¶

# Add to docker-compose.yml and start
docker-compose up -d prometheus grafana alertmanager

# Access dashboards
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin/admin)
# AlertManager: http://localhost:9093

Option 3: External Monitoring with Datadog¶

Setup Datadog Agent¶

Create Datadog account at datadog.com
Add Datadog agent to docker-compose.yml:

datadog:
  image: datadog/agent:latest
  container_name: messwala-datadog
  environment:
    - DD_API_KEY=${DATADOG_API_KEY}
    - DD_SITE=datadoghq.com
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock:ro
    - /proc:/host/proc:ro
    - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
  networks:
    - messwala-network

Configure APM (Application Performance Monitoring):

// In backend/server.js, add at the top:
if (process.env.DD_ENABLED === 'true') {
  const tracer = require('dd-trace').init();
}

In Datadog dashboard:
Set up monitors for key metrics
Configure alert notifications
Create custom dashboards
Set up log aggregation

Option 4: Manual Application Logging & Alerting¶

Your application includes comprehensive logging support:

Application Logs¶

All logs are written to backend/logs/ directory:

logs/
├── application.log      # General application logs
├── error.log           # Error logs
├── performance.log     # Performance metrics
└── access.log          # HTTP access logs

Production Logging Setup¶

Configure log level in .env:

LOG_LEVEL=warn    # Production: warn, debug only in development

Send logs to external service:

Option A: Send to AWS CloudWatch

npm install winston-cloudwatch

Option B: Send to ELK Stack

npm install winston-elasticsearch

Option C: Send to Papertrail/Datadog

npm install winston-datadog-transport

Monitor Logs for Alerts¶

Create scripts to monitor logs for critical errors:

#!/bin/bash
# monitor-logs.sh - Alert on critical errors

tail -f backend/logs/error.log | while read line; do
  if [[ $line == *"CRITICAL"* ]] || [[ $line == *"FATAL"* ]]; then
    # Send alert
    curl -X POST https://slack-webhook.com \
      -d "{\"text\": \"⚠️ ALERT: $line\"}"
  fi
done

Application Metrics Available¶

Your application provides these metrics:

Performance Metrics¶

GET /api/metrics

Response:
{
  "uptime": 3600,
  "memory": {
    "rss": 100000000,
    "heapTotal": 50000000,
    "heapUsed": 30000000
  },
  "performance": {
    "avgResponseTime": 45,
    "p95ResponseTime": 120,
    "p99ResponseTime": 250,
    "requestsPerSecond": 10,
    "errorRate": 0.01
  }
}

System Health Summary¶

GET /api/admin/health-summary

Response:
{
  "status": "healthy",
  "cache": {
    "hits": 5000,
    "misses": 1000,
    "hitRate": "83.3%"
  },
  "jobQueue": {
    "pending": 5,
    "failed": 0,
    "processed": 10000
  },
  "rateLimiter": {
    "blockedRequests": 10,
    "totalRequests": 100000
  }
}

Recommended Alert Thresholds¶

These thresholds work well for MessWala:

Metric	Warning	Critical	Action
Error Rate	>2%	>5%	Investigate logs, scale service
CPU Usage	>70%	>90%	Scale up, optimize queries
Memory Usage	>75%	>90%	Restart, add memory, optimize
Response Time (p95)	>500ms	>2s	Check database, optimize queries
Database Latency	>100ms	>300ms	Check indexes, scale database
Notification Queue	>500 items	>2000 items	Check notification service
Cache Hit Rate	<60%	<40%	Adjust cache settings
Service Downtime	-	1+ minute	Auto-restart, page on-call

Monitoring Checklist¶

Daily Tasks¶

[ ] Check health endpoint responses
[ ] Review error logs for new patterns
[ ] Monitor request rates and response times
[ ] Verify backup completion

Weekly Tasks¶

[ ] Review performance trends
[ ] Check resource usage patterns
[ ] Analyze slow query logs
[ ] Review alert history
[ ] Update alert thresholds if needed

Monthly Tasks¶

[ ] Full health assessment report
[ ] Performance optimization review
[ ] Capacity planning check
[ ] Security audit of logs
[ ] Escalation policy review

Testing Monitoring¶

Simulate High Load¶

# Install artillery for load testing
npm install -g artillery

# Create load test config
cat > load-test.yml << 'EOF'
config:
  target: "https://your-backend-domain.com"
  phases:
    - duration: 60
      arrivalRate: 10

scenarios:
  - name: "API Load Test"
    flow:
      - get:
          url: "/api/health"
      - get:
          url: "/api/hostels"
      - post:
          url: "/api/meals"
          json:
            hostelId: "123"
EOF

# Run load test
artillery quick --count 100 --num 10 https://your-backend-domain.com/api/health

Test Alerts¶

# Trigger a test alert
curl -X POST https://your-domain.com/admin/test-alert

Production Monitoring Checklist¶

[ ] Health endpoints responding correctly
[ ] Monitoring dashboard accessible
[ ] Alert rules configured and tested
[ ] Notification channels working (Email, Slack, SMS)
[ ] Logs being collected and archived
[ ] Performance baselines established
[ ] Escalation procedures documented
[ ] On-call schedule configured
[ ] Historical metrics being retained
[ ] Dashboards showing real-time data

Support Resources¶

Render Docs: https://render.com/docs
Prometheus Docs: https://prometheus.io/docs
Grafana Docs: https://grafana.com/grafana/documentation/
Datadog Docs: https://docs.datadoghq.com
Health Check Best Practices: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Document Version: 1.0
Status: ✅ Ready for Production
Last Updated: March 29, 2026