EElasticsearch Handbook

UZMAN

Monitoring & Alerting

Production cluster'ın sağlığını izlemek ve sorunları proaktif tespit etmek.

Seviye: Uzman — Bu bölüm production deneyimi gerektirir.

ES Cluster _nodes/stats _cluster/health _cat/thread_pool JVM metrics Slow logs Index stats Exporter elasticsearch-exporter Metricbeat Stack Monitoring :9114 metrics Prometheus TSDB storage PromQL queries Alert rules 15s scrape Grafana Dashboards Visualize + Explore Alertmanager Route + Silence PagerDuty/Slack SRE NOC On-Call P1/P2

Karar Rehberi

DurumÖneriÖrnek veya gerekçe
Stack Monitoring (Kibana) Uygun: Elastic Cloud, hızlı setup Managed cluster
Prometheus + Grafana Uygun: Existing infra, custom dashboards Multi-system monitoring
elasticsearch-exporter Uygun: Prometheus scrape target Self-managed + Prom
Metricbeat Uygun: ES-native, Kibana dashboards ELK-native monitoring
Alertmanager Uygun: Route/silence/group alerts PagerDuty escalation
Watcher (ES) Uygun: ES-native alerting Legacy setup

Kritik Metrikler

Metrik Tehlike Eşiği Aksiyon
Cluster status YELLOW > 5 min Unassigned shard investigate
Cluster status RED Immediate response
JVM heap > 85% GC pressure, scale up
Disk usage > 85% ILM/delete/add disk
Search latency p99 > 500ms Profile + optimize
Indexing rate drop > 50% Bulk rejection check
Thread pool rejected > 0 Queue size / scale
Circuit breaker Trips Reduce query complexity
REST API
# Key monitoring endpoints
curl -s "http://localhost:9200/_cluster/health?pretty"
curl -s "http://localhost:9200/_cat/nodes?v&h=name,role,heap.percent,disk.used_percent,cpu,load_1m"
curl -s "http://localhost:9200/_cat/indices?v&h=index,health,pri,rep,docs.count,store.size&s=store.size:desc"
curl -s "http://localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=rejected:desc"
curl -s "http://localhost:9200/_nodes/stats/jvm,os,process?pretty"

# Pending tasks (cluster stability)
curl -s "http://localhost:9200/_cluster/pending_tasks?pretty"

# Hot threads (CPU debug)
curl -s "http://localhost:9200/_nodes/hot_threads"
.NET Client
// Health check for ASP.NET
public class ElasticsearchHealthCheck : IHealthCheck
{
    private readonly ElasticsearchClient _client;

    public ElasticsearchHealthCheck(ElasticsearchClient client) => _client = client;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context, CancellationToken ct = default)
    {
        try
        {
            var response = await _client.Cluster.HealthAsync(h => h
                .Timeout(TimeSpan.FromSeconds(5)), ct);

            if (!response.IsValidResponse)
                return HealthCheckResult.Unhealthy("ES unreachable");

            return response.Status switch
            {
                HealthStatus.Green => HealthCheckResult.Healthy("Cluster green"),
                HealthStatus.Yellow => HealthCheckResult.Degraded(
                    "Cluster yellow: " + response.UnassignedShards + " unassigned"),
                _ => HealthCheckResult.Unhealthy("Cluster RED!")
            };
        }
        catch (Exception ex)
        {
            return HealthCheckResult.Unhealthy("ES connection failed", ex);
        }
    }
}

// Register in DI
builder.Services.AddHealthChecks()
    .AddCheck<ElasticsearchHealthCheck>("elasticsearch");

Örnek: Production'da Prometheus + Grafana ile ES monitoring: JVM heap, GC pause, indexing rate, search latency, thread pool rejection. PagerDuty alert: cluster RED = P1 incident, heap>90% = P2.

Grafana Dashboard (Ready-to-Import)

Elasticsearch Cluster Dashboard JSON (4 Panel)
{
  "dashboard": {
    "title": "Elasticsearch Cluster Overview",
    "tags": ["elasticsearch", "production"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Cluster Health Status",
        "type": "stat",
        "gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
        "targets": [{
          "expr": "elasticsearch_cluster_health_status{color="green"}",
          "legendFormat": "Green"
        }],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "color": "red", "value": 0 },
                { "color": "green", "value": 1 }
              ]
            },
            "mappings": [
              { "type": "value", "options": { "0": { "text": "RED/YELLOW" }, "1": { "text": "GREEN" } } }
            ]
          }
        },
        "description": "Cluster health. GREEN=all shards assigned. YELLOW=replicas missing. RED=primaries missing."
      },
      {
        "title": "JVM Heap Usage (%)",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
        "targets": [{
          "expr": "elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} * 100",
          "legendFormat": "{{name}}"
        }],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                { "color": "green", "value": 0 },
                { "color": "yellow", "value": 75 },
                { "color": "red", "value": 85 }
              ]
            }
          }
        },
        "description": "JVM heap per node. Alert threshold: >85% sustained = GC pressure. >90% = OOM risk. Max 30GB heap."
      },
      {
        "title": "Indexing & Search Rate",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
        "targets": [
          {
            "expr": "rate(elasticsearch_indices_indexing_index_total[5m])",
            "legendFormat": "Indexing/s {{name}}"
          },
          {
            "expr": "rate(elasticsearch_indices_search_query_total[5m])",
            "legendFormat": "Search/s {{name}}"
          }
        ],
        "fieldConfig": { "defaults": { "unit": "ops" } },
        "description": "Indexing and search operations per second. Sudden drops indicate bulk rejections or circuit breaker trips."
      },
      {
        "title": "Thread Pool Rejections",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 },
        "targets": [
          {
            "expr": "rate(elasticsearch_thread_pool_rejected_count{name=~"search|write|bulk"}[5m])",
            "legendFormat": "{{name}} rejected {{node}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "ops",
            "thresholds": {
              "steps": [
                { "color": "green", "value": 0 },
                { "color": "red", "value": 1 }
              ]
            }
          }
        },
        "description": "Thread pool rejections indicate overload. Any rejection >0 = capacity issue. Scale up or reduce load."
      }
    ],
    "time": { "from": "now-1h", "to": "now" },
    "refresh": "30s"
  }
}

Alert thresholds (Prometheus rules):

Metric Warning Critical Action
elasticsearch_cluster_health_status{color="red"} == 1 for 1m P1: immediate response
jvm_heap_percent > 80% for 5m > 90% for 2m Scale up / reduce load
thread_pool_rejected > 0 for 1m > 10/s for 1m Queue size / scale nodes
disk_used_percent > 80% > 85% Delete / add disk / ILM