UZMAN

Monitoring & Alerting

Production cluster'ın sağlığını izlemek ve sorunları proaktif tespit etmek.

Seviye: Uzman — Bu bölüm production deneyimi gerektirir.

Karar Rehberi

Durum	Öneri	Örnek veya gerekçe
Stack Monitoring (Kibana)	Uygun: Elastic Cloud, hızlı setup	Managed cluster
Prometheus + Grafana	Uygun: Existing infra, custom dashboards	Multi-system monitoring
elasticsearch-exporter	Uygun: Prometheus scrape target	Self-managed + Prom
Metricbeat	Uygun: ES-native, Kibana dashboards	ELK-native monitoring
Alertmanager	Uygun: Route/silence/group alerts	PagerDuty escalation
Watcher (ES)	Uygun: ES-native alerting	Legacy setup

Kritik Metrikler

Metrik	Tehlike Eşiği	Aksiyon
Cluster status	YELLOW > 5 min	Unassigned shard investigate
Cluster status	RED	Immediate response
JVM heap	> 85%	GC pressure, scale up
Disk usage	> 85%	ILM/delete/add disk
Search latency p99	> 500ms	Profile + optimize
Indexing rate drop	> 50%	Bulk rejection check
Thread pool rejected	> 0	Queue size / scale
Circuit breaker	Trips	Reduce query complexity

REST API

# Key monitoring endpoints
curl -s "http://localhost:9200/_cluster/health?pretty"
curl -s "http://localhost:9200/_cat/nodes?v&h=name,role,heap.percent,disk.used_percent,cpu,load_1m"
curl -s "http://localhost:9200/_cat/indices?v&h=index,health,pri,rep,docs.count,store.size&s=store.size:desc"
curl -s "http://localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=rejected:desc"
curl -s "http://localhost:9200/_nodes/stats/jvm,os,process?pretty"

# Pending tasks (cluster stability)
curl -s "http://localhost:9200/_cluster/pending_tasks?pretty"

# Hot threads (CPU debug)
curl -s "http://localhost:9200/_nodes/hot_threads"

.NET Client

// Health check for ASP.NET
public class ElasticsearchHealthCheck : IHealthCheck
{
    private readonly ElasticsearchClient _client;

    public ElasticsearchHealthCheck(ElasticsearchClient client) => _client = client;

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context, CancellationToken ct = default)
    {
        try
        {
            var response = await _client.Cluster.HealthAsync(h => h
                .Timeout(TimeSpan.FromSeconds(5)), ct);

            if (!response.IsValidResponse)
                return HealthCheckResult.Unhealthy("ES unreachable");

            return response.Status switch
            {
                HealthStatus.Green => HealthCheckResult.Healthy("Cluster green"),
                HealthStatus.Yellow => HealthCheckResult.Degraded(
                    "Cluster yellow: " + response.UnassignedShards + " unassigned"),
                _ => HealthCheckResult.Unhealthy("Cluster RED!")
            };
        }
        catch (Exception ex)
        {
            return HealthCheckResult.Unhealthy("ES connection failed", ex);
        }
    }
}

// Register in DI
builder.Services.AddHealthChecks()
    .AddCheck<ElasticsearchHealthCheck>("elasticsearch");

Örnek: Production'da Prometheus + Grafana ile ES monitoring: JVM heap, GC pause, indexing rate, search latency, thread pool rejection. PagerDuty alert: cluster RED = P1 incident, heap>90% = P2.

Grafana Dashboard (Ready-to-Import)

Elasticsearch Cluster Dashboard JSON (4 Panel)

{
  "dashboard": {
    "title": "Elasticsearch Cluster Overview",
    "tags": ["elasticsearch", "production"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Cluster Health Status",
        "type": "stat",
        "gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
        "targets": [{
          "expr": "elasticsearch_cluster_health_status{color="green"}",
          "legendFormat": "Green"
        }],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "color": "red", "value": 0 },
                { "color": "green", "value": 1 }
              ]
            },
            "mappings": [
              { "type": "value", "options": { "0": { "text": "RED/YELLOW" }, "1": { "text": "GREEN" } } }
            ]
          }
        },
        "description": "Cluster health. GREEN=all shards assigned. YELLOW=replicas missing. RED=primaries missing."
      },
      {
        "title": "JVM Heap Usage (%)",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
        "targets": [{
          "expr": "elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} * 100",
          "legendFormat": "{{name}}"
        }],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                { "color": "green", "value": 0 },
                { "color": "yellow", "value": 75 },
                { "color": "red", "value": 85 }
              ]
            }
          }
        },
        "description": "JVM heap per node. Alert threshold: >85% sustained = GC pressure. >90% = OOM risk. Max 30GB heap."
      },
      {
        "title": "Indexing & Search Rate",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
        "targets": [
          {
            "expr": "rate(elasticsearch_indices_indexing_index_total[5m])",
            "legendFormat": "Indexing/s {{name}}"
          },
          {
            "expr": "rate(elasticsearch_indices_search_query_total[5m])",
            "legendFormat": "Search/s {{name}}"
          }
        ],
        "fieldConfig": { "defaults": { "unit": "ops" } },
        "description": "Indexing and search operations per second. Sudden drops indicate bulk rejections or circuit breaker trips."
      },
      {
        "title": "Thread Pool Rejections",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 },
        "targets": [
          {
            "expr": "rate(elasticsearch_thread_pool_rejected_count{name=~"search|write|bulk"}[5m])",
            "legendFormat": "{{name}} rejected {{node}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "ops",
            "thresholds": {
              "steps": [
                { "color": "green", "value": 0 },
                { "color": "red", "value": 1 }
              ]
            }
          }
        },
        "description": "Thread pool rejections indicate overload. Any rejection >0 = capacity issue. Scale up or reduce load."
      }
    ],
    "time": { "from": "now-1h", "to": "now" },
    "refresh": "30s"
  }
}

Alert thresholds (Prometheus rules):

Metric	Warning	Critical	Action
`elasticsearch_cluster_health_status{color="red"}`	—	== 1 for 1m	P1: immediate response
`jvm_heap_percent`	> 80% for 5m	> 90% for 2m	Scale up / reduce load
`thread_pool_rejected`	> 0 for 1m	> 10/s for 1m	Queue size / scale nodes
`disk_used_percent`	> 80%	> 85%	Delete / add disk / ILM