Architecture Design
This document details the architecture of a Highly Available Centralized Logging System based on the Grafana Loki Stack, deployed on an on-premise Kubernetes cluster.
1. Objectives of the Architecture
- Achieve end-to-end log traceability for services (Node.js, Go, Python)
- Ensure High Availability (HA) of log collection, storage, and querying
- Design for on-premise deployment: no dependency on external cloud services
- Use open source, Kubernetes-native tools
- Enable log search, dashboarding, and alerting
- Allow easy scaling and fault recovery
2. Architectural Components
🔹 2.1. Log Shipper: Promtail / Fluent Bit
- Deployment: DaemonSet (one per node)
- Function:
- Tails logs from
/var/log/containers/*.log
- Enriches logs with metadata:
namespace
,pod
,container
,labels
- Sends logs to Loki Distributor via HTTP Push
- Tails logs from
🔹 2.2. Loki Components
a. Distributor
- Type: Deployment (≥ 2 replicas)
- Responsibility: Accepts incoming logs and routes them to available ingesters using consistent hashing
b. Ingester
- Type: StatefulSet with Persistent Volume (PVC)
- Responsibility: Writes logs to local disk temporarily and uploads to long-term storage (MinIO/S3)
c. Querier
- Type: Deployment (≥ 2 replicas)
- Responsibility: Reads logs from object storage and returns search results to Grafana
d. Compactor / Index Gateway (Optional but recommended for performance)
🔹 2.3. MinIO (S3-compatible)
- Deployment: StatefulSet (4+ nodes recommended for redundancy)
- Responsibility: Long-term object storage backend for Loki
- Alternative: Ceph with S3 gateway
🔹 2.4. Grafana
- Deployment: Deployment + PVC (optional)
- Responsibility:
- Queries Loki via querier
- Provides dashboards and alerting
- RBAC and folder-based access control
3. High Availability Strategy
Component | Strategy |
---|---|
Promtail | DaemonSet (one per node, tolerates pod loss) |
Distributor | ≥ 2 replicas, load balanced via Service |
Ingester | StatefulSet + PVC + anti-affinity |
Querier | ≥ 2 replicas, stateless, scalable |
MinIO | Distributed mode, data redundancy via erasure coding |
Grafana | Stateless + optional PVC for dashboards |
Services | Kubernetes Service (ClusterIP / LoadBalancer) |
Ingress | NGINX / MetalLB with TLS termination |