Overview
Purpose
This document presents the research, comparison, and architectural design of a Highly Available (HA) Centralized Logging System for our on-premise Kubernetes infrastructure.
The goal is to ensure logs from all services (Node.js, Go, Python) are collected, aggregated, stored, and can be queried reliably — even under partial network or node failures.
Background
In a distributed system like Kubernetes, log data is fragmented across nodes and ephemeral containers. Without a centralized logging architecture, it's difficult to debug, monitor, or audit systems reliably.
This issue becomes more critical in an on-premise environment, where we don't have access to cloud-native log services (e.g., AWS CloudWatch, GCP Logging). Therefore, we must design a solution that is:
- Self-hosted
- Fault-tolerant
- Kubernetes-native
- Scalable
Objectives
- Evaluate and compare centralized logging stacks compatible with Kubernetes and on-premise deployments.
- Design an architecture that avoids single points of failure.
- Support structured logs (JSON) with labels and contextual fields.
- Provide a Helm-based deployment method for ease of rollout.
- Ensure log storage and retention are handled reliably on internal storage systems (MinIO, NFS, etc).
Scope
- Platform: On-premise Kubernetes
- Agents: Promtail / Fluent Bit
- Storage: MinIO, NFS, Ceph
- Stacks considered: Loki, ELK Stack, Graylog
- Out of scope: Public cloud logging solutions (e.g., Datadog, CloudWatch)
Success Criteria
- Logging system survives one or more node failures
- Logs from all services are collected consistently
- Querying and alerting are responsive and reliable
- Deployment is reproducible using Helm charts
- Secure access and log retention policies are in place
Target Audience
This documentation is intended for:
- DevOps / SRE engineers maintaining observability
- Backend engineers needing to troubleshoot services
- Infrastructure team responsible for log storage
- Security team interested in log retention & auditing
Structure
The documentation consists of:
- Stack Comparison
- Architecture Design
- Deployment Strategy (Helm)
- Scaling & HA Practices
- Security & Access
- Monitoring & Alerting
- Backup & Restore Plan