Overview

Purpose

This document presents the research, comparison, and architectural design of a Highly Available (HA) Centralized Logging System for our on-premise Kubernetes infrastructure.

The goal is to ensure logs from all services (Node.js, Go, Python) are collected, aggregated, stored, and can be queried reliably — even under partial network or node failures.

Background

In a distributed system like Kubernetes, log data is fragmented across nodes and ephemeral containers. Without a centralized logging architecture, it's difficult to debug, monitor, or audit systems reliably.

This issue becomes more critical in an on-premise environment, where we don't have access to cloud-native log services (e.g., AWS CloudWatch, GCP Logging). Therefore, we must design a solution that is:

Self-hosted
Fault-tolerant
Kubernetes-native
Scalable

Objectives

Evaluate and compare centralized logging stacks compatible with Kubernetes and on-premise deployments.
Design an architecture that avoids single points of failure.
Support structured logs (JSON) with labels and contextual fields.
Provide a Helm-based deployment method for ease of rollout.
Ensure log storage and retention are handled reliably on internal storage systems (MinIO, NFS, etc).

Scope

Platform: On-premise Kubernetes
Agents: Promtail / Fluent Bit
Storage: MinIO, NFS, Ceph
Stacks considered: Loki, ELK Stack, Graylog
Out of scope: Public cloud logging solutions (e.g., Datadog, CloudWatch)

Success Criteria

Logging system survives one or more node failures
Logs from all services are collected consistently
Querying and alerting are responsive and reliable
Deployment is reproducible using Helm charts
Secure access and log retention policies are in place

Target Audience

This documentation is intended for:

DevOps / SRE engineers maintaining observability
Backend engineers needing to troubleshoot services
Infrastructure team responsible for log storage
Security team interested in log retention & auditing

Structure

The documentation consists of:

Stack Comparison
Architecture Design
Deployment Strategy (Helm)
Scaling & HA Practices
Security & Access
Monitoring & Alerting
Backup & Restore Plan

Overview

Purpose​

Background​

Objectives​

Scope​

Success Criteria​

Target Audience​

Structure​