Skip to main content

Overview

Purpose

This document presents the research, comparison, and architectural design of a Highly Available (HA) Centralized Logging System for our on-premise Kubernetes infrastructure.

The goal is to ensure logs from all services (Node.js, Go, Python) are collected, aggregated, stored, and can be queried reliably — even under partial network or node failures.


Background

In a distributed system like Kubernetes, log data is fragmented across nodes and ephemeral containers. Without a centralized logging architecture, it's difficult to debug, monitor, or audit systems reliably.

This issue becomes more critical in an on-premise environment, where we don't have access to cloud-native log services (e.g., AWS CloudWatch, GCP Logging). Therefore, we must design a solution that is:

  • Self-hosted
  • Fault-tolerant
  • Kubernetes-native
  • Scalable

Objectives

  • Evaluate and compare centralized logging stacks compatible with Kubernetes and on-premise deployments.
  • Design an architecture that avoids single points of failure.
  • Support structured logs (JSON) with labels and contextual fields.
  • Provide a Helm-based deployment method for ease of rollout.
  • Ensure log storage and retention are handled reliably on internal storage systems (MinIO, NFS, etc).

Scope

  • Platform: On-premise Kubernetes
  • Agents: Promtail / Fluent Bit
  • Storage: MinIO, NFS, Ceph
  • Stacks considered: Loki, ELK Stack, Graylog
  • Out of scope: Public cloud logging solutions (e.g., Datadog, CloudWatch)

Success Criteria

  • Logging system survives one or more node failures
  • Logs from all services are collected consistently
  • Querying and alerting are responsive and reliable
  • Deployment is reproducible using Helm charts
  • Secure access and log retention policies are in place

Target Audience

This documentation is intended for:

  • DevOps / SRE engineers maintaining observability
  • Backend engineers needing to troubleshoot services
  • Infrastructure team responsible for log storage
  • Security team interested in log retention & auditing

Structure

The documentation consists of:

  1. Stack Comparison
  2. Architecture Design
  3. Deployment Strategy (Helm)
  4. Scaling & HA Practices
  5. Security & Access
  6. Monitoring & Alerting
  7. Backup & Restore Plan