Kubernetes Architecture

Kubernetes Architecture Documentation

This document provides a deep dive into the architecture of Kubernetes (K8s), the leading open-source system for container orchestration. The goal is to help experienced developers understand the internal workings, major components, data flow, and key design decisions of the Kubernetes codebase. We’ll focus on the high-level structure and critical interactions, referencing specific files and directories where appropriate.

High-Level Architecture

Kubernetes follows a distributed systems architecture with distinct control plane and data plane components. The control plane manages the cluster state, while the data plane executes workloads on individual nodes. Below is a simplified view of the major components and their interactions.

graph TD
    subgraph Control Plane
        API[API Server]
        CM[Controller Manager]
        SCH[Scheduler]
        ETCD[(etcd)]
    end

    subgraph Node 1
        K1[Kubelet]
        P1[Pod 1]
        P2[Pod 2]
        KP1[Kube-Proxy]
    end

    subgraph Node 2
        K2[Kubelet]
        P3[Pod 3]
        P4[Pod 4]
        KP2[Kube-Proxy]
    end

    API -->|Stores State| ETCD
    CM -->|Updates| API
    SCH -->|Binds Pods| API
    SCH -->|Schedules| K1
    SCH -->|Schedules| K2
    K1 -->|Reports| API
    K2 -->|Reports| API
    K1 --> P1
    K1 --> P2
    K2 --> P3
    K2 --> P4
    KP1 -.-> P1
    KP1 -.-> P2
    KP2 -.-> P3
    KP2 -.-> P4

This diagram illustrates the separation of concerns between the control plane (centralized management) and the node-level components (workload execution). The API Server acts as the central hub for communication, while etcd persists the cluster state. Nodes run workloads in Pods, managed by the Kubelet, with networking handled by Kube-Proxy.

Component Breakdown

1. API Server (cmd/kube-apiserver)

  • Responsibility: The API Server is the central management interface for the Kubernetes cluster. It exposes the Kubernetes API, handles RESTful requests, and updates the state of cluster objects in etcd. It authenticates and authorizes requests, acting as the gateway for all interactions.
  • Key Files/Directories:
    • cmd/kube-apiserver/apiserver.go: Entry point for the API Server, initializing the server with configuration and starting it.
    • pkg/apis: Defines the core API types and structures for Kubernetes resources.
  • Interfaces:
    • Communicates with etcd for state persistence.
    • Interacts with the Controller Manager and Scheduler for state updates and pod assignments.
    • Receives status updates from Kubelets on nodes.

2. Controller Manager (cmd/kube-controller-manager)

  • Responsibility: Runs controller processes that regulate the state of the cluster. Controllers like the Replication Controller, Node Controller, and Service Account Controller continuously reconcile the desired state (as defined in API objects) with the actual state.
  • Key Files/Directories:
    • cmd/kube-controller-manager/controller-manager.go: Main entry point for starting the controller manager.
    • pkg/controller: Contains implementations of various controllers (e.g., pkg/controller/replication for managing ReplicaSets).
  • Interfaces:
    • Queries and updates the API Server to monitor and adjust cluster resources.
    • Works indirectly with Kubelets via the API Server to enforce desired pod states.

3. Scheduler (cmd/kube-scheduler)

  • Responsibility: Assigns Pods to nodes based on resource requirements, hardware/software constraints, and user-defined policies. It evaluates node suitability using a scoring system and binding logic.
  • Key Files/Directories:
    • cmd/kube-scheduler/scheduler.go: Entry point for the scheduler binary.
    • pkg/scheduler: Core scheduling logic, including filtering and scoring plugins.
  • Interfaces:
    • Reads pod and node information from the API Server.
    • Binds Pods to nodes by updating the API Server state, which Kubelets then act upon.

4. Kubelet (cmd/kubelet)

  • Responsibility: The primary agent running on each node, responsible for managing Pods and their containers. It ensures containers are running as expected, handles pod lifecycle events, and reports node status back to the API Server.
  • Key Files/Directories:
    • cmd/kubelet/kubelet.go: Main entry point for the Kubelet binary.
    • pkg/kubelet: Core Kubelet logic, including pod management (pkg/kubelet/pod_manager.go) and lifecycle handling (pkg/kubelet/active_deadline.go).
  • Interfaces:
    • Communicates with the API Server to receive pod assignments and report status.
    • Interacts with container runtimes (e.g., Docker, containerd) via the Container Runtime Interface (CRI).

5. Kube-Proxy (pkg/kube-proxy)

  • Responsibility: Runs on each node to manage network rules, enabling communication between Pods, Services, and external traffic. It implements Service abstractions by maintaining iptables or IPVS rules for load balancing.
  • Key Files/Directories:
    • pkg/kube-proxy: Contains the implementation for network proxying logic.
  • Interfaces:
    • Reads Service and Endpoint information from the API Server.
    • Configures network rules to route traffic to Pods.

6. etcd (External Dependency)

  • Responsibility: A distributed key-value store that holds the cluster’s configuration and state data. It provides consistency and durability for all Kubernetes objects (Pods, Services, ConfigMaps, etc.).
  • Key Files/Directories: Not part of the Kubernetes codebase directly but integrated via client libraries (see vendor/github.com/etcd-io/etcd in go.mod).
  • Interfaces:
    • Primarily accessed by the API Server for reading and writing cluster state.

Data Flow

Below is a simplified flowchart showing the data flow for a typical Pod creation operation, from user request to workload execution on a node.

graph LR
    User[User] -->|1| API[API Server]
    API -->|2| ETCD[(etcd)]
    API -->|3| SCH[Scheduler]
    SCH -->|4| API
    API -->|5| ETCD
    API -->|6| K[Kubelet]
    K -->|7| CR[Runtime]
    K -->|8| API
    API -->|9| ETCD
    User -->|10| API

Flow Steps:

  1. User submits Pod spec
  2. API Server validates and stores in etcd
  3. API Server notifies Scheduler
  4. Scheduler selects node and updates API
  5. API Server updates Pod binding in etcd
  6. Kubelet receives Pod assignment
  7. Kubelet creates containers via runtime
  8. Kubelet reports status to API Server
  9. API Server updates state in etcd
  10. User queries Pod status

Flow Explanation:

  1. A user submits a Pod creation request (e.g., via kubectl apply) to the API Server (cmd/kube-apiserver/apiserver.go).
  2. The API Server validates the request (authentication, authorization) and stores the Pod spec in etcd.
  3. The Scheduler (cmd/kube-scheduler/scheduler.go) watches for unassigned Pods via the API Server, evaluates nodes using its scheduling algorithm (pkg/scheduler), and selects a suitable node.
  4. The Scheduler updates the Pod’s node binding in the API Server, which persists it to etcd.
  5. The Kubelet on the selected node (cmd/kubelet/kubelet.go) detects the Pod assignment by polling the API Server.
  6. The Kubelet instructs the container runtime (e.g., Docker) to create and start containers as per the Pod spec.
  7. The Kubelet reports the Pod’s status (e.g., Running, Failed) back to the API Server.
  8. The API Server updates the state in etcd.
  9. The user can query the Pod status through the API Server to confirm the operation’s success.

This flow highlights the decoupled, event-driven nature of Kubernetes, where components react to state changes rather than directly invoking each other.

Key Design Decisions

1. Architectural Pattern: Distributed Microservices

  • Pattern: Kubernetes is designed as a set of loosely coupled microservices (API Server, Scheduler, Controller Manager, Kubelet, etc.), each with a specific responsibility. This aligns with a microservices architecture, where components communicate over well-defined APIs (REST via the API Server).
  • Trade-offs:
    • Pros: High modularity allows independent development and deployment of components. It also supports scalability by distributing workload across nodes.
    • Cons: Increased complexity in debugging and tracing issues across components. Network latency between components can impact performance.
  • Notable Choice: The decision to centralize state management in the API Server (backed by etcd) ensures a single source of truth, simplifying consistency but creating a potential bottleneck. This is mitigated by etcd’s distributed nature and API Server caching.

2. Event-Driven Reconciliation Loops

  • Pattern: Controllers in the Controller Manager (pkg/controller) and the Scheduler (pkg/scheduler) operate on a reconciliation loop model. They continuously observe the desired state (via API Server) and take actions to align the actual state, rather than reacting to one-off commands.
  • Trade-offs:
    • Pros: Resilience to failures—controllers retry until the desired state is achieved. This also supports self-healing (e.g., restarting failed Pods).
    • Cons: Can lead to resource contention or thrashing if multiple controllers act on overlapping resources. Requires careful tuning of retry intervals and rate limiting.
  • Notable Choice: This design avoids tight coupling between components, as seen in pkg/kubelet/active_deadline.go, where the Kubelet independently enforces pod deadlines without direct Scheduler intervention.

3. Extensibility via Plugins and Interfaces

  • Pattern: Kubernetes emphasizes extensibility through interfaces like the Container Runtime Interface (CRI), Container Network Interface (CNI), and scheduler plugins. This allows third-party integrations without modifying core code.
  • Trade-offs:
    • Pros: Encourages a vibrant ecosystem (e.g., different container runtimes or network providers). See pkg/kubelet for CRI integration.
    • Cons: Increases complexity for new developers to understand plugin points and introduces variability in behavior depending on chosen plugins.
  • Notable Choice: The scheduler’s plugin framework (pkg/scheduler/framework) allows custom scoring and filtering logic, balancing flexibility with the risk of inconsistent scheduling decisions across clusters.

4. Monolithic Repository with Modular Binaries

  • Pattern: The Kubernetes codebase is a monolithic repository (k8s.io/kubernetes), but it produces modular binaries (e.g., kube-apiserver, kubelet) that can be deployed independently.
  • Trade-offs:
    • Pros: Simplifies dependency management and ensures version compatibility during development (see go.mod for pinned dependencies).
    • Cons: Large repository size can be intimidating and slow to clone/build, though mitigated by tools like Makefile for targeted builds.
  • Notable Choice: This balances the need for tight integration during development with the operational flexibility of deploying components separately.

Conclusion

Kubernetes’ architecture is a masterclass in distributed systems design, balancing modularity, scalability, and resilience. By understanding the roles of the control plane (API Server, Controller Manager, Scheduler) and node components (Kubelet, Kube-Proxy), developers can trace how state flows through the system. The event-driven reconciliation model and extensibility mechanisms are particularly clever, enabling self-healing and customization at the cost of added complexity. As you dive into specific files (e.g., cmd/kube-apiserver/apiserver.go or pkg/kubelet), keep this big picture in mind to contextualize the code’s purpose within the broader system.