Logical Breakdown of CI/CD and Operational Practices

Jan 3, 2024

1. CI/CD Basic Concepts

CI/CD consists of CI (Continuous Integration) and CD (Continuous Delivery/Deployment).
Why CI is needed: When different features are developed in parallel, code conflicts may occur. Continuous integration helps detect and resolve conflicts early.

A proper CI process includes: Code formatting checks (Prettier) → Static analysis (logic errors, type issues, unused variables, potential vulnerabilities) → Automated testing (ensures correctness of business logic and prevents regressions).

SRE (Site Reliability Engineer).
Monitoring enables engineering teams to observe system status in real time. By collecting key metrics (latency, traffic, error rate, resource utilization), teams can discover issues early instead of waiting for users to report them.
Monitoring requires clear metrics, alert thresholds, incident severity levels (P0/P1), defined response and resolution times, and fast acknowledgment (Ack) during alerts. Engineers must investigate, roll back if necessary, and continuously communicate progress.
The overall goal is to improve system stability so engineers can handle anomalies immediately and prevent issues from escalating.

Modern teams follow “You build it, you own it”: developers participate in on-call rotations and are responsible for the services they build.
On-call rotations are usually weekly, with a primary and secondary on-call engineer to ensure alerts are always handled.
When an alert fires, the process is: promptly Ack → determine if it’s a real incident → communicate internally (and externally if needed) → mitigate impact first (e.g., quick rollback) → then investigate root cause.
Core principle: “Mitigate first, diagnose later” to minimize user and business impact.

The purpose of incident reviews is not to assign blame, but to systematically analyze incidents, identify root causes, improve processes, and prevent recurrence.
Key practices: focus on “How to prevent this from happening again” and “How to detect/resolve it earlier next time,” while clearly documenting and sharing learnings to improve organizational resilience.
High-quality postmortems require a “blameless” culture. Reviews must focus on systems and processes rather than individuals, otherwise engineers will hide problems, reduce transparency, and weaken the organization.

Feature flags allow teams to dynamically enable or disable features for selected users without redeploying, enabling canary releases (gradual traffic rollout).
This avoids risky full rollouts and limits impact to small traffic segments. If issues arise, rollout can be halted or rolled back immediately.
Essentially, feature flags are an engineering safety mechanism that prevents large-scale incidents caused by configuration errors or human mistakes, and they are a critical part of modern deployment quality.

Software deployment is not “write code → deploy.” It is a multi-layered defense process.
Typical stages: Development → Internal testing (Alpha) → Dogfooding → Staging environment → External Beta testing → Canary release (small batch rollout) → A/B testing → Full rollout.
Core idea: gradually increase traffic and validate at each stage so potential issues surface early, keeping risks tightly controlled and ensuring safe, stable, and predictable deployment.

Observability allows engineers not only to know something is wrong (monitoring’s role) but also to quickly understand where and why it is happening.
Its essence is improving visibility into internal system state—similar to how a dentist needs X-rays, not just the symptom of “toothache,” to identify the root cause.
In modern complex systems (especially multi-service, dependency-heavy architectures), lack of observability leads to guesswork debugging, which is slow and costly. Building observability is therefore essential for efficient issue diagnosis and resolution.