Final Round

introduction

Thank you for the opportunity.

I have 3.8 years of experience as a DevOps Engineer. In my current role,

I’ve worked extensively with Docker and Kubernetes — managing EKS clusters, deploying Helm charts, and handling autoscaling with Karpenter.

I have also managed AWS infrastructure and automate provisioning using Terraform and Ansible.

Implemented CI/CD pipelines for deploying containerized workloads on Docker and Kubernetes.

Additionally, I’ve set up monitoring and logging using Prometheus, Grafana, and Loki to ensure the system availability and proactively handle the incidents.

I believe this experience will help me contribute to delivering secure, scalable, and cost-effective solutions at ZS.

Thank you for the opportunity.

I have around 3.8 years of experience as a DevOps Engineer. In my current role, I work extensively with Docker and Kubernetes, managing EKS clusters, deploying Helm charts, and handling autoscaling with Karpenter. I also manage AWS infrastructure and automate provisioning using Terraform and Ansible. I’ve implemented CI/CD pipelines for containerized workloads and set up monitoring and logging solutions using Prometheus, Grafana, and Loki to ensure system availability and proactively address incidents. I’m excited about the opportunity at ZS, and I believe my experience can really help contribute to delivering secure, scalable, and cost-effective cloud solutions.”

Break into chunks

Don’t try to say it in one long sentence. Break it into 3–4 thought chunks, each with a small pause:

security - vault, container scanning, WAF rules Secure Internet Access: Use NAT Gateways and VPC Endpoints to control internet access for private resources. SG and NACL cloudflare WAF, tunnel, warp

scalability - autoscalling HPA and karpenter

reliability - monitoring and alerting

cost-optimization - spot and reserve ec2, gp3, t3a.medium

automation - terraform and modules, ansbles

Tech questions

Way to answer

What is .. ?

What it is: Blue green is a deployment strategy where we run two identical environments: one currently serving traffic (blue) and one idle (green).
How it works: Where we deploy the new version to the green environment, validate it with tests, and then switch user traffic from blue to green. If an issue occurs, traffic can be instantly routed back to blue.
Purpose: It is used to minimize downtime and reduce risk during releases
When to use: Used for critical production systems During major version upgrades where downtime is unacceptable and need fast rollback
Tradeoffs: but comes with a cost as we run two identical environments

Example

Q. What is blue-green strategy?

A. Blue-green is a deployment strategy where we run two identical environments: one currently serving traffic (blue) and one in an idle state (green). — We deploy the new version to the green environment, validate it with tests, and then switch user traffic from the blue environment to the green environment. If an issue occurs, traffic can be instantly routed back to blue environments. — which minimizes downtime and reduces risk during releases, — used for critical production systems. During major version upgrades, where downtime is unacceptable and need fast rollback — but comes with a cost as we run two identical environments

Current prod (Blue) runs v1.
Deploy v2 to Green, test it.
Switch traffic → Green (v2).
If stable, next release (v3) will go to Blue (now idle).

difference between X and Y”

A Deployment Kubernetes controller for that manages stateless applications by maintaining a specified number of replicas of a pod. It provides features like rolling updates, rollbacks, scaling and self-healing, ensuring high availability and declarative updates for applications. Deployments are ideal for managing applications that require consistent uptime and easy version management.

while

scenario based questions

Creating and following SPO

SOPs (standard operating procedure) as pre-approved, repeatable playbooks for handling operations.

SOP = repeatable, known issue/process

is a documented set of steps that engineers follow to handle recurring issues or a tasksin a consistent and reliable way.

Title → Clear name of the procedure.
Purpose → Why this SOP exists.
Scope → When/where to use it (what systems/issues it covers).
Pre-requisites → Tools, permissions, or access needed.
Steps → Detailed resolution steps (step-by-step, no assumptions).
Verification → How to confirm the issue is resolved.
Escalation → What to do if steps fail (who to contact, next level).
Post-Incident → Logging, documentation, RCA if needed.

Title: Handling “Node NotReady” Issue in Amazon EKS

Purpose: To quickly identify and resolve situations where an EKS worker node is stuck in the NotReady state, impacting workloads.

Scope: Applies to all EKS clusters running on AWS where nodes may enter NotReady state.

Pre-requisites:

AWS CLI configured
kubectl access to the cluster
IAM permissions for EC2 and EKS operations

Procedure (Steps):

Step 1: Verify Node Status

kubectl get nodes

Step 2: Check EC2 Instance Health

Step 4: Common Fixes

if kubectl failed restate it

if ec2 failed restart it

Verification: confirm issue resolve and system working correctly

Escalation

If issue persists → escalate to Cloud SRE Lead / AWS Support.
Provide logs, steps folloed, snapshots (kubectl describe, kubelet logs, EC2 status).

Post-Incident

Document root cause (e.g., CNI crash, disk full, EC2 hardware issue).
Update runbook if new resolution steps are discovered.

common QA

Technical

Interviewer: “If a pod keeps crashing with ‘CrashLoopBackOff’, what’s the first thing you check?” You (not 100% sure of exact order): “Firstly, I will start with kubectl describe pod to check events and last state, then jump into logs with kubectl logs --previous. If it’s a startup failure, I’d suspect misconfigured livenessProbe or missing configmaps/secrets.”

Interviewer: “Are you sure that’s the right order?” 👊 Your Harvey Move: “I’m sure enough to start there — because 80% of CrashLoop cases are surfaced in events or logs. But if I hit a dead end? I’d immediately pull up the deployment manifest and compare against known-good configs. In production, guessing costs uptime — so I verify as I go.”

Q. What is blue green deployment?

You sure?

I’m sure about the strategy/logic/strucure/concepts

How do you handle conflict

I handle conflict by listening to all perspectives, and keeping the discussion focused on facts, project needs, rather than opinions. I encourage the team to list pros and cons objectively align the decision with project requirements and aim for a solution that everyone can support.

I remember we had a conflict in our team while choosing the right cluster autoscaler for our workloads. Our workloads experienced high traffic spikes during events, and we needed scaling that was both fast and cost-efficient.

One side argued for Cluster Autoscaler because it’s mature, stable, and widely adopted. The other side pushed for Karpenter, which provisions nodes much faster and supports Spot instances to reduce costs.

I handled the conflict by first listening to both sides and then driving a fact-based discussion. We listed out the pros and cons of each tool: Cluster Autoscaler offered maturity but slower scaling, while Karpenter was newer but optimized better for bursty traffic. After evaluating our workload patterns and SLAs, we decided to go with Karpenter for production since it fit our scaling needs, while still retaining Cluster Autoscaler for some legacy workloads.

By focusing on our requirements rather than personal preferences, we resolved the conflict, and the decision was accepted by the entire team.

how do you handle escalation?

When an issue arises, I first try to resolve it myself by following monitoring dashboards, logs, and documented SOPs. If I can’t resolve it within the expected SLA or if it’s a critical incident impacting production, I escalate according to our incident process — usually to the team lead or the specialist team like DBA or Networking. While escalating, I make sure to provide all the details — error messages, metrics, and steps already taken — so they don’t have to start from scratch. I also keep communication open with stakeholders until the issue is resolved. Afterward, I contribute to documenting or improving the SOP so that next time, the resolution is faster and maybe doesn’t require escalation at all.

I first try to resolve it myself by following monitoring dashboards, logs, and documented SOPs.
If I can’t resolve it within the expected SLA or if it’s a critical incident impacting production
escalate according to our incident process — usually to the team lead or the specialist team like DBA or Networking - escalate to particular team
I make sure to provide all the details — error messages, metrics, and steps already taken with snapshot, so they don’t have to start from scratch.
I also keep communication open with stakeholders until the issue is resolved.
Afterward, I contribute to documenting or improving the SOP so that next time, the resolution is faster and maybe doesn’t require escalation at all.

How do you perform RCA?

First, acknowledge and mitigate the issue (restore service, if possible, apply quick fix)
Gather logs, metrics, traces, check system health
Note timeline of events: when issue started, when alerts fired, when fixes applied.

Identify the Root Cause

Ask the 5 Whys (keep drilling down until you find the underlying cause, not just symptoms).
Example:
- Service crashed → Why? OOM.
- Why OOM? Memory leak.
- Why memory leak? Unbounded cache.
- Why unbounded cache? No eviction policy.
- Root cause = bad cache implementation.

Document Findings

Create RCA report including:
- Summary of incident.
- Timeline of events.
- Impact (users, systems, revenue).
- Root cause.
- Resolution steps taken.
- Action items (permanent fixes, monitoring improvements, automation).

Prevent Recurrence

Update SOPs/runbooks.
Add monitoring/alerts if gaps found.
Automate recovery if possible (self-healing).
Share lessons learned in postmortem meeting.

"I perform RCA by first limiting the impact to restore service quickly. Then I collect logs, metrics, and timelines to understand what happened. I analyze the data, often using methods like the 5 Whys, to trace the problem back to the true root cause rather than just the symptoms. I document the findings in an RCA report with impact, timeline, and corrective actions. Finally, I ensure permanent fixes are implemented — like adding monitoring, automation, or code/configuration changes — so the issue doesn’t repeat."

ROOT CAUSE OF EKS NODE NOT-READY ISSUE

Absolutely — let’s do a real-time 5 Whys analysis for the Kubernetes event:

❗ “Node Not Ready”

This is a common production issue. We’ll drill down as if we’re live-debugging in an incident war room.

🚨 Problem: Node is reporting `NotReady` status in `kubectl get nodes`

❓ Why #1: Why is the node NotReady?

Because the kubelet on that node has stopped posting its heartbeat (NodeStatus) to the Kubernetes API server within the configured timeout (default: ~40s).

✅ So it’s not necessarily that the machine is dead — just that the control plane hasn’t heard from kubelet recently.

❓ Why #2: Why has kubelet stopped posting heartbeats?

Because kubelet is either crashed, unresponsive, or can’t communicate with the API server.

Let’s check logs:

# SSH into node or use journalctl if you have access
journalctl -u kubelet -n 100 --no-pager

Possible findings:

Kubelet process is restarting/crashing
Network timeouts to API server
Certificate expiry
Disk pressure / resource starvation

👉 Let’s assume we see this log: "Failed to update node status, will retry: error updating node status, Liveness probe failed"

Or maybe: "server certificate has expired"

But let’s pick the most common real-world cause → Kubelet is under memory pressure and got OOMKilled.

❓ Why #3: Why was kubelet OOMKilled?

Because the node ran out of memory, and the Linux OOM Killer chose kubelet (or a critical system component) to kill to free up memory.

Check:

kubectl describe node <node-name> | grep -A 10 "Conditions"
# Look for MemoryPressure = True

# Also check events
kubectl get events --field-selector involvedObject.name=<node-name>

You might see:

MemoryPressure: True kubelet, <node> NodeHasInsufficientMemory kubelet has insufficient memory available

Also, check node resource usage:

kubectl top node <node-name>

→ You see 95%+ memory usage.

❓ Why #4: Why did the node run out of memory?

Because workloads scheduled on the node consumed more memory than allocatable, and/or system daemons or kernel buffers used up remaining memory.

Drill into pods on that node:

kubectl get pods -o wide --all-namespaces | grep <node-name>

Then check memory per pod:

kubectl top pods -n <namespace> --containers

You find:

A misconfigured app pod with no memory limit.
Or a runaway Java app with -Xmx8g on a 4GB node.
Or a DaemonSet (like logging agent) leaking memory over time.

👉 Let’s say: A critical application pod had no memory limit and consumed all available memory.

❓ Why #5: Why was there a pod with no memory limit consuming unlimited memory?

Because resource requests/limits were not enforced in the cluster, and no policy/gatekeeper prevented deployment of such pods.

Root cause uncovered:

✅ No ResourceQuota or LimitRange policies + lack of governance allowed a pod to be deployed without memory limits → pod consumed all node memory → kubelet OOMKilled → node became NotReady

🎯 ROOT CAUSE:

Missing cluster-level guardrails (ResourceQuota/LimitRange/OPA/Gatekeeper) allowed deployment of an unbounded pod that exhausted node memory, triggering OOMKill of kubelet and causing Node NotReady.

🛠️ Recommended Fixes (Address Root Cause)

✅ Enforce memory limits using LimitRange in every namespace:

apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range
spec:
  limits:
  - default:
      memory: 512Mi
    defaultRequest:
      memory: 256Mi
    type: Container

✅ Apply ResourceQuotas per namespace to cap total usage.
✅ Use Policy Engines like Kyverno or OPA Gatekeeper to block deployments missing limits.
✅ Enable Vertical Pod Autoscaler (VPA) or set proper requests based on profiling.
✅ Monitor & Alert on MemoryPressure node condition and kube_node_status_condition metric from KSM.
✅ Add node-level alerts for sustained high memory usage before it leads to OOM.

🔁 Bonus: Alternative Paths in 5 Whys

Depending on what you find in logs, your 5 Whys path may vary:

If network issue: Why? → CNI plugin crashed → Why? → ConfigMap corrupted → Why? → No config validation → Root cause = lack of GitOps/config drift protection.
If certificate expiry: Why? → Auto-renewal failed → Why? → kubelet cert rotation misconfigured → Root cause = insecure/default kubelet setup.
If disk pressure: Why? → Logs filled disk → Why? → No log rotation → Root cause = missing DaemonSet for fluent-bit/logrotate.

🧭 Pro Tip: Use KSM Metrics for Early Detection

Create alert based on:

kube_node_status_condition{condition="MemoryPressure", status="true"} == 1

Or:

# Node not ready for > 5m
max by (node) (
  kube_node_spec_unschedulable == 1
  or
  kube_node_status_condition{condition="Ready",status="false"} == 1
) > 0

✅ Solving “Node Not Ready” isn’t about restarting kubelet — it’s about preventing the conditions that lead to it.

Let me know if you want to walk through another root cause path (e.g., network, disk, cert, etc.) — I’ll simulate it live!

How do you handle issues and tasks

First, I understand the problem and check the impact. Then I look at logs, metrics, and monitoring dashboards to narrow down whether it’s infra, app, or external. If it’s a known issue, I follow our SOP. If it’s new, I investigate deeper using system tools and collaborate with other teams if needed. Once resolved, I validate the fix and document it so it becomes easier to handle in the future."

For recurring issues, I focus on automation and permanent fixes — I follow SOPs, perform RCA, and eliminate the root cause so it doesn’t repeat.

When a new problem comes up, I first check what’s going on and stop it from getting worse to reduce impact, then investigate deeply, escalate if needed, and finally document the solution so it becomes an SOP for future use

how did you debug/troubleshoot issues?

🔹 How to Answer: “How do you debug/troubleshoot issues?”

✅ Step 1: Understand the Problem

Start with the symptoms: error messages, alerts, logs.
Ask: What exactly is broken? Who/what is impacted?

✅ Step 2: Reproduce (if possible)

Try to reproduce issue in staging/test environment.
Helps confirm if issue is systemic or environment-specific.

✅ Step 3: Check Observability Data

Use logs, metrics, traces, dashboards (Grafana, CloudWatch, Prometheus, Loki).
Identify abnormal spikes (CPU, memory, disk, latency).

✅ Step 4: Narrow Down the Scope

Is it app-level (code bug), infra-level (node/DB/network), or external (AWS service, DNS)?
Divide & conquer → rule out healthy components.

✅ Step 5: Apply Known Fixes / SOPs

If it’s a known recurring issue → follow runbook/SOP.
Example: Node NotReady → check kubelet, restart or replace.

✅ Step 6: Deep Investigation (if unknown)

Use debugging tools:
- kubectl describe pod/node (K8s)
- journalctl, dmesg (Linux)
- EXPLAIN ANALYZE (Postgres slow queries)
- Tracing requests across microservices.

✅ Step 7: Collaborate / Escalate

If out of scope → escalate to specialist (DBA, networking, AWS support).
Share logs + steps already tried (saves time).

✅ Step 8: Verify Resolution

After fix → confirm system is healthy.
Validate via monitoring dashboards + user experience.

✅ Step 9: Document & Improve

Update SOP/runbook.
Add monitoring/alerting if gap found.
Feed into postmortem/RCA if major issue.

🔹 Interview-Ready Answer

"I troubleshoot issues in a structured way. First, I understand the problem and check the impact. Then I look at logs, metrics, and monitoring dashboards to narrow down whether it’s infra, app, or external. If it’s a known issue, I follow our SOP. If it’s new, I investigate deeper using system tools and collaborate with other teams if needed. Once resolved, I validate the fix and document it so it becomes easier to handle in the future."

👉 Pavan, do you want me to also prepare a real-world example answer (like how you debugged an “EKS Node NotReady” or “RDS connection spike” issue) so it feels concrete in interviews instead of just generic steps?

Behavioral-scenario question

Follow STAR Approach - Situation → Task → Action → Result.

personality check

Harvey personality

Harvey walks into every room like he owns it

You think you’re smart? Fine. But I’m smarter because I know how to use it.

preparation - He reads files, studies opponents, rehearses lines, and anticipates moves.

Preparation builds confidence. The more you prep, the more comfortable you are admitting you don’t know something — because you know you can find out fast.

Harvey: When I got here i dominated, nobody questions my ability to get the job done

first impression is last

he doesn't get it, just doing a good works isn't a whole job

Measures Greatness by Results, Not Knowledge

In interview results oriented people matter than knowlegde

don't know the answer

Let me think about that for a sec” or

Ok “That’s a great question — let me make sure I address it properly

Recall related knowledge.
Structure your answer.

You sure?

Based on what I know right now? Yes, I’m sure

Why? Do you see a any flaw in that approach? I’d love to hear your take — maybe there’s a nuance I haven’t considered yet.

I’m sure about the logic, less sure about the edge case

Ansible Yes, I’ve got hands-on experience with Ansible — I’ve used it for tasks like provisioning servers and writing playbooks. I don’t work with it daily, but I’m comfortable using it when needed.

argocd Not yet, but I’ve read about its workflow — ArgoCD continuously monitors Git for changes and applies them to the cluster. If I get the chance, I’d love to implement it, as it builds on the skills I already have with Kubernetes and GitOps.

do you have any question

Questions

What are the most common production challenges you face with client workloads?

How do you prioritize between speed of delivery and platform stability/reliability?

What KPIs matter most to you for measuring cloud success?

What skills do you see as most critical for cloud engineers to develop in the next few years?

let's them talk about cost, security, reliability, compliance, speed, etc.

If he already mentions the CCoE team

👉 “What are the primary focus areas for the CCoE team when balancing cost, security, reliability, and delivery speed?

If he hasn’t mentioned the CCoE team yet

👉 “I’d like to understand more about the Cloud Center of Excellence team. What are the primary focus areas when balancing cost, security, reliability, and delivery speed?”

ending ..

No, that covers everything I wanted to ask. Thank you for taking the time to explain the team and its priorities. I’m looking forward to contributing to the CCoE team, working at scale, and helping deliver secure, reliable, and scalable cloud solutions Thank you!

Note: whenever you prepare for theory prepare scenario QA as well

Non technical

from and family

I come from jalgaon and have a supportive family.

My younger brother works in IT, my father is a farmer, and my mother is a homemaker.

They have always encouraged me to focus on education and personal growth, which inspired me to pursue a career in technology and DevOps.

strength

so, my strengh is, iam result oriented person and my strength is how i handle the situation quickly

weakness

i have curiosity to learn new technologies but sometimes it makes me distracted from the tasks that i need to focus on

Why do you want to join ZS Associates?

ZS is a global consulting firm with a strong presence in healthcare and life sciences. I’m excited about the opportunity because the skills align well with the goal and i heard CCoE team operates 200 eks cluster and thta will be exciting for me to work at hat scale and also zs have strong present in

zs is aws partner and ai cloud solutions in aws marketplace that excites me to join zs Where do you see yourself in 5 years?

In 5 years, I see myself as a senior cloud engineer or architect, who build and design highly scalable and reliable cloud solution and keep learning emerging technologies

7. Why should we hire you?

I bring 3.8 years of hands-on experience in cloud infrastructure, Kubernetes, and DevOps space and my experience aligns well with the responsibilities of this role, and I am confident I can contribute to delivering secure, scalable, and cost-effective cloud solutions at ZS

my interest match with zs tech stack so yes I think I am right candidate for zs

Tell me about a challenge you faced at work and how you handled it

In my previous role, we faced a challenge with sudden spikes in workload on our Kubernetes cluster. The existing Horizontal Pod Autoscaler wasn’t sufficient to scale nodes quickly, which sometimes caused pods to stay in Pending state. I implemented Karpenter as a dynamic cluster autoscaler, which automatically provisions and scales nodes based on workload demand. This solution reduced pod scheduling delays, improved application reliability, and optimized resource utilization across the cluster."

How do you handle stress or tight deadlines?

I prioritize tasks, break them into smaller steps, and focus on one task at a time. I also communicate proactively with my team and managers about progress and challenges. Using this approach, I’ve successfully delivered critical projects under tight deadlines without compromising quality."

How do you work in a team?

I enjoy collaborating with team members, sharing knowledge, and learning from others. I make sure to communicate clearly, help resolve conflicts, and focus on the team’s goals. I also mentor new members to help them get up to speed quickly."

What motivates you?

"I’m motivated by solving complex problems, learning new technologies, and seeing the impact of my work on system performance and reliability. I also enjoy mentoring others and contributing to team growth.

What are your hobbies/interests?

"I enjoy learning travelling

How do you handle mistakes at work?

"I acknowledge mistakes immediately, analyze it, and take corrective actions to prevent recurrence.

Kubernetes scenario QA

Perfect Pavan 👍 These 5 conditions you listed are very common Kubernetes failure scenarios. Interviewers usually test them in scenario-based questions where you explain in STAR (Situation – Task – Action – Result) format.

Here’s how you can frame interview-ready answers for each case:

1. Node Not Ready

Situation: In our EKS cluster, one of the worker nodes went into NotReady state during a production release.

Task: My task was to quickly identify the cause and bring the node back online without disrupting critical workloads.

Action:

Checked node status: kubectl describe node <node-name>
Found that kubelet was not reporting due to high disk usage.
Pods producing high log output (e.g., debug mode, verbose logs). and resulted /var/lib/docker/containers/.../*.log and node reported disk pressure
Cleared unused Docker images and logs.
sudo truncate -s 0 /var/lib/docker/containers//.log
Discussed with the development team and advised them to reduce verbose logging and expose only necessary logs.
If needed, cordoned and drained the node, then restarted kubelet service.
2 - unable to ssh node, drain node, deleted node, ASG created new - this is valid

Result: Node transitioned back to Ready, pods were rescheduled, and we reduced disk alert threshold monitoring to proactively prevent future issues.

2. Pod Pending

Situation: Some pods in the default namespace were stuck in Pending state during an autoscaling event.

Task: I had to find why the scheduler wasn’t placing pods and fix it quickly.

Action:

Ran kubectl describe pod <pod> → saw Insufficient CPU error.
Confirmed with kubectl get nodes -o wide that nodes were fully utilized.
Increased node group size using Cluster Autos Caler.
Updated pod resource requests to realistic values to avoid over-requesting.

Result: Pods got scheduled successfully, and autoscaling handled future workload spikes smoothly.

3. CrashLoopBackOff

Situation: A new microservice deployment was failing repeatedly and showing CrashLoopBackOff.

Task: I needed to identify the cause and stabilize the application.

Action:

Checked logs: kubectl logs <pod> --previous.
Found the container was exiting due to missing environment variable (DB password).
ConfigMap/Secret not mounted
Wrong command/Code bug - process exits with non-zero status.
Fixed ConfigMap and Secret references in the Deployment YAML.
Rolled out new deployment with kubectl rollout restart.

Result: Application came up successfully, and we introduced pre-deployment validation checks for required env variables.

4. OOMKilled

Situation: A data processing job pod was repeatedly killed with status OOMKilled.

Task: I had to optimize the pod resource allocation and prevent service impact.

Action:

Described pod: kubectl describe pod <pod> → saw memory limit exceeded.
Increased memory limits in the Deployment manifest from 512Mi → 2Gi.
Also optimized the application code to use streaming instead of loading entire dataset in memory.

Result: Pod ran successfully without OOM kills, and resource requests/limits were adjusted for all workloads to prevent overcommitment.

5. ImagePullBackOff

Situation: During a deployment, pods were stuck in ImagePullBackOff state.

Task: My job was to ensure the correct image could be pulled and deployed.

Action:

Described pod: kubectl describe pod <pod> → found image not found error in private ECR.
Verified the image existed and corrected the tag in the manifest.
Checked IAM permissions for worker nodes to access ECR and fixed missing policy.
Re-deployed after fixing the secret.

Result: Pods pulled images successfully, and we set up CI/CD pipeline automation to tag and push images correctly before deployment.

🔥 Pro tip for interviews: When you present these, keep your Action focused on tools/commands (kubectl, CloudWatch, Grafana, etc.) and your Result on impact reduction and long-term preventive measures.

Do you want me to also prepare a combined “interview question set” where the interviewer gives you one-liner scenarios like “Your pod is in CrashLoopBackOff. What do you do?” and you respond in STAR in short 1–2 minute answers?

build failure

Perfect ✅ — let’s go deeper only on build failures in GitHub Actions pipelines (before deployment). These are the most common CI/CD build-time issues you’ll see:

🔹 Common Build Failures in GitHub Actions (CI Stage)

Failure

Cause

Symptom (Logs)

Resolution

Missing Dependencies

Wrong or missing entries in requirements.txt, package.json, pom.xml, etc.

ModuleNotFoundError: No module named 'flask' or npm ERR! missing script: build

Update dependency file, lock versions, run pip freeze > requirements.txt

Version Conflicts

Library requires specific version not compatible with another

ERROR: Cannot install - version conflict

Use dependency pinning (e.g., requests==2.32.3), upgrade/downgrade packages

Private Repo Dependency

Code depends on private GitHub/NPM/PyPI repo

npm ERR! 403 Forbidden or pip install ERROR: 401 Unauthorized

Use GitHub Action’s GITHUB_TOKEN or deploy keys for authentication

Caching Issues

GitHub Actions cache stores old deps

Builds pass locally but fail in CI

Clear/rebuild cache, use actions/cache properly with hashFiles('**/requirements.txt')

Build Tool Missing

Required CLI not installed in runner

make: command not found or mvn: not found

Install tool in workflow (sudo apt-get install -y build-essential)

OS-Specific Dependency

Works on dev machine but not in Linux runner

cannot find /usr/lib/... or failed to build wheel

Add OS packages (apt-get install libpq-dev), use multi-platform builds

Docker Build Failure

Wrong Dockerfile instructions, missing files

COPY failed: file not found in build context

Verify Dockerfile paths, use .dockerignore correctly

Authentication Failure for Registry

Pushing image to ECR/GHCR without login

denied: User not authorized to perform ecr:PutImage

Use aws-actions/amazon-ecr-login@v2 or docker/login-action

Tests Fail in CI

Unit/integration tests not passing

pytest FAILED or npm test failed

Fix code/test mismatch, mock external dependencies

Secrets Missing in CI

Test requires env var not set in GitHub Secrets

KeyError: DB_PASSWORD or undefined process.env.API_KEY

Define secrets under repo → Settings → Secrets and reference ${{ secrets.DB_PASSWORD }}

🔹 How to Prevent Build Failures

Dependency pinning (package==version) → avoid version drift.
Use caching properly → speeds up installs but bust cache when requirements.txt changes.
Run builds locally in same environment as GitHub runner (Ubuntu, not Windows/Mac).
Fail fast → run linting/tests before heavy build steps.
Use matrix builds → test across versions (Python 3.9, 3.10 etc.).

🎤 Interview-ready way to explain: “Most build failures in GitHub Actions come from missing dependencies, version conflicts, or misconfigured build tools. For example, I’ve seen Python builds fail with ModuleNotFoundError because requirements.txt was incomplete, or Docker builds fail because COPY referenced a missing file. I usually fix these by verifying dependency files, pinning versions, using caching, and aligning local dev with the GitHub runner environment.”

👉 Do you want me to prepare a 5-scenario STAR-style story just for build failures (e.g., missing dependency, OS package issue, cache issue, private repo dependency, Docker build error) so you can narrate them in an interview?

Terraform

Perfect 👍 — now let’s switch focus to failures during terraform apply (execution time, not just build/plan time). These are the common real-world Terraform apply issues you’ll encounter:

🔹 Common Failures During `terraform apply`

Failure

Cause

Symptom (Logs)

Resolution

Insufficient IAM Permissions

User/role doesn’t have required AWS/GCP/Azure permissions

Error: AccessDenied: User is not authorized to perform...

Update IAM policy, least-privilege permissions, test with aws sts simulate-principal-policy

Resource Already Exists

Resource created manually outside Terraform

Error: [Conflict] resource already exists

Import resource (terraform import) or delete manual resource

Quota / Limit Exceeded

Hitting cloud resource limits (e.g., VPC, EIP, EKS nodes)

Error: Quota exceeded

Request quota increase, cleanup unused resources

Invalid/Conflicting Config

Wrong VPC ID, subnet mismatch, or dependency not yet created

Error: InvalidSubnetID.NotFound or DependencyViolation

Fix config, add depends_on if needed

State Lock Error

Another apply running, lock not released

Error: Error acquiring the state lock

Use remote state with locking (S3 + DynamoDB for AWS), run terraform force-unlock only if safe

Timeouts / API Errors

Provider timeout or cloud API flakiness

Error: timeout while waiting for state to become 'available'

Use timeouts {} block, retry apply

Destroy/Recreate Conflicts

In-use resources can’t be destroyed (e.g., Security Group attached)

Error: DependencyViolation: resource in use

Detach dependencies first, refactor resource lifecycle

Wrong Credentials / Expired Token

Expired session token or misconfigured creds

Error: NoCredentialProviders

Refresh credentials, configure GitHub OIDC/AWS profile correctly

Provisioner Script Failures

remote-exec / local-exec fails inside resource

Error: remote-exec provisioner error

Debug script, test manually, use Ansible/cloud-init instead

Drift Detected

Infra drift causes unexpected changes

Error: resource already changed outside Terraform

Run terraform refresh, import drifted resources

🔹 Interview-Ready STAR Examples

IAM Permission Issue

S: While applying Terraform for an S3 bucket, AccessDenied occurred.
T: Needed to provision resources with correct IAM.
A: Updated pipeline IAM role with s3:CreateBucket and s3:PutBucketPolicy.
R: Apply succeeded; added IAM permission checks in pipeline.

Resource Already Exists

S: Apply failed because an RDS instance was manually created before Terraform.
A: Used terraform import to bring it into state.
R: Future applies managed resource correctly.

Quota Limit

S: EKS worker node scaling failed during apply.
A: Identified hitting EC2 instance quota.
R: Requested AWS quota increase + added monitoring.

Dependency Conflict

S: Apply failed creating Security Group due to VPC dependency.
A: Added explicit depends_on to ensure order.
R: Apply succeeded.

Timeout

S: Apply hung creating RDS DB.
A: Increased timeouts in resource block.
R: Apply completed without manual retries.

👉 Do you want me to prepare a “Top 5 Terraform apply failure Q&A” version (like GitHub Actions one) so you can use it in interviews as question → answer format?

PreviousSRE NextHR DIscuusion

Last updated 4 months ago

hashtagBreak into chunks

hashtagProcedure (Steps):

hashtagEscalation

hashtagPost-Incident

hashtagIdentify the Root Cause

hashtagDocument Findings

hashtagPrevent Recurrence

hashtag🚨 Problem: Node is reporting NotReady status in kubectl get nodes

hashtag❓ Why #1: Why is the node NotReady?

hashtag❓ Why #2: Why has kubelet stopped posting heartbeats?

hashtag❓ Why #3: Why was kubelet OOMKilled?

hashtag❓ Why #4: Why did the node run out of memory?

hashtag❓ Why #5: Why was there a pod with no memory limit consuming unlimited memory?

hashtag🎯 ROOT CAUSE:

hashtag🛠️ Recommended Fixes (Address Root Cause)

hashtag🔁 Bonus: Alternative Paths in 5 Whys

hashtag🧭 Pro Tip: Use KSM Metrics for Early Detection

hashtag🔹 How to Answer: “How do you debug/troubleshoot issues?”

hashtag✅ Step 1: Understand the Problem

hashtag✅ Step 2: Reproduce (if possible)

hashtag✅ Step 3: Check Observability Data

hashtag✅ Step 4: Narrow Down the Scope

hashtag✅ Step 5: Apply Known Fixes / SOPs

hashtag✅ Step 6: Deep Investigation (if unknown)

hashtag✅ Step 7: Collaborate / Escalate

hashtag✅ Step 8: Verify Resolution

hashtag✅ Step 9: Document & Improve

hashtag🔹 Interview-Ready Answer

hashtagMeasures Greatness by Results, Not Knowledge

hashtagIf he already mentions the CCoE team

hashtagIf he hasn’t mentioned the CCoE team yet

hashtag1. Node Not Ready

hashtag2. Pod Pending

hashtag3. CrashLoopBackOff

hashtag4. OOMKilled

hashtag5. ImagePullBackOff

hashtag🔹 Common Build Failures in GitHub Actions (CI Stage)

hashtag🔹 How to Prevent Build Failures

hashtag🔹 Common Failures During terraform apply

hashtag🔹 Interview-Ready STAR Examples

Break into chunks

Procedure (Steps):

Escalation

Post-Incident

Identify the Root Cause

Document Findings

Prevent Recurrence

🚨 Problem: Node is reporting `NotReady` status in `kubectl get nodes`

❓ Why #1: Why is the node NotReady?

❓ Why #2: Why has kubelet stopped posting heartbeats?

❓ Why #3: Why was kubelet OOMKilled?

❓ Why #4: Why did the node run out of memory?

❓ Why #5: Why was there a pod with no memory limit consuming unlimited memory?

🎯 ROOT CAUSE:

🛠️ Recommended Fixes (Address Root Cause)

🔁 Bonus: Alternative Paths in 5 Whys

🧭 Pro Tip: Use KSM Metrics for Early Detection

🔹 How to Answer: “How do you debug/troubleshoot issues?”

✅ Step 1: Understand the Problem

✅ Step 2: Reproduce (if possible)

✅ Step 3: Check Observability Data

✅ Step 4: Narrow Down the Scope

✅ Step 5: Apply Known Fixes / SOPs

✅ Step 6: Deep Investigation (if unknown)

✅ Step 7: Collaborate / Escalate

✅ Step 8: Verify Resolution

✅ Step 9: Document & Improve

🔹 Interview-Ready Answer

Measures Greatness by Results, Not Knowledge

If he already mentions the CCoE team

If he hasn’t mentioned the CCoE team yet

1. Node Not Ready

2. Pod Pending

3. CrashLoopBackOff

4. OOMKilled

5. ImagePullBackOff

🔹 Common Build Failures in GitHub Actions (CI Stage)

🔹 How to Prevent Build Failures

🔹 Common Failures During `terraform apply`

🔹 Interview-Ready STAR Examples