Final Round
introduction
Thank you for the opportunity.
I have 3.8 years of experience as a DevOps Engineer. In my current role,
I’ve worked extensively with Docker and Kubernetes — managing EKS clusters, deploying Helm charts, and handling autoscaling with Karpenter.
I have also managed AWS infrastructure and automate provisioning using Terraform and Ansible.
Implemented CI/CD pipelines for deploying containerized workloads on Docker and Kubernetes.
Additionally, I’ve set up monitoring and logging using Prometheus, Grafana, and Loki to ensure the system availability and proactively handle the incidents.
I believe this experience will help me contribute to delivering secure, scalable, and cost-effective solutions at ZS.
Thank you for the opportunity.
I have around 3.8 years of experience as a DevOps Engineer. In my current role, I work extensively with Docker and Kubernetes, managing EKS clusters, deploying Helm charts, and handling autoscaling with Karpenter. I also manage AWS infrastructure and automate provisioning using Terraform and Ansible. I’ve implemented CI/CD pipelines for containerized workloads and set up monitoring and logging solutions using Prometheus, Grafana, and Loki to ensure system availability and proactively address incidents. I’m excited about the opportunity at ZS, and I believe my experience can really help contribute to delivering secure, scalable, and cost-effective cloud solutions.”
Break into chunks
Don’t try to say it in one long sentence. Break it into 3–4 thought chunks, each with a small pause:
security - vault, container scanning, WAF rules Secure Internet Access: Use NAT Gateways and VPC Endpoints to control internet access for private resources. SG and NACL cloudflare WAF, tunnel, warp
scalability - autoscalling HPA and karpenter
reliability - monitoring and alerting
cost-optimization - spot and reserve ec2, gp3, t3a.medium
automation - terraform and modules, ansbles
Tech questions
Way to answer
What is .. ?
What it is: Blue green is a deployment strategy where we run two identical environments: one currently serving traffic (blue) and one idle (green).
How it works: Where we deploy the new version to the green environment, validate it with tests, and then switch user traffic from blue to green. If an issue occurs, traffic can be instantly routed back to blue.
Purpose: It is used to minimize downtime and reduce risk during releases
When to use: Used for critical production systems During major version upgrades where downtime is unacceptable and need fast rollback
Tradeoffs: but comes with a cost as we run two identical environments
Example
Q. What is blue-green strategy?
A. Blue-green is a deployment strategy where we run two identical environments: one currently serving traffic (blue) and one in an idle state (green). — We deploy the new version to the green environment, validate it with tests, and then switch user traffic from the blue environment to the green environment. If an issue occurs, traffic can be instantly routed back to blue environments. — which minimizes downtime and reduces risk during releases, — used for critical production systems. During major version upgrades, where downtime is unacceptable and need fast rollback — but comes with a cost as we run two identical environments
Current prod (Blue) runs v1.
Deploy v2 to Green, test it.
Switch traffic → Green (v2).
If stable, next release (v3) will go to Blue (now idle).
difference between X and Y”
A Deployment Kubernetes controller for that manages stateless applications by maintaining a specified number of replicas of a pod. It provides features like rolling updates, rollbacks, scaling and self-healing, ensuring high availability and declarative updates for applications. Deployments are ideal for managing applications that require consistent uptime and easy version management.
while
scenario based questions
Creating and following SPO
SOPs (standard operating procedure) as pre-approved, repeatable playbooks for handling operations.
SOP = repeatable, known issue/process
is a documented set of steps that engineers follow to handle recurring issues or a tasksin a consistent and reliable way.
Title → Clear name of the procedure.
Purpose → Why this SOP exists.
Scope → When/where to use it (what systems/issues it covers).
Pre-requisites → Tools, permissions, or access needed.
Steps → Detailed resolution steps (step-by-step, no assumptions).
Verification → How to confirm the issue is resolved.
Escalation → What to do if steps fail (who to contact, next level).
Post-Incident → Logging, documentation, RCA if needed.
Title: Handling “Node NotReady” Issue in Amazon EKS
Purpose: To quickly identify and resolve situations where an EKS worker node is stuck in the NotReady state, impacting workloads.
Scope: Applies to all EKS clusters running on AWS where nodes may enter NotReady state.
Pre-requisites:
AWS CLI configured
kubectlaccess to the clusterIAM permissions for EC2 and EKS operations
Procedure (Steps):
Step 1: Verify Node Status
kubectl get nodesStep 2: Check EC2 Instance Health
Step 4: Common Fixes
if kubectl failed restate it
if ec2 failed restart it
Verification: confirm issue resolve and system working correctly
Escalation
If issue persists → escalate to Cloud SRE Lead / AWS Support.
Provide logs, steps folloed, snapshots (
kubectl describe, kubelet logs, EC2 status).
Post-Incident
Document root cause (e.g., CNI crash, disk full, EC2 hardware issue).
Update runbook if new resolution steps are discovered.
Technical
Interviewer: “If a pod keeps crashing with ‘CrashLoopBackOff’, what’s the first thing you check?” You (not 100% sure of exact order): “Firstly, I will start with
kubectl describe podto check events and last state, then jump into logs withkubectl logs --previous. If it’s a startup failure, I’d suspect misconfigured livenessProbe or missing configmaps/secrets.”
Interviewer: “Are you sure that’s the right order?” 👊 Your Harvey Move: “I’m sure enough to start there — because 80% of CrashLoop cases are surfaced in events or logs. But if I hit a dead end? I’d immediately pull up the deployment manifest and compare against known-good configs. In production, guessing costs uptime — so I verify as I go.”
Q. What is blue green deployment?
You sure?
I’m sure about the strategy/logic/strucure/concepts
How do you handle conflict
I handle conflict by listening to all perspectives, and keeping the discussion focused on facts, project needs, rather than opinions. I encourage the team to list pros and cons objectively align the decision with project requirements and aim for a solution that everyone can support.
I remember we had a conflict in our team while choosing the right cluster autoscaler for our workloads. Our workloads experienced high traffic spikes during events, and we needed scaling that was both fast and cost-efficient.
One side argued for Cluster Autoscaler because it’s mature, stable, and widely adopted. The other side pushed for Karpenter, which provisions nodes much faster and supports Spot instances to reduce costs.
I handled the conflict by first listening to both sides and then driving a fact-based discussion. We listed out the pros and cons of each tool: Cluster Autoscaler offered maturity but slower scaling, while Karpenter was newer but optimized better for bursty traffic. After evaluating our workload patterns and SLAs, we decided to go with Karpenter for production since it fit our scaling needs, while still retaining Cluster Autoscaler for some legacy workloads.
By focusing on our requirements rather than personal preferences, we resolved the conflict, and the decision was accepted by the entire team.
how do you handle escalation?
When an issue arises, I first try to resolve it myself by following monitoring dashboards, logs, and documented SOPs. If I can’t resolve it within the expected SLA or if it’s a critical incident impacting production, I escalate according to our incident process — usually to the team lead or the specialist team like DBA or Networking. While escalating, I make sure to provide all the details — error messages, metrics, and steps already taken — so they don’t have to start from scratch. I also keep communication open with stakeholders until the issue is resolved. Afterward, I contribute to documenting or improving the SOP so that next time, the resolution is faster and maybe doesn’t require escalation at all.
I first try to resolve it myself by following monitoring dashboards, logs, and documented SOPs.
If I can’t resolve it within the expected SLA or if it’s a critical incident impacting production
escalate according to our incident process — usually to the team lead or the specialist team like DBA or Networking - escalate to particular team
I make sure to provide all the details — error messages, metrics, and steps already taken with snapshot, so they don’t have to start from scratch.
I also keep communication open with stakeholders until the issue is resolved.
Afterward, I contribute to documenting or improving the SOP so that next time, the resolution is faster and maybe doesn’t require escalation at all.
How do you perform RCA?
First, acknowledge and mitigate the issue (restore service, if possible, apply quick fix)
Gather logs, metrics, traces, check system health
Note timeline of events: when issue started, when alerts fired, when fixes applied.
Identify the Root Cause
Ask the 5 Whys (keep drilling down until you find the underlying cause, not just symptoms).
Example:
Service crashed → Why? OOM.
Why OOM? Memory leak.
Why memory leak? Unbounded cache.
Why unbounded cache? No eviction policy.
Root cause = bad cache implementation.
Document Findings
Create RCA report including:
Summary of incident.
Timeline of events.
Impact (users, systems, revenue).
Root cause.
Resolution steps taken.
Action items (permanent fixes, monitoring improvements, automation).
Prevent Recurrence
Update SOPs/runbooks.
Add monitoring/alerts if gaps found.
Automate recovery if possible (self-healing).
Share lessons learned in postmortem meeting.
"I perform RCA by first limiting the impact to restore service quickly. Then I collect logs, metrics, and timelines to understand what happened. I analyze the data, often using methods like the 5 Whys, to trace the problem back to the true root cause rather than just the symptoms. I document the findings in an RCA report with impact, timeline, and corrective actions. Finally, I ensure permanent fixes are implemented — like adding monitoring, automation, or code/configuration changes — so the issue doesn’t repeat."
ROOT CAUSE OF EKS NODE NOT-READY ISSUE
Absolutely — let’s do a real-time 5 Whys analysis for the Kubernetes event:
❗ “Node Not Ready”
This is a common production issue. We’ll drill down as if we’re live-debugging in an incident war room.
🚨 Problem: Node is reporting NotReady status in kubectl get nodes
NotReady status in kubectl get nodes❓ Why #1: Why is the node NotReady?
Because the kubelet on that node has stopped posting its heartbeat (
NodeStatus) to the Kubernetes API server within the configured timeout (default: ~40s).
✅ So it’s not necessarily that the machine is dead — just that the control plane hasn’t heard from kubelet recently.
❓ Why #2: Why has kubelet stopped posting heartbeats?
Because kubelet is either crashed, unresponsive, or can’t communicate with the API server.
Let’s check logs:
Possible findings:
Kubelet process is restarting/crashing
Network timeouts to API server
Certificate expiry
Disk pressure / resource starvation
👉 Let’s assume we see this log:
"Failed to update node status, will retry: error updating node status, Liveness probe failed"
Or maybe:
"server certificate has expired"
But let’s pick the most common real-world cause → Kubelet is under memory pressure and got OOMKilled.
❓ Why #3: Why was kubelet OOMKilled?
Because the node ran out of memory, and the Linux OOM Killer chose kubelet (or a critical system component) to kill to free up memory.
Check:
You might see:
MemoryPressure: Truekubelet, <node> NodeHasInsufficientMemory kubelet has insufficient memory available
Also, check node resource usage:
→ You see 95%+ memory usage.
❓ Why #4: Why did the node run out of memory?
Because workloads scheduled on the node consumed more memory than allocatable, and/or system daemons or kernel buffers used up remaining memory.
Drill into pods on that node:
Then check memory per pod:
You find:
A misconfigured app pod with no memory limit.
Or a runaway Java app with
-Xmx8gon a 4GB node.Or a DaemonSet (like logging agent) leaking memory over time.
👉 Let’s say: A critical application pod had no memory limit and consumed all available memory.
❓ Why #5: Why was there a pod with no memory limit consuming unlimited memory?
Because resource requests/limits were not enforced in the cluster, and no policy/gatekeeper prevented deployment of such pods.
Root cause uncovered:
✅ No ResourceQuota or LimitRange policies + lack of governance allowed a pod to be deployed without memory limits → pod consumed all node memory → kubelet OOMKilled → node became NotReady
🎯 ROOT CAUSE:
Missing cluster-level guardrails (ResourceQuota/LimitRange/OPA/Gatekeeper) allowed deployment of an unbounded pod that exhausted node memory, triggering OOMKill of kubelet and causing Node NotReady.
🛠️ Recommended Fixes (Address Root Cause)
✅ Enforce memory limits using
LimitRangein every namespace:✅ Apply ResourceQuotas per namespace to cap total usage.
✅ Use Policy Engines like Kyverno or OPA Gatekeeper to block deployments missing limits.
✅ Enable Vertical Pod Autoscaler (VPA) or set proper requests based on profiling.
✅ Monitor & Alert on
MemoryPressurenode condition andkube_node_status_conditionmetric from KSM.✅ Add node-level alerts for sustained high memory usage before it leads to OOM.
🔁 Bonus: Alternative Paths in 5 Whys
Depending on what you find in logs, your 5 Whys path may vary:
If network issue: Why? → CNI plugin crashed → Why? → ConfigMap corrupted → Why? → No config validation → Root cause = lack of GitOps/config drift protection.
If certificate expiry: Why? → Auto-renewal failed → Why? → kubelet cert rotation misconfigured → Root cause = insecure/default kubelet setup.
If disk pressure: Why? → Logs filled disk → Why? → No log rotation → Root cause = missing DaemonSet for fluent-bit/logrotate.
🧭 Pro Tip: Use KSM Metrics for Early Detection
Create alert based on:
Or:
✅ Solving “Node Not Ready” isn’t about restarting kubelet — it’s about preventing the conditions that lead to it.
Let me know if you want to walk through another root cause path (e.g., network, disk, cert, etc.) — I’ll simulate it live!
How do you handle issues and tasks
First, I understand the problem and check the impact. Then I look at logs, metrics, and monitoring dashboards to narrow down whether it’s infra, app, or external. If it’s a known issue, I follow our SOP. If it’s new, I investigate deeper using system tools and collaborate with other teams if needed. Once resolved, I validate the fix and document it so it becomes easier to handle in the future."
For recurring issues, I focus on automation and permanent fixes — I follow SOPs, perform RCA, and eliminate the root cause so it doesn’t repeat.
When a new problem comes up, I first check what’s going on and stop it from getting worse to reduce impact, then investigate deeply, escalate if needed, and finally document the solution so it becomes an SOP for future use
how did you debug/troubleshoot issues?
First, I understand the problem and check the impact. Then I look at logs, metrics, and monitoring dashboards to narrow down whether it’s infra, app, or external. If it’s a known issue, I follow our SOP. If it’s new, I investigate deeper using system tools and collaborate with other teams if needed. Once resolved, I validate the fix and document it so it becomes easier to handle in the future."
🔹 How to Answer: “How do you debug/troubleshoot issues?”
✅ Step 1: Understand the Problem
Start with the symptoms: error messages, alerts, logs.
Ask: What exactly is broken? Who/what is impacted?
✅ Step 2: Reproduce (if possible)
Try to reproduce issue in staging/test environment.
Helps confirm if issue is systemic or environment-specific.
✅ Step 3: Check Observability Data
Use logs, metrics, traces, dashboards (Grafana, CloudWatch, Prometheus, Loki).
Identify abnormal spikes (CPU, memory, disk, latency).
✅ Step 4: Narrow Down the Scope
Is it app-level (code bug), infra-level (node/DB/network), or external (AWS service, DNS)?
Divide & conquer → rule out healthy components.
✅ Step 5: Apply Known Fixes / SOPs
If it’s a known recurring issue → follow runbook/SOP.
Example: Node NotReady → check kubelet, restart or replace.
✅ Step 6: Deep Investigation (if unknown)
Use debugging tools:
kubectl describe pod/node(K8s)journalctl,dmesg(Linux)EXPLAIN ANALYZE(Postgres slow queries)Tracing requests across microservices.
✅ Step 7: Collaborate / Escalate
If out of scope → escalate to specialist (DBA, networking, AWS support).
Share logs + steps already tried (saves time).
✅ Step 8: Verify Resolution
After fix → confirm system is healthy.
Validate via monitoring dashboards + user experience.
✅ Step 9: Document & Improve
Update SOP/runbook.
Add monitoring/alerting if gap found.
Feed into postmortem/RCA if major issue.
🔹 Interview-Ready Answer
"I troubleshoot issues in a structured way. First, I understand the problem and check the impact. Then I look at logs, metrics, and monitoring dashboards to narrow down whether it’s infra, app, or external. If it’s a known issue, I follow our SOP. If it’s new, I investigate deeper using system tools and collaborate with other teams if needed. Once resolved, I validate the fix and document it so it becomes easier to handle in the future."
👉 Pavan, do you want me to also prepare a real-world example answer (like how you debugged an “EKS Node NotReady” or “RDS connection spike” issue) so it feels concrete in interviews instead of just generic steps?
Behavioral-scenario question
Follow STAR Approach - Situation → Task → Action → Result.
personality check
Harvey personality
Harvey walks into every room like he owns it
You think you’re smart? Fine. But I’m smarter because I know how to use it.
preparation - He reads files, studies opponents, rehearses lines, and anticipates moves.
Preparation builds confidence. The more you prep, the more comfortable you are admitting you don’t know something — because you know you can find out fast.
Harvey: When I got here i dominated, nobody questions my ability to get the job done
first impression is last
he doesn't get it, just doing a good works isn't a whole job
Measures Greatness by Results, Not Knowledge
In interview results oriented people matter than knowlegde
don't know the answer
Let me think about that for a sec” or
Ok “That’s a great question — let me make sure I address it properly
Recall related knowledge.
Structure your answer.
You sure?
Based on what I know right now? Yes, I’m sure
Why? Do you see a any flaw in that approach? I’d love to hear your take — maybe there’s a nuance I haven’t considered yet.
I’m sure about the logic, less sure about the edge case
Ansible Yes, I’ve got hands-on experience with Ansible — I’ve used it for tasks like provisioning servers and writing playbooks. I don’t work with it daily, but I’m comfortable using it when needed.
argocd Not yet, but I’ve read about its workflow — ArgoCD continuously monitors Git for changes and applies them to the cluster. If I get the chance, I’d love to implement it, as it builds on the skills I already have with Kubernetes and GitOps.
do you have any question
Questions
What are the most common production challenges you face with client workloads?
How do you prioritize between speed of delivery and platform stability/reliability?
What KPIs matter most to you for measuring cloud success?
What skills do you see as most critical for cloud engineers to develop in the next few years?
let's them talk about cost, security, reliability, compliance, speed, etc.
If he already mentions the CCoE team
👉 “What are the primary focus areas for the CCoE team when balancing cost, security, reliability, and delivery speed?
If he hasn’t mentioned the CCoE team yet
👉 “I’d like to understand more about the Cloud Center of Excellence team. What are the primary focus areas when balancing cost, security, reliability, and delivery speed?”
ending ..
No, that covers everything I wanted to ask. Thank you for taking the time to explain the team and its priorities. I’m looking forward to contributing to the CCoE team, working at scale, and helping deliver secure, reliable, and scalable cloud solutions Thank you!
Note: whenever you prepare for theory prepare scenario QA as well
Non technical
from and family
I come from jalgaon and have a supportive family.
My younger brother works in IT, my father is a farmer, and my mother is a homemaker.
They have always encouraged me to focus on education and personal growth, which inspired me to pursue a career in technology and DevOps.
strength
so, my strengh is, iam result oriented person and my strength is how i handle the situation quickly
weakness
i have curiosity to learn new technologies but sometimes it makes me distracted from the tasks that i need to focus on
Why do you want to join ZS Associates?
ZS is a global consulting firm with a strong presence in healthcare and life sciences. I’m excited about the opportunity because the skills align well with the goal and i heard CCoE team operates 200 eks cluster and thta will be exciting for me to work at hat scale and also zs have strong present in
zs is aws partner and ai cloud solutions in aws marketplace that excites me to join zs Where do you see yourself in 5 years?
In 5 years, I see myself as a senior cloud engineer or architect, who build and design highly scalable and reliable cloud solution and keep learning emerging technologies
7. Why should we hire you?
I bring 3.8 years of hands-on experience in cloud infrastructure, Kubernetes, and DevOps space and my experience aligns well with the responsibilities of this role, and I am confident I can contribute to delivering secure, scalable, and cost-effective cloud solutions at ZS
my interest match with zs tech stack so yes I think I am right candidate for zs
Tell me about a challenge you faced at work and how you handled it
In my previous role, we faced a challenge with sudden spikes in workload on our Kubernetes cluster. The existing Horizontal Pod Autoscaler wasn’t sufficient to scale nodes quickly, which sometimes caused pods to stay in Pending state. I implemented Karpenter as a dynamic cluster autoscaler, which automatically provisions and scales nodes based on workload demand. This solution reduced pod scheduling delays, improved application reliability, and optimized resource utilization across the cluster."
How do you handle stress or tight deadlines?
I prioritize tasks, break them into smaller steps, and focus on one task at a time. I also communicate proactively with my team and managers about progress and challenges. Using this approach, I’ve successfully delivered critical projects under tight deadlines without compromising quality."
How do you work in a team?
I enjoy collaborating with team members, sharing knowledge, and learning from others. I make sure to communicate clearly, help resolve conflicts, and focus on the team’s goals. I also mentor new members to help them get up to speed quickly."
What motivates you?
"I’m motivated by solving complex problems, learning new technologies, and seeing the impact of my work on system performance and reliability. I also enjoy mentoring others and contributing to team growth.
What are your hobbies/interests?
"I enjoy learning travelling
How do you handle mistakes at work?
"I acknowledge mistakes immediately, analyze it, and take corrective actions to prevent recurrence.
Kubernetes scenario QA
QA
Perfect Pavan 👍 These 5 conditions you listed are very common Kubernetes failure scenarios. Interviewers usually test them in scenario-based questions where you explain in STAR (Situation – Task – Action – Result) format.
Here’s how you can frame interview-ready answers for each case:
1. Node Not Ready
Situation:
In our EKS cluster, one of the worker nodes went into NotReady state during a production release.
Task: My task was to quickly identify the cause and bring the node back online without disrupting critical workloads.
Action:
Checked node status:
kubectl describe node <node-name>Found that
kubeletwas not reporting due to high disk usage.Pods producing high log output (e.g., debug mode, verbose logs). and resulted /var/lib/docker/containers/.../*.log and node reported disk pressure
Cleared unused Docker images and logs.
sudo truncate -s 0 /var/lib/docker/containers//.logDiscussed with the development team and advised them to reduce verbose logging and expose only necessary logs.
If needed, cordoned and drained the node, then restarted kubelet service.
2 - unable to ssh node, drain node, deleted node, ASG created new - this is valid
Result:
Node transitioned back to Ready, pods were rescheduled, and we reduced disk alert threshold monitoring to proactively prevent future issues.
2. Pod Pending
Situation:
Some pods in the default namespace were stuck in Pending state during an autoscaling event.
Task: I had to find why the scheduler wasn’t placing pods and fix it quickly.
Action:
Ran
kubectl describe pod <pod>→ sawInsufficient CPUerror.Confirmed with
kubectl get nodes -o widethat nodes were fully utilized.Increased node group size using Cluster Autos Caler.
Updated pod resource requests to realistic values to avoid over-requesting.
Result: Pods got scheduled successfully, and autoscaling handled future workload spikes smoothly.
3. CrashLoopBackOff
Situation:
A new microservice deployment was failing repeatedly and showing CrashLoopBackOff.
Task: I needed to identify the cause and stabilize the application.
Action:
Checked logs:
kubectl logs <pod> --previous.Found the container was exiting due to missing environment variable (DB password).
ConfigMap/Secret not mounted
Wrong command/Code bug - process exits with non-zero status.
Fixed ConfigMap and Secret references in the Deployment YAML.
Rolled out new deployment with
kubectl rollout restart.
Result: Application came up successfully, and we introduced pre-deployment validation checks for required env variables.
4. OOMKilled
Situation:
A data processing job pod was repeatedly killed with status OOMKilled.
Task: I had to optimize the pod resource allocation and prevent service impact.
Action:
Described pod:
kubectl describe pod <pod>→ saw memory limit exceeded.Increased memory limits in the Deployment manifest from 512Mi → 2Gi.
Also optimized the application code to use streaming instead of loading entire dataset in memory.
Result: Pod ran successfully without OOM kills, and resource requests/limits were adjusted for all workloads to prevent overcommitment.
5. ImagePullBackOff
Situation:
During a deployment, pods were stuck in ImagePullBackOff state.
Task: My job was to ensure the correct image could be pulled and deployed.
Action:
Described pod:
kubectl describe pod <pod>→ foundimage not founderror in private ECR.Verified the image existed and corrected the tag in the manifest.
Checked IAM permissions for worker nodes to access ECR and fixed missing policy.
Re-deployed after fixing the secret.
Result: Pods pulled images successfully, and we set up CI/CD pipeline automation to tag and push images correctly before deployment.
🔥 Pro tip for interviews:
When you present these, keep your Action focused on tools/commands (kubectl, CloudWatch, Grafana, etc.) and your Result on impact reduction and long-term preventive measures.
Do you want me to also prepare a combined “interview question set” where the interviewer gives you one-liner scenarios like “Your pod is in CrashLoopBackOff. What do you do?” and you respond in STAR in short 1–2 minute answers?
build failure
QA
Perfect ✅ — let’s go deeper only on build failures in GitHub Actions pipelines (before deployment). These are the most common CI/CD build-time issues you’ll see:
🔹 Common Build Failures in GitHub Actions (CI Stage)
Failure
Cause
Symptom (Logs)
Resolution
Missing Dependencies
Wrong or missing entries in requirements.txt, package.json, pom.xml, etc.
ModuleNotFoundError: No module named 'flask' or npm ERR! missing script: build
Update dependency file, lock versions, run pip freeze > requirements.txt
Version Conflicts
Library requires specific version not compatible with another
ERROR: Cannot install - version conflict
Use dependency pinning (e.g., requests==2.32.3), upgrade/downgrade packages
Private Repo Dependency
Code depends on private GitHub/NPM/PyPI repo
npm ERR! 403 Forbidden or pip install ERROR: 401 Unauthorized
Use GitHub Action’s GITHUB_TOKEN or deploy keys for authentication
Caching Issues
GitHub Actions cache stores old deps
Builds pass locally but fail in CI
Clear/rebuild cache, use actions/cache properly with hashFiles('**/requirements.txt')
Build Tool Missing
Required CLI not installed in runner
make: command not found or mvn: not found
Install tool in workflow (sudo apt-get install -y build-essential)
OS-Specific Dependency
Works on dev machine but not in Linux runner
cannot find /usr/lib/... or failed to build wheel
Add OS packages (apt-get install libpq-dev), use multi-platform builds
Docker Build Failure
Wrong Dockerfile instructions, missing files
COPY failed: file not found in build context
Verify Dockerfile paths, use .dockerignore correctly
Authentication Failure for Registry
Pushing image to ECR/GHCR without login
denied: User not authorized to perform ecr:PutImage
Use aws-actions/amazon-ecr-login@v2 or docker/login-action
Tests Fail in CI
Unit/integration tests not passing
pytest FAILED or npm test failed
Fix code/test mismatch, mock external dependencies
Secrets Missing in CI
Test requires env var not set in GitHub Secrets
KeyError: DB_PASSWORD or undefined process.env.API_KEY
Define secrets under repo → Settings → Secrets and reference ${{ secrets.DB_PASSWORD }}
🔹 How to Prevent Build Failures
Dependency pinning (
package==version) → avoid version drift.Use caching properly → speeds up installs but bust cache when
requirements.txtchanges.Run builds locally in same environment as GitHub runner (Ubuntu, not Windows/Mac).
Fail fast → run linting/tests before heavy build steps.
Use matrix builds → test across versions (Python 3.9, 3.10 etc.).
🎤 Interview-ready way to explain:
“Most build failures in GitHub Actions come from missing dependencies, version conflicts, or misconfigured build tools. For example, I’ve seen Python builds fail with ModuleNotFoundError because requirements.txt was incomplete, or Docker builds fail because COPY referenced a missing file. I usually fix these by verifying dependency files, pinning versions, using caching, and aligning local dev with the GitHub runner environment.”
👉 Do you want me to prepare a 5-scenario STAR-style story just for build failures (e.g., missing dependency, OS package issue, cache issue, private repo dependency, Docker build error) so you can narrate them in an interview?
Terraform
QA
Perfect 👍 — now let’s switch focus to failures during terraform apply (execution time, not just build/plan time). These are the common real-world Terraform apply issues you’ll encounter:
🔹 Common Failures During terraform apply
terraform applyFailure
Cause
Symptom (Logs)
Resolution
Insufficient IAM Permissions
User/role doesn’t have required AWS/GCP/Azure permissions
Error: AccessDenied: User is not authorized to perform...
Update IAM policy, least-privilege permissions, test with aws sts simulate-principal-policy
Resource Already Exists
Resource created manually outside Terraform
Error: [Conflict] resource already exists
Import resource (terraform import) or delete manual resource
Quota / Limit Exceeded
Hitting cloud resource limits (e.g., VPC, EIP, EKS nodes)
Error: Quota exceeded
Request quota increase, cleanup unused resources
Invalid/Conflicting Config
Wrong VPC ID, subnet mismatch, or dependency not yet created
Error: InvalidSubnetID.NotFound or DependencyViolation
Fix config, add depends_on if needed
State Lock Error
Another apply running, lock not released
Error: Error acquiring the state lock
Use remote state with locking (S3 + DynamoDB for AWS), run terraform force-unlock only if safe
Timeouts / API Errors
Provider timeout or cloud API flakiness
Error: timeout while waiting for state to become 'available'
Use timeouts {} block, retry apply
Destroy/Recreate Conflicts
In-use resources can’t be destroyed (e.g., Security Group attached)
Error: DependencyViolation: resource in use
Detach dependencies first, refactor resource lifecycle
Wrong Credentials / Expired Token
Expired session token or misconfigured creds
Error: NoCredentialProviders
Refresh credentials, configure GitHub OIDC/AWS profile correctly
Provisioner Script Failures
remote-exec / local-exec fails inside resource
Error: remote-exec provisioner error
Debug script, test manually, use Ansible/cloud-init instead
Drift Detected
Infra drift causes unexpected changes
Error: resource already changed outside Terraform
Run terraform refresh, import drifted resources
🔹 Interview-Ready STAR Examples
IAM Permission Issue
S: While applying Terraform for an S3 bucket,
AccessDeniedoccurred.T: Needed to provision resources with correct IAM.
A: Updated pipeline IAM role with
s3:CreateBucketands3:PutBucketPolicy.R: Apply succeeded; added IAM permission checks in pipeline.
Resource Already Exists
S: Apply failed because an RDS instance was manually created before Terraform.
A: Used
terraform importto bring it into state.R: Future applies managed resource correctly.
Quota Limit
S: EKS worker node scaling failed during apply.
A: Identified hitting EC2 instance quota.
R: Requested AWS quota increase + added monitoring.
Dependency Conflict
S: Apply failed creating Security Group due to VPC dependency.
A: Added explicit
depends_onto ensure order.R: Apply succeeded.
Timeout
S: Apply hung creating RDS DB.
A: Increased
timeoutsin resource block.R: Apply completed without manual retries.
👉 Do you want me to prepare a “Top 5 Terraform apply failure Q&A” version (like GitHub Actions one) so you can use it in interviews as question → answer format?
Last updated