SRE

chevron-rightLinux QAhashtag

Absolutely! Below are 20 commonly asked Linux-related questions in Google SRE interviews, along with answers that reflect what a top candidate would say — someone who understands the why, not just the how.


🧠 Top 20 Google SRE Linux Interview Questions (with Ideal Answers)


1. What happens when you type google.com in your browser?

Best Answer:

When you type google.com, the OS performs a DNS resolution using /etc/resolv.conf. Once it gets the IP, the browser opens a TCP connection (via 3-way handshake) to port 443. TLS handshake secures the connection. Then it sends an HTTP GET request. The server responds with data, and the browser renders it. Along the way, ARP resolves MAC addresses, and IP routing decides packet delivery.


2. How does ps get process information?

Best Answer:

The ps command reads from the /proc virtual filesystem, especially files like /proc/[pid]/stat, /proc/[pid]/cmdline, etc. It also uses /proc/stat and other kernel interfaces to gather CPU and memory stats.


3. Explain top command output.

Best Answer:

top shows real-time system stats. Key fields:

  • %CPU: CPU usage by process

  • %MEM: Memory usage

  • LOAD AVG: 1/5/15 min average processes in queue

  • NI: Process niceness (priority) It uses /proc to collect this data and refreshes every few seconds.


4. Difference between a process and a thread?

Best Answer:

A process is an independent execution unit with its own memory space. A thread is a lightweight subunit of a process that shares memory. Threads are faster to create and switch, but processes offer better isolation.


5. How do you check if a port is open and listening?

Best Answer:

sudo lsof -i :8080
netstat -tuln | grep 8080
ss -tuln | grep 8080

Or use nc -zv localhost 8080 to probe a port. ss is preferred in modern Linux systems.


Best Answer:

  • Hard link: Same inode, file remains even if the original is deleted.

  • Soft (symbolic) link: Points to pathname. If the target is deleted, the link breaks.


7. What is a zombie process?

Best Answer:

A zombie is a dead process whose parent hasn't read its exit status via wait(). It appears in ps with <defunct>. It doesn’t use CPU/memory but wastes a PID. Proper parent processes reap zombies to avoid PID exhaustion.


8. How does Linux manage memory?

Best Answer:

Linux uses virtual memory, with pages mapped to physical RAM. Unused pages may be swapped. The kernel uses page cache to cache disk IO. Tools like free, vmstat, and /proc/meminfo show memory usage.


9. What is the OOM killer?

Best Answer:

The Out of Memory (OOM) killer is triggered when memory is exhausted. It chooses a process to kill based on a badness score, freeing memory. It’s a last resort to avoid system crash.


10. How do you find the top memory-consuming processes?

Best Answer:

ps aux --sort=-%mem | head
top
htop

Or check /proc/[pid]/status for VmRSS, VmSize.


11. What does nice and renice do?

Best Answer:

nice sets a process's initial niceness (priority), ranging from -20 (high priority) to 19 (low). renice changes the priority of a running process.


12. How does the kernel schedule processes?

Best Answer:

Linux uses Completely Fair Scheduler (CFS). It tracks "virtual runtime" to ensure each process gets fair CPU time. It maintains a red-black tree to pick the task with the lowest virtual runtime.


13. Explain runlevels or systemd targets.

Best Answer:

Traditional runlevels (e.g., 3 for multi-user, 5 for GUI) are replaced by systemd targets (like multi-user.target, graphical.target). Use systemctl get-default and systemctl isolate to view/change them.


14. What happens during Linux boot process?

Best Answer:

  1. BIOS/UEFI initializes hardware

  1. Bootloader (GRUB) loads the kernel

  2. Kernel initializes drivers, mounts root FS

  3. init or systemd runs

  4. System targets/services are started


15. How do you monitor disk IO?

Best Answer:

Use iostat, iotop, or vmstat 1. iostat shows per-device read/write stats. iotop shows live disk IO per process.


16. How can you find out which process is using a file or port?

Best Answer:

lsof /path/to/file
lsof -i :80
fuser /path/to/file

These map file descriptors or ports to processes.


17. What’s the difference between /dev/null and /dev/zero?

Best Answer:

/dev/null discards anything written to it (bit bucket). /dev/zero outputs null bytes (\0) when read. It’s used for allocating zeroed memory.


18. How do you check for file descriptor leaks?

Best Answer:

ls /proc/<pid>/fd | wc -l
lsof -p <pid>

Compare open FDs over time to detect leaks. ulimit -n shows FD limit.


19. How do cgroups work?

Best Answer:

Cgroups (Control Groups) limit and isolate CPU, memory, IO, etc. per process group. They form the backbone of Docker/K8s resource limits. Use systemd or cgcreate/cgexec to manage them manually.


20. How do you analyze a Linux crash (kernel panic)?

Best Answer:

Enable kdump to capture crash dumps. Analyze using crash utility and vmcore. Check /var/log/messages and dmesg logs. If kdump is not enabled, use console logs and hardware LED indicators (IPMI).


✅ Bonus Tips for Google SRE Interviews

  1. Don't just explain what — explain why and how things work under the hood.

  2. Tie answers to real incidents you’ve handled (e.g., “We once had a memory leak causing OOM kills…”).

  3. Show awareness of scale, observability, and automation.

  4. Be honest when you don’t know — but always say how you’d find the answer.


That’s a very fair point — and you're absolutely right. The earlier sets include deep-dive Linux questions which are often reserved for senior SREs or during escalation rounds.

For someone with 3–5 years experience, especially applying for L3 or L4 SRE roles, Google (and other top companies) will focus more on practical knowledge, troubleshooting, and real-world experience rather than kernel internals or niche topics.


✅ Here’s a more realistic and experience-level-appropriate set of 20 Linux questions — with solid, real-world answers.

These are the kind of questions a mid-level DevOps/SRE is expected to handle well.


1. How do you check system resource usage in Linux?

Answer:

I use:

  • top or htop for live CPU, memory, process info

  • free -h for memory summary

  • df -h for disk usage

  • du -sh * to check folder size

  • uptime for load averages


2. How do you troubleshoot high CPU usage?

Answer:

I run top or htop to identify the high-CPU process. Then ps aux --sort=-%cpu | head to confirm. If it’s a specific service, I check logs, open file handles (lsof), and restart or scale as needed.


3. How do you schedule a job in Linux?

Answer:

I use crontab -e for recurring jobs. For one-time tasks, at is useful. For example:

0 2 * * * /opt/scripts/db_backup.sh

runs the job daily at 2 AM.


4. How do you find which process is using a port?

Answer:

lsof -i :8080 or netstat -tulnp | grep 8080 or ss -ltnp — they show which PID is listening on a port.


5. How do you free up disk space?

Answer:

  • Clear logs from /var/log (e.g., rotated logs)

  • Clean apt/yum cache

  • Remove unused Docker images: docker system prune

  • Find large files with du -ah / | sort -rh | head


6. How do you find memory-consuming processes?

Answer:

ps aux --sort=-%mem | head

Also top or htop sorted by %MEM. I also check for memory leaks if a process steadily grows.


7. What’s the difference between systemctl stop and kill?

Answer:

systemctl stop gracefully stops the service via systemd. kill sends a signal to the process — can be SIGTERM (soft) or SIGKILL (forceful). systemctl is cleaner as it handles dependencies.


8. How do you check service logs?

Answer:

For systemd services:

journalctl -u nginx.service

Or check /var/log/<service-name>/ for traditional logs.


9. How do you find slow disk IO issues?

Answer:

I use:

  • iostat -xm 1 to monitor device IO

  • iotop to find which processes are using most IO

  • dstat for overall system stats


10. What’s the use of /etc/hosts?

Answer:

It maps hostnames to IP addresses before DNS is used. Useful for overriding DNS temporarily or internal testing.


11. How do you monitor a running log file in real-time?

Answer:

tail -f /var/log/syslog

Or use multitail, less +F, or piping through grep to watch specific keywords.


12. What’s the difference between reboot, shutdown, halt, and poweroff?

Answer:

  • reboot: Restart system

  • shutdown -r now: Same as reboot

  • halt: Stops CPU without powering off

  • poweroff: Powers off the system completely


13. How do you check if a service is enabled on boot?

Answer:

systemctl is-enabled nginx

You can also run systemctl list-unit-files | grep enabled.


14. How do you list all running processes?

Answer:

ps aux
top
htop

15. What is the difference between chmod 755 and chmod 644?

Answer:

  • 755: Read/write/execute for owner, read/execute for group/others (used for scripts and executables).

  • 644: Read/write for owner, read-only for group/others (used for config and text files).


16. How do you check system uptime and reboot history?

Answer:

uptime
who -b
last reboot

17. How do you find failed login attempts?

Answer:

  • cat /var/log/auth.log | grep "Failed password"

  • journalctl -u sshd


18. How do you create and extract a .tar.gz file?

Answer:

tar -czvf backup.tar.gz /my/folder/
tar -xzvf backup.tar.gz

19. How do you set an environment variable permanently?

Answer:

Add export VAR=value to:

  • ~/.bashrc or ~/.bash_profile (user)

  • /etc/environment (system-wide)


20. How do you troubleshoot “Permission Denied” errors?

Answer:

  • Use ls -l to check file permissions

  • Use id or groups to check user/group

  • Use chmod, chown to fix ownership

  • Also check if the file has noexec or if SELinux/AppArmor is blocking it


🎯 Final Thought

With 3–5 years of experience, you should:

  • Focus on practical skills

  • Show that you debug logically under pressure

  • Back up your answers with real experience ("We had a disk full issue on EC2...")


You're doing awesome — and since you're serious about Google SRE prep with 3–5 years of experience, here's another fresh batch of 20 Linux interview questions (non-repeating) that match your real-world, mid-level SRE/DevOps profile.

These will test your daily hands-on skills, problem-solving, and environment awareness — exactly what Google and similar companies look for.


🧠 Linux Interview Questions – Set #3 (Mid-Level, Practical, Non-Repeating)


1. How do you find which user ran a command recently on a server?

Answer:

I check shell history files like ~/.bash_history, and use last or who for login history. For auditing, I check /var/log/auth.log or use auditd if it’s set up.


2. How do you restart a service if it crashes unexpectedly?

Answer:

In systemd-based systems, I add:

[Service]
Restart=always
RestartSec=5

in the unit file or override config. This ensures automatic restart after failure.


3. How do you list files modified in the last 1 hour?

Answer:

find /path -type f -mmin -60

4. How do you kill all processes by a specific user?

Answer:

pkill -u username

or

kill -9 $(ps -u username -o pid=)

5. How do you copy files from one server to another securely?

Answer:

scp /file user@remote:/path
rsync -avz -e ssh /file user@remote:/path

rsync is better for large or incremental transfers.


6. What is the nohup command and when do you use it?

Answer:

nohup runs a command immune to hangups (terminal close), logs output to nohup.out. Useful for long background tasks.


7. How do you track cron job failures?

Answer:

  • Redirect output to a log file:

* * * * * /script.sh >> /var/log/script.log 2>&1
  • Or check /var/log/syslog or journalctl -u cron.


8. How do you find the file system type of a partition?

Answer:

df -T
lsblk -f
blkid

9. How do you unmount a busy filesystem?

Answer:

Use:

lsof /mount/point
fuser -vm /mount/point

Then stop the processes and umount.


10. How do you list open network connections on a Linux server?

Answer:

ss -tunap
netstat -anp
lsof -i

11. How do you see memory usage per process (accurately)?

Answer:

I use smem or check /proc/<pid>/smaps. top shows RES (resident memory), but smem gives more accurate PSS (proportional set size).


12. How do you reduce swap usage on a server?

Answer:

I check why swap is used (top, vmstat). Then:

swapoff -a && swapon -a

Or tune swappiness:

sysctl vm.swappiness=10

13. How do you display human-readable permissions (rwx) with numeric mode?

Answer:

stat -c "%A %a %n" *

14. How do you find the default gateway and DNS in Linux?

Answer:

ip route | grep default
cat /etc/resolv.conf

15. How do you test if DNS resolution is slow?

Answer:

dig google.com +stats

Compare timing. Also try:

time curl http://domain.com

16. How do you limit CPU usage of a process?

Answer:

Use cpulimit or cgroups:

cpulimit -p <pid> -l 20

or create a cgroup with cpu.cfs_quota_us.


17. How do you check which kernel version you're running?

Answer:

uname -r
cat /proc/version

18. How do you analyze server reboots or crashes?

Answer:

Use:

last reboot
journalctl --since "1 hour ago"

Check dmesg, syslog, or if configured, check crash dumps.


19. How do you sync system time with NTP?

Answer:

timedatectl status
timedatectl set-ntp true

or install ntp or chrony and sync with pool servers.


20. How do you check available entropy? Why does it matter?

Answer:

cat /proc/sys/kernel/random/entropy_avail

Low entropy can slow down cryptographic operations (like SSH, SSL). Fix using rng-tools or haveged.


You're on 🔥 — and I love that you're serious about mastering your SRE/Linux interview game.

Here's Set #4: another fresh 20 Linux interview questions with zero repeats, curated specifically for 3–5 year experienced DevOps/SRE engineers. These focus on troubleshooting, scripting, file systems, networking, and command-line logic — the exact stuff Google and other big players expect from solid mid-level candidates.


🧠 Linux Interview Questions – Set #4 (Mid-Level, Non-Repeating)


1. How do you check if a Linux server is running out of inodes?

Answer:

df -i

If %iused is high, you may need to delete small files (e.g., logs, tmp files). Inode exhaustion causes "No space left on device" even when space exists.


2. How do you list the 10 largest files in a directory tree?

Answer:

find /path -type f -exec du -h {} + | sort -rh | head -n 10

3. How do you find recently installed packages in Linux?

Answer:

  • On Debian/Ubuntu:

grep " install " /var/log/dpkg.log
  • On RHEL/CentOS:

rpm -qa --last

4. How do you recursively change permissions only for files or only for directories?

Answer:

  • Files only:

find . -type f -exec chmod 644 {} +
  • Directories only:

find . -type d -exec chmod 755 {} +

5. How do you monitor real-time network usage per process?

Answer:

Use nethogs or:

iftop
ip -s link

6. How do you see which shared libraries a binary uses?

Answer:

ldd /usr/bin/nginx

7. How do you compare two directories and find differences?

Answer:

diff -r /dir1 /dir2

Or use rsync -an --delete for a dry run of syncing changes.


8. How do you find all files owned by a user?

Answer:

find / -user username

9. How do you test whether a variable in a bash script is empty?

Answer:

if [ -z "$VAR" ]; then
  echo "Empty"
fi

10. What’s the difference between > and >> in shell redirection?

Answer:

> overwrites a file, >> appends to it. Useful in logging scripts to avoid log loss.


11. How do you make a script executable from anywhere in the system?

Answer:

  1. Add shebang (#!/bin/bash) at the top

  2. Make executable: chmod +x script.sh

  3. Move it to a directory in $PATH like /usr/local/bin/


12. How do you limit the number of processes a user can run?

Answer:

Set in /etc/security/limits.conf:

username hard nproc 100

13. How do you pause a running process and resume it later?

Answer:

  • Pause with Ctrl+Z

  • Resume in foreground: fg

  • Resume in background: bg


14. How do you rename all .log files to .bak in a folder?

Answer:

for f in *.log; do mv "$f" "${f%.log}.bak"; done

15. How do you test network connectivity without using ping or curl?

Answer:

nc -zv google.com 80
telnet google.com 80

Or check with traceroute, dig, or ip route get.


16. How do you set a static IP address temporarily and permanently?

Answer:

  • Temporarily:

ip addr add 192.168.1.100/24 dev eth0
  • Permanently: edit /etc/network/interfaces (Debian) or use nmcli or Netplan.


17. How do you make an alias permanent?

Answer:

Add it to ~/.bashrc or ~/.bash_aliases, e.g.:

alias ll='ls -alF'

18. How do you extract .tar.gz but only specific files/folders?

Answer:

tar -xzvf archive.tar.gz folder/file1.txt

19. How do you make a script run at system boot?

Answer:

For systemd-based:

sudo cp script.sh /etc/systemd/system/myscript.service
sudo systemctl enable myscript.service

Or use @reboot in crontab:

@reboot /path/to/script.sh

20. How do you limit file upload size in a Linux-based web server (like Nginx)?

Answer:

Set:

client_max_body_size 20M;

in Nginx config, then reload the service.


chevron-rightKubernteshashtag

Great! Since you're preparing for a Google SRE interview with 3–5 years of experience, here's a curated set of 20 Kubernetes interview questions that are:

  • Frequently asked by Google interviewers

  • Matched to your experience level

  • Answered as a top candidate would — showing both practical usage and understanding


🧠 Top 20 Kubernetes Interview Questions for Google SRE (3–5 Years Experience)


1. What is Kubernetes and why is it used?

Best Answer:

Kubernetes is an open-source container orchestration platform used to deploy, manage, and scale containerized applications. It automates tasks like service discovery, scaling, load balancing, rolling updates, and self-healing.


2. How does Kubernetes achieve high availability for applications?

Best Answer:

It runs multiple replicas of a pod behind a Service. If one pod fails, the others continue serving traffic. Readiness probes and ReplicaSets ensure that only healthy pods serve traffic. HorizontalPodAutoscaler and anti-affinity rules help scale and distribute them.


3. What happens when you run kubectl apply -f deployment.yaml?

Best Answer:

The manifest is sent to the API server, which stores the desired state in etcd. The controller manager compares current vs desired state and schedules pods as needed using kube-scheduler. Then kubelet runs pods on the target node.


4. What is the difference between a Deployment, ReplicaSet, and StatefulSet?

Best Answer:

  • Deployment: Manages stateless apps with rolling updates via ReplicaSets.

  • ReplicaSet: Ensures a set number of identical pods are running.

  • StatefulSet: For stateful apps with stable pod identity (DNS, storage) — e.g., databases.


5. How do you roll back a deployment in Kubernetes?

Best Answer:

kubectl rollout undo deployment <name>

Kubernetes stores rollout history. You can also specify a revision with --to-revision.


6. How does Kubernetes perform health checks?

Best Answer:

Via liveness and readiness probes:

  • Liveness: Tells if the container is alive — restarts it if it fails.

  • Readiness: Tells if the container is ready to accept traffic.


7. How does a Service work in Kubernetes?

Best Answer:

A Service provides a stable IP and DNS name for a set of pods. It uses labels to match pods. Internally, kube-proxy maintains iptables or IPVS rules to forward traffic to backend pods.


8. What is a ConfigMap and how is it used?

Best Answer:

ConfigMaps externalize configuration from code. You can mount them as volumes or inject as environment variables. This allows updating config without rebuilding images.


9. What’s the difference between ConfigMap and Secret?

Best Answer:

  • ConfigMap: Stores plaintext config

  • Secret: Stores sensitive data (base64 encoded, not encrypted by default) Secrets are mounted with stricter permissions and optionally encrypted at rest.


10. What are Namespaces in Kubernetes?

Best Answer:

Namespaces logically isolate resources in a cluster. Useful in multi-tenant setups or separating environments (dev/stage/prod). Resources like pods, services, and configmaps are namespace-scoped.


11. How do you debug a pod stuck in CrashLoopBackOff?

Best Answer:

kubectl describe pod <pod>
kubectl logs <pod>

I check container exit codes, logs, and readiness/liveness probe failures. Often it's config issues, failed DB connections, or missing secrets.


12. What is a DaemonSet and use case?

Best Answer:

A DaemonSet ensures a pod runs on every (or specific) node. Commonly used for log shippers (Fluentd), monitoring agents (Prometheus node exporter), or CSI drivers.


13. How do you limit resource usage of a pod?

Best Answer:

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

This helps the scheduler place the pod correctly and prevents noisy neighbor issues.


14. What is a Node in Kubernetes?

Best Answer:

A node is a worker machine (VM or bare metal) in the cluster that runs pods. It includes kubelet, container runtime, and kube-proxy.


15. How does Kubernetes schedule pods to nodes?

Best Answer:

The kube-scheduler selects nodes based on:

  • resource availability (CPU/mem)

  • taints/tolerations

  • node/pod affinity

  • constraints (e.g., nodeSelector, nodeAffinity) Then binds the pod to the best-fit node.


16. What is a Taint and Toleration?

Best Answer:

Taints are labels on nodes that repel pods. Tolerations allow pods to “tolerate” those taints. Example: Taint master nodes to avoid regular pod scheduling.


17. What are Init Containers?

Best Answer:

Init containers run before app containers in a pod. They are ideal for tasks like pulling config, waiting for a service, or setting up environment. They must complete before the main container starts.


18. How do you expose an application outside the cluster?

Best Answer:

  • Use NodePort for basic external access

  • Use LoadBalancer (in cloud)

  • Use Ingress for path- or host-based routing


19. What is an Ingress and why use it?

Best Answer:

Ingress is a Kubernetes object that manages external access to services via HTTP/HTTPS. It supports routing based on paths and domains. Requires an Ingress Controller (e.g., NGINX, Traefik).


20. How do you monitor a Kubernetes cluster?

Best Answer:

I use Prometheus + Grafana for metrics, Alertmanager for alerts, and tools like:

  • kubectl top (via Metrics Server)

  • Lens or K9s for visualization

  • kubectl describe and logs for debugging Logging: EFK/ELK stack or Loki.


✅ Bonus Tips for Google SRE Interview

  • Be ready to whiteboard how components interact (API server, kubelet, etcd, etc.)

  • Emphasize how you handled real outages, debugging, or scaling

  • Explain why you use a certain Kubernetes feature, not just how


Awesome! Here’s Set #2 of 20 fresh Kubernetes interview questions for Google SRE candidates with 3–5 years experience — no repeats, real-world focus, and answered like a top-tier engineer.


🧠 Kubernetes Interview Questions – Set #2 (Google SRE, No Repeats)


1. How does Kubernetes handle pod failures?

Best Answer:

The kubelet detects the failure and notifies the API server. The ReplicaSet or StatefulSet controller creates a new pod to maintain the desired replica count. Liveness probes trigger restarts if needed.


2. What’s the difference between RollingUpdate and Recreate strategies?

Best Answer:

  • RollingUpdate: Gradually replaces old pods with new ones, ensuring no downtime.

  • Recreate: Terminates all old pods before starting new ones — might cause downtime.


3. What is etcd and why is it important?

Best Answer:

etcd is a consistent, distributed key-value store used as Kubernetes' backing store. It holds all cluster data (state, secrets, configuration). If etcd fails, the cluster loses state.


4. What happens if a node goes down in Kubernetes?

Best Answer:

After a grace period (default 5m), the node is marked NotReady, and its pods are evicted. The scheduler reassigns the pods to healthy nodes if pod disruption budgets allow.


5. What is a Pod Disruption Budget (PDB)?

Best Answer:

A PDB defines the minimum number of pods that must be available during voluntary disruptions (like upgrades). It helps avoid complete app downtime during node drain or rolling updates.


6. How do you handle secret rotation in Kubernetes?

Best Answer:

Secrets are stored as objects. I:

  • Use volume mounts for apps that re-read secrets

  • Automate rotation via CI/CD

  • Use External Secrets Operator or integrate with HashiCorp Vault


7. How do you create a Kubernetes Job and CronJob?

Best Answer:

  • Job runs a task to completion (e.g., backup).

  • CronJob schedules Jobs periodically.

schedule: "0 3 * * *"
jobTemplate:
  spec:
    template:
      spec:
        containers:

8. What is the role of the Kubernetes Controller Manager?

Best Answer:

It runs control loops (Deployment, ReplicaSet, Node, etc.) that constantly reconcile desired state (from etcd) vs actual state (from kubelet).


9. How do you isolate workloads in Kubernetes?

Best Answer:

I use:

  • Namespaces to group resources

  • NetworkPolicies to restrict traffic

  • RBAC for access control

  • ResourceQuotas to limit usage per team


10. What’s the difference between kubectl get, describe, logs, and exec?

Best Answer:

  • get: shows summary

  • describe: detailed metadata, events

  • logs: container stdout/stderr

  • exec: run commands inside the pod (like SSH)


11. How do you view events in a Kubernetes cluster?

Best Answer:

kubectl get events --sort-by='.lastTimestamp'

Helps in troubleshooting startup failures, scheduling issues, probe failures, etc.


12. What are Admission Controllers in Kubernetes?

Best Answer:

They intercept API requests after authentication/authorization but before persistence. Used for validating policies, mutating resources, enforcing security (e.g., PodSecurityAdmission).


13. How do you perform zero-downtime deployments?

Best Answer:

Use:

  • RollingUpdate strategy

  • Readiness probes

  • Proper maxUnavailable and maxSurge settings

  • PreStop hooks to delay pod termination


14. What are sidecar containers and how are they used?

Best Answer:

Sidecars run alongside the main app container in the same pod. Used for logging, proxying, authentication (e.g., Envoy for Istio), or config reloading.


15. What is Horizontal Pod Autoscaler (HPA)?

Best Answer:

HPA scales pods based on CPU/memory or custom metrics. It queries metrics-server and adjusts replicas using a target utilization percentage.


16. What is Vertical Pod Autoscaler (VPA)?

Best Answer:

VPA adjusts pod CPU/memory requests/limits based on usage. It doesn’t increase pod count, just resource allocation. Useful for apps with dynamic load profiles.


17. How do you drain a node safely for maintenance?

Best Answer:

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

This evicts pods, respecting PDBs. cordon prevents new pods before drain.


18. How do you implement RBAC in Kubernetes?

Best Answer:

Define:

  • Role or ClusterRole (permissions)

  • RoleBinding or ClusterRoleBinding (bind to users/service accounts)


19. How do you enforce network segmentation between namespaces?

Best Answer:

With NetworkPolicies:

podSelector: {}
policyTypes: [Ingress, Egress]

Deny all by default, then explicitly allow.


20. What’s the difference between Kubelet and Kube-proxy?

Best Answer:

  • Kubelet: Agent on each node that manages containers and reports status to API server.

  • Kube-proxy: Handles service-to-pod networking (iptables/IPVS rules) to load-balance traffic.


✅ Recap

You now have:

  • ✅ 40 unique Kubernetes interview questions

  • ✅ Real-world SRE use cases

  • ✅ Google SRE–friendly answers with clarity and depth

Love the energy — you're crushing this! 🔥 Here’s Set #3: 20 more non-repeating, real-world Kubernetes interview questions focused on SRE responsibilities, cluster reliability, troubleshooting, and production readiness — perfect for Google SRE roles at the 3–5 year level.


🧠 Kubernetes Interview Questions – Set #3 (Google SRE, 3–5 YOE, No Repeats)


1. How do you perform a rolling restart of a deployment in Kubernetes?

Best Answer:

kubectl rollout restart deployment <name>

This triggers a rolling update without changing the manifest — useful for config changes or debugging.


2. What’s the difference between emptyDir, hostPath, and PersistentVolume?

Best Answer:

  • emptyDir: Temporary, deleted with pod

  • hostPath: Mounts a node’s filesystem (not portable)

  • PersistentVolume: Decoupled, reusable storage backed by cloud/NFS/etc


3. How do you scale a Kubernetes deployment manually?

Best Answer:

kubectl scale deployment myapp --replicas=5

Or change replicas: in the YAML and apply.


4. What is a ServiceAccount and why use it?

Best Answer:

A ServiceAccount allows a pod to access the Kubernetes API securely. It's commonly used for automation, monitoring agents, and any app using kubectl or client-go SDK.


5. How do you ensure logs from all pods are collected and searchable?

Best Answer:

Use a sidecar agent (Fluentd/FluentBit) or DaemonSet to collect logs, then send to Elasticsearch, Loki, or Cloud Logging. Prefer stdout/stderr over writing to files.


6. How do you limit access to sensitive secrets in a cluster?

Best Answer:

  • Use RBAC to restrict who can get/list/watch secrets

  • Mount secrets only in specific pods/namespaces

  • Enable encryption at rest for etcd

  • Rotate secrets using external tools like Vault


7. How do you check which scheduler decisions were made for a pod?

Best Answer:

kubectl describe pod <pod-name>

Look under the Events section for messages like "Successfully assigned pod to node".


8. What is a PreStop hook and how does it help with graceful shutdown?

Best Answer:

A PreStop hook runs before SIGTERM, giving the app time to clean up. Combined with terminationGracePeriodSeconds, it prevents traffic cut-off before readiness probe fails.


9. How do you inject environment-specific variables into Kubernetes pods?

Best Answer:

  • Use ConfigMaps/Secrets as env vars

  • Use valueFrom in the pod spec to pull from metadata (e.g., pod name, namespace)


10. How does Kubernetes handle certificate rotation for TLS communication?

Best Answer:

kubelet and controller-manager auto-renew their certs. For workloads, I integrate cert-manager to manage auto-renewal of TLS certs from Let’s Encrypt or an internal CA.


11. How do you investigate kubelet issues on a node?

Best Answer:

journalctl -u kubelet

I check for node pressure (disk, memory), kubelet version mismatches, cgroup errors, or volume mount issues.


12. What is a MutatingAdmissionWebhook?

Best Answer:

It modifies incoming objects before persistence. Used to inject sidecars (e.g., Istio), enforce default labels, or add env vars. Complementary to validating webhooks.


13. What does kube-proxy do and what are its operating modes?

Best Answer:

It manages network rules to route traffic to services. It supports:

  • iptables mode (default): rules in iptables

  • IPVS mode: faster, kernel-level load balancing


14. How do you troubleshoot pods stuck in Pending state?

Best Answer:

Check:

kubectl describe pod <pod>

Common causes:

  • Not enough resources

  • Node selectors or taints

  • PVC unbound

  • Affinity rules not met


15. How do you ensure only signed container images run in your cluster?

Best Answer:

Use imagePolicyWebhook or tools like Kyverno, OPA Gatekeeper, or Cosign with Sigstore to enforce signature verification before admission.


16. How do you rotate service account tokens in a pod?

Best Answer:

Kubernetes automatically mounts and refreshes tokens. To rotate manually:

  • Disable automount

  • Use external secret injector (Vault Agent Injector)

  • Refresh token via projected volumes


17. What is kube-state-metrics and how is it used in monitoring?

Best Answer:

It exports cluster state (pod status, replica count, resource usage) as Prometheus metrics. I use it to alert on unhealthy pods, unscheduled workloads, and configuration drift.


18. How do you back up and restore etcd in Kubernetes?

Best Answer:

ETCDCTL_API=3 etcdctl snapshot save backup.db
etcdctl snapshot restore backup.db

Useful for disaster recovery. Only the control plane node can perform this.


19. What is node affinity and anti-affinity?

Best Answer:

  • Affinity: schedule pods on matching nodes

  • Anti-affinity: avoid placing pods on the same node as others (e.g., HA)


20. How do you ensure every namespace has a default resource quota and limit range?

Best Answer:

Use LimitRange and ResourceQuota objects applied via automation (ArgoCD, Kyverno), or admission controllers to enforce policies on new namespaces.


✅ You Now Have:

  • 60 high-quality K8s questions (Google-level, 3–5 YOE)

  • ✅ All real-world, production-focused answers

  • ✅ A full grasp of how SREs troubleshoot, monitor, and secure clusters


You're doing an amazing job staying consistent! 💪 Here’s your Set #4 (you already have Sets 1, 2, and 3), with 20 more advanced yet practical Kubernetes interview questions, curated for a Google SRE/DevOps role with 3–5 years experience.

These questions cover multi-cluster, security, performance tuning, production best practices, and edge cases — what real SREs face and what top-tier interviewers love to ask.


🧠 Kubernetes Interview Questions – Set #4 (Google SRE, No Repeats, 3–5 YOE)


1. How do you design a multi-tenant Kubernetes cluster securely?

Best Answer:

  • Use separate Namespaces

  • Apply strict RBAC and NetworkPolicies

  • Apply ResourceQuotas and LimitRanges

  • Use OPA Gatekeeper or Kyverno for policy enforcement

  • Prefer workload isolation using PodSecurityStandard or GKE Autopilot


2. How do you limit the impact of a noisy neighbor in Kubernetes?

Best Answer:

I use requests and limits on CPU/memory, use LimitRanges in each namespace, and apply PodPriority + Preemption to ensure critical pods are never evicted due to resource starvation.


3. How do you reduce image pull latency for large container images?

Best Answer:

  • Use smaller base images (distroless, alpine)

  • Enable imagePullPolicy: IfNotPresent

  • Pre-pull images using DaemonSets

  • Use a local registry mirror


4. What is PriorityClass in Kubernetes?

Best Answer:

It's an object that assigns priority to pods. Higher-priority pods are scheduled first and can preempt lower-priority pods if resources are tight.


5. How do you control where pods are scheduled in multi-zone clusters?

Best Answer:

Use:

  • nodeAffinity

  • topologySpreadConstraints

  • podAntiAffinity

These help distribute pods across zones for HA.


6. How do you perform canary deployments in Kubernetes?

Best Answer:

I use multiple Deployments or Rollouts (with Argo Rollouts) and split traffic using:

  • Service selector changes

  • Ingress annotations

  • Istio/Linkerd routing rules


7. How do you manage Kubernetes manifests at scale?

Best Answer:

I use Helm for templating, Kustomize for overlays, and GitOps tools like ArgoCD or Flux for promotion across environments.


8. What tools do you use for cluster-wide policy enforcement?

Best Answer:

  • OPA Gatekeeper for Rego-based policies

  • Kyverno for YAML-native rules

  • K-Rail, JSPolicy, or admission webhooks for specific use cases


9. What’s the difference between initContainers and postStart hooks?

Best Answer:

  • initContainers: run before main container starts

  • postStart: hook in the main container lifecycle, may not block startup I use initContainers when startup dependency order matters.


10. How do you monitor the control plane components in Kubernetes?

Best Answer:

  • kube-apiserver, etcd, kube-scheduler, controller-manager expose metrics

  • Scrape them via Prometheus

  • Check their logs (journalctl, kubectl logs)

  • Use liveness probes and alerting on abnormal behavior or restart frequency


11. How do you secure etcd in a Kubernetes cluster?

Best Answer:

  • Enable TLS for peer/client communication

  • Restrict etcd access to the API server only

  • Encrypt etcd at rest (--encryption-provider-config)

  • Use etcd snapshots for secure backups


12. How do you protect against container escape vulnerabilities?

Best Answer:

  • Run containers as non-root

  • Drop Linux capabilities

  • Use Seccomp and AppArmor/SELinux profiles

  • Enable PodSecurity admission


13. How do you validate changes before applying them to production?

Best Answer:

  • Use CI/CD pipelines with kubectl diff

  • Validate YAML with kubeval, kube-score, conftest

  • Deploy to staging namespace with shadow traffic

  • Use dry-run mode:

kubectl apply --dry-run=client -f my-deployment.yaml

14. How do you troubleshoot high CPU usage by a pod?

Best Answer:

  • Check kubectl top pod

  • kubectl exec into the pod and run top, ps, strace

  • Check for tight loops or unbounded resource use

  • Use kubectl debug or ephemeral containers for live debugging


15. How do you monitor node health across clusters?

Best Answer:

  • Use node-exporter + Prometheus

  • Alert on node conditions like DiskPressure, MemoryPressure, PIDPressure

  • Use taints to remove unhealthy nodes

  • For multi-cluster, I aggregate data using Thanos or Prometheus Federation


16. How do you upgrade Kubernetes clusters with zero downtime?

Best Answer:

  • Upgrade control plane first

  • Then drain and upgrade nodes one by one

  • Use PDBs to ensure availability

  • Validate API deprecations before upgrade

  • Use managed services (like GKE, EKS) or tools like kubeadm, kOps


17. How do you configure custom metrics for HPA?

Best Answer:

  • Expose custom metrics via /metrics endpoint

  • Use Prometheus Adapter to make them available to HPA

  • Define HPA based on those metrics:

metrics:
- type: External
  external:
    metricName: queue_length

18. How do you analyze a failed pod that’s already terminated?

Best Answer:

  • Use kubectl get pod -a or --field-selector=status.phase=Failed

  • Check kubectl describe for exit codes

  • View logs with kubectl logs --previous


19. How do you prevent privilege escalation in Kubernetes?

Best Answer:

  • Set allowPrivilegeEscalation: false in the PodSpec

  • Drop capabilities like CAP_SYS_ADMIN

  • Disallow host networking, IPC, PID, or hostPath volumes

  • Use admission controllers to enforce policies


20. How do you manage secret sprawl in large Kubernetes environments?

Best Answer:

  • Use external secret managers (Vault, AWS Secrets Manager)

  • Inject via sidecars or CSI drivers

  • Rotate secrets regularly

  • Use SecretProviderClass with CSI Secrets Store for dynamic injection


✅ At This Point, You Have:

  • ✅ 80 total Kubernetes questions (4 sets × 20)

  • ✅ Covered real SRE use cases: HA, RBAC, monitoring, debugging, multi-tenancy, scaling, and policy enforcement

  • ✅ Battle-tested answers aligned with Google’s SRE bar


Would you like:

  • 📄 A single PDF with all 80 K8s questions?

  • 🧪 Real troubleshooting lab tasks (with sample cluster failures)?

  • 📊 Prometheus + Grafana interview questions?

  • 🚀 Docker + CI/CD questions for Google SRE final round?

Tell me your next target — I’ll prep it exactly for you.

WHAT IS SRE TASKS

chevron-rightsre teskhashtag

Great question, Pavan! A Senior SRE (Site Reliability Engineer) takes a strategic and proactive approach to SLA/SLO/SLI, ensuring system reliability and performance while balancing innovation and risk. Let’s break it down clearly:


🧠 How a Senior SRE Thinks About SLA/SLO/SLI

Concept
Senior SRE View

SLA (Service Level Agreement)

A business commitment to customers (e.g. 99.9% uptime per month). Violations may have penalties.

SLO (Service Level Objective)

Internal targets used to measure reliability (e.g. 99.95% success rate over 30 days). More aggressive than SLA.

SLI (Service Level Indicator)

Metrics used to measure the user experience (e.g. request latency, error rate, availability).

Mentality

Senior SREs focus on proactive reliability — preventing incidents before SLA breaches happen. They use SLOs to balance reliability vs velocity.


🔧 What They Do to Maintain SLA/SLO/SLI

  1. Define Meaningful SLIs

    • Examples:

      • Availability: % of successful HTTP 200s

      • Latency: % of requests below 200ms

      • Error Rate: 5xx errors / total requests

  2. Set Realistic but Ambitious SLOs

    • e.g., 99.9% availability with a 0.1% error budget

  3. Monitor Error Budgets

    • Use burn rate alerts: how fast you're consuming the error budget

    • Pause risky deployments if SLOs are at risk

  4. Automate Response

    • Auto-scaling, self-healing (e.g., Kubernetes probes + restarts)

    • Use runbooks and incident workflows

  5. Postmortems & Root Cause Analysis

    • Document each SLA breach with impact, root cause, lessons learned, and action items

  6. Blameless Culture

    • Focus on system/process improvements over blaming individuals


📚 What Documentation They Read / Create

Document Type
Purpose

SLI/SLO Definitions

Describes how metrics are measured, where they're collected, thresholds

SLAs

External/customer-facing agreements (usually from product/engineering leadership)

Runbooks

Step-by-step incident recovery instructions

Playbooks

Automated/templated responses to known issues

Postmortems

Detailed analysis of production incidents

Capacity Plans

Forecasting system load and scaling needs

On-Call Handover Docs

Shift responsibilities and open issues summary


📊 How They Make Dashboards in Monitoring Tools (Grafana, Datadog, etc.)

  1. Choose SLIs to Visualize

    • Request success rate

    • 95th/99th percentile latency

    • Uptime per region

    • SLO compliance vs error budget

  2. Use Tools Like

    • Grafana (with Prometheus, Loki)

    • Datadog (SLO dashboards)

    • New Relic, Splunk, CloudWatch

  3. Best Practices

    • Color-code thresholds (green/yellow/red)

    • Add burn rate panels for error budgets

    • Show time ranges: last 1h, 24h, 7d

    • Include annotations for deployments, incidents


🕵️‍♂️ What Senior SREs Monitor

Layer
Metrics

🔗 Frontend

Page load time, JS errors, API failures

🌐 API Layer

Request rate, error rate, latency

📦 Services/Microservices

Health checks, pod restarts, memory leaks

🛠️ Infrastructure

CPU, memory, disk, node health, auto-scaler behavior

🔐 Security

Auth failures, rate limiting, firewall logs

📈 Business Metrics

Signup rate, payment errors (when reliability affects revenue)

⚠️ Alerts/Incidents

Latency spikes, high error rates, queue backlogs


🧠 Thought Process in Short

“Is the user happy? Are we delivering what we promised? If not, are we detecting and fixing it before it causes real business pain?”

They think in terms of user experience, failure domains, proactive monitoring, and always work to balance reliability with delivery speed.


If you're building a monitoring setup or defining SLIs/SLOs for your current stack (AWS + K8s + Node + React), I can help you design real examples tailored to your infra. Want that?

chevron-rightincident managerhashtag

  • Minimum 5+ years Hands on experience in incident Management of Cloud & SaaS Managed services.

  • Proven experience in handling Major Incident Response and Resolution management, Cross functional co-ordination, Communication and stakeholder expectation management.

  • Proven experience in ITSM Tool Monitoring and Management for Incident, Problem and SLA management.

  • Technical analysis: Ability to analyze Logs, error messages and Tech report and ability to identify RCA and Remediation.

  • Drive Team towards key Support KPIs/SLA Compliance.

Key certifications: ITIL Certified /Fundamental Technical certification in Cloud and SaaS Technologies.

Key KPI Demonstrations:

  • MTTR Reduction

  • Recurring incident elimination

  • Incident SLA management

  • CSAT (Stakeholder expectation Management)

  • Process improvement (Incident, Problem and SLA Management)

  • Knowledge management

Soft skills:

  • Excellent written and verbal communication skills

  • Ability to cope with Business Ops pressure and handle cross functional team towards Resolution

  • Major incidents can happen at any time. Ability to cope with Time pressure and be ready to be available round the clock to handle Major Incidents

  • Ability to monitor incident management processes, Tools and Document and recommend continuous Improvement.

Last updated