SRE

Linux QA

Absolutely! Below are 20 commonly asked Linux-related questions in Google SRE interviews, along with answers that reflect what a top candidate would say — someone who understands the why, not just the how.

🧠 Top 20 Google SRE Linux Interview Questions (with Ideal Answers)

1. What happens when you type `google.com` in your browser?

Best Answer:

When you type google.com, the OS performs a DNS resolution using /etc/resolv.conf. Once it gets the IP, the browser opens a TCP connection (via 3-way handshake) to port 443. TLS handshake secures the connection. Then it sends an HTTP GET request. The server responds with data, and the browser renders it. Along the way, ARP resolves MAC addresses, and IP routing decides packet delivery.

2. How does `ps` get process information?

Best Answer:

The ps command reads from the /proc virtual filesystem, especially files like /proc/[pid]/stat, /proc/[pid]/cmdline, etc. It also uses /proc/stat and other kernel interfaces to gather CPU and memory stats.

3. Explain `top` command output.

Best Answer:

top shows real-time system stats. Key fields:

%CPU: CPU usage by process
%MEM: Memory usage
LOAD AVG: 1/5/15 min average processes in queue
NI: Process niceness (priority) It uses /proc to collect this data and refreshes every few seconds.

4. Difference between a process and a thread?

Best Answer:

A process is an independent execution unit with its own memory space. A thread is a lightweight subunit of a process that shares memory. Threads are faster to create and switch, but processes offer better isolation.

5. How do you check if a port is open and listening?

Best Answer:

sudo lsof -i :8080
netstat -tuln | grep 8080
ss -tuln | grep 8080

Or use nc -zv localhost 8080 to probe a port. ss is preferred in modern Linux systems.

6. What’s the difference between hard and soft links?

Best Answer:

Hard link: Same inode, file remains even if the original is deleted.

Soft (symbolic) link: Points to pathname. If the target is deleted, the link breaks.

7. What is a zombie process?

Best Answer:

A zombie is a dead process whose parent hasn't read its exit status via wait(). It appears in ps with <defunct>. It doesn’t use CPU/memory but wastes a PID. Proper parent processes reap zombies to avoid PID exhaustion.

8. How does Linux manage memory?

Best Answer:

Linux uses virtual memory, with pages mapped to physical RAM. Unused pages may be swapped. The kernel uses page cache to cache disk IO. Tools like free, vmstat, and /proc/meminfo show memory usage.

9. What is the OOM killer?

Best Answer:

The Out of Memory (OOM) killer is triggered when memory is exhausted. It chooses a process to kill based on a badness score, freeing memory. It’s a last resort to avoid system crash.

10. How do you find the top memory-consuming processes?

Best Answer:

ps aux --sort=-%mem | head
top
htop

Or check /proc/[pid]/status for VmRSS, VmSize.

11. What does `nice` and `renice` do?

Best Answer:

nice sets a process's initial niceness (priority), ranging from -20 (high priority) to 19 (low). renice changes the priority of a running process.

12. How does the kernel schedule processes?

Best Answer:

Linux uses Completely Fair Scheduler (CFS). It tracks "virtual runtime" to ensure each process gets fair CPU time. It maintains a red-black tree to pick the task with the lowest virtual runtime.

13. Explain runlevels or systemd targets.

Best Answer:

Traditional runlevels (e.g., 3 for multi-user, 5 for GUI) are replaced by systemd targets (like multi-user.target, graphical.target). Use systemctl get-default and systemctl isolate to view/change them.

14. What happens during Linux boot process?

Best Answer:

BIOS/UEFI initializes hardware

Bootloader (GRUB) loads the kernel
Kernel initializes drivers, mounts root FS
init or systemd runs
System targets/services are started

15. How do you monitor disk IO?

Best Answer:

Use iostat, iotop, or vmstat 1. iostat shows per-device read/write stats. iotop shows live disk IO per process.

16. How can you find out which process is using a file or port?

Best Answer:

lsof /path/to/file
lsof -i :80
fuser /path/to/file

These map file descriptors or ports to processes.

17. What’s the difference between `/dev/null` and `/dev/zero`?

Best Answer:

/dev/null discards anything written to it (bit bucket). /dev/zero outputs null bytes (\0) when read. It’s used for allocating zeroed memory.

18. How do you check for file descriptor leaks?

Best Answer:

ls /proc/<pid>/fd | wc -l
lsof -p <pid>

Compare open FDs over time to detect leaks. ulimit -n shows FD limit.

19. How do cgroups work?

Best Answer:

Cgroups (Control Groups) limit and isolate CPU, memory, IO, etc. per process group. They form the backbone of Docker/K8s resource limits. Use systemd or cgcreate/cgexec to manage them manually.

20. How do you analyze a Linux crash (kernel panic)?

Best Answer:

Enable kdump to capture crash dumps. Analyze using crash utility and vmcore. Check /var/log/messages and dmesg logs. If kdump is not enabled, use console logs and hardware LED indicators (IPMI).

✅ Bonus Tips for Google SRE Interviews

Don't just explain what — explain why and how things work under the hood.
Tie answers to real incidents you’ve handled (e.g., “We once had a memory leak causing OOM kills…”).
Show awareness of scale, observability, and automation.
Be honest when you don’t know — but always say how you’d find the answer.

That’s a very fair point — and you're absolutely right. The earlier sets include deep-dive Linux questions which are often reserved for senior SREs or during escalation rounds.

For someone with 3–5 years experience, especially applying for L3 or L4 SRE roles, Google (and other top companies) will focus more on practical knowledge, troubleshooting, and real-world experience rather than kernel internals or niche topics.

✅ Here’s a more realistic and experience-level-appropriate set of 20 Linux questions — with solid, real-world answers.

These are the kind of questions a mid-level DevOps/SRE is expected to handle well.

1. How do you check system resource usage in Linux?

Answer:

I use:
top or htop for live CPU, memory, process info
free -h for memory summary
df -h for disk usage
du -sh * to check folder size
uptime for load averages

2. How do you troubleshoot high CPU usage?

Answer:

I run top or htop to identify the high-CPU process. Then ps aux --sort=-%cpu | head to confirm. If it’s a specific service, I check logs, open file handles (lsof), and restart or scale as needed.

3. How do you schedule a job in Linux?

Answer:

I use crontab -e for recurring jobs. For one-time tasks, at is useful. For example:

0 2 * * * /opt/scripts/db_backup.sh

runs the job daily at 2 AM.

4. How do you find which process is using a port?

Answer:

lsof -i :8080 or netstat -tulnp | grep 8080 or ss -ltnp — they show which PID is listening on a port.

5. How do you free up disk space?

Answer:

Clear logs from /var/log (e.g., rotated logs)
Clean apt/yum cache
Remove unused Docker images: docker system prune
Find large files with du -ah / | sort -rh | head

6. How do you find memory-consuming processes?

Answer:

ps aux --sort=-%mem | head

Also top or htop sorted by %MEM. I also check for memory leaks if a process steadily grows.

7. What’s the difference between `systemctl stop` and `kill`?

Answer:

systemctl stop gracefully stops the service via systemd. kill sends a signal to the process — can be SIGTERM (soft) or SIGKILL (forceful). systemctl is cleaner as it handles dependencies.

8. How do you check service logs?

Answer:

For systemd services:

journalctl -u nginx.service

Or check /var/log/<service-name>/ for traditional logs.

9. How do you find slow disk IO issues?

Answer:

I use:

iostat -xm 1 to monitor device IO
iotop to find which processes are using most IO
dstat for overall system stats

10. What’s the use of `/etc/hosts`?

Answer:

It maps hostnames to IP addresses before DNS is used. Useful for overriding DNS temporarily or internal testing.

11. How do you monitor a running log file in real-time?

Answer:

tail -f /var/log/syslog

Or use multitail, less +F, or piping through grep to watch specific keywords.

12. What’s the difference between reboot, shutdown, halt, and poweroff?

Answer:

reboot: Restart system
shutdown -r now: Same as reboot
halt: Stops CPU without powering off
poweroff: Powers off the system completely

13. How do you check if a service is enabled on boot?

Answer:

systemctl is-enabled nginx

You can also run systemctl list-unit-files | grep enabled.

14. How do you list all running processes?

Answer:

ps aux
top
htop

15. What is the difference between `chmod 755` and `chmod 644`?

Answer:

755: Read/write/execute for owner, read/execute for group/others (used for scripts and executables).
644: Read/write for owner, read-only for group/others (used for config and text files).

16. How do you check system uptime and reboot history?

Answer:

uptime
who -b
last reboot

Answer:

cat /var/log/auth.log | grep "Failed password"
journalctl -u sshd

18. How do you create and extract a `.tar.gz` file?

Answer:

tar -czvf backup.tar.gz /my/folder/
tar -xzvf backup.tar.gz

19. How do you set an environment variable permanently?

Answer:

Add export VAR=value to:

~/.bashrc or ~/.bash_profile (user)
/etc/environment (system-wide)

20. How do you troubleshoot “Permission Denied” errors?

Answer:

Use ls -l to check file permissions

Use id or groups to check user/group
Use chmod, chown to fix ownership
Also check if the file has noexec or if SELinux/AppArmor is blocking it

🎯 Final Thought

With 3–5 years of experience, you should:

Focus on practical skills
Show that you debug logically under pressure
Back up your answers with real experience ("We had a disk full issue on EC2...")

You're doing awesome — and since you're serious about Google SRE prep with 3–5 years of experience, here's another fresh batch of 20 Linux interview questions (non-repeating) that match your real-world, mid-level SRE/DevOps profile.

These will test your daily hands-on skills, problem-solving, and environment awareness — exactly what Google and similar companies look for.

🧠 Linux Interview Questions – Set #3 (Mid-Level, Practical, Non-Repeating)

1. How do you find which user ran a command recently on a server?

Answer:

I check shell history files like ~/.bash_history, and use last or who for login history. For auditing, I check /var/log/auth.log or use auditd if it’s set up.

2. How do you restart a service if it crashes unexpectedly?

Answer:

In systemd-based systems, I add:

[Service]
Restart=always
RestartSec=5

in the unit file or override config. This ensures automatic restart after failure.

3. How do you list files modified in the last 1 hour?

Answer:

find /path -type f -mmin -60

4. How do you kill all processes by a specific user?

Answer:

pkill -u username

or

kill -9 $(ps -u username -o pid=)

5. How do you copy files from one server to another securely?

Answer:

scp /file user@remote:/path
rsync -avz -e ssh /file user@remote:/path

rsync is better for large or incremental transfers.

6. What is the `nohup` command and when do you use it?

Answer:

nohup runs a command immune to hangups (terminal close), logs output to nohup.out. Useful for long background tasks.

7. How do you track cron job failures?

Answer:

Redirect output to a log file:

* * * * * /script.sh >> /var/log/script.log 2>&1

Or check /var/log/syslog or journalctl -u cron.

8. How do you find the file system type of a partition?

Answer:

df -T
lsblk -f
blkid

9. How do you unmount a busy filesystem?

Answer:

Use:

lsof /mount/point
fuser -vm /mount/point

Then stop the processes and umount.

10. How do you list open network connections on a Linux server?

Answer:

ss -tunap
netstat -anp
lsof -i

11. How do you see memory usage per process (accurately)?

Answer:

I use smem or check /proc/<pid>/smaps. top shows RES (resident memory), but smem gives more accurate PSS (proportional set size).

12. How do you reduce swap usage on a server?

Answer:

I check why swap is used (top, vmstat). Then:

swapoff -a && swapon -a

Or tune swappiness:

sysctl vm.swappiness=10

13. How do you display human-readable permissions (rwx) with numeric mode?

Answer:

stat -c "%A %a %n" *

14. How do you find the default gateway and DNS in Linux?

Answer:

ip route | grep default
cat /etc/resolv.conf

15. How do you test if DNS resolution is slow?

Answer:

dig google.com +stats

Compare timing. Also try:

time curl http://domain.com

16. How do you limit CPU usage of a process?

Answer:

Use cpulimit or cgroups:

cpulimit -p <pid> -l 20

or create a cgroup with cpu.cfs_quota_us.

17. How do you check which kernel version you're running?

Answer:

uname -r
cat /proc/version

18. How do you analyze server reboots or crashes?

Answer:

Use:

last reboot
journalctl --since "1 hour ago"

Check dmesg, syslog, or if configured, check crash dumps.

19. How do you sync system time with NTP?

Answer:

timedatectl status
timedatectl set-ntp true

or install ntp or chrony and sync with pool servers.

20. How do you check available entropy? Why does it matter?

Answer:

cat /proc/sys/kernel/random/entropy_avail

Low entropy can slow down cryptographic operations (like SSH, SSL). Fix using rng-tools or haveged.

You're on 🔥 — and I love that you're serious about mastering your SRE/Linux interview game.

Here's Set #4: another fresh 20 Linux interview questions with zero repeats, curated specifically for 3–5 year experienced DevOps/SRE engineers. These focus on troubleshooting, scripting, file systems, networking, and command-line logic — the exact stuff Google and other big players expect from solid mid-level candidates.

🧠 Linux Interview Questions – Set #4 (Mid-Level, Non-Repeating)

1. How do you check if a Linux server is running out of inodes?

Answer:

df -i

If %iused is high, you may need to delete small files (e.g., logs, tmp files). Inode exhaustion causes "No space left on device" even when space exists.

2. How do you list the 10 largest files in a directory tree?

Answer:

find /path -type f -exec du -h {} + | sort -rh | head -n 10

3. How do you find recently installed packages in Linux?

Answer:

On Debian/Ubuntu:

grep " install " /var/log/dpkg.log

On RHEL/CentOS:

rpm -qa --last

4. How do you recursively change permissions only for files or only for directories?

Answer:

Files only:

find . -type f -exec chmod 644 {} +

Directories only:

find . -type d -exec chmod 755 {} +

5. How do you monitor real-time network usage per process?

Answer:

Use nethogs or:

iftop
ip -s link

6. How do you see which shared libraries a binary uses?

Answer:

ldd /usr/bin/nginx

7. How do you compare two directories and find differences?

Answer:

diff -r /dir1 /dir2

Or use rsync -an --delete for a dry run of syncing changes.

8. How do you find all files owned by a user?

Answer:

find / -user username

9. How do you test whether a variable in a bash script is empty?

Answer:

if [ -z "$VAR" ]; then
  echo "Empty"
fi

10. What’s the difference between `>` and `>>` in shell redirection?

Answer:

> overwrites a file, >> appends to it. Useful in logging scripts to avoid log loss.

11. How do you make a script executable from anywhere in the system?

Answer:

Add shebang (#!/bin/bash) at the top
Make executable: chmod +x script.sh
Move it to a directory in $PATH like /usr/local/bin/

12. How do you limit the number of processes a user can run?

Answer:

Set in /etc/security/limits.conf:

username hard nproc 100

13. How do you pause a running process and resume it later?

Answer:

Pause with Ctrl+Z
Resume in foreground: fg
Resume in background: bg

14. How do you rename all `.log` files to `.bak` in a folder?

Answer:

for f in *.log; do mv "$f" "${f%.log}.bak"; done

15. How do you test network connectivity without using `ping` or `curl`?

Answer:

nc -zv google.com 80
telnet google.com 80

Or check with traceroute, dig, or ip route get.

16. How do you set a static IP address temporarily and permanently?

Answer:

Temporarily:

ip addr add 192.168.1.100/24 dev eth0

Permanently: edit /etc/network/interfaces (Debian) or use nmcli or Netplan.

17. How do you make an alias permanent?

Answer:

Add it to ~/.bashrc or ~/.bash_aliases, e.g.:

alias ll='ls -alF'

18. How do you extract `.tar.gz` but only specific files/folders?

Answer:

tar -xzvf archive.tar.gz folder/file1.txt

19. How do you make a script run at system boot?

Answer:

For systemd-based:

sudo cp script.sh /etc/systemd/system/myscript.service
sudo systemctl enable myscript.service

Or use @reboot in crontab:

@reboot /path/to/script.sh

20. How do you limit file upload size in a Linux-based web server (like Nginx)?

Answer:

Set:

client_max_body_size 20M;

in Nginx config, then reload the service.

Kuberntes

Great! Since you're preparing for a Google SRE interview with 3–5 years of experience, here's a curated set of 20 Kubernetes interview questions that are:

Frequently asked by Google interviewers
Matched to your experience level
Answered as a top candidate would — showing both practical usage and understanding

🧠 Top 20 Kubernetes Interview Questions for Google SRE (3–5 Years Experience)

1. What is Kubernetes and why is it used?

Best Answer:

Kubernetes is an open-source container orchestration platform used to deploy, manage, and scale containerized applications. It automates tasks like service discovery, scaling, load balancing, rolling updates, and self-healing.

2. How does Kubernetes achieve high availability for applications?

Best Answer:

It runs multiple replicas of a pod behind a Service. If one pod fails, the others continue serving traffic. Readiness probes and ReplicaSets ensure that only healthy pods serve traffic. HorizontalPodAutoscaler and anti-affinity rules help scale and distribute them.

3. What happens when you run `kubectl apply -f deployment.yaml`?

Best Answer:

The manifest is sent to the API server, which stores the desired state in etcd. The controller manager compares current vs desired state and schedules pods as needed using kube-scheduler. Then kubelet runs pods on the target node.

4. What is the difference between a Deployment, ReplicaSet, and StatefulSet?

Best Answer:

Deployment: Manages stateless apps with rolling updates via ReplicaSets.
ReplicaSet: Ensures a set number of identical pods are running.
StatefulSet: For stateful apps with stable pod identity (DNS, storage) — e.g., databases.

5. How do you roll back a deployment in Kubernetes?

Best Answer:

kubectl rollout undo deployment <name>

Kubernetes stores rollout history. You can also specify a revision with --to-revision.

6. How does Kubernetes perform health checks?

Best Answer:

Via liveness and readiness probes:

Liveness: Tells if the container is alive — restarts it if it fails.
Readiness: Tells if the container is ready to accept traffic.

7. How does a Service work in Kubernetes?

Best Answer:

A Service provides a stable IP and DNS name for a set of pods. It uses labels to match pods. Internally, kube-proxy maintains iptables or IPVS rules to forward traffic to backend pods.

8. What is a ConfigMap and how is it used?

Best Answer:

ConfigMaps externalize configuration from code. You can mount them as volumes or inject as environment variables. This allows updating config without rebuilding images.

9. What’s the difference between ConfigMap and Secret?

Best Answer:

ConfigMap: Stores plaintext config
Secret: Stores sensitive data (base64 encoded, not encrypted by default) Secrets are mounted with stricter permissions and optionally encrypted at rest.

10. What are Namespaces in Kubernetes?

Best Answer:

Namespaces logically isolate resources in a cluster. Useful in multi-tenant setups or separating environments (dev/stage/prod). Resources like pods, services, and configmaps are namespace-scoped.

11. How do you debug a pod stuck in `CrashLoopBackOff`?

Best Answer:

kubectl describe pod <pod>
kubectl logs <pod>

I check container exit codes, logs, and readiness/liveness probe failures. Often it's config issues, failed DB connections, or missing secrets.

12. What is a DaemonSet and use case?

Best Answer:

A DaemonSet ensures a pod runs on every (or specific) node. Commonly used for log shippers (Fluentd), monitoring agents (Prometheus node exporter), or CSI drivers.

13. How do you limit resource usage of a pod?

Best Answer:

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

This helps the scheduler place the pod correctly and prevents noisy neighbor issues.

14. What is a Node in Kubernetes?

Best Answer:

A node is a worker machine (VM or bare metal) in the cluster that runs pods. It includes kubelet, container runtime, and kube-proxy.

15. How does Kubernetes schedule pods to nodes?

Best Answer:

The kube-scheduler selects nodes based on:

resource availability (CPU/mem)
taints/tolerations
node/pod affinity
constraints (e.g., nodeSelector, nodeAffinity) Then binds the pod to the best-fit node.

16. What is a Taint and Toleration?

Best Answer:

Taints are labels on nodes that repel pods. Tolerations allow pods to “tolerate” those taints. Example: Taint master nodes to avoid regular pod scheduling.

17. What are Init Containers?

Best Answer:

Init containers run before app containers in a pod. They are ideal for tasks like pulling config, waiting for a service, or setting up environment. They must complete before the main container starts.

18. How do you expose an application outside the cluster?

Best Answer:

Use NodePort for basic external access

Use LoadBalancer (in cloud)
Use Ingress for path- or host-based routing

19. What is an Ingress and why use it?

Best Answer:

Ingress is a Kubernetes object that manages external access to services via HTTP/HTTPS. It supports routing based on paths and domains. Requires an Ingress Controller (e.g., NGINX, Traefik).

20. How do you monitor a Kubernetes cluster?

Best Answer:

I use Prometheus + Grafana for metrics, Alertmanager for alerts, and tools like:

kubectl top (via Metrics Server)
Lens or K9s for visualization
kubectl describe and logs for debugging Logging: EFK/ELK stack or Loki.

✅ Bonus Tips for Google SRE Interview

Be ready to whiteboard how components interact (API server, kubelet, etcd, etc.)
Emphasize how you handled real outages, debugging, or scaling
Explain why you use a certain Kubernetes feature, not just how

Awesome! Here’s Set #2 of 20 fresh Kubernetes interview questions for Google SRE candidates with 3–5 years experience — no repeats, real-world focus, and answered like a top-tier engineer.

🧠 Kubernetes Interview Questions – Set #2 (Google SRE, No Repeats)

1. How does Kubernetes handle pod failures?

Best Answer:

The kubelet detects the failure and notifies the API server. The ReplicaSet or StatefulSet controller creates a new pod to maintain the desired replica count. Liveness probes trigger restarts if needed.

2. What’s the difference between RollingUpdate and Recreate strategies?

Best Answer:

RollingUpdate: Gradually replaces old pods with new ones, ensuring no downtime.
Recreate: Terminates all old pods before starting new ones — might cause downtime.

3. What is etcd and why is it important?

Best Answer:

etcd is a consistent, distributed key-value store used as Kubernetes' backing store. It holds all cluster data (state, secrets, configuration). If etcd fails, the cluster loses state.

4. What happens if a node goes down in Kubernetes?

Best Answer:

After a grace period (default 5m), the node is marked NotReady, and its pods are evicted. The scheduler reassigns the pods to healthy nodes if pod disruption budgets allow.

5. What is a Pod Disruption Budget (PDB)?

Best Answer:

A PDB defines the minimum number of pods that must be available during voluntary disruptions (like upgrades). It helps avoid complete app downtime during node drain or rolling updates.

6. How do you handle secret rotation in Kubernetes?

Best Answer:

Secrets are stored as objects. I:

Use volume mounts for apps that re-read secrets
Automate rotation via CI/CD
Use External Secrets Operator or integrate with HashiCorp Vault

7. How do you create a Kubernetes Job and CronJob?

Best Answer:

Job runs a task to completion (e.g., backup).

CronJob schedules Jobs periodically.

schedule: "0 3 * * *"
jobTemplate:
  spec:
    template:
      spec:
        containers:

8. What is the role of the Kubernetes Controller Manager?

Best Answer:

It runs control loops (Deployment, ReplicaSet, Node, etc.) that constantly reconcile desired state (from etcd) vs actual state (from kubelet).

9. How do you isolate workloads in Kubernetes?

Best Answer:

I use:

Namespaces to group resources
NetworkPolicies to restrict traffic
RBAC for access control
ResourceQuotas to limit usage per team

10. What’s the difference between `kubectl get`, `describe`, `logs`, and `exec`?

Best Answer:

get: shows summary

describe: detailed metadata, events
logs: container stdout/stderr
exec: run commands inside the pod (like SSH)

11. How do you view events in a Kubernetes cluster?

Best Answer:

kubectl get events --sort-by='.lastTimestamp'

Helps in troubleshooting startup failures, scheduling issues, probe failures, etc.

12. What are Admission Controllers in Kubernetes?

Best Answer:

They intercept API requests after authentication/authorization but before persistence. Used for validating policies, mutating resources, enforcing security (e.g., PodSecurityAdmission).

13. How do you perform zero-downtime deployments?

Best Answer:

Use:

RollingUpdate strategy
Readiness probes
Proper maxUnavailable and maxSurge settings
PreStop hooks to delay pod termination

14. What are sidecar containers and how are they used?

Best Answer:

Sidecars run alongside the main app container in the same pod. Used for logging, proxying, authentication (e.g., Envoy for Istio), or config reloading.

15. What is Horizontal Pod Autoscaler (HPA)?

Best Answer:

HPA scales pods based on CPU/memory or custom metrics. It queries metrics-server and adjusts replicas using a target utilization percentage.

16. What is Vertical Pod Autoscaler (VPA)?

Best Answer:

VPA adjusts pod CPU/memory requests/limits based on usage. It doesn’t increase pod count, just resource allocation. Useful for apps with dynamic load profiles.

17. How do you drain a node safely for maintenance?

Best Answer:

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

This evicts pods, respecting PDBs. cordon prevents new pods before drain.

18. How do you implement RBAC in Kubernetes?

Best Answer:

Define:

Role or ClusterRole (permissions)
RoleBinding or ClusterRoleBinding (bind to users/service accounts)

19. How do you enforce network segmentation between namespaces?

Best Answer:

With NetworkPolicies:

podSelector: {}
policyTypes: [Ingress, Egress]

Deny all by default, then explicitly allow.

20. What’s the difference between Kubelet and Kube-proxy?

Best Answer:

Kubelet: Agent on each node that manages containers and reports status to API server.

Kube-proxy: Handles service-to-pod networking (iptables/IPVS rules) to load-balance traffic.

✅ Recap

You now have:

✅ 40 unique Kubernetes interview questions
✅ Real-world SRE use cases
✅ Google SRE–friendly answers with clarity and depth

Love the energy — you're crushing this! 🔥 Here’s Set #3: 20 more non-repeating, real-world Kubernetes interview questions focused on SRE responsibilities, cluster reliability, troubleshooting, and production readiness — perfect for Google SRE roles at the 3–5 year level.

🧠 Kubernetes Interview Questions – Set #3 (Google SRE, 3–5 YOE, No Repeats)

1. How do you perform a rolling restart of a deployment in Kubernetes?

Best Answer:

kubectl rollout restart deployment <name>

This triggers a rolling update without changing the manifest — useful for config changes or debugging.

2. What’s the difference between `emptyDir`, `hostPath`, and `PersistentVolume`?

Best Answer:

emptyDir: Temporary, deleted with pod
hostPath: Mounts a node’s filesystem (not portable)
PersistentVolume: Decoupled, reusable storage backed by cloud/NFS/etc

3. How do you scale a Kubernetes deployment manually?

Best Answer:

kubectl scale deployment myapp --replicas=5

Or change replicas: in the YAML and apply.

4. What is a ServiceAccount and why use it?

Best Answer:

A ServiceAccount allows a pod to access the Kubernetes API securely. It's commonly used for automation, monitoring agents, and any app using kubectl or client-go SDK.

5. How do you ensure logs from all pods are collected and searchable?

Best Answer:

Use a sidecar agent (Fluentd/FluentBit) or DaemonSet to collect logs, then send to Elasticsearch, Loki, or Cloud Logging. Prefer stdout/stderr over writing to files.

6. How do you limit access to sensitive secrets in a cluster?

Best Answer:

Use RBAC to restrict who can get/list/watch secrets
Mount secrets only in specific pods/namespaces
Enable encryption at rest for etcd
Rotate secrets using external tools like Vault

7. How do you check which scheduler decisions were made for a pod?

Best Answer:

kubectl describe pod <pod-name>

Look under the Events section for messages like "Successfully assigned pod to node".

8. What is a PreStop hook and how does it help with graceful shutdown?

Best Answer:

A PreStop hook runs before SIGTERM, giving the app time to clean up. Combined with terminationGracePeriodSeconds, it prevents traffic cut-off before readiness probe fails.

9. How do you inject environment-specific variables into Kubernetes pods?

Best Answer:

Use ConfigMaps/Secrets as env vars
Use valueFrom in the pod spec to pull from metadata (e.g., pod name, namespace)

10. How does Kubernetes handle certificate rotation for TLS communication?

Best Answer:

kubelet and controller-manager auto-renew their certs. For workloads, I integrate cert-manager to manage auto-renewal of TLS certs from Let’s Encrypt or an internal CA.

11. How do you investigate kubelet issues on a node?

Best Answer:

journalctl -u kubelet

I check for node pressure (disk, memory), kubelet version mismatches, cgroup errors, or volume mount issues.

12. What is a MutatingAdmissionWebhook?

Best Answer:

It modifies incoming objects before persistence. Used to inject sidecars (e.g., Istio), enforce default labels, or add env vars. Complementary to validating webhooks.

13. What does `kube-proxy` do and what are its operating modes?

Best Answer:

It manages network rules to route traffic to services. It supports:

iptables mode (default): rules in iptables
IPVS mode: faster, kernel-level load balancing

14. How do you troubleshoot pods stuck in `Pending` state?

Best Answer:

Check:

kubectl describe pod <pod>

Common causes:

Not enough resources
Node selectors or taints
PVC unbound
Affinity rules not met

15. How do you ensure only signed container images run in your cluster?

Best Answer:

Use imagePolicyWebhook or tools like Kyverno, OPA Gatekeeper, or Cosign with Sigstore to enforce signature verification before admission.

16. How do you rotate service account tokens in a pod?

Best Answer:

Kubernetes automatically mounts and refreshes tokens. To rotate manually:

Disable automount
Use external secret injector (Vault Agent Injector)
Refresh token via projected volumes

17. What is `kube-state-metrics` and how is it used in monitoring?

Best Answer:

It exports cluster state (pod status, replica count, resource usage) as Prometheus metrics. I use it to alert on unhealthy pods, unscheduled workloads, and configuration drift.

18. How do you back up and restore etcd in Kubernetes?

Best Answer:

ETCDCTL_API=3 etcdctl snapshot save backup.db
etcdctl snapshot restore backup.db

Useful for disaster recovery. Only the control plane node can perform this.

19. What is node affinity and anti-affinity?

Best Answer:

Affinity: schedule pods on matching nodes

Anti-affinity: avoid placing pods on the same node as others (e.g., HA)

20. How do you ensure every namespace has a default resource quota and limit range?

Best Answer:

Use LimitRange and ResourceQuota objects applied via automation (ArgoCD, Kyverno), or admission controllers to enforce policies on new namespaces.

✅ You Now Have:

✅ 60 high-quality K8s questions (Google-level, 3–5 YOE)
✅ All real-world, production-focused answers
✅ A full grasp of how SREs troubleshoot, monitor, and secure clusters

You're doing an amazing job staying consistent! 💪 Here’s your Set #4 (you already have Sets 1, 2, and 3), with 20 more advanced yet practical Kubernetes interview questions, curated for a Google SRE/DevOps role with 3–5 years experience.

These questions cover multi-cluster, security, performance tuning, production best practices, and edge cases — what real SREs face and what top-tier interviewers love to ask.

🧠 Kubernetes Interview Questions – Set #4 (Google SRE, No Repeats, 3–5 YOE)

1. How do you design a multi-tenant Kubernetes cluster securely?

Best Answer:

Use separate Namespaces
Apply strict RBAC and NetworkPolicies
Apply ResourceQuotas and LimitRanges
Use OPA Gatekeeper or Kyverno for policy enforcement
Prefer workload isolation using PodSecurityStandard or GKE Autopilot

2. How do you limit the impact of a noisy neighbor in Kubernetes?

Best Answer:

I use requests and limits on CPU/memory, use LimitRanges in each namespace, and apply PodPriority + Preemption to ensure critical pods are never evicted due to resource starvation.

3. How do you reduce image pull latency for large container images?

Best Answer:

Use smaller base images (distroless, alpine)
Enable imagePullPolicy: IfNotPresent
Pre-pull images using DaemonSets
Use a local registry mirror

4. What is PriorityClass in Kubernetes?

Best Answer:

It's an object that assigns priority to pods. Higher-priority pods are scheduled first and can preempt lower-priority pods if resources are tight.

5. How do you control where pods are scheduled in multi-zone clusters?

Best Answer:

Use:

nodeAffinity
topologySpreadConstraints
podAntiAffinity

These help distribute pods across zones for HA.

6. How do you perform canary deployments in Kubernetes?

Best Answer:

I use multiple Deployments or Rollouts (with Argo Rollouts) and split traffic using:

Service selector changes
Ingress annotations
Istio/Linkerd routing rules

7. How do you manage Kubernetes manifests at scale?

Best Answer:

I use Helm for templating, Kustomize for overlays, and GitOps tools like ArgoCD or Flux for promotion across environments.

8. What tools do you use for cluster-wide policy enforcement?

Best Answer:

OPA Gatekeeper for Rego-based policies
Kyverno for YAML-native rules
K-Rail, JSPolicy, or admission webhooks for specific use cases

9. What’s the difference between `initContainers` and `postStart` hooks?

Best Answer:

initContainers: run before main container starts
postStart: hook in the main container lifecycle, may not block startup I use initContainers when startup dependency order matters.

10. How do you monitor the control plane components in Kubernetes?

Best Answer:

kube-apiserver, etcd, kube-scheduler, controller-manager expose metrics
Scrape them via Prometheus
Check their logs (journalctl, kubectl logs)
Use liveness probes and alerting on abnormal behavior or restart frequency

11. How do you secure etcd in a Kubernetes cluster?

Best Answer:

Enable TLS for peer/client communication
Restrict etcd access to the API server only
Encrypt etcd at rest (--encryption-provider-config)
Use etcd snapshots for secure backups

12. How do you protect against container escape vulnerabilities?

Best Answer:

Run containers as non-root
Drop Linux capabilities
Use Seccomp and AppArmor/SELinux profiles
Enable PodSecurity admission

13. How do you validate changes before applying them to production?

Best Answer:

Use CI/CD pipelines with kubectl diff
Validate YAML with kubeval, kube-score, conftest
Deploy to staging namespace with shadow traffic
Use dry-run mode:

kubectl apply --dry-run=client -f my-deployment.yaml

14. How do you troubleshoot high CPU usage by a pod?

Best Answer:

Check kubectl top pod
kubectl exec into the pod and run top, ps, strace
Check for tight loops or unbounded resource use
Use kubectl debug or ephemeral containers for live debugging

15. How do you monitor node health across clusters?

Best Answer:

Use node-exporter + Prometheus
Alert on node conditions like DiskPressure, MemoryPressure, PIDPressure
Use taints to remove unhealthy nodes
For multi-cluster, I aggregate data using Thanos or Prometheus Federation

16. How do you upgrade Kubernetes clusters with zero downtime?

Best Answer:

Upgrade control plane first
Then drain and upgrade nodes one by one
Use PDBs to ensure availability
Validate API deprecations before upgrade
Use managed services (like GKE, EKS) or tools like kubeadm, kOps

17. How do you configure custom metrics for HPA?

Best Answer:

Expose custom metrics via /metrics endpoint
Use Prometheus Adapter to make them available to HPA
Define HPA based on those metrics:

metrics:
- type: External
  external:
    metricName: queue_length

18. How do you analyze a failed pod that’s already terminated?

Best Answer:

Use kubectl get pod -a or --field-selector=status.phase=Failed
Check kubectl describe for exit codes
View logs with kubectl logs --previous

19. How do you prevent privilege escalation in Kubernetes?

Best Answer:

Set allowPrivilegeEscalation: false in the PodSpec
Drop capabilities like CAP_SYS_ADMIN
Disallow host networking, IPC, PID, or hostPath volumes
Use admission controllers to enforce policies

20. How do you manage secret sprawl in large Kubernetes environments?

Best Answer:

Use external secret managers (Vault, AWS Secrets Manager)
Inject via sidecars or CSI drivers
Rotate secrets regularly
Use SecretProviderClass with CSI Secrets Store for dynamic injection

✅ At This Point, You Have:

✅ 80 total Kubernetes questions (4 sets × 20)
✅ Covered real SRE use cases: HA, RBAC, monitoring, debugging, multi-tenancy, scaling, and policy enforcement
✅ Battle-tested answers aligned with Google’s SRE bar

Would you like:

📄 A single PDF with all 80 K8s questions?
🧪 Real troubleshooting lab tasks (with sample cluster failures)?
📊 Prometheus + Grafana interview questions?
🚀 Docker + CI/CD questions for Google SRE final round?

Tell me your next target — I’ll prep it exactly for you.

WHAT IS SRE TASKS

sre tesk

Great question, Pavan! A Senior SRE (Site Reliability Engineer) takes a strategic and proactive approach to SLA/SLO/SLI, ensuring system reliability and performance while balancing innovation and risk. Let’s break it down clearly:

🧠 How a Senior SRE Thinks About SLA/SLO/SLI

Concept

Senior SRE View

SLA (Service Level Agreement)

A business commitment to customers (e.g. 99.9% uptime per month). Violations may have penalties.

SLO (Service Level Objective)

Internal targets used to measure reliability (e.g. 99.95% success rate over 30 days). More aggressive than SLA.

SLI (Service Level Indicator)

Metrics used to measure the user experience (e.g. request latency, error rate, availability).

Mentality

Senior SREs focus on proactive reliability — preventing incidents before SLA breaches happen. They use SLOs to balance reliability vs velocity.

🔧 What They Do to Maintain SLA/SLO/SLI

Define Meaningful SLIs
- Examples:
  - Availability: % of successful HTTP 200s
  - Latency: % of requests below 200ms
  - Error Rate: 5xx errors / total requests
Set Realistic but Ambitious SLOs
- e.g., 99.9% availability with a 0.1% error budget
Monitor Error Budgets
- Use burn rate alerts: how fast you're consuming the error budget
- Pause risky deployments if SLOs are at risk
Automate Response
- Auto-scaling, self-healing (e.g., Kubernetes probes + restarts)
- Use runbooks and incident workflows
Postmortems & Root Cause Analysis
- Document each SLA breach with impact, root cause, lessons learned, and action items
Blameless Culture
- Focus on system/process improvements over blaming individuals

📚 What Documentation They Read / Create

Document Type

Purpose

✅ SLI/SLO Definitions

Describes how metrics are measured, where they're collected, thresholds

✅ SLAs

External/customer-facing agreements (usually from product/engineering leadership)

✅ Runbooks

Step-by-step incident recovery instructions

✅ Playbooks

Automated/templated responses to known issues

✅ Postmortems

Detailed analysis of production incidents

✅ Capacity Plans

Forecasting system load and scaling needs

✅ On-Call Handover Docs

Shift responsibilities and open issues summary

📊 How They Make Dashboards in Monitoring Tools (Grafana, Datadog, etc.)

Choose SLIs to Visualize
- Request success rate
- 95th/99th percentile latency
- Uptime per region
- SLO compliance vs error budget
Use Tools Like
- Grafana (with Prometheus, Loki)
- Datadog (SLO dashboards)
- New Relic, Splunk, CloudWatch
Best Practices
- Color-code thresholds (green/yellow/red)
- Add burn rate panels for error budgets
- Show time ranges: last 1h, 24h, 7d
- Include annotations for deployments, incidents

🕵️‍♂️ What Senior SREs Monitor

Layer

Metrics

🔗 Frontend

Page load time, JS errors, API failures

🌐 API Layer

Request rate, error rate, latency

📦 Services/Microservices

Health checks, pod restarts, memory leaks

🛠️ Infrastructure

CPU, memory, disk, node health, auto-scaler behavior

🔐 Security

Auth failures, rate limiting, firewall logs

📈 Business Metrics

Signup rate, payment errors (when reliability affects revenue)

⚠️ Alerts/Incidents

Latency spikes, high error rates, queue backlogs

🧠 Thought Process in Short

“Is the user happy? Are we delivering what we promised? If not, are we detecting and fixing it before it causes real business pain?”

They think in terms of user experience, failure domains, proactive monitoring, and always work to balance reliability with delivery speed.

If you're building a monitoring setup or defining SLIs/SLOs for your current stack (AWS + K8s + Node + React), I can help you design real examples tailored to your infra. Want that?

incident manager

Minimum 5+ years Hands on experience in incident Management of Cloud & SaaS Managed services.
Proven experience in handling Major Incident Response and Resolution management, Cross functional co-ordination, Communication and stakeholder expectation management.
Proven experience in ITSM Tool Monitoring and Management for Incident, Problem and SLA management.
Technical analysis: Ability to analyze Logs, error messages and Tech report and ability to identify RCA and Remediation.
Drive Team towards key Support KPIs/SLA Compliance.

Key certifications: ITIL Certified /Fundamental Technical certification in Cloud and SaaS Technologies.

Key KPI Demonstrations:

MTTR Reduction
Recurring incident elimination
Incident SLA management
CSAT (Stakeholder expectation Management)
Process improvement (Incident, Problem and SLA Management)
Knowledge management

Soft skills:

Excellent written and verbal communication skills
Ability to cope with Business Ops pressure and handle cross functional team towards Resolution
Major incidents can happen at any time. Ability to cope with Time pressure and be ready to be available round the clock to handle Major Incidents
Ability to monitor incident management processes, Tools and Document and recommend continuous Improvement.

PreviousInterview preparation stages NextFinal Round

Last updated 7 months ago

hashtag🧠 Top 20 Google SRE Linux Interview Questions (with Ideal Answers)

hashtag1. What happens when you type google.com in your browser?

hashtag2. How does ps get process information?

hashtag3. Explain top command output.

hashtag4. Difference between a process and a thread?

hashtag5. How do you check if a port is open and listening?

hashtag6. What’s the difference between hard and soft links?

hashtag7. What is a zombie process?

hashtag8. How does Linux manage memory?

hashtag9. What is the OOM killer?

hashtag10. How do you find the top memory-consuming processes?

hashtag11. What does nice and renice do?

hashtag12. How does the kernel schedule processes?

hashtag13. Explain runlevels or systemd targets.

hashtag14. What happens during Linux boot process?

hashtag15. How do you monitor disk IO?

hashtag16. How can you find out which process is using a file or port?

hashtag17. What’s the difference between /dev/null and /dev/zero?

hashtag18. How do you check for file descriptor leaks?

hashtag19. How do cgroups work?

hashtag20. How do you analyze a Linux crash (kernel panic)?

hashtag✅ Bonus Tips for Google SRE Interviews

hashtag✅ Here’s a more realistic and experience-level-appropriate set of 20 Linux questions — with solid, real-world answers.

hashtag1. How do you check system resource usage in Linux?

hashtag2. How do you troubleshoot high CPU usage?

hashtag3. How do you schedule a job in Linux?

hashtag4. How do you find which process is using a port?

hashtag5. How do you free up disk space?

hashtag6. How do you find memory-consuming processes?

hashtag7. What’s the difference between systemctl stop and kill?

hashtag8. How do you check service logs?

hashtag9. How do you find slow disk IO issues?

hashtag10. What’s the use of /etc/hosts?

hashtag11. How do you monitor a running log file in real-time?

hashtag12. What’s the difference between reboot, shutdown, halt, and poweroff?

hashtag13. How do you check if a service is enabled on boot?

hashtag14. How do you list all running processes?

hashtag15. What is the difference between chmod 755 and chmod 644?

hashtag16. How do you check system uptime and reboot history?

hashtag17. How do you find failed login attempts?

hashtag18. How do you create and extract a .tar.gz file?

hashtag19. How do you set an environment variable permanently?

hashtag20. How do you troubleshoot “Permission Denied” errors?

hashtag🎯 Final Thought

hashtag🧠 Linux Interview Questions – Set #3 (Mid-Level, Practical, Non-Repeating)

hashtag1. How do you find which user ran a command recently on a server?

hashtag2. How do you restart a service if it crashes unexpectedly?

hashtag3. How do you list files modified in the last 1 hour?

hashtag4. How do you kill all processes by a specific user?

hashtag5. How do you copy files from one server to another securely?

hashtag6. What is the nohup command and when do you use it?

hashtag7. How do you track cron job failures?

hashtag8. How do you find the file system type of a partition?

hashtag9. How do you unmount a busy filesystem?

hashtag10. How do you list open network connections on a Linux server?

hashtag11. How do you see memory usage per process (accurately)?

hashtag12. How do you reduce swap usage on a server?

hashtag13. How do you display human-readable permissions (rwx) with numeric mode?

hashtag14. How do you find the default gateway and DNS in Linux?

hashtag15. How do you test if DNS resolution is slow?

hashtag16. How do you limit CPU usage of a process?

hashtag17. How do you check which kernel version you're running?

hashtag18. How do you analyze server reboots or crashes?

hashtag19. How do you sync system time with NTP?

hashtag20. How do you check available entropy? Why does it matter?

hashtagYou're on 🔥 — and I love that you're serious about mastering your SRE/Linux interview game.

hashtag🧠 Linux Interview Questions – Set #4 (Mid-Level, Non-Repeating)

hashtag1. How do you check if a Linux server is running out of inodes?

hashtag2. How do you list the 10 largest files in a directory tree?

hashtag3. How do you find recently installed packages in Linux?

hashtag4. How do you recursively change permissions only for files or only for directories?

hashtag5. How do you monitor real-time network usage per process?

hashtag6. How do you see which shared libraries a binary uses?

hashtag7. How do you compare two directories and find differences?

hashtag8. How do you find all files owned by a user?

hashtag9. How do you test whether a variable in a bash script is empty?

hashtag10. What’s the difference between > and >> in shell redirection?

hashtag11. How do you make a script executable from anywhere in the system?

hashtag12. How do you limit the number of processes a user can run?

hashtag13. How do you pause a running process and resume it later?

🧠 Top 20 Google SRE Linux Interview Questions (with Ideal Answers)

1. What happens when you type `google.com` in your browser?

2. How does `ps` get process information?

3. Explain `top` command output.

4. Difference between a process and a thread?

5. How do you check if a port is open and listening?

6. What’s the difference between hard and soft links?

7. What is a zombie process?

8. How does Linux manage memory?

9. What is the OOM killer?

10. How do you find the top memory-consuming processes?

11. What does `nice` and `renice` do?

12. How does the kernel schedule processes?

13. Explain runlevels or systemd targets.

14. What happens during Linux boot process?

15. How do you monitor disk IO?

16. How can you find out which process is using a file or port?

17. What’s the difference between `/dev/null` and `/dev/zero`?

18. How do you check for file descriptor leaks?

19. How do cgroups work?

20. How do you analyze a Linux crash (kernel panic)?

✅ Bonus Tips for Google SRE Interviews

✅ Here’s a more realistic and experience-level-appropriate set of 20 Linux questions — with solid, real-world answers.

1. How do you check system resource usage in Linux?

2. How do you troubleshoot high CPU usage?

3. How do you schedule a job in Linux?

4. How do you find which process is using a port?

5. How do you free up disk space?

6. How do you find memory-consuming processes?

7. What’s the difference between `systemctl stop` and `kill`?

8. How do you check service logs?

9. How do you find slow disk IO issues?

10. What’s the use of `/etc/hosts`?

11. How do you monitor a running log file in real-time?

12. What’s the difference between reboot, shutdown, halt, and poweroff?

13. How do you check if a service is enabled on boot?

14. How do you list all running processes?

15. What is the difference between `chmod 755` and `chmod 644`?

16. How do you check system uptime and reboot history?

17. How do you find failed login attempts?

18. How do you create and extract a `.tar.gz` file?

19. How do you set an environment variable permanently?

20. How do you troubleshoot “Permission Denied” errors?

🎯 Final Thought

🧠 Linux Interview Questions – Set #3 (Mid-Level, Practical, Non-Repeating)

1. How do you find which user ran a command recently on a server?

2. How do you restart a service if it crashes unexpectedly?

3. How do you list files modified in the last 1 hour?

4. How do you kill all processes by a specific user?

5. How do you copy files from one server to another securely?

6. What is the `nohup` command and when do you use it?

7. How do you track cron job failures?

8. How do you find the file system type of a partition?

9. How do you unmount a busy filesystem?

10. How do you list open network connections on a Linux server?

11. How do you see memory usage per process (accurately)?

12. How do you reduce swap usage on a server?

13. How do you display human-readable permissions (rwx) with numeric mode?

14. How do you find the default gateway and DNS in Linux?

15. How do you test if DNS resolution is slow?

16. How do you limit CPU usage of a process?

17. How do you check which kernel version you're running?

18. How do you analyze server reboots or crashes?

19. How do you sync system time with NTP?

20. How do you check available entropy? Why does it matter?

You're on 🔥 — and I love that you're serious about mastering your SRE/Linux interview game.

🧠 Linux Interview Questions – Set #4 (Mid-Level, Non-Repeating)

1. How do you check if a Linux server is running out of inodes?

2. How do you list the 10 largest files in a directory tree?

3. How do you find recently installed packages in Linux?

4. How do you recursively change permissions only for files or only for directories?

5. How do you monitor real-time network usage per process?

6. How do you see which shared libraries a binary uses?

7. How do you compare two directories and find differences?

8. How do you find all files owned by a user?

9. How do you test whether a variable in a bash script is empty?

10. What’s the difference between `>` and `>>` in shell redirection?

11. How do you make a script executable from anywhere in the system?

12. How do you limit the number of processes a user can run?

13. How do you pause a running process and resume it later?