Cloud Storage
Setupβ
We will be using a virtual machine in the faculty's cloud.
When creating a virtual machine in the Launch Instance window:
- Name your VM using the following convention:
cc_lab<no>_<username>, where<no>is the lab number and<username>is your institutional account. - Select Boot from image in Instance Boot Source section
- Select CC Template in Image Name section
- Select the g.medium flavor.
In the base virtual machine:
- Download the laboratory archive from here.
Use:
wget https://repository.grid.pub.ro/cs/cc/laboratoare/lab-storage.zipto download the archive. - Extract the archive.
student@lab-storage:~$ # download the archive
student@lab-storage:~$ wget https://repository.grid.pub.ro/cs/cc/laboratoare/lab-storage.zip
student@lab-storage:~$ unzip lab-storage.zip
Creating a Kubernetes clusterβ
As in the previous laboratories, we will create a cluster on the lab machine, using the kind create cluster command:
student@lab-storage:~$ kind create cluster --config kind-config.yaml
Creating cluster "cc-storage" ...
β Ensuring node image (kindest/node:v1.34.0) πΌ
β Preparing nodes π¦ π¦
β Writing configuration π
β Starting control-plane πΉοΈ
β Installing CNI π
β Installing StorageClass πΎ
β Joining worker nodes π
Set kubectl context to "kind-cc-storage"
You can now use your cluster with:
kubectl cluster-info --context kind-cc-storage
Have a nice day! π
It is recommended that you use port-forwarding instead of X11 forwarding to interact with the UI.
Storage in Cloudβ
Storage is a critical part of any cloud application. This data can be anything from user-generated content, application logs, backups, or even machine learning models. Because an application is running in the cloud, it needs a way to access storage that is not tied to a specific machine or location. This is where cloud storage comes in.
Requirements for cloud storage include:
- Accessibility: Data should be easily accessible from anywhere, through APIs or other interfaces.
- Performance: Cloud storage should provide low latency and high throughput for data access.
- Scalability: The ability to handle increasing amounts of data without performance degradation.
- Durability: Ensuring that data is not lost and can be retrieved reliably (e.g., through replication).
On-Premises vs Cloud Storageβ
The need for a storage solution for a cloud application is obvious but leaves the question of why not deploying it on-premises.
On-premises storage refers to storage solutions that are physically located within an organization's premises, such as local hard drives or network-attached storage (NAS). In contrast, cloud storage is provided by third-party providers and accessed over the internet.
| On-Premises Storage | Cloud Storage | |
|---|---|---|
| Cost | High upfront costs, ongoing maintenance | Pay-as-you-go for storage and usage |
| Performance | Limited by local hardware and network | High performance with optimized infrastructure |
| Scalability | Limited, requires manual intervention | can grow with demand |
| Durability | Prone to failure, requires backups | High durability, often with replication |
As a rough baseline, standard object storage costs approximately $0.02β0.025 per GB/month across providers (AWS, Azure, GCP), making them broadly comparable for storage alone. The real cost differences emerge from read/write operations and how tightly a workload is coupled to provider-specific features.
Providersβ
- AWS S3 - The most widely adopted object storage service, with the richest ecosystem of integrations and tooling
- GCP Cloud Storage - Tight integration with Google's data and ML services (BigQuery, Dataflow, Vertex AI)
- Azure Blob Storage - Best fit for organizations already in the Microsoft ecosystem (Active Directory, Office 365)
Storage in Kubernetesβ
Kubernetes provides integration with various storage backends, abstracting them with the following concepts:
- Persistent Volume Claim (PVC): a request for storage by a user. This request is fulfilled by finding a suitable Persistent Volume and binding it to the claim.
- Persistent Volume (PV): a piece of storage in the cluster, that can be mounted by pods. It can be provisioned by an administrator manually or dynamically, using a Storage Class.
- Storage Class: it is configured to create Persistent Volumes on demand. It defines the provisioner (e.g., AWS EBS, GCE PD, Azure Disk) and parameters (e.g., type of disk, IOPS) for the PVs it creates.
This abstraction enables applications to use storage as they would with a local disk, while Kubernetes manages the underlying storage resources and their lifecycle. Changing the storage backend (e.g., switching from AWS EBS to Azure Disk) does not require changes to the application code, as long as the PVCs and PVs are properly configured.
Persistent Volume Claim (PVC)β
The Persistent Volume Claim is a request for storage by a user. It will be fulfilled by Kubernetes and bound to a suitable Persistent Volume. A typical PVC definition looks like this:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
# How the volume will be mounted by the pod. Available options are:
# - ReadWriteOnce: the volume can be mounted as read-write by a single node
# - ReadOnlyMany: the volume can be mounted as read-only by many nodes
# - ReadWriteMany: the volume can be mounted as read-write by many nodes
accessModes:
- ReadWriteOnce
# The minimum amount of storage that the volume should have.
resources:
requests:
storage: 8Gi
# The policy for reclaiming the volume when it is released. Available options are:
# - Retain: the volume will be retained when the claim is deleted
# - Delete: the volume will be deleted when the claim is deleted
persistentVolumeReclaimPolicy: Retain
# Optional: the name of the Storage Class to use for dynamic provisioning.
storageClassName: nvme-ssd
# Alternatively, you can specify a specific PV to bind to by using the `volumeName` field.
# This will block the claim until the specified PV is available and matches the claim's requirements.
# volumeName: my-pv
The accessModes field in the PVC and PV definitions refers to nodes, not pods. This means that if a PV is created with ReadWriteOnce, it can only be mounted by one node at a time, but multiple pods on that node can access it simultaneously.
To use a PVC you have to mount it in a pod. This is done in two steps:
- First, you specify the PVC in the
volumessection of the pod spec, this makes the volume available to the containers in the pod. - Then, you specify the
volumeMountsin the container spec to mount the volume to a specific path inside the container.
A typical pod definition that uses a PVC looks like this:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
# The list of volumes that can be mounted by containers in this pod. Each volume must have a unique name.
volumes:
- name: my-volume
# The source of the volume. In this case we are using a PVC, but there are other options like ConfigMap, Secret, etc.
persistentVolumeClaim:
claimName: my-pvc
containers:
- name: my-container
image: nginx
# The list of volumes mounted into the container. Each volumeMount must reference a volume defined in the .spec.volumes.
volumeMounts:
- name: my-volume
mountPath: /usr/share/nginx/html
Persistent Volume (PV)β
A Persistent Volume is an extension of the concept of a Volume in Docker. Both are used to persist data beyond the lifecycle of a container. In addition:
- the Persistent Volume is not tied to a specific node, meaning that a pod can be rescheduled to another node without losing data
- the Persistent Volume is not tied to a specific pod, meaning that multiple pods can mount it to share data
The Persistent Volume is a piece of storage in the cluster, that can be mounted by pods. Unless there is a specific need to create PVs manually, it is recommended to use dynamic provisioning with Storage Classes, which simplifies the management of storage resources. A typical PV definition looks like this:
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-pv
spec:
# How the volume will be mounted by the pod. Available options are:
# - ReadWriteOnce: the volume can be mounted as read-write by a single node
# - ReadOnlyMany: the volume can be mounted as read-only by many nodes
# - ReadWriteMany: the volume can be mounted as read-write by many nodes
accessModes:
- ReadWriteOnce
# The capacity of the volume. This is the total amount of storage that the PV provides.
capacity:
storage: 8Gi
# The policy for reclaiming the volume when it is released. Available options are:
# - Retain: the volume will be retained when the claim is deleted
# - Delete: the volume will be deleted when the claim is deleted
persistentVolumeReclaimPolicy: Retain
Exercise: Manual Provisioningβ
Storage Classes are the recommended way to manage storage in Kubernetes, but it is also possible to create Persistent Volumes manually. This exercise will help you understand how manual provisioning works and how to troubleshoot common issues.
An app was deployed but its pod is stuck in Pending. Figure out what is missing and fix it.
-
Run the setup script to create the broken resources:
student@lab-storage:~$ bash setup-manual-pvc.sh -
Investigate the status of the pod and the PVC:
student@lab-storage:~$ kubectl describe pod manual-pv-podstudent@lab-storage:~$ kubectl describe pvc manual-pvc -
Create the missing resource so the pod reaches
Running.tipWhen creating the Persistent Volume you have to setup its storage backend. For this exercise you can use
.spec.hostPath: /tmp/manual-pv-datafield, which will link the PV to a directory on the node.This is not recommended for production use, but it is useful for learning purposes.
Storage Classβ
A Storage Class is a way to define how storage is provisioned in the cluster. A cluster might have multiple Storage Classes, each representing a different type of storage (e.g., SSD, HDD, network storage) with different performance characteristics and costs. A typical Storage Class definition looks like this:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nvme-ssd
# The provisioner that will create the underlying storage resource.
# Each cloud provider has its own provisioner (e.g., AWS EBS, GCE PD, Azure Disk).
# Use "rancher.io/local-path" for local storage.
provisioner: rancher.io/local-path
# The policy for reclaiming the volume when the PVC is deleted. Available options are:
# - Retain: the volume will be retained when the claim is deleted
# - Delete: the volume will be deleted when the claim is deleted
reclaimPolicy: Delete
# Allow PVCs to expand the volume after creation.
allowVolumeExpansion: true
# When to bind a Persistent Volume to a Persistent Volume Claim. Available options are:
# - Immediate: the PV will be bound to the PVC as soon as it is created
# - WaitForFirstConsumer: the PV will be bound to the PVC only when a Pod that uses the PVC is scheduled.
volumeBindingMode: WaitForFirstConsumer
Storage Provisioning in Kubernetesβ
The above diagram illustrates the process of provisioning a Persistent Volume by a Storage Class and mounting it to a Pod:
- The user creates a Persistent Volume Claim (PVC)
- The Storage Class detects the PVC, requests the provisioner to allocate the storage, and creates a Persistent Volume (PV) that satisfies the claim
- The PV is bound to the PVC, tying their lifecycle together
- The user creates a Pod that references the PVC, and Kubernetes mounts the PV to the Pod
- The Pod can now read/write data to the PV, and the data will persist
Deep Dive: How is the filesystem from cloud storage provisioned and mounted to a Pod?
The above diagram ommited some details due to its focus on the high-level flow and interactions between the components. This section explores the underlying mechanisms of how the filesystem from cloud storage is provisioned and mounted to a Pod.
Once a Pod is scheduled to a node, the kubelet on that node detects that the Pod has a volume that needs to be mounted:
- The kubelet learns from the Pod which Persistent Volume Claim (PVC) it needs to mount, and from the PVC it learns which Persistent Volume (PV) is bound to it.
- The kubelet then interacts with the Container Storage Interface (CSI) driver associated with the PV's Storage Class to provision the storage if it hasn't been provisioned yet, and to mount the storage to the node. (upper part of the diagram)
- The kubelet then mounts the storage from the node to the container's filesystem, making it available for the application running in the Pod to read/write data. (lower part of the diagram)
Exercise: My Pod Won't Startβ
Investigate why a Pod is stuck in Pending state even though it has a PVC attached. Fix the issue and get the Pod running.
-
Run the setup script to create the broken resources:
student@lab-storage:~$ bash setup-broken-storage-class.sh -
Investigate the status of the Pod and the PVC:
student@lab-storage:~$ kubectl describe pod -l app=broken-storagestudent@lab-storage:~$ kubectl describe pvc broken-pvc -
List the available Storage Classes and identify the correct one:
student@lab-storage:~$ kubectl get storageclass
When working with local storage provisioners like rancher.io/local-path on a kind cluster, you must use the WaitForFirstConsumer volume binding mode.
Exercise: Sharing Storage Between Deploymentsβ
The goal of this exercise is to deploy a writer Deployment that writes to a volume, and a reader Deployment that reads from the same volume. There will be a single writer pod and multiple reader pods. The writer will append a timestamped message to a file every 5 seconds, while the reader pods will print the contents of that file.
-
Apply the provided manifests to create a PVC and a
writerDeployment:student@lab-storage:~$ kubectl apply -f shared-pvc-manifests.yamlpersistentvolumeclaim/read-write-pvc createddeployment.apps/writer createdstudent@lab-storage:~$ kubectl logs -l app=writerWed Mar 18 07:40:45 UTC 2026: hello from the other sideWed Mar 18 07:40:50 UTC 2026: hello from the other sideWed Mar 18 07:40:55 UTC 2026: hello from the other side -
Edit
shared-pvc-manifests.yamlto add areaderDeployment with2 replicasthat mounts the same PVC and prints the contents of the file every 5 seconds.tipTo read the contents of the file where volume is mounted, you can use a
busyboxcontainer with the commandsh -c "tail -f /data/messages.txt". -
The
readerpods are probably stuck inPendingstate. Investigate the reason and fix the issue inshared-pvc-manifests.yaml.tipSome Kubernetes objects have immutable fields that cannot be changed after creation. If you need to change an immutable field, you must delete and recreate the object with the correct configuration.
You can use
kubectl delete -f shared-pvc-manifests.yamlto delete the existing resources, thenkubectl apply -f shared-pvc-manifests.yamlto create them again with the updated configuration.Hint: Why are the reader pods stuck in Pending state?
ReadWriteOncemeans the volume can be attached to one node at a time. Are the reader and writer pods scheduled on the same node? -
Ensure the
readerpods are running and check their logs to see the messages written by thewriterpod.student@lab-storage:~$ kubectl logs -l app=reader
StatefulSetβ
A StatefulSet is a workload controller for pods that need a stable, persistent identity. Unlike a Deployment - where all replicas are identical and interchangeable - each pod in a StatefulSet has a unique, ordered identity (myapp-0, myapp-1, ...) that is preserved across restarts and rescheduling.
This distinction is crucial for applications that require stable network identities or persistent storage. A PostgreSQL replica, for example, must always come back as the same replica - same hostname, same storage. If that were not the case, from the application perspective, it would look like the replica lost all its data.
Deep Dive: Why do applications like databases require stable identities and persistent storage?
Modern applications are designed to be scalable and resilient, which often means they can run as multiple coordinated instances (replicas). For some applications, like stateless web servers, it doesn't matter which instance serves a request - any replica can handle it. For others, like databases, each instance has a specific role and state that must be preserved.
We will refer to a set of such instances as a cluster. Do not confuse this with a Kubernetes cluster.
Let's take the example of a database cluster and see why preserving identities and storage is important:
- Data routing - data is split into shards, each owned by a specific replica, so reads and writes can be routed directly to the right place. If a replica restarts with a different identity, the cluster treats it as a brand-new empty node and loses track of which shard it holds - requests for that data can no longer be routed, and read traffic can't be balanced across replicas either.
- Replication safety - to guard against failures, each shard is copied to multiple replicas (typically 3). If a replica restarts with a different identity, the cluster sees an unknown member and starts re-replicating data to it, potentially overwriting data that was still valid on that node or dropping in-flight writes.
- Coordination - replicas elect a leader that decides shard assignments, admission of new members, and failover. Leader election and role assignment are tied to stable identities. A replica that returns with a new identity looks like an unknown member joining the cluster, triggering unnecessary re-elections and shard rebalancing that cause downtime and churn.
Designing an application to be resilient to changing identities and storage is possible, but it adds significant complexity and overhead. StatefulSets provide a simple way to give applications the stable identities and persistent storage they need, without having to build that logic into the application itself.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: myapp
spec:
# How many replicas to create.
replicas: 3
# How to identify the pods that belong to this StatefulSet.
selector:
matchLabels:
app: myapp
# How to create the pods.
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: busybox
command: ["sh", "-c", "while true; do echo $(hostname): $(date) >> /data/log.txt; sleep 5; done"]
# Mount the volume created from the volumeClaimTemplates.
volumeMounts:
- name: data
mountPath: /data
# How to provision storage for each pod.
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard
resources:
requests:
storage: 1Gi
You might wonder whether you could achieve the same result by creating a Deployment and manually creating one PVC per replica. This is technically possible, falls apart quickly because Deployment names its pods with random hashes (myapp-7d9f8b-xkz2p), so after a restart there is no reliable way to know which pod should mount which PVC.
Here is a quick comparison of the key differences between Deployments and StatefulSets:
| Deployment | StatefulSet | |
|---|---|---|
| Pod names | Random hash (myapp-7d9f8b-xkz2p) | Stable index (myapp-0, myapp-1) |
| Per-pod DNS | No | Yes (myapp-0.myapp.svc.cluster.local) |
| Storage | All replicas share one PVC | One PVC per pod via volumeClaimTemplates |
| Startup / shutdown order | Parallel, no guarantees | Sequential (0->1->2, teardown 2->1->0) |
| Pod replacement | New name, no PVC affinity | Same name, rebinds to original PVC |
Exercise: StatefulSet with Shared Shardsβ
Deploy a StatefulSet with 3 replicas, where each pod periodically writes its hostname and timestamp to a shared file. The problem is that all pods are writing to the same file, instead of each pod having its own shard. Your task is to investigate why this is happening and fix the issue.
-
Run the setup script to deploy the broken StatefulSet:
student@lab-storage:~$ bash setup-shared-shards-statefulset.sh -
Inspect the shard data across pods and observe the problem:
student@lab-storage:~$ kubectl exec shared-shards-sts-0 -- cat /data/shard.txtstudent@lab-storage:~$ kubectl exec shared-shards-sts-1 -- cat /data/shard.txt -
Investigate why all pods are writing to the same place. Check which PVC each pod is using:
student@lab-storage:~$ kubectl describe pod shared-shards-sts-0 -
Get the StatefulSet manifest, edit it to fix the issue, and apply the fix:
# You can use `kubectl get -o yaml` to see the manifest and `kubectl edit` to make changes.student@lab-storage:~$ kubectl get statefulset shared-shards-sts -o yamlstudent@lab-storage:~$ kubectl edit statefulset shared-shards-ststipA StatefulSet with a
volumeClaimTemplatessection does not need to specify avolumessection in the pod template. The PVCs created from thevolumeClaimTemplatesare automatically made available to the pods as volumes, and can be mounted by name in the container spec.