Introduction to CAP Theorem
The CAP Theorem is a fundamental principle in distributed systems that helps in understanding the trade-offs that are necessary when designing a distributed database. It was proposed by Eric Brewer in the year 2000. The theorem states that in a distributed system, you can only achieve two out of the following three properties simultaneously:
- Consistency (C)
- Availability (A)
- Partition Tolerance (P)
These three properties are vital in the context of distributed data storage systems, but according to the CAP theorem, achieving all three is impossible. Hence, when designing distributed systems, you must prioritize two of these properties based on your system requirements.
Key Concepts of the CAP Theorem
-
Consistency (C):
Consistency means that every read operation returns the most recent write, i.e., all nodes in a distributed system see the same data at the same time, no matter which node they connected to.
In simple terms, when data is written to the system, it must immediately be propagated to all nodes, ensuring that all users see the same information.
Example: In a bank transaction system, if one account is debited, all nodes must reflect that change immediately to avoid showing inconsistent balances.
Impact: To maintain consistency, a system may have to delay responses to ensure all replicas are updated.
-
Availability (A):
Availability means that the system is always able to serve requests.
Every request (read or write) gets a response, even if the system doesn't guarantee the most up-to-date data or even if some of the nodes are down.
Example: In a social media system, you might still be able to post updates or view older posts, even if the latest data from other users hasn’t yet been propagated to all nodes.
Impact: Prioritizing availability may lead to stale or outdated data being served during failures or delays in updates.
-
Partition Tolerance (P):
Partition indicates a communication break between two nodes. Partition Tolerance means that the system continues to operate even when there is a network partition, i.e., when communication between nodes in a distributed system is disrupted.
The system can still function despite certain nodes being isolated from each other.
Example: In a cloud service, if data centers in different regions lose connectivity, a partition-tolerant system will continue serving users with data from the available nodes, even if they can't communicate with others.
Impact: Partition tolerance is crucial in distributed systems, especially in wide-area networks, but it often forces the system to compromise on consistency or availability.
Trade-offs in CAP Theorem
Consistency (C):
Consistency means that every read operation returns the most recent write, i.e., all nodes in a distributed system see the same data at the same time, no matter which node they connected to. In simple terms, when data is written to the system, it must immediately be propagated to all nodes, ensuring that all users see the same information.
Example: In a bank transaction system, if one account is debited, all nodes must reflect that change immediately to avoid showing inconsistent balances.
Impact: To maintain consistency, a system may have to delay responses to ensure all replicas are updated.
Availability (A):
Availability means that the system is always able to serve requests. Every request (read or write) gets a response, even if the system doesn't guarantee the most up-to-date data or even if some of the nodes are down.
Example: In a social media system, you might still be able to post updates or view older posts, even if the latest data from other users hasn’t yet been propagated to all nodes.
Impact: Prioritizing availability may lead to stale or outdated data being served during failures or delays in updates.
Partition Tolerance (P):
Partition indicates a communication break between two nodes. Partition Tolerance means that the system continues to operate even when there is a network partition, i.e., when communication between nodes in a distributed system is disrupted. The system can still function despite certain nodes being isolated from each other.
Example: In a cloud service, if data centers in different regions lose connectivity, a partition-tolerant system will continue serving users with data from the available nodes, even if they can't communicate with others.
Impact: Partition tolerance is crucial in distributed systems, especially in wide-area networks, but it often forces the system to compromise on consistency or availability.
Based on the CAP Theorem, distributed systems can only guarantee two of the three properties. Let’s break down the different combinations:
1. Consistency + Availability (CA)
If you prioritize both Consistency and Availability, the system will not be able to tolerate network partitions. Systems that ensure consistency and availability work well as long as there are no network partitions. If a partition occurs, the system will likely stop responding, as the system can no longer guarantee both consistency and availability.
Use Case: This combination is suitable for systems within a single data center where the likelihood of network partitions is minimal.
Example:2. Consistency + Partition Tolerance (CP)
With Consistency and Partition Tolerance, the system may sacrifice availability. When a partition occurs, the system may become unavailable to maintain consistency.
Use Case: Suitable for systems where consistency is critical, such as financial systems or banking applications where every transaction must be accurate and up-to-date.
Example:3. Availability + Partition Tolerance (AP)
If you prioritize Availability and Partition Tolerance, you may sacrifice consistency. These systems remain available and partition-tolerant, but may return outdated or inconsistent data when there are network issues. This is often referred to as eventual consistency, where the system will eventually become consistent but not necessarily immediately.
Use Case: Ideal for systems where availability is more important than consistency, such as social media feeds, caching systems, or shopping cart systems in e-commerce platforms.
Example:CAP Theorem in Real World Distributed System
In real world system we cant avoid Partition. So we must need to choose between Consistency & Availability.
If n3 becomes unreachable and cannot communicate with n1 and n2, then:
- Any writes sent to n1 or n2 cannot be forwarded to n3.
- Any writes sent to n3 cannot reach n1 or n2, so n1 and n2 will have stale data.
We must block all write operations on n1 and n2 during the partition.
This prevents inconsistent data, but it also means the system becomes unavailable for writes until n3 is reachable again.
The system continues to accept both reads and writes, even though some nodes may return outdated data.
n1 and n2 keep processing writes, and once the network partition heals, the data will be reconciled and synced with n3.
Every read returns the most recent write, as if there is a single up-to-date copy of the data. After a write completes, all nodes in the distributed system immediately see the same value.
Eventual Consistency:Nodes may temporarily return stale data, but if no new updates occur, all replicas will converge to the same value over time. The system does not guarantee immediate consistency, only eventual agreement.
CAP Theorem in System Design Interviews
In system design interviews, CAP theorem is an important concept because it helps you understand trade-offs in designing distributed systems. When discussing a system, you should be able to explain:
- What kind of data consistency is required for the system (e.g., strong or eventual consistency)?
- How important availability is — does the system need to serve requests under all conditions, or is temporary unavailability acceptable?
- How the system can handle network partitions, and whether the system should continue to function during partition events.
When asked about CAP in a system design interview, follow this approach:
1. Clarify Requirements
Ask clarifying questions to understand the business or technical requirements. Does the system need high availability (e.g., financial transactions)? Or is consistency more critical (e.g., in a booking system where double bookings must be avoided)?
2. Identify Trade-offs
Explain how the system you propose will handle trade-offs between consistency, availability, and partition tolerance. For example, if you’re designing a social media platform, you might prioritize availability and partition tolerance, with eventual consistency being acceptable.
3. Use Real-World Examples
Reference real-world systems and databases like Cassandra, DynamoDB, or traditional SQL systems to highlight how different distributed systems make trade-offs based on CAP theorem.
4. Discuss Failures and Recovery
Address what happens in the event of network partitions and how your system would recover. Mention concepts like eventual consistency, leader election, and replication strategies to strengthen your explanation.
Cloud Database following CAP Theorem
- Amazon DynamoDB (AP)
- Google Cloud Spanner (CP)
- Amazon Aurora (CA)
- Cassandra (AP)
- Firebase Realtime Database (AP)
- Amazon RDS (CA)
- Amazon DynamoDB (AP)
- Google Cloud Spanner (CP)
- Amazon Aurora (CA)
- Cassandra (AP)
- Firebase Realtime Database (AP)
- Amazon RDS (CA)