Understanding CAP Theorem and Database Sharding

CAP THEOROM :

C – Consistency

A – Availability

P – Partitioning

The CAP Theorem is a fundamental concept in distributed system design, helping engineers understand the trade-offs between Consistency, Availability, and Partition Tolerance in any distributed data store.

Here’s a brief explanation of each:

Consistency (C): Every read receives the most recent write or an error. All nodes return the same data at the same time.
Availability (A): Every request (read/write) gets a response, even if it's not the most up-to-date data.
Partition Tolerance (P): The system continues to operate despite network partitions (i.e., communication between nodes may fail).

Key Insight of CAP Theorem:

You can only guarantee two of the three properties (Consistency, Availability, Partition Tolerance) in a distributed system. This is often summarized as “Choose two out of three.”

For instance:

CA Systems: Prioritize consistency and availability but struggle with partition tolerance. Example: Traditional relational databases.
CP Systems: Focus on consistency and partition tolerance, sacrificing some availability during partition events. Example: HBase.
AP Systems: Ensure availability and partition tolerance but may not always be consistent. Example: Cassandra, DynamoDB.

Real-World Application:

In practice, no system can completely sacrifice partition tolerance (since network failures are inevitable), so the choice typically boils down to balancing consistency and availability.

Understanding the CAP Theorem helps engineers make informed decisions about the right database or system architecture based on application requirements and trade-offs. It's particularly important when scaling systems across multiple regions or designing high-availability architectures.

Distributed System : A System consisting of a group of machines working in coordination so as to appear as a single coherent system to the end-user

Consistency : Any Read that is happening after a latest write , all the nodes should return the latest value of that Write.

Availability : Every Available node in the system should respond in a non error format to any Read request without the guarantee of returning the latest Write

Partition Tolerance : System will be responding to all Read and Write if the communication channel (or middleware) between nodes is broken ( or partitioned)

In CAP THEOROM generally will support any two of them in CAP , like C&A , A&C , A&P , P&C any one of the thing we need to be left , not all the time the Distributed will support C,A,P will accept anyone of the thing we have to left.

DB SHARDING :

Understanding Database Sharding | DigitalOcean

Horizontal Partitioning Is same as Sharding used to store the large data with highly available and scaling

Local Sharding – Physical Machine (Physical DB)

Physical Sharding have Multiple Number of Local Sharding and the Data should be Stored in Multiple Physical Machines ( DB)

Advantage of Sharding :

Query Optimization

Better Performance

Reduce Latency

There are two types of Sharding

1.Algorithmic Sharding – App/Client knows which Shard the query has to go that’s Algorithmic Sharding

2.Dynamic Sharding – Other Module / Part which Client has to query in order to findout which Shard the query has to go to its called Dynamic Sharding

Disadvantage of Sharding :

Partition could be proper

Recover the old data , with Sharding to Non Sharding

Key Based Sharding :

Shared Key its should be an Static it won’t be change

We can select Multiple Column as Shared Key

First step here to pickup an Shard Key , Shard Key not equal to Primary Key

A Primary Key of the Table can be an Shard Key but the Vice Versa its Not Possible

Choosing the Shard Key means Pickup an Column on bases of which you are going to divide the data

Ex : UserId your picking up as Shard Key

After selecting the Shard(UserId) key It will use the Hash function so based on the Shard Key it will be give the Shard Value

Its an Algorithmic Sharding

Advantage :

Evenly Distributed Data

Disadvantage :

Adding new Shard because the Hashing function might Change

Range Based Sharding :

You have to store the Data in Shard in basis of Range

Its an Algorithmic Sharding

Ex : You are storing the value in Shard based on Month Range , like Each 2 Months Data should be Stored in One Shard

Advantage :

Same Database Schema for all Logical & Physical Shard

Disadvantage :

Increased Storage on Specific Shard This is called as an HOTSPOT

Data will not be uniformly distributed , like its an Range Based Shard so In Specific Range only it will be Shard to Specific Shard

Directory Based Sharding (Dynamic Sharding) :

Here Dividing your Shard Based on the Specific Column

Ex : Here you are Dividing Shard Based on Zone

Zone : 1 2 3 4 Shard : A B C D

Here Zone 1 —> Shard A , 2 —> B , 3 —> C , 4 —> D

These will be stores in Lookup Table based on that Zone the Shard could be Directed.

Advantage :

We can add Shard without Touching previous Shard

Remove Shard when ever you want

Data Evenly Distributed

Disadvantage :

For Every Read & Write the Application It will be first Connect with Lookup Table then Lookup Table Routs Where Read & Write has to go.

Lookup Table Crashes the Whole System will Crashes

Cap Theorom & Db Sharding

Key Insight of CAP Theorem:

Real-World Application: