Reducing Object Storage Usage with Hash-Based Media Deduplication
How I optimized media storage by implementing hash-based deduplication to prevent duplicate image uploads and reduce bandwidth and storage consumption.
introduction
While building media upload functionality, I noticed a common inefficiency: identical images being uploaded multiple times by different users. Since object storage services charge for storage and bandwidth, repeatedly storing identical files wastes resources. To address this, I implemented a hash-based deduplication system that ensures identical media files are stored only once.
problem statement
If every uploaded image is blindly stored in object storage, the system quickly accumulates duplicate files. Even when the content is identical, each upload consumes additional storage and increases bandwidth usage. In systems with limited free storage tiers, such as the 5GB free tier in Krutrim Object Storage, duplicate uploads can exhaust available capacity much faster than necessary.
naive approach and limitations
- Every upload creates a new object in storage.
- Identical files are stored multiple times.
- Storage quota is consumed unnecessarily.
- Bandwidth usage increases due to repeated uploads.
design decision
{
"core_feature": "Content-based media deduplication",
"identification_method": "SHA-256 hashing of file contents",
"storage_strategy": "Single stored object referenced by multiple posts",
"database_tracking": "Dedicated media table storing hash and metadata"
}architecture overview
The system computes a cryptographic hash of the uploaded file before storing it. This hash uniquely represents the file’s content. The backend checks a media table to determine whether a file with the same hash already exists. If it does, the existing media reference is reused instead of uploading a new file. If the hash does not exist, the file is uploaded to object storage and a new media record is created.
implementation details
{
"hash_generation": "When a file is uploaded, the server generates a SHA-256 hash from the binary buffer. This hash acts as the unique identifier for the file content.",
"media_table_design": "A dedicated media table stores metadata including the hash, object storage key, URL, file size, and usage count.",
"duplicate_detection": "Before uploading a file, the backend queries the media table using the generated hash. If a match exists, the system skips the upload and reuses the existing storage URL.",
"reference_tracking": "Each media record tracks how many posts reference it through a usage counter, ensuring safe deletion when no posts use the file."
}example query pattern
The upload pipeline first checks if the hash exists in the media table. If a match is found, the system reuses the stored URL. If not, the file is uploaded and a new media record is inserted.
performance considerations
- Hash lookup queries must be indexed for fast detection.
- Using a UNIQUE constraint on the hash prevents duplicates.
- Skipping uploads reduces both storage writes and network bandwidth.
- Deduplication improves upload speed for previously uploaded media.
security and validation
- Used SHA-256 to minimize hash collision risks.
- Validated file type and size before hashing.
- Prevented direct object access without proper authorization.
tradeoffs
{
"pros": [
"Prevents duplicate storage of identical files.",
"Reduces bandwidth usage by skipping duplicate uploads.",
"Improves storage efficiency.",
"Simple architecture with minimal infrastructure overhead."
],
"cons": [
"Hash computation adds minor CPU overhead during upload.",
"Does not detect visually similar images with different binary content.",
"Requires additional database lookup before upload."
]
}why not store everything
- Duplicate files waste storage capacity.
- Bandwidth costs increase with repeated uploads.
- Storage quotas are exhausted faster.
lessons learned
- Small backend optimizations can significantly improve resource efficiency.
- Tracking media metadata separately simplifies storage management.
- Hash-based deduplication is a simple but powerful technique for media-heavy systems.
- Database constraints can enforce correctness in deduplication logic.
future improvements
- Introduce perceptual hashing to detect visually similar images.
- Add image compression before upload.
- Use CDN caching to further reduce bandwidth usage.
- Support video deduplication for large media files.
conclusion
Hash-based media deduplication ensures that identical files are stored only once while still allowing multiple posts to reference the same object. This approach significantly reduces storage usage and bandwidth consumption while keeping the system architecture simple and scalable.