The Problem
A backend management system needs to efficiently retrieve files from S3 based on filename patterns containing UIDs and dates. The challenge: no traditional database should be used for maintaining file metadata. The system requirements include:
- 10,000+ S3 buckets with ~20,000 files each (10MB per file)
- Files stored for 1 year (~20 files added daily per bucket)
- Search capabilities by UID or date range
- No database maintenance overhead
- Cost-effective solution
At first glance, this seems like a perfect use case for a traditional database with indexed metadata. However, the constraint of avoiding database maintenance led us to explore alternative architectures.
Investigation
Initial Approach: Direct S3 ListObjects
The first instinct was to use S3's native ListObjectsV2 API for file discovery:
// Naive approach - scanning entire buckets
const listAllFiles = async (bucketName) => {
const params = {
Bucket: bucketName,
MaxKeys: 1000
};
let allFiles = [];
let continuationToken = null;
do {
if (continuationToken) {
params.ContinuationToken = continuationToken;
}
const result = await s3.listObjectsV2(params).promise();
allFiles.push(...result.Contents);
continuationToken = result.NextContinuationToken;
} while (continuationToken);
return allFiles;
};Performance Analysis
Testing this approach revealed several issues:
- Scalability Problems: With 20,000 files per bucket, listing requires ~20 API calls (1,000 files per page)
- Cost Concerns: Each
ListObjectsrequest incurs charges, making frequent searches expensive - Response Time: Full bucket scans take several seconds, creating poor user experience
- Resource Intensive: No caching mechanism leads to repeated expensive operations
Root Cause
The core issue was treating S3 as a database when it's fundamentally a file storage service. S3's ListObjects API is designed for:
- Prefix-based filtering (directory-style navigation)
- Occasional listing operations
- Not complex queries or frequent searches
The mismatch between our search requirements and S3's capabilities created the performance bottleneck.
Solution
Architecture Overview
We developed a serverless, database-free solution using AWS native services:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Next.js │───▶│ AWS API │───▶│ Lambda │
│ Frontend │ │ Gateway │ │ Functions │
└─────────────┘ └──────────────┘ └─────────────┘
│
┌─────────────┐ ┌──────────────┐ │
│ S3 │◀───│ Index │◀─────────┘
│ File Store │ │ Files │
└─────────────┘ └──────────────┘
│
┌──────────────┐
│ Cognito │
│ Auth │
└──────────────┘
1. Hierarchical File Organization
Implemented a strategic naming convention to leverage S3's prefix filtering:
// File naming pattern
const generateFilePath = (date, uid, filename) => {
const year = date.getFullYear();
const month = String(date.getMonth() + 1).padStart(2, '0');
const day = String(date.getDate()).padStart(2, '0');
return `${year}/${month}/${day}/${uid}_${filename}`;
};
// Efficient date-range queries using prefixes
const searchByDateRange = async (bucketName, startDate, endDate) => {
const searches = [];
// Generate date prefixes for the range
for (let date = startDate; date <= endDate; date.setDate(date.getDate() + 1)) {
const prefix = generateDatePrefix(date);
searches.push(listObjectsWithPrefix(bucketName, prefix));
}
const results = await Promise.all(searches);
return results.flat();
};2. Lambda-Based Index Generation
Created a lightweight indexing system using AWS Lambda:
// Lambda function for index generation
exports.generateIndex = async (event) => {
const bucketName = event.bucketName;
const indexData = [];
try {
// List all objects in bucket
const objects = await listAllObjectsPaginated(bucketName);
// Extract metadata from filenames
objects.forEach(obj => {
const parsed = parseFileName(obj.Key);
if (parsed) {
indexData.push({
key: obj.Key,
uid: parsed.uid,
date: parsed.date,
size: obj.Size,
lastModified: obj.LastModified
});
}
});
// Sort by date for efficient range queries
indexData.sort((a, b) => new Date(a.date) - new Date(b.date));
// Store index file in S3
const indexKey = `_index/${bucketName}_index.json`;
await s3.putObject({
Bucket: bucketName,
Key: indexKey,
Body: JSON.stringify(indexData),
ContentType: 'application/json'
}).promise();
return {
statusCode: 200,
body: JSON.stringify({
message: 'Index generated successfully',
fileCount: indexData.length
})
};
} catch (error) {
console.error('Index generation failed:', error);
throw error;
}
};
// Filename parsing utility
const parseFileName = (key) => {
// Pattern: YYYY/MM/DD/UID_filename.ext
const match = key.match(/(\d{4})\/(\d{2})\/(\d{2})\/([^_]+)_(.+)/);
if (match) {
const [, year, month, day, uid, filename] = match;
return {
uid,
date: `${year}-${month}-${day}`,
filename,
fullPath: key
};
}
return null;
};3. Automated Index Maintenance
Set up event-driven index updates:
// CloudWatch Events trigger for daily index refresh
const scheduleRule = {
ScheduleExpression: 'cron(0 2 * * ? *)', // Daily at 2 AM
State: 'ENABLED',
Targets: [
{
Id: 'IndexGeneratorTarget',
Arn: indexGeneratorLambdaArn,
Input: JSON.stringify({ bucketName: 'your-bucket-name' })
}
]
};
// S3 event trigger for real-time updates (optional)
const s3EventConfig = {
CloudWatchConfiguration: {
Events: ['s3:ObjectCreated:*'],
LambdaConfiguration: {
LambdaFunctionArn: indexUpdateLambdaArn
}
}
};4. Next.js Frontend with AWS Integration
Built a responsive management interface:
// pages/api/search.js - Next.js API route
export default async function handler(req, res) {
const { bucketName, uid, startDate, endDate } = req.query;
try {
// Load index file
const indexKey = `_index/${bucketName}_index.json`;
const indexObject = await s3.getObject({
Bucket: bucketName,
Key: indexKey
}).promise();
const index = JSON.parse(indexObject.Body.toString());
// Filter based on search criteria
let filteredResults = index;
if (uid) {
filteredResults = filteredResults.filter(item =>
item.uid.includes(uid)
);
}
if (startDate || endDate) {
filteredResults = filteredResults.filter(item => {
const itemDate = new Date(item.date);
const start = startDate ? new Date(startDate) : new Date('1900-01-01');
const end = endDate ? new Date(endDate) : new Date('2100-01-01');
return itemDate >= start && itemDate <= end;
});
}
res.status(200).json({
results: filteredResults,
total: filteredResults.length
});
} catch (error) {
res.status(500).json({ error: 'Search failed' });
}
}5. Authentication with Cognito
Integrated AWS Cognito for user management:
// lib/auth.js
import { CognitoIdentityProvider } from '@aws-sdk/client-cognito-identity-provider';
const cognito = new CognitoIdentityProvider({ region: 'us-west-2' });
export const authenticateUser = async (username, password) => {
try {
const response = await cognito.initiateAuth({
AuthFlow: 'USER_PASSWORD_AUTH',
ClientId: process.env.COGNITO_CLIENT_ID,
AuthParameters: {
USERNAME: username,
PASSWORD: password,
},
});
return {
success: true,
accessToken: response.AuthenticationResult.AccessToken,
refreshToken: response.AuthenticationResult.RefreshToken,
};
} catch (error) {
return { success: false, error: error.message };
}
};6. Deployment on AWS Amplify
Configured automatic deployment:
# amplify.yml
version: 1
frontend:
phases:
preBuild:
commands:
- npm ci
build:
commands:
- npm run build
artifacts:
baseDirectory: .next
files:
- '**/*'
cache:
paths:
- node_modules/**/*
environment:
- COGNITO_USER_POOL_ID
- COGNITO_CLIENT_ID
- AWS_REGIONResults and Performance
Before vs After Comparison
| Metric | Before (Direct S3) | After (Indexed) | Improvement |
|---|---|---|---|
| Search Time | 5-10 seconds | 200-500ms | 95% faster |
| API Calls per Search | 20+ | 1-2 | 90% reduction |
| Monthly Cost | $50-100 | $5-10 | 85% savings |
| User Experience | Poor | Excellent | Dramatically improved |
Cost Analysis
Lambda Costs: ~$2/month for daily index generation across all buckets S3 Costs: Minimal additional storage for index files (~1MB per bucket) Amplify Hosting: ~$1/month for static site hosting Cognito: Free tier covers typical usage
Total monthly cost: ~$5-10 vs $50-100+ with frequent direct S3 queries
