Makuhari Development Corporation
6 min read, 1110 words, last updated: 2025/8/10
TwitterLinkedInFacebookEmail

The Problem

A backend management system needs to efficiently retrieve files from S3 based on filename patterns containing UIDs and dates. The challenge: no traditional database should be used for maintaining file metadata. The system requirements include:

  • 10,000+ S3 buckets with ~20,000 files each (10MB per file)
  • Files stored for 1 year (~20 files added daily per bucket)
  • Search capabilities by UID or date range
  • No database maintenance overhead
  • Cost-effective solution

At first glance, this seems like a perfect use case for a traditional database with indexed metadata. However, the constraint of avoiding database maintenance led us to explore alternative architectures.

Investigation

Initial Approach: Direct S3 ListObjects

The first instinct was to use S3's native ListObjectsV2 API for file discovery:

// Naive approach - scanning entire buckets
const listAllFiles = async (bucketName) => {
  const params = {
    Bucket: bucketName,
    MaxKeys: 1000
  };
  
  let allFiles = [];
  let continuationToken = null;
  
  do {
    if (continuationToken) {
      params.ContinuationToken = continuationToken;
    }
    
    const result = await s3.listObjectsV2(params).promise();
    allFiles.push(...result.Contents);
    continuationToken = result.NextContinuationToken;
  } while (continuationToken);
  
  return allFiles;
};

Performance Analysis

Testing this approach revealed several issues:

  1. Scalability Problems: With 20,000 files per bucket, listing requires ~20 API calls (1,000 files per page)
  2. Cost Concerns: Each ListObjects request incurs charges, making frequent searches expensive
  3. Response Time: Full bucket scans take several seconds, creating poor user experience
  4. Resource Intensive: No caching mechanism leads to repeated expensive operations

Root Cause

The core issue was treating S3 as a database when it's fundamentally a file storage service. S3's ListObjects API is designed for:

  • Prefix-based filtering (directory-style navigation)
  • Occasional listing operations
  • Not complex queries or frequent searches

The mismatch between our search requirements and S3's capabilities created the performance bottleneck.

Solution

Architecture Overview

We developed a serverless, database-free solution using AWS native services:

┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│   Next.js   │───▶│   AWS API    │───▶│   Lambda    │
│  Frontend   │    │   Gateway    │    │  Functions  │
└─────────────┘    └──────────────┘    └─────────────┘
                                              │
┌─────────────┐    ┌──────────────┐          │
│    S3       │◀───│    Index     │◀─────────┘
│  File Store │    │    Files     │
└─────────────┘    └──────────────┘
                          │
                   ┌──────────────┐
                   │   Cognito    │
                   │     Auth     │
                   └──────────────┘

1. Hierarchical File Organization

Implemented a strategic naming convention to leverage S3's prefix filtering:

// File naming pattern
const generateFilePath = (date, uid, filename) => {
  const year = date.getFullYear();
  const month = String(date.getMonth() + 1).padStart(2, '0');
  const day = String(date.getDate()).padStart(2, '0');
  
  return `${year}/${month}/${day}/${uid}_${filename}`;
};
 
// Efficient date-range queries using prefixes
const searchByDateRange = async (bucketName, startDate, endDate) => {
  const searches = [];
  
  // Generate date prefixes for the range
  for (let date = startDate; date <= endDate; date.setDate(date.getDate() + 1)) {
    const prefix = generateDatePrefix(date);
    searches.push(listObjectsWithPrefix(bucketName, prefix));
  }
  
  const results = await Promise.all(searches);
  return results.flat();
};

2. Lambda-Based Index Generation

Created a lightweight indexing system using AWS Lambda:

// Lambda function for index generation
exports.generateIndex = async (event) => {
  const bucketName = event.bucketName;
  const indexData = [];
  
  try {
    // List all objects in bucket
    const objects = await listAllObjectsPaginated(bucketName);
    
    // Extract metadata from filenames
    objects.forEach(obj => {
      const parsed = parseFileName(obj.Key);
      if (parsed) {
        indexData.push({
          key: obj.Key,
          uid: parsed.uid,
          date: parsed.date,
          size: obj.Size,
          lastModified: obj.LastModified
        });
      }
    });
    
    // Sort by date for efficient range queries
    indexData.sort((a, b) => new Date(a.date) - new Date(b.date));
    
    // Store index file in S3
    const indexKey = `_index/${bucketName}_index.json`;
    await s3.putObject({
      Bucket: bucketName,
      Key: indexKey,
      Body: JSON.stringify(indexData),
      ContentType: 'application/json'
    }).promise();
    
    return {
      statusCode: 200,
      body: JSON.stringify({ 
        message: 'Index generated successfully',
        fileCount: indexData.length
      })
    };
    
  } catch (error) {
    console.error('Index generation failed:', error);
    throw error;
  }
};
 
// Filename parsing utility
const parseFileName = (key) => {
  // Pattern: YYYY/MM/DD/UID_filename.ext
  const match = key.match(/(\d{4})\/(\d{2})\/(\d{2})\/([^_]+)_(.+)/);
  
  if (match) {
    const [, year, month, day, uid, filename] = match;
    return {
      uid,
      date: `${year}-${month}-${day}`,
      filename,
      fullPath: key
    };
  }
  return null;
};

3. Automated Index Maintenance

Set up event-driven index updates:

// CloudWatch Events trigger for daily index refresh
const scheduleRule = {
  ScheduleExpression: 'cron(0 2 * * ? *)', // Daily at 2 AM
  State: 'ENABLED',
  Targets: [
    {
      Id: 'IndexGeneratorTarget',
      Arn: indexGeneratorLambdaArn,
      Input: JSON.stringify({ bucketName: 'your-bucket-name' })
    }
  ]
};
 
// S3 event trigger for real-time updates (optional)
const s3EventConfig = {
  CloudWatchConfiguration: {
    Events: ['s3:ObjectCreated:*'],
    LambdaConfiguration: {
      LambdaFunctionArn: indexUpdateLambdaArn
    }
  }
};

4. Next.js Frontend with AWS Integration

Built a responsive management interface:

// pages/api/search.js - Next.js API route
export default async function handler(req, res) {
  const { bucketName, uid, startDate, endDate } = req.query;
  
  try {
    // Load index file
    const indexKey = `_index/${bucketName}_index.json`;
    const indexObject = await s3.getObject({
      Bucket: bucketName,
      Key: indexKey
    }).promise();
    
    const index = JSON.parse(indexObject.Body.toString());
    
    // Filter based on search criteria
    let filteredResults = index;
    
    if (uid) {
      filteredResults = filteredResults.filter(item => 
        item.uid.includes(uid)
      );
    }
    
    if (startDate || endDate) {
      filteredResults = filteredResults.filter(item => {
        const itemDate = new Date(item.date);
        const start = startDate ? new Date(startDate) : new Date('1900-01-01');
        const end = endDate ? new Date(endDate) : new Date('2100-01-01');
        
        return itemDate >= start && itemDate <= end;
      });
    }
    
    res.status(200).json({
      results: filteredResults,
      total: filteredResults.length
    });
    
  } catch (error) {
    res.status(500).json({ error: 'Search failed' });
  }
}

5. Authentication with Cognito

Integrated AWS Cognito for user management:

// lib/auth.js
import { CognitoIdentityProvider } from '@aws-sdk/client-cognito-identity-provider';
 
const cognito = new CognitoIdentityProvider({ region: 'us-west-2' });
 
export const authenticateUser = async (username, password) => {
  try {
    const response = await cognito.initiateAuth({
      AuthFlow: 'USER_PASSWORD_AUTH',
      ClientId: process.env.COGNITO_CLIENT_ID,
      AuthParameters: {
        USERNAME: username,
        PASSWORD: password,
      },
    });
    
    return {
      success: true,
      accessToken: response.AuthenticationResult.AccessToken,
      refreshToken: response.AuthenticationResult.RefreshToken,
    };
  } catch (error) {
    return { success: false, error: error.message };
  }
};

6. Deployment on AWS Amplify

Configured automatic deployment:

# amplify.yml
version: 1
frontend:
  phases:
    preBuild:
      commands:
        - npm ci
    build:
      commands:
        - npm run build
  artifacts:
    baseDirectory: .next
    files:
      - '**/*'
  cache:
    paths:
      - node_modules/**/*
  environment:
    - COGNITO_USER_POOL_ID
    - COGNITO_CLIENT_ID
    - AWS_REGION

Results and Performance

Before vs After Comparison

Metric Before (Direct S3) After (Indexed) Improvement
Search Time 5-10 seconds 200-500ms 95% faster
API Calls per Search 20+ 1-2 90% reduction
Monthly Cost $50-100 $5-10 85% savings
User Experience Poor Excellent Dramatically improved

Cost Analysis

Lambda Costs: ~$2/month for daily index generation across all buckets S3 Costs: Minimal additional storage for index files (~1MB per bucket) Amplify Hosting: ~$1/month for static site hosting Cognito: Free tier covers typical usage

Total monthly cost: ~$5-10 vs $50-100+ with frequent direct S3 queries

Makuhari Development Corporation
法人番号: 6040001134259
ご利用にあたって
個人情報保護方針
個人情報取扱に関する同意事項
お問い合わせ
Copyright© Makuhari Development Corporation. All Rights Reserved.