A comparative analysis of application-level and system-level container runtimes for state-of-the-art data deduplication techniques

Citation

Abstract

Containerization has become a cornerstone of modern software deployment, offering lightweight isolation and rapid scalability across diverse environments. However, the growing variety of container runtimes introduces uncertainty regarding their behavior under data-intensive workloads such as deduplication, where computational efficiency and resource utilization directly affect scalability and responsiveness. To investigate this, we design a structured experimental pipeline that executes three hash-based deduplication algorithms - CRC32, MD5, and SHA-256; within three container runtimes: Docker, LXC, and Podman. Each algorithm is run ten times across datasets of 1M, 5M, and 10M records to ensure statistical consistency, generating over 3,700 performance samples consolidated into 180+ representative instances. Building on this pipeline, we develop a holistic scalability assessment framework that quantifies container efficiency through throughput trends, variability in CPU and memory usage, and collision rates, offering a comprehensive perspective on runtime behavior. Experimental findings show that Docker maintains balanced scalability with stable throughput growth through efficient daemon-managed scheduling, while LXC delivers superior computational efficiency under heavy workloads due to its direct kernel namespace access. Podman, though optimized for lightweight and security-focused tasks, demonstrates performance variability when scaled. Finally, we introduced a decision tree to assist in selecting optimal container–algorithm configurations tailored to workload requirements. This work establishes an empirical foundation for understanding container performance in deduplication contexts, providing actionable insights for building efficient and resilient cloud-native data processing infrastructures.

Description

Cataloged from PDF version of thesis.
Includes bibliographical references (pages 45-49).
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2025.

Publisher Link

Type

Thesis