Building a Smart Offsite Sync and MD5 Verification Script with AzCopy

When working with cloud storage, particularly when syncing files to Azure Blob Storage, ensuring the integrity of the files is very important. Azure's AzCopy tool offers powerful features for syncing files between a local directory and Azure Blob Storage. However, while AzCopy handles transfers efficiently, adding a layer of file integrity verification using MD5 hashes helps ensure the files were transferred correctly.

In this post, I will walk you through how I built a smart sync-and-verify backup script using AzCopy. The script syncs files to Azure Blob Storage and verifies the integrity of changed files by comparing MD5 hashes between local and Azure-stored files.

Problem Statement

When syncing large directories of files to cloud storage, especially during backups, it is essential to verify that transferred files are identical to the local files by comparing their MD5 hashes.

Solution

By integrating AzCopy's sync functionality with a custom shell script, we can:

  1. Sync files from a local directory to Azure Blob Storage.
  2. Extract details of the transferred files from the AzCopy job logs.
  3. Perform MD5 verification to ensure the files are consistent on both ends.

Tools Used

  • AzCopy: A command-line tool that helps manage and transfer data to/from Azure Blob Storage.
  • Bash: The script was built in Bash, enabling seamless integration with the Linux environment.
  • Azure CLI: Used to retrieve MD5 hashes from Azure Blob Storage.

Shell Script Workflow

Step 1: Setting Up Variables

Before diving into the logic, we define key variables, including the local directory to be synced, Azure Storage account details, and the location for logs. Here's an example setup:

STORAGE_ACCOUNT="myStorageAccount"
STORAGE_CONTAINER="backup"
LOCAL_DIR="/media/backup"
LOG_FILE="/var/log/azure_backup_md5_check_$(date +%Y%m%d).log"

You need to make sure to export $AZURE_BACKUP_SAS_TOKEN and $AZURE_BACKUP_ACCOUNT_KEY to the users environment to authenticate with the Azure Blob Storage.

Step 2: Syncing Files Using AzCopy

The script uses the AzCopy sync command to upload files that have changed or are new in the local directory. Here's the command that performs the sync:

sync_output=$(azcopy sync "$LOCAL_DIR" "https://$STORAGE_ACCOUNT.blob.core.windows.net/$STORAGE_CONTAINER?$AZURE_BACKUP_SAS_TOKEN" --delete-destination=true --put-md5)

After the sync, the script extracts the job ID from the AzCopy output:

job_id=$(echo "$sync_output" | awk '/Job/{print $2; exit}')

This job ID is crucial for locating the corresponding job log file.

Step 3: Parsing the AzCopy Job Log

The job log provides details of the files that were transferred. We need to capture lines that indicate the start of a file transfer and extract the local source file path:

while IFS= read -r line; do
    local_file=$(echo "$line" | grep -oP 'Source "\K[^"]+')
    changed_files+=("$local_file")
done < <(sudo grep "Starting transfer:" "$log_file")

By using this method, we ensure we only consider the files that were actually transferred during the sync.

Step 4: MD5 Hash Verification

Once we have the list of changed files, we compare the MD5 hash of each file locally and on Azure Blob Storage. To get the MD5 hash of the local file, we use the md5sum command:

calculate_md5() {
    local file="$1"
    md5sum "$file" | awk '{print $1}'
}

For the Azure Blob Storage file, we query the ContentMD5 property using the Azure CLI:

get_blob_md5() {
    local blob_name="$1"
    az storage blob show --account-name "$STORAGE_ACCOUNT" --container-name "$STORAGE_CONTAINER" --name "$blob_name" \
    --query properties.contentSettings.contentMd5 --output tsv --account-key "$AZURE_BACKUP_ACCOUNT_KEY"
}

azure_md5=$(get_blob_md5 "$blob_name")
azure_md5_hex=$(echo "$azure_md5" | base64 --decode | xxd -p)

We then compare the two hashes and log any mismatches.

Step 5: Logging and Summary

The script logs the entire process, ensuring transparency in file transfers and MD5 verifications. A summary is printed at the end, detailing which files matched and which had mismatches:

log_message "Summary:"
if [ ${#mismatched_files[@]} -eq 0 ]; then
    log_message "✅ All changed files matched successfully!"
else
    log_message "❌ The following changed files had MD5 mismatches:"
    for file in "${mismatched_files[@]}"; do
        log_message " - $file"
    done
fi

Complete Script

Here’s the complete script after incorporating the sync, log parsing, and MD5 verification logic.

This shell script integrates AzCopy to sync files from a local directory to Azure Blob Storage, ensuring that only changed files are uploaded. After syncing, the script verifies file integrity by comparing the MD5 hashes of the uploaded files with their local counterparts. It parses the AzCopy job logs to identify transferred files and performs the MD5 comparison, logging any mismatches.
This shell script integrates AzCopy to sync files from a local directory to Azure Blob Storage, ensuring that only changed files are uploaded. After syncing, the script verifies file integrity by c…

This project demonstrates how combining AzCopy's powerful sync functionality with Bash scripting can result in a robust solution for transferring and verifying file integrity in Azure Blob Storage. By leveraging job logs, we can accurately identify changed files and ensure that all transferred files match their local counterparts using MD5 verification.