Building a Smart Offsite Sync and MD5 Verification Script with AzCopy

When working with cloud storage, particularly when syncing files to Azure Blob Storage, ensuring the integrity of the files is very important. Azure's AzCopy tool offers powerful features for syncing files between a local directory and Azure Blob Storage. However, while AzCopy handles transfers efficiently, adding a layer of file integrity verification using MD5 hashes helps ensure the files were transferred correctly.

In this post, I will walk you through how I built a smart sync-and-verify backup script using AzCopy. The script syncs files to Azure Blob Storage and verifies the integrity of changed files by comparing MD5 hashes between local and Azure-stored files.

Problem Statement

When syncing large directories of files to cloud storage, especially during backups, it is essential to verify that transferred files are identical to the local files by comparing their MD5 hashes.

Solution

By integrating AzCopy's sync functionality with a custom shell script, we can:

  1. Sync files from a local directory to Azure Blob Storage.
  2. Extract details of the transferred files from the AzCopy job logs.
  3. Perform MD5 verification to ensure the files are consistent on both ends.

Tools Used

  • AzCopy: A command-line tool that helps manage and transfer data to/from Azure Blob Storage.
  • Bash: The script was built in Bash, enabling seamless integration with the Linux environment.
  • Azure CLI: Used to retrieve MD5 hashes from Azure Blob Storage.

Shell Script Workflow

Step 1: Setting Up Variables

Before diving into the logic, we define key variables, including the local directory to be synced, Azure Storage account details, and the location for logs. Here's an example setup:

STORAGE_ACCOUNT="myStorageAccount"
STORAGE_CONTAINER="backup"
LOCAL_DIR="/media/backup"
LOG_FILE="/var/log/azure_backup_md5_check_$(date +%Y%m%d).log"

You need to make sure to export $AZURE_BACKUP_SAS_TOKEN and $AZURE_BACKUP_ACCOUNT_KEY to the users environment to authenticate with the Azure Blob Storage.

Step 2: Syncing Files Using AzCopy

The script uses the AzCopy sync command to upload files that have changed or are new in the local directory. Here's the command that performs the sync:

sync_output=$(azcopy sync "$LOCAL_DIR" "https://$STORAGE_ACCOUNT.blob.core.windows.net/$STORAGE_CONTAINER?$AZURE_BACKUP_SAS_TOKEN" --delete-destination=true --put-md5)

After the sync, the script extracts the job ID from the AzCopy output:

job_id=$(echo "$sync_output" | awk '/Job/{print $2; exit}')

This job ID is crucial for locating the corresponding job log file.

Step 3: Parsing the AzCopy Job Log

The job log provides details of the files that were transferred. We need to capture lines that indicate the start of a file transfer and extract the local source file path:

while IFS= read -r line; do
    local_file=$(echo "$line" | grep -oP 'Source "\K[^"]+')
    changed_files+=("$local_file")
done < <(sudo grep "Starting transfer:" "$log_file")

By using this method, we ensure we only consider the files that were actually transferred during the sync.

Step 4: MD5 Hash Verification

Once we have the list of changed files, we compare the MD5 hash of each file locally and on Azure Blob Storage. To get the MD5 hash of the local file, we use the md5sum command:

calculate_md5() {
    local file="$1"
    md5sum "$file" | awk '{print $1}'
}

For the Azure Blob Storage file, we query the ContentMD5 property using the Azure CLI:

get_blob_md5() {
    local blob_name="$1"
    az storage blob show --account-name "$STORAGE_ACCOUNT" --container-name "$STORAGE_CONTAINER" --name "$blob_name" \
    --query properties.contentSettings.contentMd5 --output tsv --account-key "$AZURE_BACKUP_ACCOUNT_KEY"
}

azure_md5=$(get_blob_md5 "$blob_name")
azure_md5_hex=$(echo "$azure_md5" | base64 --decode | xxd -p)

We then compare the two hashes and log any mismatches.

Step 5: Logging and Summary

The script logs the entire process, ensuring transparency in file transfers and MD5 verifications. A summary is printed at the end, detailing which files matched and which had mismatches:

log_message "Summary:"
if [ ${#mismatched_files[@]} -eq 0 ]; then
    log_message "✅ All changed files matched successfully!"
else
    log_message "❌ The following changed files had MD5 mismatches:"
    for file in "${mismatched_files[@]}"; do
        log_message " - $file"
    done
fi

Complete Script

Here’s the complete script after incorporating the sync, log parsing, and MD5 verification logic.

#!/bin/bash

# This script performs an offsite sync of the locally stored bacula backup files and does an MD5 checksum comparison between local files and their corresponding Azure Blob Storage counterparts.
# It ensures that the files have been correctly uploaded to Azure Blob Storage by comparing their MD5 hashes.

# Prerequisites:
# - Azure CLI and azcopy must be installed on the system.
# - The environment variable AZURE_BACKUP_ACCOUNT_KEY must be set with the Azure Storage account key.
# - The environment variable AZURE_BACKUP_SAS_TOKEN must be set with the SAS token for accessing the Azure Storage container.

# Variables:
# STORAGE_ACCOUNT: The name of the Azure Storage account.
# STORAGE_CONTAINER: The name of the Azure Storage container.
# LOCAL_DIR: The local directory containing the files to be uploaded and checked.
# LOG_FILE: The path to the log file where the script's output will be logged.

# Functions:
# log_message: Logs messages to both the console and the log file.
# calculate_md5: Calculates the MD5 hash of a local file.
# get_blob_md5: Retrieves the ContentMD5 property of a blob from Azure Blob Storage.
# get_uploaded_files_from_log: Parses the azcopy job log to get the list of uploaded files.

# Workflow:
# 1. Ensure that Azure CLI and azcopy are installed.
# 2. Start logging the process.
# 3. Sync local files with Azure Blob Storage using azcopy.
# 4. Parse the azcopy job log to get the list of uploaded files.
# 5. Loop through the uploaded files and compare their local MD5 hashes with the MD5 hashes of the corresponding blobs in Azure Blob Storage.
# 6. Log the results of the MD5 comparisons.
# 7. Print a summary of the comparison results.

# Exit Codes:
# 0: Success
# 1: Failure (e.g., missing prerequisites, errors during execution)

STORAGE_ACCOUNT="myStorageAccount"
STORAGE_CONTAINER="backup"
LOCAL_DIR="/media/backup"

LOG_FILE="/var/log/azure_backup_md5_check_$(date +%Y%m%d).log"

# Ensure Azure CLI and azcopy are installed
if ! command -v az &> /dev/null || ! command -v azcopy &> /dev/null; then
    echo "Azure CLI or azcopy could not be found. Please install them to continue."
    exit 1
fi

# Function to log messages to both console and log file
log_message() {
    local message="$1"
    echo "$message" | tee -a "$LOG_FILE"
}

# Function to calculate MD5 hash of a local file
calculate_md5() {
    local file="$1"
    md5sum "$file" | awk '{print $1}'
}

# Function to get ContentMD5 from Azure Blob Storage
get_blob_md5() {
    local blob_name="$1"
    az storage blob show --account-name "$STORAGE_ACCOUNT" --container-name "$STORAGE_CONTAINER" --name "$blob_name" \
    --query properties.contentSettings.contentMd5 --output tsv --account-key "$AZURE_BACKUP_ACCOUNT_KEY"
}

# Function to parse azcopy job log and get the list of uploaded files
get_uploaded_files_from_log() {
    local job_id="$1"
    local log_file="/root/.azcopy/${job_id}.log"

    if [ ! -f "$log_file" ]; then
        log_message "Error: Log file for job $job_id not found at $log_file."
        exit 1
    fi

    log_message "Parsing azcopy log $log_file for uploaded files..."

    # Use process substitution to read the grep output directly
    while IFS= read -r line; do
        # Extract the source local file path
        local_file=$(echo "$line" | grep -oP 'Source "\K[^"]+')
        changed_files+=("$local_file")
    done < <(sudo grep "Starting transfer:" "$log_file")

    # Output the changed_files array for verification
    for file in "${changed_files[@]}"; do
        echo "$file"
    done
}

# Array to track mismatched files
mismatched_files=()
changed_files=()

# Start logging
log_message "MD5 comparison started at $(date)"
log_message "Local directory: $LOCAL_DIR"
log_message "Storage account: $STORAGE_ACCOUNT"
log_message "-----------------------------------"

# Sync local files with Azure Blob Storage (using azcopy)
log_message "Starting azcopy sync between $LOCAL_DIR and Azure Blob Storage..."
sync_output=$(azcopy sync "$LOCAL_DIR" "https://$STORAGE_ACCOUNT.blob.core.windows.net/$STORAGE_CONTAINER?$AZURE_BACKUP_SAS_TOKEN" --delete-destination=true --put-md5)

# Extract the first job ID from the sync output
job_id=$(echo "$sync_output" | awk '/Job/{print $2; exit}')

if [ -z "$job_id" ]; then
    log_message "Error: Unable to extract the job ID from the azcopy sync output."
    exit 1
fi

log_message "azcopy sync completed. Job ID: $job_id"
log_message "-----------------------------------"

# Parse the azcopy job log to get the list of uploaded files
get_uploaded_files_from_log "$job_id"

log_message "Found ${#changed_files[@]} changed or new files."
log_message "-----------------------------------"

# Loop through all changed files for MD5 comparison
for local_file in "${changed_files[@]}"; do
    # Strip the LOCAL_DIR prefix to get the relative path
    relative_path="${local_file#$LOCAL_DIR/}"

    # The relative path will match the Azure blob path, so we can directly use it as the blob name
    blob_name=$(echo "$relative_path" | sed 's|\\|/|g')

    log_message "Comparing $local_file with Azure blob $blob_name..."

    # Calculate local MD5
    local_md5=$(calculate_md5 "$local_file")

    # Get the Azure Blob MD5
    azure_md5=$(get_blob_md5 "$blob_name")

    # Convert Azure's Base64-encoded MD5 to hexadecimal for comparison
    if [ -n "$azure_md5" ]; then
        azure_md5_hex=$(echo "$azure_md5" | base64 --decode | xxd -p)
    else
        log_message "Azure Blob for $blob_name not found or no MD5 available."
        log_message "-----------------------------------"
        continue
    fi

    # Compare the MD5 hashes
    if [[ "$local_md5" == "$azure_md5_hex" ]]; then
        log_message "✅ MD5 matches for $relative_path"
    else
        log_message "❌ MD5 mismatch for $relative_path"
        mismatched_files+=("$relative_path")
    fi

    log_message "Local MD5:  $local_md5"
    log_message "Azure MD5:  $azure_md5_hex"
    log_message "-----------------------------------"
done

# Print summary at the end
log_message "Summary:"
if [ ${#mismatched_files[@]} -eq 0 ]; then
    log_message "✅ All changed files matched successfully!"
else
    log_message "❌ The following changed files had MD5 mismatches:"
    for file in "${mismatched_files[@]}"; do
        log_message " - $file"
    done
fi

log_message "MD5 comparison finished at $(date)"

This project demonstrates how combining AzCopy's powerful sync functionality with Bash scripting can result in a robust solution for transferring and verifying file integrity in Azure Blob Storage. By leveraging job logs, we can accurately identify changed files and ensure that all transferred files match their local counterparts using MD5 verification.