Python DSA for Data Engineering: Arrays, Binary Search & Strings

A comprehensive guide to essential data structures and algorithms for data engineers

Introduction

Arrays, binary search, and string manipulation are fundamental skills for data engineers. These concepts are crucial when working with log processing, data validation, ETL pipelines, and data transformation tasks.

Why These Matter for Data Engineers

Log Processing: Parsing and analyzing application logs
Data Validation: Checking data integrity and format
ETL Operations: Transforming and cleaning data
Performance: Efficient searching and scanning of large datasets

Core Concepts

1. Arrays and Lists

In Python, we primarily use lists (dynamic arrays). Time complexity for common operations:

Access: O(1)
Search: O(n)
Insertion: O(n) worst case, O(1) at end
Deletion: O(n) worst case, O(1) at end

# Creating arrays/lists
data = [1, 2, 3, 4, 5]
mixed_data = [1, "hello", 3.14, True]

# Common operations
data.append(6)          # O(1) - Add to end
data.insert(0, 0)       # O(n) - Insert at index
data.pop()              # O(1) - Remove from end
data.remove(2)          # O(n) - Remove specific value
len(data)               # O(1) - Get length
data[0]                 # O(1) - Access by index

2. Binary Search

Binary search is efficient for searching in sorted arrays with O(log n) time complexity:

def binary_search(arr, target):
    """
    Binary search implementation
    Works only on SORTED arrays
    
    Time: O(log n)
    Space: O(1)
    """
    left, right = 0, len(arr) - 1
    
    while left <= right:
        mid = left + (right - left) // 2  # Avoid overflow
        
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    
    return -1  # Not found

# Usage
data = [1, 3, 5, 7, 9, 11, 13, 15]
index = binary_search(data, 7)  # Returns 3

3. Two-Pointer Technique

Efficient pattern for array problems with O(n) time and O(1) space:

def two_pointer_example(arr, target_sum):
    """
    Find two numbers that sum to target
    Array must be sorted
    
    Time: O(n)
    Space: O(1)
    """
    left, right = 0, len(arr) - 1
    
    while left < right:
        current_sum = arr[left] + arr[right]
        
        if current_sum == target_sum:
            return [left, right]
        elif current_sum < target_sum:
            left += 1
        else:
            right -= 1
    
    return None

Practice Questions

Question 1: Parse CSV Data (Easy)

Problem: Given a CSV string, parse it into a list of dictionaries.

def parse_csv(csv_data):
    """
    Parse CSV data into list of dictionaries
    
    Time Complexity: O(n * m) where n is rows, m is columns
    Space Complexity: O(n * m)
    
    Example:
    Input: "name,age,city\\nAlice,30,NYC\\nBob,25,LA"
    Output: [
        {"name": "Alice", "age": "30", "city": "NYC"},
        {"name": "Bob", "age": "25", "city": "LA"}
    ]
    """
    lines = csv_data.strip().split("\\n")
    if not lines:
        return []
    
    # Get headers
    headers = lines[0].split(",")
    
    # Parse rows
    result = []
    for line in lines[1:]:
        values = line.split(",")
        row_dict = {headers[i]: values[i] for i in range(len(headers))}
        result.append(row_dict)
    
    return result

# Test
csv_data = """name,age,city
Alice,30,NYC
Bob,25,LA"""

print(parse_csv(csv_data))

Explanation: This is a common pattern in ETL pipelines where we split by newlines to get rows, use the first row as headers, and create dictionaries for each subsequent row.

Question 2: Find Duplicate Records (Easy-Medium)

Problem: Find duplicate records in a log file based on a unique identifier.

def find_duplicates(records, key_field):
    """
    Find duplicate records based on a key field
    
    Time Complexity: O(n)
    Space Complexity: O(n)
    """
    seen = {}
    duplicates = []
    
    for record in records:
        key_value = record.get(key_field)
        
        if key_value in seen:
            duplicates.append(record)
        else:
            seen[key_value] = record
    
    return duplicates

# Test
records = [
    {"id": "123", "timestamp": "2024-01-01", "event": "login"},
    {"id": "456", "timestamp": "2024-01-02", "event": "logout"},
    {"id": "123", "timestamp": "2024-01-03", "event": "login"}  # Duplicate
]

duplicates = find_duplicates(records, "id")
print(f"Found {len(duplicates)} duplicate(s)")

Explanation: Using a hash map (dictionary) provides O(1) lookup time, making this efficient for data quality checks.

Question 3: Search Sorted Logs by Timestamp (Medium)

Problem: Given sorted log entries by timestamp, find logs within a time range using binary search.

def binary_search_first(logs, target_time):
    """Find first occurrence of target_time or greater"""
    left, right = 0, len(logs) - 1
    result = len(logs)
    
    while left <= right:
        mid = left + (right - left) // 2
        
        if logs[mid]["timestamp"] >= target_time:
            result = mid
            right = mid - 1
        else:
            left = mid + 1
    
    return result

def search_time_range(logs, start_time, end_time):
    """
    Search logs within time range using binary search
    Assumes logs are sorted by timestamp
    
    Time Complexity: O(log n + k) where k is result size
    Space Complexity: O(k)
    """
    start_idx = binary_search_first(logs, start_time)
    end_idx = binary_search_first(logs, end_time)
    
    if start_idx <= end_idx:
        return logs[start_idx:end_idx+1]
    return []

# Test
logs = [
    {"timestamp": "2024-01-01 09:00", "message": "System started"},
    {"timestamp": "2024-01-01 10:00", "message": "User login"},
    {"timestamp": "2024-01-01 11:00", "message": "Data processed"},
    {"timestamp": "2024-01-01 12:00", "message": "Backup completed"}
]

results = search_time_range(logs, "2024-01-01 10:00", "2024-01-01 12:00")
print(f"Found {len(results)} logs in range")

Explanation: Binary search works because logs are sorted by timestamp. Much faster than linear scan for large log files. Real-world application in log analysis systems.

Summary

Key Takeaways

Arrays are fundamental for data storage and manipulation
Binary Search provides O(log n) lookup in sorted data
Two-Pointer technique solves many array problems efficiently
String operations are crucial for log processing and ETL

When to Use These Patterns

Linear Scan: Unsorted data, need to check all elements
Binary Search: Sorted data, need fast lookup
Two-Pointer: Finding pairs, removing duplicates, partitioning
Sliding Window: Moving aggregations, substring problems

📚 More Questions: This post covered 3 questions. See the full comprehensive version for 6 total practice questions with complete solutions!

Tags: Python, DSA, Data Engineering, Arrays, Binary Search, Strings, ETL, Data Processing

Search This Blog

Python, DSA, Data Engineering, Arrays, Binary Search, Strings,