Python DSA for Data Engineering: Hashing, Sets & Dictionaries

Python DSA for Data Engineering: Hashing, Sets & Dictionaries

Master fast lookups, deduplication, and data grouping for efficient data processing

Introduction

Hashing, sets, and dictionaries provide O(1) average-case performance for lookups, insertions, and deletions, making them essential for processing large datasets efficiently.

Why These Matter for Data Engineers

  • Fast Lookups: O(1) access time for data retrieval
  • Deduplication: Efficiently remove duplicate records
  • Grouping: Aggregate data by keys
  • Counting: Track frequency of events, errors, or entities
  • Caching: Store computed results for quick access

Core Concepts

1. Dictionaries (Hash Tables)

Python's dict is a hash table implementation with O(1) average operations:

# Creating dictionaries
user_data = {"name": "Alice", "age": 30, "role": "Data Engineer"}

# Common operations
user_data["name"]           # O(1) - Access
user_data["city"] = "NYC"   # O(1) - Insert/Update
del user_data["age"]        # O(1) - Delete
"name" in user_data         # O(1) - Membership check
user_data.get("age", 25)    # O(1) - Safe access with default

2. Sets

Sets are unordered collections of unique elements:

# Creating sets
unique_ids = {1, 2, 3, 4, 5}
from_list = set([1, 2, 2, 3, 3, 3])  # {1, 2, 3}

# Set operations
set_a = {1, 2, 3, 4}
set_b = {3, 4, 5, 6}

set_a | set_b  # Union: {1, 2, 3, 4, 5, 6}
set_a & set_b  # Intersection: {3, 4}
set_a - set_b  # Difference: {1, 2}

3. Counter (from collections)

from collections import Counter

# Count frequency
data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
counter = Counter(data)
# Counter({'apple': 3, 'banana': 2, 'orange': 1})

# Most common
counter.most_common(2)  # [('apple', 3), ('banana', 2)]

4. DefaultDict

from collections import defaultdict

# Create with default factory
word_groups = defaultdict(list)
error_counts = defaultdict(int)

# No KeyError - automatically creates default value
word_groups['fruits'].append('apple')
error_counts['404'] += 1  # Starts from 0

Practice Questions

Question 1: Count Log Events by Type (Easy)

Problem: Count how many times each event type appears in logs.

def count_events(logs):
    """
    Count event types in logs
    
    Time Complexity: O(n)
    Space Complexity: O(k) where k is unique event types
    """
    event_counts = {}
    
    for log in logs:
        event_type = log.get("event")
        if event_type:
            event_counts[event_type] = event_counts.get(event_type, 0) + 1
    
    return event_counts

# Alternative: Using Counter
from collections import Counter

def count_events_v2(logs):
    return Counter(log.get("event") for log in logs if "event" in log)

# Test
logs = [
    {"event": "login", "user": "alice"},
    {"event": "logout", "user": "bob"},
    {"event": "login", "user": "charlie"},
    {"event": "error", "user": "alice"}
]

counts = count_events(logs)
print("Event counts:", counts)

Explanation: Hash map provides O(1) insert and lookup. Counter is a cleaner alternative. Common pattern in log analysis.

Question 2: Deduplicate Records (Easy-Medium)

Problem: Remove duplicate records based on unique identifier.

def deduplicate_records(records, key_field):
    """
    Remove duplicate records keeping first occurrence
    
    Time Complexity: O(n)
    Space Complexity: O(n)
    """
    seen_keys = set()
    unique_records = []
    
    for record in records:
        key_value = record.get(key_field)
        
        if key_value not in seen_keys:
            seen_keys.add(key_value)
            unique_records.append(record)
    
    return unique_records

# Test
records = [
    {"id": "A001", "product": "Laptop", "price": 1200},
    {"id": "A002", "product": "Mouse", "price": 25},
    {"id": "A001", "product": "Laptop Duplicate", "price": 1250},
    {"id": "A003", "product": "Keyboard", "price": 75}
]

unique = deduplicate_records(records, "id")
print(f"Original: {len(records)} records")
print(f"After deduplication: {len(unique)} records")

Explanation: Set tracks seen keys in O(1) time. Preserves order of first occurrence. Common data cleaning operation.

Question 3: Group Sales by Region (Medium)

Problem: Group sales records by region and calculate totals.

from collections import defaultdict

def group_sales_by_region(sales):
    """
    Group sales by region and calculate totals
    
    Time Complexity: O(n)
    Space Complexity: O(k) where k is unique regions
    """
    grouped = defaultdict(lambda: {
        "total": 0,
        "count": 0,
        "sales": []
    })
    
    for sale in sales:
        region = sale.get("region", "Unknown")
        amount = sale.get("amount", 0)
        
        grouped[region]["total"] += amount
        grouped[region]["count"] += 1
        grouped[region]["sales"].append(sale)
    
    return dict(grouped)

# Test
sales_data = [
    {"region": "North", "amount": 1000, "product": "Laptop"},
    {"region": "South", "amount": 1500, "product": "Phone"},
    {"region": "North", "amount": 2000, "product": "Tablet"}
]

grouped_sales = group_sales_by_region(sales_data)
for region, data in grouped_sales.items():
    print(f"{region}: Total=${data['total']}, Count={data['count']}")

Explanation: defaultdict eliminates need to check if key exists. Groups data in single pass. Pattern used in data warehousing and reporting.

Summary

Key Takeaways

  1. Hash Tables (Dictionaries): O(1) average operations for key-value storage
  2. Sets: Fast membership testing and set operations
  3. Counter: Specialized for frequency counting
  4. DefaultDict: Eliminates key existence checks

When to Use

Data Structure Use Case
dict Key-value mapping, lookup tables
set Unique elements, membership testing
Counter Counting occurrences
defaultdict Grouping, automatic initialization

⚡ Performance Tip: Always convert to set for deduplication before processing. Use hash tables over lists for membership testing!


Tags: Python, DSA, Data Engineering, Hashing, Sets, Dictionaries, Hash Tables, Performance

Comments