Python DSA for Data Engineering: Hashing, Sets & Dictionaries
Python DSA for Data Engineering: Hashing, Sets & Dictionaries
Master fast lookups, deduplication, and data grouping for efficient data processing
Introduction
Hashing, sets, and dictionaries provide O(1) average-case performance for lookups, insertions, and deletions, making them essential for processing large datasets efficiently.
Why These Matter for Data Engineers
- Fast Lookups: O(1) access time for data retrieval
- Deduplication: Efficiently remove duplicate records
- Grouping: Aggregate data by keys
- Counting: Track frequency of events, errors, or entities
- Caching: Store computed results for quick access
Core Concepts
1. Dictionaries (Hash Tables)
Python's dict is a hash table implementation with O(1) average operations:
# Creating dictionaries
user_data = {"name": "Alice", "age": 30, "role": "Data Engineer"}
# Common operations
user_data["name"] # O(1) - Access
user_data["city"] = "NYC" # O(1) - Insert/Update
del user_data["age"] # O(1) - Delete
"name" in user_data # O(1) - Membership check
user_data.get("age", 25) # O(1) - Safe access with default
2. Sets
Sets are unordered collections of unique elements:
# Creating sets
unique_ids = {1, 2, 3, 4, 5}
from_list = set([1, 2, 2, 3, 3, 3]) # {1, 2, 3}
# Set operations
set_a = {1, 2, 3, 4}
set_b = {3, 4, 5, 6}
set_a | set_b # Union: {1, 2, 3, 4, 5, 6}
set_a & set_b # Intersection: {3, 4}
set_a - set_b # Difference: {1, 2}
3. Counter (from collections)
from collections import Counter
# Count frequency
data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
counter = Counter(data)
# Counter({'apple': 3, 'banana': 2, 'orange': 1})
# Most common
counter.most_common(2) # [('apple', 3), ('banana', 2)]
4. DefaultDict
from collections import defaultdict
# Create with default factory
word_groups = defaultdict(list)
error_counts = defaultdict(int)
# No KeyError - automatically creates default value
word_groups['fruits'].append('apple')
error_counts['404'] += 1 # Starts from 0
Practice Questions
Question 1: Count Log Events by Type (Easy)
Problem: Count how many times each event type appears in logs.
def count_events(logs):
"""
Count event types in logs
Time Complexity: O(n)
Space Complexity: O(k) where k is unique event types
"""
event_counts = {}
for log in logs:
event_type = log.get("event")
if event_type:
event_counts[event_type] = event_counts.get(event_type, 0) + 1
return event_counts
# Alternative: Using Counter
from collections import Counter
def count_events_v2(logs):
return Counter(log.get("event") for log in logs if "event" in log)
# Test
logs = [
{"event": "login", "user": "alice"},
{"event": "logout", "user": "bob"},
{"event": "login", "user": "charlie"},
{"event": "error", "user": "alice"}
]
counts = count_events(logs)
print("Event counts:", counts)
Explanation: Hash map provides O(1) insert and lookup. Counter is a cleaner alternative. Common pattern in log analysis.
Question 2: Deduplicate Records (Easy-Medium)
Problem: Remove duplicate records based on unique identifier.
def deduplicate_records(records, key_field):
"""
Remove duplicate records keeping first occurrence
Time Complexity: O(n)
Space Complexity: O(n)
"""
seen_keys = set()
unique_records = []
for record in records:
key_value = record.get(key_field)
if key_value not in seen_keys:
seen_keys.add(key_value)
unique_records.append(record)
return unique_records
# Test
records = [
{"id": "A001", "product": "Laptop", "price": 1200},
{"id": "A002", "product": "Mouse", "price": 25},
{"id": "A001", "product": "Laptop Duplicate", "price": 1250},
{"id": "A003", "product": "Keyboard", "price": 75}
]
unique = deduplicate_records(records, "id")
print(f"Original: {len(records)} records")
print(f"After deduplication: {len(unique)} records")
Explanation: Set tracks seen keys in O(1) time. Preserves order of first occurrence. Common data cleaning operation.
Question 3: Group Sales by Region (Medium)
Problem: Group sales records by region and calculate totals.
from collections import defaultdict
def group_sales_by_region(sales):
"""
Group sales by region and calculate totals
Time Complexity: O(n)
Space Complexity: O(k) where k is unique regions
"""
grouped = defaultdict(lambda: {
"total": 0,
"count": 0,
"sales": []
})
for sale in sales:
region = sale.get("region", "Unknown")
amount = sale.get("amount", 0)
grouped[region]["total"] += amount
grouped[region]["count"] += 1
grouped[region]["sales"].append(sale)
return dict(grouped)
# Test
sales_data = [
{"region": "North", "amount": 1000, "product": "Laptop"},
{"region": "South", "amount": 1500, "product": "Phone"},
{"region": "North", "amount": 2000, "product": "Tablet"}
]
grouped_sales = group_sales_by_region(sales_data)
for region, data in grouped_sales.items():
print(f"{region}: Total=${data['total']}, Count={data['count']}")
Explanation: defaultdict eliminates need to check if key exists. Groups data in single pass. Pattern used in data warehousing and reporting.
Summary
Key Takeaways
- Hash Tables (Dictionaries): O(1) average operations for key-value storage
- Sets: Fast membership testing and set operations
- Counter: Specialized for frequency counting
- DefaultDict: Eliminates key existence checks
When to Use
| Data Structure | Use Case |
|---|---|
dict |
Key-value mapping, lookup tables |
set |
Unique elements, membership testing |
Counter |
Counting occurrences |
defaultdict |
Grouping, automatic initialization |
⚡ Performance Tip: Always convert to set for deduplication before processing. Use hash tables over lists for membership testing!
Tags: Python, DSA, Data Engineering, Hashing, Sets, Dictionaries, Hash Tables, Performance
Comments
Post a Comment