Python, DSA, Data Engineering, Arrays, Binary Search, Strings,
Python DSA for Data Engineering: Arrays, Binary Search & Strings
A comprehensive guide to essential data structures and algorithms for data engineers
Introduction
Arrays, binary search, and string manipulation are fundamental skills for data engineers. These concepts are crucial when working with log processing, data validation, ETL pipelines, and data transformation tasks.
Why These Matter for Data Engineers
- Log Processing: Parsing and analyzing application logs
- Data Validation: Checking data integrity and format
- ETL Operations: Transforming and cleaning data
- Performance: Efficient searching and scanning of large datasets
Core Concepts
1. Arrays and Lists
In Python, we primarily use lists (dynamic arrays). Time complexity for common operations:
- Access: O(1)
- Search: O(n)
- Insertion: O(n) worst case, O(1) at end
- Deletion: O(n) worst case, O(1) at end
# Creating arrays/lists
data = [1, 2, 3, 4, 5]
mixed_data = [1, "hello", 3.14, True]
# Common operations
data.append(6) # O(1) - Add to end
data.insert(0, 0) # O(n) - Insert at index
data.pop() # O(1) - Remove from end
data.remove(2) # O(n) - Remove specific value
len(data) # O(1) - Get length
data[0] # O(1) - Access by index
2. Binary Search
Binary search is efficient for searching in sorted arrays with O(log n) time complexity:
def binary_search(arr, target):
"""
Binary search implementation
Works only on SORTED arrays
Time: O(log n)
Space: O(1)
"""
left, right = 0, len(arr) - 1
while left <= right:
mid = left + (right - left) // 2 # Avoid overflow
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1 # Not found
# Usage
data = [1, 3, 5, 7, 9, 11, 13, 15]
index = binary_search(data, 7) # Returns 3
3. Two-Pointer Technique
Efficient pattern for array problems with O(n) time and O(1) space:
def two_pointer_example(arr, target_sum):
"""
Find two numbers that sum to target
Array must be sorted
Time: O(n)
Space: O(1)
"""
left, right = 0, len(arr) - 1
while left < right:
current_sum = arr[left] + arr[right]
if current_sum == target_sum:
return [left, right]
elif current_sum < target_sum:
left += 1
else:
right -= 1
return None
Practice Questions
Question 1: Parse CSV Data (Easy)
Problem: Given a CSV string, parse it into a list of dictionaries.
def parse_csv(csv_data):
"""
Parse CSV data into list of dictionaries
Time Complexity: O(n * m) where n is rows, m is columns
Space Complexity: O(n * m)
Example:
Input: "name,age,city\\nAlice,30,NYC\\nBob,25,LA"
Output: [
{"name": "Alice", "age": "30", "city": "NYC"},
{"name": "Bob", "age": "25", "city": "LA"}
]
"""
lines = csv_data.strip().split("\\n")
if not lines:
return []
# Get headers
headers = lines[0].split(",")
# Parse rows
result = []
for line in lines[1:]:
values = line.split(",")
row_dict = {headers[i]: values[i] for i in range(len(headers))}
result.append(row_dict)
return result
# Test
csv_data = """name,age,city
Alice,30,NYC
Bob,25,LA"""
print(parse_csv(csv_data))
Explanation: This is a common pattern in ETL pipelines where we split by newlines to get rows, use the first row as headers, and create dictionaries for each subsequent row.
Question 2: Find Duplicate Records (Easy-Medium)
Problem: Find duplicate records in a log file based on a unique identifier.
def find_duplicates(records, key_field):
"""
Find duplicate records based on a key field
Time Complexity: O(n)
Space Complexity: O(n)
"""
seen = {}
duplicates = []
for record in records:
key_value = record.get(key_field)
if key_value in seen:
duplicates.append(record)
else:
seen[key_value] = record
return duplicates
# Test
records = [
{"id": "123", "timestamp": "2024-01-01", "event": "login"},
{"id": "456", "timestamp": "2024-01-02", "event": "logout"},
{"id": "123", "timestamp": "2024-01-03", "event": "login"} # Duplicate
]
duplicates = find_duplicates(records, "id")
print(f"Found {len(duplicates)} duplicate(s)")
Explanation: Using a hash map (dictionary) provides O(1) lookup time, making this efficient for data quality checks.
Question 3: Search Sorted Logs by Timestamp (Medium)
Problem: Given sorted log entries by timestamp, find logs within a time range using binary search.
def binary_search_first(logs, target_time):
"""Find first occurrence of target_time or greater"""
left, right = 0, len(logs) - 1
result = len(logs)
while left <= right:
mid = left + (right - left) // 2
if logs[mid]["timestamp"] >= target_time:
result = mid
right = mid - 1
else:
left = mid + 1
return result
def search_time_range(logs, start_time, end_time):
"""
Search logs within time range using binary search
Assumes logs are sorted by timestamp
Time Complexity: O(log n + k) where k is result size
Space Complexity: O(k)
"""
start_idx = binary_search_first(logs, start_time)
end_idx = binary_search_first(logs, end_time)
if start_idx <= end_idx:
return logs[start_idx:end_idx+1]
return []
# Test
logs = [
{"timestamp": "2024-01-01 09:00", "message": "System started"},
{"timestamp": "2024-01-01 10:00", "message": "User login"},
{"timestamp": "2024-01-01 11:00", "message": "Data processed"},
{"timestamp": "2024-01-01 12:00", "message": "Backup completed"}
]
results = search_time_range(logs, "2024-01-01 10:00", "2024-01-01 12:00")
print(f"Found {len(results)} logs in range")
Explanation: Binary search works because logs are sorted by timestamp. Much faster than linear scan for large log files. Real-world application in log analysis systems.
Summary
Key Takeaways
- Arrays are fundamental for data storage and manipulation
- Binary Search provides O(log n) lookup in sorted data
- Two-Pointer technique solves many array problems efficiently
- String operations are crucial for log processing and ETL
When to Use These Patterns
- Linear Scan: Unsorted data, need to check all elements
- Binary Search: Sorted data, need fast lookup
- Two-Pointer: Finding pairs, removing duplicates, partitioning
- Sliding Window: Moving aggregations, substring problems
📚 More Questions: This post covered 3 questions. See the full comprehensive version for 6 total practice questions with complete solutions!
Tags: Python, DSA, Data Engineering, Arrays, Binary Search, Strings, ETL, Data Processing
Comments
Post a Comment