Arrays and Hashing¶
Arrays and hash tables are the two most fundamental data structures in programming. This file covers how they work under the hood, then builds up the key problem-solving patterns; two pointers, sliding window, prefix sums, and hash-based lookups, through progressively harder problems, with common pitfalls at each step.
-
If you understand arrays and hash maps deeply, you can solve ~40% of all coding interview problems. These two structures appear everywhere because they offer the two things algorithms need most: fast indexed access (arrays) and fast lookup by key (hash maps).
-
This file teaches patterns, not solutions. The goal is that when you see a new problem, you recognise which pattern applies and why, rather than trying to recall a memorised solution.
Arrays¶
-
An array is a contiguous block of memory where elements are stored at fixed offsets. Accessing element \(i\) costs \(O(1)\) because the address is simply
base + i * element_size. This is the fastest possible data access, and it is why arrays are the default choice. -
Dynamic arrays (Python's
list, Java'sArrayList, C++'svector) grow automatically when full. The strategy is amortised doubling: when the array is full, allocate a new array of double the size and copy everything over. The copy costs \(O(n)\), but it happens so rarely (every \(n\) insertions) that the amortised cost per insertion is \(O(1)\). -
Cache locality is why arrays are fast in practice, not just in theory. Because elements are stored contiguously, accessing one element loads nearby elements into the CPU cache (chapter 13). Iterating through an array is cache-friendly; following pointers in a linked list is not. This constant-factor difference can be 10-100x in practice.
| Operation | Array | Dynamic Array |
|---|---|---|
| Access by index | \(O(1)\) | \(O(1)\) |
| Append | n/a | \(O(1)\) amortised |
| Insert at position \(i\) | \(O(n)\) | \(O(n)\) |
| Delete at position \(i\) | \(O(n)\) | \(O(n)\) |
| Search (unsorted) | \(O(n)\) | \(O(n)\) |
- Pitfall: inserting or deleting in the middle of an array is \(O(n)\) because all subsequent elements must be shifted. If you need frequent middle insertions, consider a linked list or a different approach entirely.
Strings¶
- A string is an array of characters. In Python, strings are immutable: every concatenation creates a new string. Building a string character by character in a loop is \(O(n^2)\) because each concatenation copies the entire string so far.
# BAD: O(n^2) string concatenation
s = ""
for c in characters:
s += c # copies entire string each time
# GOOD: O(n) using a list then join
parts = []
for c in characters:
parts.append(c)
s = "".join(parts)
-
Pitfall: in Python,
s += cinside a loop is one of the most common performance bugs. Always collect into a list and.join(). -
Encoding: ASCII uses 7 bits (128 characters). UTF-8 is variable-length: ASCII characters use 1 byte, accented characters use 2, Chinese/Japanese characters use 3, emojis use 4. When a problem says "lowercase English letters," the alphabet size is 26, which means you can use a fixed-size array instead of a hash map.
Hash Tables¶
-
A hash table maps keys to values with \(O(1)\) average-case lookup, insertion, and deletion. It works by computing a hash function \(h(key)\) that converts the key to an array index.
-
The hash function must be: deterministic (same key always gives same hash), uniform (distributes keys evenly across buckets), and fast to compute.
-
Collisions occur when two different keys hash to the same index. Two main strategies:
-
Chaining: each bucket stores a linked list of key-value pairs. On collision, append to the list. Worst case (all keys hash to the same bucket): \(O(n)\). Average case with a good hash function: \(O(1)\).
-
Open addressing: on collision, probe for the next empty slot. Linear probing checks the next slot, then the next, etc. It is cache-friendly but suffers from clustering (long runs of occupied slots). Robin Hood hashing reduces the variance by displacing entries that are "closer to home."
-
-
The load factor \(\alpha = n / m\) (items / buckets) determines performance. When \(\alpha\) exceeds a threshold (typically 0.75), the table rehashes: allocate a larger table and re-insert all elements. This costs \(O(n)\) but happens infrequently.
-
Hash maps (
dictin Python,HashMapin Java) store key-value pairs. Hash sets (setin Python,HashSetin Java) store only keys (used for fast membership testing).
| Operation | Average | Worst Case |
|---|---|---|
| Lookup | \(O(1)\) | \(O(n)\) |
| Insert | \(O(1)\) | \(O(n)\) |
| Delete | \(O(1)\) | \(O(n)\) |
-
Bloom filters are space-efficient probabilistic sets. They can tell you "definitely not in the set" or "probably in the set" (with a tunable false positive rate). They use \(k\) hash functions and a bit array. Used in databases (avoid disk reads for absent keys), web caches, and spell checkers.
-
When to reach for a hash map: whenever you need to answer "have I seen this before?" or "what is the count/index/value associated with this key?" in \(O(1)\). If you are doing repeated linear scans looking for something, a hash map almost certainly makes it faster.
Pattern: Hash Map Lookups¶
- The most basic pattern: use a hash map to replace \(O(n)\) scans with \(O(1)\) lookups.
Easy: Two Sum¶
-
Problem: given an array of integers and a target, return indices of two numbers that add up to the target.
-
Brute force \(O(n^2)\): check every pair.
-
Pattern insight: for each number
num, you needtarget - numto exist somewhere in the array. Instead of scanning the array for it, store previously seen numbers in a hash map.
def two_sum(nums, target):
seen = {} # value -> index
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return [seen[complement], i]
seen[num] = i
-
Why this works: one pass through the array. For each element, the hash map lookup is \(O(1)\). Total: \(O(n)\) time, \(O(n)\) space.
-
Pitfall: do not add the current number to the hash map before checking for the complement, or you might match an element with itself. The order in the code above is correct: check first, then insert.
Medium: Group Anagrams¶
-
Problem: given a list of strings, group the anagrams together. ("eat", "tea", "ate") are one group.
-
Pattern insight: anagrams have the same characters in different orders. If you sort each string, anagrams produce the same sorted key. Use that sorted key as the hash map key.
from collections import defaultdict
def group_anagrams(strs):
groups = defaultdict(list)
for s in strs:
key = tuple(sorted(s)) # or use character count tuple
groups[key].append(s)
return list(groups.values())
- Optimisation: sorting each string costs \(O(k \log k)\) where \(k\) is the string length. For a faster key, count character frequencies and use the count tuple as the key:
def group_anagrams_fast(strs):
groups = defaultdict(list)
for s in strs:
count = [0] * 26
for c in s:
count[ord(c) - ord('a')] += 1
groups[tuple(count)].append(s)
return list(groups.values())
-
This is \(O(k)\) per string instead of \(O(k \log k)\). The character count tuple is a canonical form: a representation that is the same for all members of a group.
-
Pitfall: in Python, lists are not hashable (cannot be dict keys). You must convert to a tuple. This trips people up when they try
groups[count].append(s).
Hard: Longest Consecutive Sequence¶
-
Problem: given an unsorted array, find the length of the longest consecutive sequence (e.g., [100, 4, 200, 1, 3, 2] → 4, because [1, 2, 3, 4]).
-
Brute force \(O(n \log n)\): sort the array, then scan for consecutive runs.
-
Pattern insight: put all numbers in a hash set for \(O(1)\) lookup. For each number, check if it is the start of a sequence (i.e.,
num - 1is not in the set). If so, count how far the sequence extends.
def longest_consecutive(nums):
num_set = set(nums)
best = 0
for num in num_set:
# only start counting from the beginning of a sequence
if num - 1 not in num_set:
length = 1
while num + length in num_set:
length += 1
best = max(best, length)
return best
-
Why \(O(n)\): the inner
whileloop runs at most \(n\) times total across all iterations (each number is visited at most twice: once in the outer loop, once in awhileextension). Theif num - 1 not in num_setguard ensures we only start counting from sequence beginnings. -
Pitfall: without the
if num - 1 not in num_setcheck, you would start counting from every element, making it \(O(n^2)\) in the worst case (e.g., [1, 2, 3, ..., n] would scan the entire sequence from each starting point).
Pattern: Two Pointers¶
-
The two pointers pattern uses two indices that move through the array, usually from opposite ends or from the same end at different speeds. It works when the array is sorted or when you need to compare pairs.
-
When to use: the problem involves pairs, subarrays, or partitioning, and the array is sorted (or can be sorted without losing needed information).
Easy: Valid Palindrome¶
-
Problem: determine if a string is a palindrome, considering only alphanumeric characters and ignoring case.
-
Pattern: one pointer at the start, one at the end. Move them inward, comparing characters.
def is_palindrome(s):
left, right = 0, len(s) - 1
while left < right:
# skip non-alphanumeric characters
while left < right and not s[left].isalnum():
left += 1
while left < right and not s[right].isalnum():
right -= 1
if s[left].lower() != s[right].lower():
return False
left += 1
right -= 1
return True
- Pitfall: forgetting the
left < rightcheck inside the inner while loops. Without it, the pointer can go out of bounds on strings like "!!!" (all non-alphanumeric).
Medium: Three Sum¶
-
Problem: find all unique triplets in the array that sum to zero.
-
Pattern: sort the array. Fix one element, then use two pointers on the remaining portion to find pairs that sum to the negation of the fixed element.
def three_sum(nums):
nums.sort()
result = []
for i in range(len(nums) - 2):
# skip duplicate fixed elements
if i > 0 and nums[i] == nums[i - 1]:
continue
left, right = i + 1, len(nums) - 1
target = -nums[i]
while left < right:
total = nums[left] + nums[right]
if total < target:
left += 1
elif total > target:
right -= 1
else:
result.append([nums[i], nums[left], nums[right]])
# skip duplicates
while left < right and nums[left] == nums[left + 1]:
left += 1
while left < right and nums[right] == nums[right - 1]:
right -= 1
left += 1
right -= 1
return result
-
Why this works: sorting is \(O(n \log n)\). For each fixed element, the two-pointer scan is \(O(n)\). Total: \(O(n^2)\), which is optimal for this problem (you must consider all pairs).
-
Pitfall: handling duplicates is the hardest part. Without the duplicate-skipping logic (both for the fixed element and for the two-pointer results), you will return duplicate triplets. The
if i > 0 and nums[i] == nums[i-1]: continueline is critical.
Hard: Trapping Rain Water¶
-
Problem: given an elevation map (array of non-negative integers), compute how much water it can trap after rain.
-
Pattern insight: for each position, the water level is determined by the minimum of the maximum height to its left and the maximum height to its right, minus the current height. Two pointers from both ends track these running maxima.
def trap(height):
left, right = 0, len(height) - 1
left_max, right_max = 0, 0
water = 0
while left < right:
if height[left] < height[right]:
if height[left] >= left_max:
left_max = height[left]
else:
water += left_max - height[left]
left += 1
else:
if height[right] >= right_max:
right_max = height[right]
else:
water += right_max - height[right]
right -= 1
return water
-
Why this works: the key insight is that if
height[left] < height[right], the water at positionleftis bounded byleft_max(we know there is a taller bar on the right side, so the right side cannot be the bottleneck). We process the shorter side, guaranteed that the other side has a taller bar. -
Pitfall: many people try to precompute
left_max[i]andright_max[i]arrays first (which works but uses \(O(n)\) space). The two-pointer approach achieves \(O(1)\) space. Also, confusing>=with>in the max updates can cause off-by-one water calculations.
Pattern: Sliding Window¶
-
The sliding window pattern maintains a window (contiguous subarray) that expands and contracts as you iterate. It works for problems asking about subarrays or substrings that satisfy some condition.
-
When to use: the problem asks for the longest/shortest subarray or substring satisfying a constraint, and expanding/contracting the window is monotonic (adding elements can only make the constraint harder/easier to satisfy, not both).
-
Template:
def sliding_window(arr):
left = 0
state = ... # window state (counts, sum, etc.)
best = ...
for right in range(len(arr)):
# expand: add arr[right] to the window state
update_state(state, arr[right])
# contract: shrink from the left while constraint is violated
while constraint_violated(state):
remove_from_state(state, arr[left])
left += 1
# update answer
best = max(best, right - left + 1) # or min, depending on problem
return best
Easy: Best Time to Buy and Sell Stock¶
-
Problem: given daily prices, find the maximum profit from one buy and one sell (buy before sell).
-
Pattern: track the minimum price seen so far (the left boundary of the window) and compute the profit at each day.
def max_profit(prices):
min_price = float('inf')
max_profit = 0
for price in prices:
min_price = min(min_price, price)
max_profit = max(max_profit, price - min_price)
return max_profit
- This is a degenerate sliding window: the left pointer (min price) only moves forward when a new minimum is found. \(O(n)\) time, \(O(1)\) space.
Medium: Longest Substring Without Repeating Characters¶
-
Problem: find the length of the longest substring without any repeating characters.
-
Pattern: expand the window by moving
right. When a duplicate is found, contract from the left until the duplicate is removed.
def length_of_longest_substring(s):
char_index = {} # character -> its most recent index
left = 0
best = 0
for right, char in enumerate(s):
if char in char_index and char_index[char] >= left:
left = char_index[char] + 1 # jump past the duplicate
char_index[char] = right
best = max(best, right - left + 1)
return best
-
Why
char_index[char] >= left: the character might be in the map from before the current window started. Without this check, you would incorrectly shrink the window for a character that is not actually in the current window. -
Pitfall: using a set and removing characters one by one from the left is correct but slower. The hash map approach jumps directly to the right position.
Hard: Minimum Window Substring¶
-
Problem: given strings
sandt, find the minimum window insthat contains all characters oft. -
Pattern: expand the window to include all required characters, then contract from the left to find the minimum valid window.
from collections import Counter
def min_window(s, t):
if not t or not s:
return ""
need = Counter(t) # characters we need and their counts
have = 0 # how many unique characters we have in sufficient quantity
required = len(need) # how many unique characters we need
left = 0
best = (float('inf'), 0, 0) # (length, left, right)
window_counts = {}
for right in range(len(s)):
char = s[right]
window_counts[char] = window_counts.get(char, 0) + 1
# check if this character's count now meets the requirement
if char in need and window_counts[char] == need[char]:
have += 1
# contract from the left while the window is valid
while have == required:
# update best
if (right - left + 1) < best[0]:
best = (right - left + 1, left, right)
# remove leftmost character
left_char = s[left]
window_counts[left_char] -= 1
if left_char in need and window_counts[left_char] < need[left_char]:
have -= 1
left += 1
length, start, end = best
return s[start:end + 1] if length != float('inf') else ""
-
Pitfall: the
havecounter is the crucial optimisation. Without it, you would need to compare the entirewindow_countsdict withneedat every step, which is \(O(|\text{unique chars}|)\) per step. Thehavecounter makes the validity check \(O(1)\). -
Pitfall: checking
window_counts[char] == need[char](not>=) ensures we incrementhaveexactly once per character. If we used>=, we would over-count.
Pattern: Prefix Sums¶
- A prefix sum array stores cumulative sums:
prefix[i] = sum(arr[0:i]). Once built in \(O(n)\), any subarray sum can be computed in \(O(1)\):sum(arr[l:r]) = prefix[r] - prefix[l].
def build_prefix(arr):
prefix = [0] * (len(arr) + 1)
for i in range(len(arr)):
prefix[i + 1] = prefix[i] + arr[i]
return prefix
# sum of arr[l:r] (inclusive l, exclusive r)
def range_sum(prefix, l, r):
return prefix[r] - prefix[l]
- When to use: the problem involves multiple subarray sum queries, or finding subarrays with a specific sum.
Easy: Range Sum Query¶
-
Problem: given an array, answer multiple queries of "what is the sum from index \(l\) to \(r\)?"
-
Without prefix sums: each query is \(O(n)\). With prefix sums: \(O(n)\) precomputation, then \(O(1)\) per query.
Medium: Subarray Sum Equals K¶
-
Problem: count the number of contiguous subarrays that sum to \(k\).
-
Pattern insight: a subarray sum from index \(l\) to \(r\) equals
prefix[r+1] - prefix[l]. We want this to equal \(k\), soprefix[l] = prefix[r+1] - k. For each position, count how many earlier prefix sums equalcurrent_prefix - kusing a hash map.
def subarray_sum(nums, k):
count = 0
prefix = 0
prefix_counts = {0: 1} # empty prefix sum
for num in nums:
prefix += num
# how many earlier prefix sums equal prefix - k?
count += prefix_counts.get(prefix - k, 0)
prefix_counts[prefix] = prefix_counts.get(prefix, 0) + 1
return count
-
This combines prefix sums with hash map lookups: \(O(n)\) time, \(O(n)\) space.
-
Pitfall: forgetting to initialise
prefix_counts = {0: 1}. The empty prefix (before any element) has sum 0. Without this, you miss subarrays starting from index 0.
Hard: Product of Array Except Self¶
-
Problem: given an array, return an array where each element is the product of all other elements. You cannot use division.
-
Pattern: build prefix products from the left and suffix products from the right. Each position's answer is
left_product * right_product.
def product_except_self(nums):
n = len(nums)
result = [1] * n
# left pass: result[i] = product of nums[0..i-1]
prefix = 1
for i in range(n):
result[i] = prefix
prefix *= nums[i]
# right pass: multiply by product of nums[i+1..n-1]
suffix = 1
for i in range(n - 1, -1, -1):
result[i] *= suffix
suffix *= nums[i]
return result
-
\(O(n)\) time, \(O(1)\) extra space (the output array does not count). This uses the output array itself to store intermediate prefix products, then multiplies in the suffix products in the second pass.
-
Pitfall: if the array contains a zero, division-based approaches fail. This prefix/suffix approach handles zeros correctly because it never divides.
Common Pitfalls Summary¶
| Pitfall | Example | Fix |
|---|---|---|
| Off-by-one in window size | right - left vs right - left + 1 |
Draw out a 2-element example |
| Mutable default in Python | def f(seen={}) shares state across calls |
Use def f(seen=None) |
| String concatenation in loop | s += c is \(O(n^2)\) in Python |
Use list.append + "".join" |
Forgetting {0: 1} in prefix sum |
Miss subarrays starting at index 0 | Always init with empty prefix |
| Hash map before check | Two Sum: adding num before checking complement |
Check first, then insert |
| Not handling duplicates | Three Sum returns duplicate triplets | Skip consecutive equal values |
| Integer overflow | Sum of large arrays in C++/Java | Use long or check bounds |
Take-Home Problems (NeetCode)¶
Practice these in order. Each reinforces a pattern from this file.
Hash Map Lookups¶
- Contains Duplicate — warm-up: hash set for seen-before
- Two Sum — complement lookup
- Group Anagrams — canonical form as key
- Top K Frequent Elements — hash map + bucket sort
- Longest Consecutive Sequence — hash set with start-of-sequence trick
- Encode and Decode Strings — design a serialisation scheme
Two Pointers¶
- Valid Palindrome — inward pointers
- Two Sum II (sorted) — two pointers on sorted array
- Three Sum — fix + two pointers + dedup
- Container With Most Water — greedy two pointers
- Trapping Rain Water — two pointers with running max
Sliding Window¶
- Best Time to Buy and Sell Stock — degenerate window
- Longest Substring Without Repeating Characters — expand/contract with hash map
- Longest Repeating Character Replacement — window + max frequency trick
- Minimum Window Substring — expand until valid, contract to minimise
Prefix Sums¶
- Product of Array Except Self — prefix/suffix products