Skip to content

Arrays and Hashing

Arrays and hash tables are the two most fundamental data structures in programming. This file covers how they work under the hood, then builds up the key problem-solving patterns; two pointers, sliding window, prefix sums, and hash-based lookups, through progressively harder problems, with common pitfalls at each step.

  • If you understand arrays and hash maps deeply, you can solve ~40% of all coding interview problems. These two structures appear everywhere because they offer the two things algorithms need most: fast indexed access (arrays) and fast lookup by key (hash maps).

  • This file teaches patterns, not solutions. The goal is that when you see a new problem, you recognise which pattern applies and why, rather than trying to recall a memorised solution.

Arrays

  • An array is a contiguous block of memory where elements are stored at fixed offsets. Accessing element \(i\) costs \(O(1)\) because the address is simply base + i * element_size. This is the fastest possible data access, and it is why arrays are the default choice.

  • Dynamic arrays (Python's list, Java's ArrayList, C++'s vector) grow automatically when full. The strategy is amortised doubling: when the array is full, allocate a new array of double the size and copy everything over. The copy costs \(O(n)\), but it happens so rarely (every \(n\) insertions) that the amortised cost per insertion is \(O(1)\).

  • Cache locality is why arrays are fast in practice, not just in theory. Because elements are stored contiguously, accessing one element loads nearby elements into the CPU cache (chapter 13). Iterating through an array is cache-friendly; following pointers in a linked list is not. This constant-factor difference can be 10-100x in practice.

Operation Array Dynamic Array
Access by index \(O(1)\) \(O(1)\)
Append n/a \(O(1)\) amortised
Insert at position \(i\) \(O(n)\) \(O(n)\)
Delete at position \(i\) \(O(n)\) \(O(n)\)
Search (unsorted) \(O(n)\) \(O(n)\)
  • Pitfall: inserting or deleting in the middle of an array is \(O(n)\) because all subsequent elements must be shifted. If you need frequent middle insertions, consider a linked list or a different approach entirely.

Strings

  • A string is an array of characters. In Python, strings are immutable: every concatenation creates a new string. Building a string character by character in a loop is \(O(n^2)\) because each concatenation copies the entire string so far.
# BAD: O(n^2) string concatenation
s = ""
for c in characters:
    s += c  # copies entire string each time

# GOOD: O(n) using a list then join
parts = []
for c in characters:
    parts.append(c)
s = "".join(parts)
  • Pitfall: in Python, s += c inside a loop is one of the most common performance bugs. Always collect into a list and .join().

  • Encoding: ASCII uses 7 bits (128 characters). UTF-8 is variable-length: ASCII characters use 1 byte, accented characters use 2, Chinese/Japanese characters use 3, emojis use 4. When a problem says "lowercase English letters," the alphabet size is 26, which means you can use a fixed-size array instead of a hash map.

Hash Tables

  • A hash table maps keys to values with \(O(1)\) average-case lookup, insertion, and deletion. It works by computing a hash function \(h(key)\) that converts the key to an array index.

  • The hash function must be: deterministic (same key always gives same hash), uniform (distributes keys evenly across buckets), and fast to compute.

  • Collisions occur when two different keys hash to the same index. Two main strategies:

    • Chaining: each bucket stores a linked list of key-value pairs. On collision, append to the list. Worst case (all keys hash to the same bucket): \(O(n)\). Average case with a good hash function: \(O(1)\).

    • Open addressing: on collision, probe for the next empty slot. Linear probing checks the next slot, then the next, etc. It is cache-friendly but suffers from clustering (long runs of occupied slots). Robin Hood hashing reduces the variance by displacing entries that are "closer to home."

  • The load factor \(\alpha = n / m\) (items / buckets) determines performance. When \(\alpha\) exceeds a threshold (typically 0.75), the table rehashes: allocate a larger table and re-insert all elements. This costs \(O(n)\) but happens infrequently.

  • Hash maps (dict in Python, HashMap in Java) store key-value pairs. Hash sets (set in Python, HashSet in Java) store only keys (used for fast membership testing).

Operation Average Worst Case
Lookup \(O(1)\) \(O(n)\)
Insert \(O(1)\) \(O(n)\)
Delete \(O(1)\) \(O(n)\)
  • Bloom filters are space-efficient probabilistic sets. They can tell you "definitely not in the set" or "probably in the set" (with a tunable false positive rate). They use \(k\) hash functions and a bit array. Used in databases (avoid disk reads for absent keys), web caches, and spell checkers.

  • When to reach for a hash map: whenever you need to answer "have I seen this before?" or "what is the count/index/value associated with this key?" in \(O(1)\). If you are doing repeated linear scans looking for something, a hash map almost certainly makes it faster.


Pattern: Hash Map Lookups

  • The most basic pattern: use a hash map to replace \(O(n)\) scans with \(O(1)\) lookups.

Easy: Two Sum

  • Problem: given an array of integers and a target, return indices of two numbers that add up to the target.

  • Brute force \(O(n^2)\): check every pair.

  • Pattern insight: for each number num, you need target - num to exist somewhere in the array. Instead of scanning the array for it, store previously seen numbers in a hash map.

def two_sum(nums, target):
    seen = {}  # value -> index
    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i
  • Why this works: one pass through the array. For each element, the hash map lookup is \(O(1)\). Total: \(O(n)\) time, \(O(n)\) space.

  • Pitfall: do not add the current number to the hash map before checking for the complement, or you might match an element with itself. The order in the code above is correct: check first, then insert.

Medium: Group Anagrams

  • Problem: given a list of strings, group the anagrams together. ("eat", "tea", "ate") are one group.

  • Pattern insight: anagrams have the same characters in different orders. If you sort each string, anagrams produce the same sorted key. Use that sorted key as the hash map key.

from collections import defaultdict

def group_anagrams(strs):
    groups = defaultdict(list)
    for s in strs:
        key = tuple(sorted(s))  # or use character count tuple
        groups[key].append(s)
    return list(groups.values())
  • Optimisation: sorting each string costs \(O(k \log k)\) where \(k\) is the string length. For a faster key, count character frequencies and use the count tuple as the key:
def group_anagrams_fast(strs):
    groups = defaultdict(list)
    for s in strs:
        count = [0] * 26
        for c in s:
            count[ord(c) - ord('a')] += 1
        groups[tuple(count)].append(s)
    return list(groups.values())
  • This is \(O(k)\) per string instead of \(O(k \log k)\). The character count tuple is a canonical form: a representation that is the same for all members of a group.

  • Pitfall: in Python, lists are not hashable (cannot be dict keys). You must convert to a tuple. This trips people up when they try groups[count].append(s).

Hard: Longest Consecutive Sequence

  • Problem: given an unsorted array, find the length of the longest consecutive sequence (e.g., [100, 4, 200, 1, 3, 2] → 4, because [1, 2, 3, 4]).

  • Brute force \(O(n \log n)\): sort the array, then scan for consecutive runs.

  • Pattern insight: put all numbers in a hash set for \(O(1)\) lookup. For each number, check if it is the start of a sequence (i.e., num - 1 is not in the set). If so, count how far the sequence extends.

def longest_consecutive(nums):
    num_set = set(nums)
    best = 0

    for num in num_set:
        # only start counting from the beginning of a sequence
        if num - 1 not in num_set:
            length = 1
            while num + length in num_set:
                length += 1
            best = max(best, length)

    return best
  • Why \(O(n)\): the inner while loop runs at most \(n\) times total across all iterations (each number is visited at most twice: once in the outer loop, once in a while extension). The if num - 1 not in num_set guard ensures we only start counting from sequence beginnings.

  • Pitfall: without the if num - 1 not in num_set check, you would start counting from every element, making it \(O(n^2)\) in the worst case (e.g., [1, 2, 3, ..., n] would scan the entire sequence from each starting point).


Pattern: Two Pointers

  • The two pointers pattern uses two indices that move through the array, usually from opposite ends or from the same end at different speeds. It works when the array is sorted or when you need to compare pairs.

  • When to use: the problem involves pairs, subarrays, or partitioning, and the array is sorted (or can be sorted without losing needed information).

Easy: Valid Palindrome

  • Problem: determine if a string is a palindrome, considering only alphanumeric characters and ignoring case.

  • Pattern: one pointer at the start, one at the end. Move them inward, comparing characters.

def is_palindrome(s):
    left, right = 0, len(s) - 1

    while left < right:
        # skip non-alphanumeric characters
        while left < right and not s[left].isalnum():
            left += 1
        while left < right and not s[right].isalnum():
            right -= 1

        if s[left].lower() != s[right].lower():
            return False

        left += 1
        right -= 1

    return True
  • Pitfall: forgetting the left < right check inside the inner while loops. Without it, the pointer can go out of bounds on strings like "!!!" (all non-alphanumeric).

Medium: Three Sum

  • Problem: find all unique triplets in the array that sum to zero.

  • Pattern: sort the array. Fix one element, then use two pointers on the remaining portion to find pairs that sum to the negation of the fixed element.

def three_sum(nums):
    nums.sort()
    result = []

    for i in range(len(nums) - 2):
        # skip duplicate fixed elements
        if i > 0 and nums[i] == nums[i - 1]:
            continue

        left, right = i + 1, len(nums) - 1
        target = -nums[i]

        while left < right:
            total = nums[left] + nums[right]
            if total < target:
                left += 1
            elif total > target:
                right -= 1
            else:
                result.append([nums[i], nums[left], nums[right]])
                # skip duplicates
                while left < right and nums[left] == nums[left + 1]:
                    left += 1
                while left < right and nums[right] == nums[right - 1]:
                    right -= 1
                left += 1
                right -= 1

    return result
  • Why this works: sorting is \(O(n \log n)\). For each fixed element, the two-pointer scan is \(O(n)\). Total: \(O(n^2)\), which is optimal for this problem (you must consider all pairs).

  • Pitfall: handling duplicates is the hardest part. Without the duplicate-skipping logic (both for the fixed element and for the two-pointer results), you will return duplicate triplets. The if i > 0 and nums[i] == nums[i-1]: continue line is critical.

Hard: Trapping Rain Water

  • Problem: given an elevation map (array of non-negative integers), compute how much water it can trap after rain.

  • Pattern insight: for each position, the water level is determined by the minimum of the maximum height to its left and the maximum height to its right, minus the current height. Two pointers from both ends track these running maxima.

def trap(height):
    left, right = 0, len(height) - 1
    left_max, right_max = 0, 0
    water = 0

    while left < right:
        if height[left] < height[right]:
            if height[left] >= left_max:
                left_max = height[left]
            else:
                water += left_max - height[left]
            left += 1
        else:
            if height[right] >= right_max:
                right_max = height[right]
            else:
                water += right_max - height[right]
            right -= 1

    return water
  • Why this works: the key insight is that if height[left] < height[right], the water at position left is bounded by left_max (we know there is a taller bar on the right side, so the right side cannot be the bottleneck). We process the shorter side, guaranteed that the other side has a taller bar.

  • Pitfall: many people try to precompute left_max[i] and right_max[i] arrays first (which works but uses \(O(n)\) space). The two-pointer approach achieves \(O(1)\) space. Also, confusing >= with > in the max updates can cause off-by-one water calculations.


Pattern: Sliding Window

  • The sliding window pattern maintains a window (contiguous subarray) that expands and contracts as you iterate. It works for problems asking about subarrays or substrings that satisfy some condition.

  • When to use: the problem asks for the longest/shortest subarray or substring satisfying a constraint, and expanding/contracting the window is monotonic (adding elements can only make the constraint harder/easier to satisfy, not both).

  • Template:

def sliding_window(arr):
    left = 0
    state = ...  # window state (counts, sum, etc.)
    best = ...

    for right in range(len(arr)):
        # expand: add arr[right] to the window state
        update_state(state, arr[right])

        # contract: shrink from the left while constraint is violated
        while constraint_violated(state):
            remove_from_state(state, arr[left])
            left += 1

        # update answer
        best = max(best, right - left + 1)  # or min, depending on problem

    return best

Easy: Best Time to Buy and Sell Stock

  • Problem: given daily prices, find the maximum profit from one buy and one sell (buy before sell).

  • Pattern: track the minimum price seen so far (the left boundary of the window) and compute the profit at each day.

def max_profit(prices):
    min_price = float('inf')
    max_profit = 0

    for price in prices:
        min_price = min(min_price, price)
        max_profit = max(max_profit, price - min_price)

    return max_profit
  • This is a degenerate sliding window: the left pointer (min price) only moves forward when a new minimum is found. \(O(n)\) time, \(O(1)\) space.

Medium: Longest Substring Without Repeating Characters

  • Problem: find the length of the longest substring without any repeating characters.

  • Pattern: expand the window by moving right. When a duplicate is found, contract from the left until the duplicate is removed.

def length_of_longest_substring(s):
    char_index = {}  # character -> its most recent index
    left = 0
    best = 0

    for right, char in enumerate(s):
        if char in char_index and char_index[char] >= left:
            left = char_index[char] + 1  # jump past the duplicate

        char_index[char] = right
        best = max(best, right - left + 1)

    return best
  • Why char_index[char] >= left: the character might be in the map from before the current window started. Without this check, you would incorrectly shrink the window for a character that is not actually in the current window.

  • Pitfall: using a set and removing characters one by one from the left is correct but slower. The hash map approach jumps directly to the right position.

Hard: Minimum Window Substring

  • Problem: given strings s and t, find the minimum window in s that contains all characters of t.

  • Pattern: expand the window to include all required characters, then contract from the left to find the minimum valid window.

from collections import Counter

def min_window(s, t):
    if not t or not s:
        return ""

    need = Counter(t)       # characters we need and their counts
    have = 0                # how many unique characters we have in sufficient quantity
    required = len(need)    # how many unique characters we need

    left = 0
    best = (float('inf'), 0, 0)  # (length, left, right)

    window_counts = {}

    for right in range(len(s)):
        char = s[right]
        window_counts[char] = window_counts.get(char, 0) + 1

        # check if this character's count now meets the requirement
        if char in need and window_counts[char] == need[char]:
            have += 1

        # contract from the left while the window is valid
        while have == required:
            # update best
            if (right - left + 1) < best[0]:
                best = (right - left + 1, left, right)

            # remove leftmost character
            left_char = s[left]
            window_counts[left_char] -= 1
            if left_char in need and window_counts[left_char] < need[left_char]:
                have -= 1
            left += 1

    length, start, end = best
    return s[start:end + 1] if length != float('inf') else ""
  • Pitfall: the have counter is the crucial optimisation. Without it, you would need to compare the entire window_counts dict with need at every step, which is \(O(|\text{unique chars}|)\) per step. The have counter makes the validity check \(O(1)\).

  • Pitfall: checking window_counts[char] == need[char] (not >=) ensures we increment have exactly once per character. If we used >=, we would over-count.


Pattern: Prefix Sums

  • A prefix sum array stores cumulative sums: prefix[i] = sum(arr[0:i]). Once built in \(O(n)\), any subarray sum can be computed in \(O(1)\): sum(arr[l:r]) = prefix[r] - prefix[l].
def build_prefix(arr):
    prefix = [0] * (len(arr) + 1)
    for i in range(len(arr)):
        prefix[i + 1] = prefix[i] + arr[i]
    return prefix

# sum of arr[l:r] (inclusive l, exclusive r)
def range_sum(prefix, l, r):
    return prefix[r] - prefix[l]
  • When to use: the problem involves multiple subarray sum queries, or finding subarrays with a specific sum.

Easy: Range Sum Query

  • Problem: given an array, answer multiple queries of "what is the sum from index \(l\) to \(r\)?"

  • Without prefix sums: each query is \(O(n)\). With prefix sums: \(O(n)\) precomputation, then \(O(1)\) per query.

Medium: Subarray Sum Equals K

  • Problem: count the number of contiguous subarrays that sum to \(k\).

  • Pattern insight: a subarray sum from index \(l\) to \(r\) equals prefix[r+1] - prefix[l]. We want this to equal \(k\), so prefix[l] = prefix[r+1] - k. For each position, count how many earlier prefix sums equal current_prefix - k using a hash map.

def subarray_sum(nums, k):
    count = 0
    prefix = 0
    prefix_counts = {0: 1}  # empty prefix sum

    for num in nums:
        prefix += num
        # how many earlier prefix sums equal prefix - k?
        count += prefix_counts.get(prefix - k, 0)
        prefix_counts[prefix] = prefix_counts.get(prefix, 0) + 1

    return count
  • This combines prefix sums with hash map lookups: \(O(n)\) time, \(O(n)\) space.

  • Pitfall: forgetting to initialise prefix_counts = {0: 1}. The empty prefix (before any element) has sum 0. Without this, you miss subarrays starting from index 0.

Hard: Product of Array Except Self

  • Problem: given an array, return an array where each element is the product of all other elements. You cannot use division.

  • Pattern: build prefix products from the left and suffix products from the right. Each position's answer is left_product * right_product.

def product_except_self(nums):
    n = len(nums)
    result = [1] * n

    # left pass: result[i] = product of nums[0..i-1]
    prefix = 1
    for i in range(n):
        result[i] = prefix
        prefix *= nums[i]

    # right pass: multiply by product of nums[i+1..n-1]
    suffix = 1
    for i in range(n - 1, -1, -1):
        result[i] *= suffix
        suffix *= nums[i]

    return result
  • \(O(n)\) time, \(O(1)\) extra space (the output array does not count). This uses the output array itself to store intermediate prefix products, then multiplies in the suffix products in the second pass.

  • Pitfall: if the array contains a zero, division-based approaches fail. This prefix/suffix approach handles zeros correctly because it never divides.


Common Pitfalls Summary

Pitfall Example Fix
Off-by-one in window size right - left vs right - left + 1 Draw out a 2-element example
Mutable default in Python def f(seen={}) shares state across calls Use def f(seen=None)
String concatenation in loop s += c is \(O(n^2)\) in Python Use list.append + "".join"
Forgetting {0: 1} in prefix sum Miss subarrays starting at index 0 Always init with empty prefix
Hash map before check Two Sum: adding num before checking complement Check first, then insert
Not handling duplicates Three Sum returns duplicate triplets Skip consecutive equal values
Integer overflow Sum of large arrays in C++/Java Use long or check bounds

Take-Home Problems (NeetCode)

Practice these in order. Each reinforces a pattern from this file.

Hash Map Lookups

Two Pointers

Sliding Window

Prefix Sums