The Best Data Structure For Storing Non-Duplicate Items In Python

As a full-stack developer, choosing the right data structure is crucial for writing efficient and bug-free code. One common scenario you‘ll encounter is needing to store a collection of items while ensuring there are no duplicates. In Python, the built-in set data type is hands-down the best choice for this use case.

Why Use a Set?

Python provides several built-in data structures including lists, tuples, dictionaries and sets. Lists allow duplicate elements and maintain insertion order. Tuples are immutable sequences that allow duplicates. Dictionaries store key-value pairs.

Sets, on the other hand, are unordered collections of unique hashable elements. The key characteristics that make sets ideal for storing non-duplicate items are:

  1. Duplicates are automatically eliminated. Adding an element that already exists does nothing.

  2. Membership testing is highly efficient. Checking if an element exists in a set is much faster than with lists or tuples.

  3. Provides useful operations for set theory like union, intersection, difference, etc.

  4. Elements are stored in a way that allows fast lookup, insertion and deletion.

Under the hood, sets are implemented using hash tables. When you add an element to a set, Python applies a hash function to determine the element‘s hash value. This hash is used as an index in an underlying array to store the element. Hash collisions are handled by storing elements in buckets.

This hashing is what enables sets to efficiently check if an element already exists, giving them a time complexity of O(1) on average for insertion, deletion, and membership testing. In contrast, lists and tuples have a time complexity of O(n) for membership testing in the worst case.

Using Sets in Python

Creating a set is straightforward – simply enclose comma-separated elements in curly braces:

colors = {‘red‘, ‘green‘, ‘blue‘}
print(colors)  # Output: {‘red‘, ‘blue‘, ‘green‘}

Notice that order is not maintained, since sets are unordered by definition. You can also create an empty set using the set() constructor:

empty_set = set()

Adding elements to a set is done using the add() method:

numbers = {1, 2, 3}
numbers.add(4)
numbers.add(4)  # Duplicates are ignored
print(numbers)  # Output: {1, 2, 3, 4}

As you can see, attempting to add a duplicate element does nothing – sets only store unique elements.

One of the most useful features of sets is the ability to efficiently remove duplicates from other collections like lists. You can convert a list to a set using the set() constructor:

fruits = [‘apple‘, ‘banana‘, ‘orange‘, ‘apple‘, ‘pear‘, ‘banana‘] 
fruits_set = set(fruits)
print(fruits_set)  # Output: {‘orange‘, ‘banana‘, ‘pear‘, ‘apple‘}

The resulting fruits_set contains only the unique elements from the original list. You can convert the set back to a list if needed:

unique_fruits = list(fruits_set)
print(unique_fruits)  # Output: [‘orange‘, ‘banana‘, ‘pear‘, ‘apple‘]

Keep in mind lists generated from sets may have a different order than the original, since sets don‘t maintain order.

Set Operations

Python‘s set type provides several handy operations based on mathematical set theory. Here are some of the most commonly used:

  • union(): Returns a new set containing elements from both sets
  • intersection(): Returns elements present in both sets
  • difference(): Returns elements in first set but not in second
  • symmetric_difference(): Returns elements in either set but not both
  • issubset(), issuperset(): Check if one set contains another

Here‘s an example demonstrating a few of these:

A = {1, 2, 3, 4, 5}
B = {4, 5, 6, 7, 8}

print(A | B)  # Union: {1, 2, 3, 4, 5, 6, 7, 8}
print(A & B)  # Intersection: {4, 5}
print(A - B)  # Difference: {1, 2, 3}
print(B - A)  # Difference: {8, 6, 7}
print(A ^ B)  # Symmetric Difference: {1, 2, 3, 6, 7, 8}

C = {1, 2, 3}
print(C.issubset(A))  # True
print(A.issuperset(C))  # True

These set operations make it easy to perform tasks like finding elements two sets have in common, items that are in one set but not the other, etc. And because of the underlying hash table implementation, these operations are quite efficient.

Membership Testing

Another area where sets shine is checking if an element exists within the set. The in keyword tests for membership:

vowels = {‘a‘, ‘e‘, ‘i‘, ‘o‘, ‘u‘}
print(‘a‘ in vowels)  # True 
print(‘x‘ in vowels)  # False

For sets, in checks the underlying hash table, resulting in fast O(1) lookups on average. In contrast, using in with a list or tuple requires scanning through the elements until a match is found, which is O(n) in the worst case.

This makes sets an excellent choice when you need to do many membership tests and performance is important. For example, let‘s say you have a large list of emails and want to efficiently check if a given email has already been seen:

seen_emails = set()

def check_email(email):
    if email in seen_emails:
        print(f"{email} already seen!")
    else:
        seen_emails.add(email)
        print(f"{email} added")

check_email("[email protected]")  # Output: [email protected] added  
check_email("[email protected]") # Output: [email protected] added
check_email("[email protected]")  # Output: [email protected] already seen!

Using a set for seen_emails allows for fast checks and avoids storing duplicate entries. If we used a list instead, the in checks would get slower and slower as the list grows larger.

Downsides of Sets

While sets are fantastic for storing unique elements and fast membership testing, there are a couple of potential downsides to be aware of:

  1. Sets are unordered. If you need to maintain the order elements were added in, use a list or tuple instead.

  2. Set elements must be hashable. This means mutable types like lists and dictionaries cannot be stored in a set, since they don‘t have a fixed hash value.

  3. Sets don‘t support indexing or slicing like sequences such as lists and tuples. You can‘t access elements by position.

Alternatives to Sets

If you need unique elements but also want to preserve insertion order, consider using a dictionary instead. Dictionaries remember the order in which key-value pairs were added (since Python 3.7). You can use dict.fromkeys() to create a dictionary with None values and your sequence as the keys:

items = ["apple", "banana", "pear", "apple"]
unique_items = dict.fromkeys(items)
print(unique_items)  # Output: {‘apple‘: None, ‘banana‘: None, ‘pear‘: None}

The resulting unique_items dictionary has the unique elements from the original list as its keys, in the same order they first appeared. You can access the elements by iterating the dict‘s keys.

Another option is to use the built-in collections.OrderedDict class. This is a dictionary subclass that remembers insertion order of keys and also provides methods for rearranging the order. However, it‘s been somewhat obsoleted by the fact that regular dicts are ordered by default since 3.7.

Conclusion

To summarize, Python‘s built-in set data type is the clear winner when you need to store a collection of unique elements. By eliminating duplicates and providing fast membership testing and convenient set operations, sets are a powerful tool to have in your Python toolbelt.

Just remember that sets are unordered and cannot contain unhashable types. If you need to preserve insertion order while still avoiding duplicates, a dictionary can be a good alternative. And if you want more control over the order of unique items, check out collections.OrderedDict.

With judicious use of sets and knowledge of their strengths and limitations, you‘ll be able to write cleaner, more efficient Python code. Whenever you‘re dealing with a collection that shouldn‘t contain duplicates, think sets!

Similar Posts