Checking if all the Characters in a String are Unique with Python

I just finished CS 325 - Analysis of Algorithms several weeks ago and because I don't want to lose a lot of the knowledge I gained in that class , I have decided to start going through some books I have on data structures and algorithms questions.

I am probably going to start trying to blog more about the interesting questions (to me) that I come across in this self studying.

One problem that I recently solved and really liked was titled "Is Unique" and wanted me to create an algorithm to determine if a string has all unique characters.

My general approach to solving problems like this is to solve them first on paper before trying to write any code.

The first approach that came to me was to just iterate through each character in the string, storing each character in a list. While looping, if the character is already in the list, then the function should return False since that would indicate that the character has been encountered before and therefore is not unique.

Another thing I briefly thought about, is that if the string contains more characters than are in whatever character set we are dealing with, then the function should return False since it would be impossible to have 201 unique characters of a string in a character set that has only 200.

The actual Python code I then wrote to solve the problem on my own is below. I chose not to deal with returning False for a string with more characters than the charactr set because I felt like there would be no way to determine what character set I was dealing with!

def is_unique(input):
  char_seen = []
  for char in input:
    if char in char_seen:
      return False
    char_seen.append(char)
  return True

I was fairly happy with my solution and then I curiously read the ideal solution in the book to find out what the ideal approach is.

One thing to keep in mind is that the ideal solution is never going to better than O(n) because we have to check every character in the input string before we know whether or not each character is unique. There is no way to get around that.

My code above though is an example of where Python's high level features can really trip you up if you are not careful. What I mean is that I am using the power of the in operator in python to help me check if an element is in a list. I was thinking that my original code above was O(n) In my head, I was thinking okay I am simply checking if an element is in a list and doing that for each element in the list. That is 1 operation for each n element in the input list, so that is O(n).

But should this really be considered one operation/step in my analysis? Should I be thinking that I can check if an element is an array/list in constant time? One thing I could do is look up the time complexity of the in operator in Python: https://wiki.python.org/moin/TimeComplexity.

Another thing I can do on my own though is to solve the problem without using the in operator:

def is_unique(n):
  char_seen = []
  for char in n:
    for i in char_seen:
      if char == i:
        return False
    char_seen.append(char)
  return True

When we avoid the power of the in operator for checking if elements exist in a list, it easier to see that my original code is actually O(n^2) since I have nested for loops that are both driven by the input variable n.

So, is there a better approach that can give us a worst case running time better than O(n^2)? It turns out there is and it actually makes a lot of sense when you think about it.

The approach that the book suggested is that we use a hash table (dictionary in Python) to store all the possible characters and then have a boolean to indicate whether that value has been encountered previously. As an example, say we had a special four character set of a, b, c, d, this is what our hash table data structure could look like:

'a' : False,
'b' : False,
'c' : False,
'd' : False

Now, we can use the quick lookup times of hash tables to avoid having to iterate through the list. The worst-case space complexity of this approach is also lower since we are only storing a boolean for each character in the character set, rather storing n characters in a list.

This new approach looks like its O(n)! But its actually faster than that. You could argue though that the new time complexity would be O(1) (constant time) since the algorithm would never loop through more than c characters, where c is the number of characters in the character set. c is a fixed constant and for example if we had 4 characters in our character set the time complexity would be O(4) . Due to the way Big Oh notation works, this would reduce to O(1).

Below is the code I wrote that uses this quicker approach. For this example, I assumed we had a special small character set consisting only of the characters a, b, c, d.

def is_unique(n):

  char_set = {'a':False,
           'b':False,
           'c':False,
           'd':False}

  for char in n:
    if char_set[char] == False:
       char_set[char] = True
    else:
      return False
  return True

Well, that's it for this quick post!

The classes I am starting this Winter term at Oregon State are Assembly Language and Software Engineering 1. So expect to see more posts on those topics as well. I am also working heavily (when I have time) on a personal project web app that I plan to deploy hopefully soon.

Comments