My original plan, after that last side-step post on using packages, was to begin having a look at feature scaling, encoding and such. But I can’t seem to get my head around doing so.

So I am going to start working on a post looking at some of Python’s built-in data types. It may never get published, but hopefully working on it will give my brain sometime to come to grips with a return to working on machine learning and the Titanic dataset.

I am going to look mainly at text, sequence, set and mapping types. Other than numeric types they get used a lot and from my perspective they have many similarities. But in some cases the differences are significant.

Objects

When you instantiate an object in Python, it is assigned an unique object id. The type of the object is specified at this/run time and can not be changed. The variable label is linked to the object id.

But, do be sure, a variable label can always be linked to a different object. So a variable linked to a string object can later in code be linked to list object. But that string object can’t be turned into a list object.

t_var = 'test'
print(f"t_var = {t_var} ({id(t_var)})")
t_var = list(range(4))
print(f"t_var = {t_var} ({id(t_var)})")
(ani-3.10) PS R:\learn\py_play> python rek_quick_test.py
t_var = test (2079214320304)
t_var = [0, 1, 2, 3] (2079218594816)

Some of these objects are mutable and some are immutable. The former includes lists (list), dictionaries (dict) and sets (set). Most custom classes are also mutable. The latter, immutable, includes integers (int), floats (float), booleans (bool), ranges (range), strings (str), and tuples (tuple).

A characteristic of immutable types is that they are compared by object identity. They are also hashable. Mutable objects are more generally compared by their values and are not hashable

Text

I want to start with the Text Sequence Type because of its immutability. Not hard to get one’s head around, but I intitially wondered why? The label would seem to indicate it has similarities to the Sequence Types (list, dictionary, tuple). And, that a string object can be indexed/iterated over like the sequence types mentioned is probably the key similarity.

So, once created the contents/state of a string can not be changed. You can of course assign a new string to the same variable/label. But it does not alter the prior string linked to that variable name. It creates a new one. Confirming immutability should be easy enough.

In [3]:
# strings and immutability
# simple example
msg = "Welcome to Too Old to Code!"
print(msg[0:10])
msg[0:10] = "Hello from"
print(msg)
Welcome to
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-67eb98679154> in <module>
      3 msg = "Welcome to Too Old to Code!"
      4 print(msg[0:10])
----> 5 msg[0:10] = "Hello from"
      6 print(msg)

TypeError: 'str' object does not support item assignment

Okay, so I can’t change a character of a given string. That would seem to indicate it is immutable. Let’s have a look at the object ids for a few situations.

In [4]:
# let's look at object ids
msg = "Welcome to Too Old to Code!"
print(id(msg))
msg = "Welcome to Too Old to Code!"
print(id(msg))
msg2 = "Welcome to Too Old to Code!"
print(id(msg))
msg = msg.join(" Me")
print(id(msg))
1923301897280
1923301898080
1923301898080
1923301781264

So, assigning exactly the same string to the same variable name does not result in the same object. And, clearly joining two strings together and assigning to the label of the first still generates a new object. Yup, immutable.

But do pay attention to what happened when I created msg2 with the same value as the preceding msg declaration. Python saves memory by re-using object ids with the same value. But, if I assign a new value to msg that connection will/should be broken. Will leave it to you to test that out.

Why Immutability?

As near as I can make out the primary reason is dictionaries. From the documentation:

A mapping object maps hashable values to arbitrary objects. Mappings are mutable objects.

And,

A dictionary’s keys are almost arbitrary values. Values that are not hashable, that is, values containing lists, dictionaries or other mutable types (that are compared by value rather than by object identity) may not be used as keys.

Again, pretty easy to understand why. If I use a variable as the key to create a key/value pair in a dictionary, the contents represented by the variable name is hashed in order to generate the key that Python really uses. The hash function is designed to limit the number of hashing collisions. That is, different values generating the same hash value. All good.

However, let’s say I could change the content assigned to the variable name. If so, I would almost certainly no longer be able to use that variable name to access the data I originally stored in the dictionary. By design, that new content value will probably generate a completely new hash value. Under the hood, Python would be looking elsewhere for the item’s value based on the key’s new hash value.

Python’s tuples are also immutable and hashable. I have used that situation to create multi-dimensional spaces with a dictionary. The tuple represents the coordinates of a specific point in the space. And, in fact, this often is a more efficient representation than trying to create multi-dimensional lists, or using Numpy arrays. But it would not have been possible if tuples weren’t immutable and hashable.

Container and Sequence Type Similarities

Let’s define some variables for use in the following examples.

In [5]:
# Let's define some variables for use in what follows.
a_str = "Welcome to Too Old to Code!"
b_str = "aka me"
a_list = ['python', 'perl', 'fortran', 'c', 'lisp']
b_list = [x for x in range(0, 97, 16)]
a_tuple = (12, 14, -8)
b_tuple = ('x', 'y', 'z')
a_set = {'Python', 'Git', 'Numpy', 'scikit-learn', 'pandas'}
b_set = {'calculus', 'algebra', 'trigonometry', 'statistics', 'probability'}
a_dict = {'fname': 'Harry', 'sname': 'Houdini', 'handcuffs': 25, 'rope': 50, 'locks': 75}
b_dict = {'cages': 12, 'barrels': 9, 'guns': 6}
a_rng = range(1, 38, 2)
b_rng = range(0, 257, 16)

Strings aren’t a container type, but they are a sequence type. All these sequence types are indexable and iterable. And, along with the container types, share some common methods/operations, with a few differences. Let’s start by looking at these common methods.

x in seq or x not in seq
The first is true if x is in the sequence, false otherwise. Vice versa for the latter.
For most sequences this is a simple one element test, but with some sequence types, e.g. strings, you can test for subsequences.
In [6]:
# common sequence methods/operations
print(f"'to' in a_str -> {'to' in a_str}")
print(f"'java' not in a_list -> {'java' not in a_list}")
print(f"16 in a_tuple -> {16 in a_tuple}")
print(f"'Git' not in a_set -> {'Git' not in a_set}")
print(f"'Harry' in a_dict -> {'Harry' in a_dict}")
print(f"'fname' in a_dict -> {'fname' in a_dict}")
print(f"'Harry' in a_dict.values() -> {'Harry' in a_dict.values()}")
print(f"'fname' not in a_dict.keys() -> {'fname' not in a_dict.keys()}")
print(f"5 not in a_rng -> {5 not in a_rng}")
'to' in a_str -> True
'java' not in a_list -> True
16 in a_tuple -> False
'Git' not in a_set -> False
'Harry' in a_dict -> False
'fname' in a_dict -> True
'Harry' in a_dict.values() -> True
'fname' not in a_dict.keys() -> False
5 not in a_rng -> False
s1 + s2
This concatenates the two sequences.
Do note that concatenating immutable sequences results in a new object. And sequence types that only support item sequences that follow a specific pattern, e.g. range, don't allow concatenation.
Dictionaries and sets are not sequence types, so do not support concatenation with the + operator.
In [7]:
print(f"a_str + ' ' + b_str -> {a_str + ' ' + b_str}")
print(f"a_list + b_list -> {a_list + b_list}")
print(f"b_tuple + a_tuple -> {b_tuple + a_tuple}")
# following will not work
# print(f"a_dict + b_dict -> {a_dict + b_dict}")
# print(f"a_set + b_set -> {a_set + b_set}")
# print(f"a_rng + b_rng -> {a_rng + b_rng}")
a_str + ' ' + b_str -> Welcome to Too Old to Code! aka me
a_list + b_list -> ['python', 'perl', 'fortran', 'c', 'lisp', 0, 16, 32, 48, 64, 80, 96]
b_tuple + a_tuple -> ('x', 'y', 'z', 12, 14, -8)
seq * n or n * seq
This adds the sequence to itself n times.
Again, it will not work for ranges, sets or dictionaries.
In [8]:
'a' * 9
3 * [42]
b_tuple * 3
print(b_tuple)
Out[8]:
'aaaaaaaaa'
Out[8]:
[42, 42, 42]
Out[8]:
('x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z')
Out[8]:
('x', 'y', 'z')
len(seq)
Returns the number of items in or the length of the sequence or container.
In [9]:
print(f"len('{a_str}') -> {len(a_str)}")
print(f"len({b_list}) -> {len(b_list)}")
print(f"len({a_tuple}) -> {len(a_tuple)}")
print(f"len({b_set}) -> {len(b_set)}")
print(f"len({a_dict}) -> {len(a_dict)}")
print(f"len({b_rng}) -> {len(b_rng)}")
len('Welcome to Too Old to Code!') -> 27
len([0, 16, 32, 48, 64, 80, 96]) -> 7
len((12, 14, -8)) -> 3
len({'trigonometry', 'statistics', 'calculus', 'probability', 'algebra'}) -> 5
len({'fname': 'Harry', 'sname': 'Houdini', 'handcuffs': 25, 'rope': 50, 'locks': 75}) -> 5
len(range(0, 257, 16)) -> 17
seq[i]
Get the ith item of the sequence. Zero based, i.e. the first item is `seq[0]`. Negative indexes start from the end of the sequence. That is `seq[-1]` would be the last element in the sequence.
Sets are *unordered* collections. As such, they do not support indexing, slicing or the like.
Not going to bother with any examples at this time.
seq[i:j] and seq[i:j:k]
Slices. The first, returns the elements from position i to j-1. The second does the same but only returns every kth item within the range specified by i and j.
The latter slicing operation also provides a quick way to reverse a sequence.
In [10]:
print(f"{a_str}[11:26] -> '{a_str[11:26]}' (a_str[11]: '{a_str[11]}')")
print(f"{a_str}[11:26:2] -> '{a_str[11:26:2]}'")
print(f"{a_str}[11:26:-2] -> '{a_str[11:26:-2]}'")
print(f"{a_str}[-11:-26:-2] -> '{a_str[-11:-26:-2]}' (a_str[-11]: '{a_str[-11]}')")
print(f"{a_str}[-11:-26:2] -> '{a_str[-11:-26:2]}'")
print(f"{a_list}[2:4] -> {a_list[2:4]}")
print(f"{b_tuple}[-3:-1] -> {b_tuple[-3:-1]}")
print(f"{b_rng}[-5:] -> {b_rng[-5:]}")
print(f"list(b_rng[-5:]) -> {list(b_rng[-5:])}")
# won't work
# print(f"{b_set}[-2:-1] -> {b_set[-2:-1]}")
# print(f"{a_dict}[2:5] -> {a_dict[2:5]}")
# print(a_dict['sname':'rope'])

print('\nthere are ways to do most things, but not sure how useful this would ever be in production code') print(f"\tlist(a_dict.values())[1:4] -> {list(a_dict.values())[1:4]}")

print("\nReverse a squence:") print(f"\t'{a_str}'[::-1] -> '{a_str[::-1]}'") print(f"\t'{b_tuple}'[::-1] -> {b_tuple[::-1]}") print(f"\t'{a_list[1:4]}'[::-1] -> {a_list[1:4][::-1]}")

Welcome to Too Old to Code![11:26] -> 'Too Old to Code'
      (a_str[11]: 'T')
Welcome to Too Old to Code![11:26:2] -> 'ToOdt oe'
Welcome to Too Old to Code![11:26:-2] -> ''
Welcome to Too Old to Code![-11:-26:-2] -> 'l o teol'
      (a_str[-11]: 'l')
Welcome to Too Old to Code![-11:-26:2] -> ''
['python', 'perl', 'fortran', 'c', 'lisp'][2:4] -> ['fortran', 'c']
('x', 'y', 'z')[-3:-1] -> ('x', 'y')
range(0, 257, 16)[-5:] -> range(192, 272, 16)
list(b_rng[-5:]) -> [192, 208, 224, 240, 256]

There are ways to do most things, but not sure how useful this would ever be in production code list(a_dict.values())[1:4] -> ['Houdini', 25, 50]

Reverse a squence or sub-sequence: 'Welcome to Too Old to Code!'[::-1] -> '!edoC ot dlO ooT ot emocleW' '('x', 'y', 'z')'[::-1] -> ('z', 'y', 'x') '['perl', 'fortran', 'c']'[::-1] -> ['c', 'fortran', 'perl']

Interesting what a slice on a range object returns. Another range object suitably modified.
seq.index(x[, i[, j]])
Get the index of element x at or after index i and before index j. The returned index is relative to the start of the sequence, not the slice if specified. If x is not found in the sequence, a ValueError is raised.
With the usual caveats regarding the non-sequence container types.
In [11]:
print(f"a_str.index('O') -> {a_str.index('O')}")
print(f"a_str.index('o') -> {a_str.index('o')}")
print(f"a_str.index('o', 15) -> {a_str.index('o', 15)}")
print(f"a_str.index('o', 5, 15) -> {a_str.index('o', 5, 15)}")
print(f"a_str.index('o', 4, 15) -> {a_str.index('o', 4, 15)}")
print(f"a_list.index('perl') -> {a_list.index('perl')}")
print(f"b_rng.index(256) -> {b_rng.index(256)}")
a_str.index('O') -> 15
a_str.index('o') -> 4
a_str.index('o', 15) -> 20
a_str.index('o', 5, 15) -> 9
a_str.index('o', 4, 15) -> 4
a_list.index('perl') -> 1
b_rng.index(256) -> 16
Let's cause an error to be raised. Note: that is probably not how finally: would/should be used. I was just curious if the valid result would be available after the error was raised.
In [12]:
try:
  ndx_perl = a_list.index('perl')
  ndx_Python = a_list.index('Python')
except ValueError as err1:
  print(err1)
else:
  print(f"a_list.index('Python') -> {ndx_Python}")
finally:
  print(f"a_list.index('perl') -> {ndx_perl}")
'Python' is not in list
a_list.index('perl') -> 1
min(seq) and max(seq) and seq.count(x)
These pretty much do exactly what they say. And for strings, of course, the comparison is done character by character. If the two characters are different the one with the lower Unicode value is considered the smaller of the two.
In [13]:
seqs_conts = {'a_str': a_str, 'a_list': a_list, 'a_tuple': a_tuple, 'a_set': a_set, 'a_dict': a_dict, 'a_rng': a_rng}
for sc, seq in seqs_conts.items():
  if sc in ['a_str', 'a_list', 'a_set', 'a_dict']:
    print(f"min({sc}) -> '{min(seq)}'; max({sc}) -> '{max(seq)}'")
  else:
    print(f"min({sc}) -> {min(seq)}; max({sc}) -> {max(seq)}")
min(a_str) -> ' '; max(a_str) -> 't'
min(a_list) -> 'c'; max(a_list) -> 'python'
min(a_tuple) -> -8; max(a_tuple) -> 14
min(a_set) -> 'Git'; max(a_set) -> 'scikit-learn'
min(a_dict) -> 'fname'; max(a_dict) -> 'sname'
min(a_rng) -> 1; max(a_rng) -> 37
Let's see if we can get the min/max for the values of a dictionary.
In [14]:
# to get min/max values of a dict
try:
  min_ad = min(a_dict.values())
  max_ad = max(a_dict.values())
except TypeError as err:
  print(err)
else:
  print(f"min(a_dict.values()) -> {min_ad}; max(a_dict.values()) -> {max_ad}")
'<' not supported between instances of 'int' and 'str'
Opps! We can't compare values of different types that can't be coerced to a common data type.
In [15]:
# to get min/max values of a dict
print(f"min(b_dict.values()) -> {min(b_dict.values())}; max(b_dict.values()) -> {max(b_dict.values())}")
min(b_dict.values()) -> 6; max(b_dict.values()) -> 12
The only variables with multiple items are the strings. But…
In [16]:
a_str.count('o')
b_str.count('a')
a_list.count('perl')
b_list.count('algol')
Out[16]:
6
Out[16]:
2
Out[16]:
1
Out[16]:
0

Done

You know, I enjoyed this exercise. Even learned a thing or two. So, if my head stays a bit screwed about I may continue in this vein for a post or two more. Not sure what I will look at, but…

Not that any of the above really needed a notebook, I did use one. So feel free to download and play with my version of this post’s notebook.

Resources