My original plan, after that last side-step post on using packages, was to begin having a look at feature scaling, encoding and such. But I can’t seem to get my head around doing so.
So I am going to start working on a post looking at some of Python’s built-in data types. It may never get published, but hopefully working on it will give my brain sometime to come to grips with a return to working on machine learning and the Titanic dataset.
I am going to look mainly at text, sequence, set and mapping types. Other than numeric types they get used a lot and from my perspective they have many similarities. But in some cases the differences are significant.
Objects
When you instantiate an object in Python, it is assigned an unique object id. The type of the object is specified at this/run time and can not be changed. The variable label is linked to the object id.
But, do be sure, a variable label can always be linked to a different object. So a variable linked to a string object can later in code be linked to list object. But that string object can’t be turned into a list object.
t_var = 'test'
print(f"t_var = {t_var} ({id(t_var)})")
t_var = list(range(4))
print(f"t_var = {t_var} ({id(t_var)})")
(ani-3.10) PS R:\learn\py_play> python rek_quick_test.py
t_var = test (2079214320304)
t_var = [0, 1, 2, 3] (2079218594816)
Some of these objects are mutable and some are immutable. The former includes lists (list), dictionaries (dict) and sets (set). Most custom classes are also mutable. The latter, immutable, includes integers (int), floats (float), booleans (bool), ranges (range), strings (str), and tuples (tuple).
A characteristic of immutable types is that they are compared by object identity. They are also hashable. Mutable objects are more generally compared by their values and are not hashable
Text
I want to start with the Text Sequence Type because of its immutability. Not hard to get one’s head around, but I intitially wondered why? The label would seem to indicate it has similarities to the Sequence Types (list, dictionary, tuple). And, that a string object can be indexed/iterated over like the sequence types mentioned is probably the key similarity.
So, once created the contents/state of a string can not be changed. You can of course assign a new string to the same variable/label. But it does not alter the prior string linked to that variable name. It creates a new one. Confirming immutability should be easy enough.
# strings and immutability
# simple example
msg = "Welcome to Too Old to Code!"
print(msg[0:10])
msg[0:10] = "Hello from"
print(msg)
Okay, so I can’t change a character of a given string. That would seem to indicate it is immutable. Let’s have a look at the object ids for a few situations.
# let's look at object ids
msg = "Welcome to Too Old to Code!"
print(id(msg))
msg = "Welcome to Too Old to Code!"
print(id(msg))
msg2 = "Welcome to Too Old to Code!"
print(id(msg))
msg = msg.join(" Me")
print(id(msg))
So, assigning exactly the same string to the same variable name does not result in the same object. And, clearly joining two strings together and assigning to the label of the first still generates a new object. Yup, immutable.
But do pay attention to what happened when I created msg2
with the same value as the preceding msg
declaration. Python saves memory by re-using object ids with the same value. But, if I assign a new value to msg
that connection will/should be broken. Will leave it to you to test that out.
Why Immutability?
As near as I can make out the primary reason is dictionaries. From the documentation:
A mapping object maps hashable values to arbitrary objects. Mappings are mutable objects.
And,
A dictionary’s keys are almost arbitrary values. Values that are not hashable, that is, values containing lists, dictionaries or other mutable types (that are compared by value rather than by object identity) may not be used as keys.
Again, pretty easy to understand why. If I use a variable as the key to create a key/value pair in a dictionary, the contents represented by the variable name is hashed in order to generate the key that Python really uses. The hash function is designed to limit the number of hashing collisions. That is, different values generating the same hash value. All good.
However, let’s say I could change the content assigned to the variable name. If so, I would almost certainly no longer be able to use that variable name to access the data I originally stored in the dictionary. By design, that new content value will probably generate a completely new hash value. Under the hood, Python would be looking elsewhere for the item’s value based on the key’s new hash value.
Python’s tuples are also immutable and hashable. I have used that situation to create multi-dimensional spaces with a dictionary. The tuple represents the coordinates of a specific point in the space. And, in fact, this often is a more efficient representation than trying to create multi-dimensional lists, or using Numpy arrays. But it would not have been possible if tuples weren’t immutable and hashable.
Container and Sequence Type Similarities
Let’s define some variables for use in the following examples.
# Let's define some variables for use in what follows.
a_str = "Welcome to Too Old to Code!"
b_str = "aka me"
a_list = ['python', 'perl', 'fortran', 'c', 'lisp']
b_list = [x for x in range(0, 97, 16)]
a_tuple = (12, 14, -8)
b_tuple = ('x', 'y', 'z')
a_set = {'Python', 'Git', 'Numpy', 'scikit-learn', 'pandas'}
b_set = {'calculus', 'algebra', 'trigonometry', 'statistics', 'probability'}
a_dict = {'fname': 'Harry', 'sname': 'Houdini', 'handcuffs': 25, 'rope': 50, 'locks': 75}
b_dict = {'cages': 12, 'barrels': 9, 'guns': 6}
a_rng = range(1, 38, 2)
b_rng = range(0, 257, 16)
Strings aren’t a container type, but they are a sequence type. All these sequence types are indexable and iterable. And, along with the container types, share some common methods/operations, with a few differences. Let’s start by looking at these common methods.
x in seq
orx not in seq
- The first is true if x is in the sequence, false otherwise. Vice versa for the latter.
- For most sequences this is a simple one element test, but with some sequence types, e.g. strings, you can test for subsequences.
- In [6]:
# common sequence methods/operations print(f"'to' in a_str -> {'to' in a_str}") print(f"'java' not in a_list -> {'java' not in a_list}") print(f"16 in a_tuple -> {16 in a_tuple}") print(f"'Git' not in a_set -> {'Git' not in a_set}") print(f"'Harry' in a_dict -> {'Harry' in a_dict}") print(f"'fname' in a_dict -> {'fname' in a_dict}") print(f"'Harry' in a_dict.values() -> {'Harry' in a_dict.values()}") print(f"'fname' not in a_dict.keys() -> {'fname' not in a_dict.keys()}") print(f"5 not in a_rng -> {5 not in a_rng}")
s1 + s2
- This concatenates the two sequences.
- Do note that concatenating immutable sequences results in a new object. And sequence types that only support item sequences that follow a specific pattern, e.g. range, don't allow concatenation.
- Dictionaries and sets are not sequence types, so do not support concatenation with the
+
operator.In [7]:print(f"a_str + ' ' + b_str -> {a_str + ' ' + b_str}") print(f"a_list + b_list -> {a_list + b_list}") print(f"b_tuple + a_tuple -> {b_tuple + a_tuple}") # following will not work # print(f"a_dict + b_dict -> {a_dict + b_dict}") # print(f"a_set + b_set -> {a_set + b_set}") # print(f"a_rng + b_rng -> {a_rng + b_rng}")
seq * n
orn * seq
- This adds the sequence to itself
n
times. - Again, it will not work for ranges, sets or dictionaries.
- In [8]:
'a' * 9 3 * [42] b_tuple * 3 print(b_tuple)
Out[8]:Out[8]:Out[8]:Out[8]: len(seq)
- Returns the number of items in or the length of the sequence or container.
- In [9]:
print(f"len('{a_str}') -> {len(a_str)}") print(f"len({b_list}) -> {len(b_list)}") print(f"len({a_tuple}) -> {len(a_tuple)}") print(f"len({b_set}) -> {len(b_set)}") print(f"len({a_dict}) -> {len(a_dict)}") print(f"len({b_rng}) -> {len(b_rng)}")
seq[i]
- Get the ith item of the sequence. Zero based, i.e. the first item is `seq[0]`. Negative indexes start from the end of the sequence. That is `seq[-1]` would be the last element in the sequence.
- Sets are *unordered* collections. As such, they do not support indexing, slicing or the like.
- Not going to bother with any examples at this time.
seq[i:j]
andseq[i:j:k]
- Slices. The first, returns the elements from position
i
toj-1
. The second does the same but only returns every kth item within the range specified byi
andj
. - The latter slicing operation also provides a quick way to reverse a sequence.
- In [10]:
print(f"{a_str}[11:26] -> '{a_str[11:26]}' (a_str[11]: '{a_str[11]}')") print(f"{a_str}[11:26:2] -> '{a_str[11:26:2]}'") print(f"{a_str}[11:26:-2] -> '{a_str[11:26:-2]}'") print(f"{a_str}[-11:-26:-2] -> '{a_str[-11:-26:-2]}' (a_str[-11]: '{a_str[-11]}')") print(f"{a_str}[-11:-26:2] -> '{a_str[-11:-26:2]}'") print(f"{a_list}[2:4] -> {a_list[2:4]}") print(f"{b_tuple}[-3:-1] -> {b_tuple[-3:-1]}") print(f"{b_rng}[-5:] -> {b_rng[-5:]}") print(f"list(b_rng[-5:]) -> {list(b_rng[-5:])}") # won't work # print(f"{b_set}[-2:-1] -> {b_set[-2:-1]}") # print(f"{a_dict}[2:5] -> {a_dict[2:5]}") # print(a_dict['sname':'rope'])
print('\nthere are ways to do most things, but not sure how useful this would ever be in production code') print(f"\tlist(a_dict.values())[1:4] -> {list(a_dict.values())[1:4]}")
print("\nReverse a squence:") print(f"\t'{a_str}'[::-1] -> '{a_str[::-1]}'") print(f"\t'{b_tuple}'[::-1] -> {b_tuple[::-1]}") print(f"\t'{a_list[1:4]}'[::-1] -> {a_list[1:4][::-1]}")
- Interesting what a slice on a
range
object returns. Anotherrange
object suitably modified. seq.index(x[, i[, j]])
- Get the index of element
x
at or after indexi
and before indexj
. The returned index is relative to the start of the sequence, not the slice if specified. Ifx
is not found in the sequence, aValueError
is raised. - With the usual caveats regarding the non-sequence container types.
- In [11]:
print(f"a_str.index('O') -> {a_str.index('O')}") print(f"a_str.index('o') -> {a_str.index('o')}") print(f"a_str.index('o', 15) -> {a_str.index('o', 15)}") print(f"a_str.index('o', 5, 15) -> {a_str.index('o', 5, 15)}") print(f"a_str.index('o', 4, 15) -> {a_str.index('o', 4, 15)}") print(f"a_list.index('perl') -> {a_list.index('perl')}") print(f"b_rng.index(256) -> {b_rng.index(256)}")
- Let's cause an error to be raised. Note: that is probably not how
finally:
would/should be used. I was just curious if the valid result would be available after the error was raised. - In [12]:
try: ndx_perl = a_list.index('perl') ndx_Python = a_list.index('Python') except ValueError as err1: print(err1) else: print(f"a_list.index('Python') -> {ndx_Python}") finally: print(f"a_list.index('perl') -> {ndx_perl}")
min(seq)
andmax(seq)
andseq.count(x)
- These pretty much do exactly what they say. And for strings, of course, the comparison is done character by character. If the two characters are different the one with the lower Unicode value is considered the smaller of the two.
- In [13]:
seqs_conts = {'a_str': a_str, 'a_list': a_list, 'a_tuple': a_tuple, 'a_set': a_set, 'a_dict': a_dict, 'a_rng': a_rng} for sc, seq in seqs_conts.items(): if sc in ['a_str', 'a_list', 'a_set', 'a_dict']: print(f"min({sc}) -> '{min(seq)}'; max({sc}) -> '{max(seq)}'") else: print(f"min({sc}) -> {min(seq)}; max({sc}) -> {max(seq)}")
- Let's see if we can get the min/max for the values of a dictionary.
- In [14]:
# to get min/max values of a dict try: min_ad = min(a_dict.values()) max_ad = max(a_dict.values()) except TypeError as err: print(err) else: print(f"min(a_dict.values()) -> {min_ad}; max(a_dict.values()) -> {max_ad}")
- Opps! We can't compare values of different types that can't be coerced to a common data type.
- In [15]:
# to get min/max values of a dict print(f"min(b_dict.values()) -> {min(b_dict.values())}; max(b_dict.values()) -> {max(b_dict.values())}")
- The only variables with multiple items are the strings. But…
- In [16]:
a_str.count('o') b_str.count('a') a_list.count('perl') b_list.count('algol')
Out[16]:Out[16]:Out[16]:Out[16]:
Done
You know, I enjoyed this exercise. Even learned a thing or two. So, if my head stays a bit screwed about I may continue in this vein for a post or two more. Not sure what I will look at, but…
Not that any of the above really needed a notebook, I did use one. So feel free to download and play with my version of this post’s notebook.