PET575, Data Analysis in Python

(Open in Playground mode. Shift+Enter runs a cell.)

What is Python?

  • Python is a programming language
  • It is completely free
  • It is the most popular language in Machine Learning and Data Analysis
  • One of the most popular languages in general

Open in Colab

How is Python different?

No need to declare a type of variable:
In [1]:
a = 5
b = 3.4
c = "test"

print("Variable 'a' is: " + str(type(a)))
print("Variable 'b' is: " + str(type(b)))
print("Variable 'c' is: " + str(type(c)))
Variable 'a' is: <class 'int'>
Variable 'b' is: <class 'float'>
Variable 'c' is: <class 'str'>
Function definitions and loops do not use brackets, only whitespace. This ensures that code looks tidy.
In [2]:
def print_n_powers_of_2(n): #function definition
    for i in range(n):      #simple loop
        print(2**i)         #note that python starts counting at zero
        
print_n_powers_of_2(6)      #execute the function passing number 6 as parameter n
1
2
4
8
16
32
Python can be run in different ways:

Note: Use Python 3.x only. Python 2.x also exists, but is being depreciated. For most intents and purposes Python 2 code will work in Python 3.

Working with notebooks
  • memory state is not saved between sessions, but the output is
  • cells can be executed out of sequence
In [3]:
a = a + 1  #add one to a defined earlier. Re-running this cell will keep incrementing a
print (a)
6
Why Python over Matlab?
  • Python has much higher adoption in the industry
  • Python pushes out MATLAB in academia too
  • Python is free, MATLAB is very much not. Many companies simply do not, and will not, own a MATLAB license.

Note: Python is very distributed in nature compared to compact package of MATLAB. If you are searching for funciton - use your search engine of choice: Google/DuckDuckGo/Qwant

Python basics

Defining variables

Below are some basic examples of variable definition

In [4]:
a = 5        #python automatically knows if it is int, float, str, etc
b = "rabbit" #strings (text) comes in quote marks
c = a + 2    #you can do operations at the same time as definition

Printing output

Printing is the basic way of showing data in Python. It is very forgiving.

In [5]:
print(a)
print(b)
print(c+a)
5
rabbit
12

Lists, NumPy

Lists, arrays, matrices - a sequence of variables

In [6]:
a = [0, 1, 2, 3]
print(a)
[0, 1, 2, 3]
In [7]:
a = [1, "two", 3.0] #feel free to mix types
print(type(a))
print()
print(type(a[0]))
print(type(a[1]))
print(type(a[2]))
<class 'list'>

<class 'int'>
<class 'str'>
<class 'float'>

To work on such data structures, the best is to use a library called numpy.

In [8]:
import numpy as np
In [9]:
a = [0, 1, 2, 3]
a = np.asarray(a) #convert python list to a numpy array
print(a)          #note the different printing format than before
[0 1 2 3]

This allows us to do different operations easily:

In [10]:
print(a+2)
[2 3 4 5]
In [11]:
print(a*2)
[0 2 4 6]
In [12]:
b = np.asarray([5,6,5,6])
print(a)
print("    +")
print(b)
print("    =")

print(a+b)
[0 1 2 3]
    +
[5 6 5 6]
    =
[5 7 7 9]

Slices

To get a part of an array, slices are used.

In [13]:
print(f'This is a complete array,                    a:{a}')
print(f'This is a a single element in an array,   a[2]:{a[2]}')
print(f'This is a slice of an array,            a[0:2]:{a[0:2]}') #note that slice is [inclisive:exclusive]
print(f'This is a last element of an array,      a[-1]:{a[-1]}')
This is a complete array,                    a:[0 1 2 3]
This is a a single element in an array,   a[2]:2
This is a slice of an array,            a[0:2]:[0 1]
This is a last element of an array,      a[-1]:3

Arrays can be nested

In [14]:
#defining a nested list
a = [
    [1,2,3,4],
    [5,6,7,8]
    ]

Caution: Python lists behave differently than numpy arrays!

In [15]:
print(a[0][1])
2

Notice different notation when converted to numpy array

In [16]:
a = np.asarray(a)
print(a[0,1])
2

Slices are particularly useful here

In [17]:
print(a[1,1:3])
[6 7]

CAUTION!

Lists and arrays behave differently than variables.

In [18]:
a = 2
b = a
b = b + 2

print(f'a is equal to {a}')
print(f'b is equal to {b}')
a is equal to 2
b is equal to 4

But, if we do the same thing with lists:

In [19]:
a = [1,1,1,1,1]
b = a
b[1] = 3

print(f'List a looks like this: {a}')
print(f'List b looks like this: {b}')
List a looks like this: [1, 3, 1, 1, 1]
List b looks like this: [1, 3, 1, 1, 1]

Notice that list a also changed. Let's see how numpy arrays behave:

In [20]:
a = np.asarray([1,1,1,1,1])
b = a
b[1] = 3

print(f'Numpy array a looks like this: {a}')
print(f'Numpy array b looks like this: {b}')
Numpy array a looks like this: [1 3 1 1 1]
Numpy array b looks like this: [1 3 1 1 1]

Same thing. There may be differences though:

In [21]:
a = [1,1,1,1,1]
b = a[1:3]
b[1] = 3

print(f'List a looks like this: {a}')
print(f'List b looks like this: {b}')

a = np.asarray([1,1,1,1,1])
b = a[1:3]
b[1] = 3

print(f'Numpy array a looks like this: {a}')
print(f'Numpy array b looks like this: {b}')
List a looks like this: [1, 1, 1, 1, 1]
List b looks like this: [1, 3]
Numpy array a looks like this: [1 1 3 1 1]
Numpy array b looks like this: [1 3]

When using a slice list created a copy, but numpy array kept the reference.

Numpy can make a copy of an array if you are explicit about it

In [22]:
a = np.asarray([1,1,1,1,1])
b = np.copy(a)
b[1] = 3

print(f'Numpy array a looks like this: {a}')
print(f'Numpy array b looks like this: {b}')
Numpy array a looks like this: [1 1 1 1 1]
Numpy array b looks like this: [1 3 1 1 1]

NumPy continued

Let's generate some random numbers
  • Average: 10
  • Standard deviation: 2
  • Shape of output: 30

NOTE: you are NOT to memorize functions, use Google/DuckDuckGo/Qwant!

https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.normal.html

In [23]:
a = np.random.normal(10, 2, 100) #center, std, size
print(a)
[13.64148509  9.37027558 11.55503141  8.83709703  9.48983833  7.16315348
  8.65966235 10.63161198 10.05489806 10.67819031 10.1622397  10.38964733
 12.28941641  9.79891143  9.71739266 11.97647768  9.37573227  9.42148989
 10.55136732  8.95869513  7.49473412 10.43983755  8.89071675  5.83360665
 13.90892978  6.2087077   7.04445124  6.84085617 10.3794853   8.18177811
 10.3832782  10.95173682  7.256328    7.61397148 10.8075306  13.16793323
 12.54771582  7.33607166  9.76968608  8.97856717 11.83631658  8.22604541
  5.81946378 10.40637312 11.99922346  8.30670111  8.97677516  7.88026374
 10.75595371 10.1664446  11.2172536  12.35007311  8.85038553  7.81225402
 10.49074409 10.57742111 10.83404465  7.25457263 10.6038723   9.31629219
  8.24065177  6.34435188  9.20605846  7.9061485  10.54353189  8.07566364
 10.31787531 11.39487657  9.31413649 10.7234155   5.45672487 10.8183719
 12.37989951  6.59965447  8.43541314  9.02318332  8.23714223  8.80445471
  5.89617577 13.41059243  7.76289325  9.10151674  8.52072542  9.42389037
  7.52512661 10.75163073 11.70457519  5.30752053  8.23754211 11.35157485
  8.52781693 13.26589646  9.73581113  7.51540953 10.76267654  9.04549763
  7.48191845  9.22324419 11.28929172  9.53355454]

Numpy allows us to quickly calculate many attributes of a given array

In [24]:
a_max = a.max()
print(f'The maximum value is {a_max}') #note the f notation
print(f'The minimum value is {a.min()}') #we can do more in the curly brackets too
print(f'The sum is {a.sum()}')
print(f'The average is {a.mean()}')
print(f'The standard deviation is {a.std()}')
The maximum value is 13.90892977807637
The minimum value is 5.3075205309070315
The sum is 947.6354450008286
The average is 9.476354450008285
The standard deviation is 1.9149614193114533

To avoid too many decimals, we can simply round the numbers

In [25]:
np.round(a.std(),2)
Out[25]:
1.91

Functions

Functions work the same way as in most other programming languages

In [26]:
def a_to_b(a,b):
    return(a**b)

print(a_to_b(3,4))
81

Conditional statements

same as functions. Remember dual equal signs!

In [27]:
def even_odd(a):
    if a%2 == 0:    #Percentage symbol is modulo. Here divide by 2 and return the rest
        print(f'{a} is even')
    else:
        print(f'{a} is odd')
        
even_odd(3)
even_odd(4)
3 is odd
4 is even

Loops

for

for loop can be used in a number of ways

In [28]:
for i in range(4):
    print(i)
0
1
2
3
In [29]:
for i in range(4,10):
    print(i)
4
5
6
7
8
9
In [30]:
text = "rabbit"
for i in text:
    print(i)
r
a
b
b
i
t
In [31]:
some_numbers = np.random.normal(1,1,10)
for i in some_numbers:
    print(np.round(i,2))
1.51
-0.22
1.41
2.47
1.58
0.77
2.68
-0.81
2.72
0.35

Excercise 1

You are playing a simple board game. You roll a dice and advance forward by the number of dots. Easy.

Some fields are special. Some are You wait one round or You roll again.

One field is very special. The rules state:

  • roll dice 3 times
  • you move forward by the two lowest numbers
  • you move back by the highest number

Example: you roll 2, 6 and 4. You move forwards by 2 + 4 and move backwards by 6

Question: Is it a good, or a bad field? What is the mean and median result?

In [32]:
def monte_carlo(rolls):
    
    results = []   #initiate an empty list
    
    for i in range(rolls):  #Perform given number of simulations
        sample = np.random.randint(1,7,3) #Return random integers from low (inclusive) to high (exclusive).
        sample = np.sort(sample) #sort the values ascending
        results.append(sample[0]+sample[1]-sample[2])  #apply the rule and append it to the list
        
    return(results)  #return the results

Execute the simulation. You can see how increasing the amount of iterations increases the calculation time

In [33]:
a = monte_carlo(10000)

Print requested values

In [34]:
print(f'The mean movement was {np.mean(a)}')
print(f'The median movement was {np.median(a)}')
print(f'Achieved values are {np.unique(a)}')
The mean movement was 0.6141
The median movement was 1.0
Achieved values are [-4 -3 -2 -1  0  1  2  3  4  5  6]

Data visualization

For making of various charts MatPlotLib is the most popular library. It has a similar syntax to MATLAB.

Seaborn is an add-on to MatPlotLib that makes generating more complex charts easier.

First, let's import the libraries:

In [35]:
import matplotlib.pyplot as plt #matplotlib is the most common plotting library for Python.
import seaborn as sns   #seaborn is an extension to matplotlib


sns.set() #this function call sets some defaults to seaborn's defaults.

Let's plot a distribution plot

In [36]:
bins = len(np.unique(a))    #define variable "bins" by counting how many unique values there were in the simulation
sns.distplot(a, bins = bins, hist=True, kde=False, norm_hist=True) #make a histogram
plt.xticks(np.unique(a))    #defining which ticks should be plotted on x axis
plt.show()     #call to show the plot. Note that we mixed seaborn and matplotlib functions. 

Experiment on your own: Does it make a difference how many sides does the dice have?

More visualization

Let's generate some data to plot

In [37]:
n = 10000
theta = np.random.normal(np.pi*3/4, 1, n)  #random numbers with normal distribution, centered at 3/4 pi
noise_x = np.random.normal(0,0.2,n) #generating numpy arrays with noise
noise_y = np.random.normal(0,0.2,n)
x = np.sin(theta)+noise_x #calculating x coordinate
y = np.cos(theta)+noise_y #calculating y coordinate

We can plot the data as a joint plot with hex plot in the middle. This is very convinient for showing density data.

Seaborn makes creating such relatively complex charts very easy.

In [38]:
sns.jointplot(x, y, kind="hex", gridsize=50, cmap="inferno")
Out[38]:
<seaborn.axisgrid.JointGrid at 0x1ca9bc6b710>

Alternatively we can use kernel density estimate for a smooth plot.

In [39]:
sns.jointplot(x=x, y=y, kind="kde",n_levels=35, cmap="viridis", cbar=True)
Out[39]:
<seaborn.axisgrid.JointGrid at 0x1ca9d17bdd8>

Some plots can be combined. Note slightly different notation for generating the plot.

In [40]:
n = 1000
theta = np.random.normal(np.pi*3/4, 0.6, n)
noise_x = np.random.normal(0,0.1,n)
noise_y = np.random.normal(0,0.1,n)
x = np.sin(theta)+noise_x
y = np.cos(theta)+noise_y

noise_v = np.random.normal(0,0.2,n)
noise_w = np.random.normal(0,0.2,n)
v = np.sin(theta)*0.2+noise_v
w = np.cos(theta)*0.2+noise_w


ax = sns.kdeplot(x, y, shade=True, cmap="Blues", shade_lowest=False)
ax = sns.kdeplot(v, w, shade=True, cmap="Reds", shade_lowest=False)

Line plot, scatter plot

In [41]:
n = 100
x = np.linspace(0,10,n) #create an array that starts at 0, ends at 10 and has n evenly spaced elements
y = np.sin(x)

Very simple plotting with just MatPlotLib

In [42]:
plt.plot(y)
Out[42]:
[<matplotlib.lines.Line2D at 0x1ca9d420550>]

Seaborn requires two parameters. Notice x axis ticks. In the example above they correspond to the index of the numpy array.

In [43]:
sns.lineplot(x,y)
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ca9d4569b0>

Best practice is to specify both axis, see example why:

In [44]:
x = np.random.uniform(0,10,n)   #create an array with random, uniform distribution
                                #between 0 and 10
x = np.sort(x)                  #sort. This is now an unevenly polled domain

y = np.sin(x)

plt.plot(y)
Out[44]:
[<matplotlib.lines.Line2D at 0x1ca9cf6b080>]
In [45]:
plt.plot(x,y)
Out[45]:
[<matplotlib.lines.Line2D at 0x1ca9bf054a8>]

The first chart is distorted due to uneven polling, even though the signal is perfectly noise free. When specified x and y variables, the signal is now plotted correctly

Scatter plots

Very similar to line plots, work better with noisy data

In [46]:
n = 200
x = np.linspace(0,10,n) #create an array that starts at 0, ends at 10 and has n evenly spaced elements
y = np.sin(x)+np.random.normal(0,0.1,n)

outlier_x = np.random.uniform(0,10,30)
outlier_y = np.random.uniform(-1.5, 1.5, 30)

x = np.append(x, outlier_x)
y = np.append(y, outlier_y)

sns.lineplot(x,y)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ca9d54ec88>

Line plot looks very bad, compare with scatter plot

In [47]:
sns.scatterplot(x,y)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ca9d5b0cf8>

One can plot multiple plots at the same time

In [48]:
n = 500
x = np.linspace(1,100,n)
y1 = np.cumsum(np.random.normal(0,1,n))
y2 = np.cumsum(np.random.normal(-0.1,1,n))
y3 = np.cumsum(np.random.normal(0,2,n))

sns.lineplot(x,y1, label="First one")
sns.lineplot(x,y2, label="Second one")
sns.lineplot(x,y3, label="Third one")
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ca9d544a58>

Other plot types

There are many more plot types, see documentation for MatPlotLib and Seaborn for more information

In [49]:
n = 500
x = np.linspace(1,40,n)
y = 0.3*x+np.sin(x*10)

plt.polar(x,y)
Out[49]:
[<matplotlib.lines.Line2D at 0x1ca9d6a25c0>]

Pandas

Pandas is an extremely popular package in data science. In short, it deals with 2D arrays called dataframes and has a plethora of data science functions.

Let's start by exploring the Titanic dataset

In [50]:
import numpy as np  #import numpy
import pandas as pd  #import pandas
import seaborn as sns #import seaborn
import matplotlib.pyplot as plt #import matplotlib
df = sns.load_dataset('titanic') #load the titanic dataset to a pandas dataframe
sns.set() #seaborn's defaults

It is very simple to look around the dataframe. Use .head() to view 5 top rows. You can also use .head(a) for defined number of rows.

In [51]:
df.head()
Out[51]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

.describe() returns basic statistical data for each column

In [52]:
df.describe()
Out[52]:
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

To get a list of all the attributes in a dataset call list(df).

In [53]:
list(df)
Out[53]:
['survived',
 'pclass',
 'sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'embarked',
 'class',
 'who',
 'adult_male',
 'deck',
 'embark_town',
 'alive',
 'alone']

Sorting.

It is sometimes convinient to sort the DataFrame by an attribute. Note that in example below the original dataframe remains untouched (unsorted).

In [54]:
df.sort_values(by="age").head(15)
Out[54]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
803 1 3 male 0.42 0 1 8.5167 C Third child False NaN Cherbourg yes False
755 1 2 male 0.67 1 1 14.5000 S Second child False NaN Southampton yes False
644 1 3 female 0.75 2 1 19.2583 C Third child False NaN Cherbourg yes False
469 1 3 female 0.75 2 1 19.2583 C Third child False NaN Cherbourg yes False
78 1 2 male 0.83 0 2 29.0000 S Second child False NaN Southampton yes False
831 1 2 male 0.83 1 1 18.7500 S Second child False NaN Southampton yes False
305 1 1 male 0.92 1 2 151.5500 S First child False C Southampton yes False
827 1 2 male 1.00 0 2 37.0042 C Second child False NaN Cherbourg yes False
381 1 3 female 1.00 0 2 15.7417 C Third child False NaN Cherbourg yes False
164 0 3 male 1.00 4 1 39.6875 S Third child False NaN Southampton no False
183 1 2 male 1.00 2 1 39.0000 S Second child False F Southampton yes False
386 0 3 male 1.00 5 2 46.9000 S Third child False NaN Southampton no False
172 1 3 female 1.00 1 1 11.1333 S Third child False NaN Southampton yes False
788 1 3 male 1.00 1 2 20.5750 S Third child False NaN Southampton yes False
642 0 3 female 2.00 3 2 27.9000 S Third child False NaN Southampton no False

To check who is the oldest on Titanic, sort descending. Note additional .dropna() that drops the lines containing missing values.

In [55]:
df.dropna().sort_values(by="age", ascending=False).head(15)
Out[55]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
630 1 1 male 80.0 0 0 30.0000 S First man True A Southampton yes True
96 0 1 male 71.0 0 0 34.6542 C First man True A Cherbourg no True
745 0 1 male 70.0 1 1 71.0000 S First man True B Southampton no False
54 0 1 male 65.0 0 1 61.9792 C First man True B Cherbourg no False
456 0 1 male 65.0 0 0 26.5500 S First man True E Southampton no True
438 0 1 male 64.0 1 4 263.0000 S First man True C Southampton no False
275 1 1 female 63.0 1 0 77.9583 S First woman False D Southampton yes False
252 0 1 male 62.0 0 0 26.5500 S First man True C Southampton no True
625 0 1 male 61.0 0 0 32.3208 S First man True D Southampton no True
170 0 1 male 61.0 0 0 33.5000 S First man True B Southampton no True
587 1 1 male 60.0 1 1 79.2000 C First man True B Cherbourg yes False
366 1 1 female 60.0 1 0 75.2500 C First woman False D Cherbourg yes False
11 1 1 female 58.0 0 0 26.5500 S First woman False C Southampton yes True
487 0 1 male 58.0 0 0 29.7000 C First man True B Cherbourg no True
268 1 1 female 58.0 0 1 153.4625 S First woman False C Southampton yes False

To drop NaN existing only in the age column, more complex operation is necessary.

Notice that we got more results - the deck attribute is full of NaNs

In [56]:
df2 = df[np.isfinite(df['age'])]                        #create a new dataframe, that is a copy of df where df['age']
                                                        #is finite, ie. is not NaN or inf
                                                        #Note! This is a reference, not a copy!
df2.sort_values(by="age", ascending=False).head(15)
Out[56]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
630 1 1 male 80.0 0 0 30.0000 S First man True A Southampton yes True
851 0 3 male 74.0 0 0 7.7750 S Third man True NaN Southampton no True
493 0 1 male 71.0 0 0 49.5042 C First man True NaN Cherbourg no True
96 0 1 male 71.0 0 0 34.6542 C First man True A Cherbourg no True
116 0 3 male 70.5 0 0 7.7500 Q Third man True NaN Queenstown no True
672 0 2 male 70.0 0 0 10.5000 S Second man True NaN Southampton no True
745 0 1 male 70.0 1 1 71.0000 S First man True B Southampton no False
33 0 2 male 66.0 0 0 10.5000 S Second man True NaN Southampton no True
54 0 1 male 65.0 0 1 61.9792 C First man True B Cherbourg no False
280 0 3 male 65.0 0 0 7.7500 Q Third man True NaN Queenstown no True
456 0 1 male 65.0 0 0 26.5500 S First man True E Southampton no True
438 0 1 male 64.0 1 4 263.0000 S First man True C Southampton no False
545 0 1 male 64.0 0 0 26.0000 S First man True NaN Southampton no True
275 1 1 female 63.0 1 0 77.9583 S First woman False D Southampton yes False
483 1 3 female 63.0 0 0 9.5875 S Third woman False NaN Southampton yes True

We can quickly analyse the age distribution among the passangers

In [57]:
df = df[np.isfinite(df['age'])]                                                 #get only results with finite age
sns.distplot(df[df['who'] == "man"]['age'], bins=10, hist=True, label="men")    #histogram of men
sns.distplot(df[df['who'] == "woman"]['age'], bins=10, hist=True, label="women")#histogram of women
plt.legend()                                                                    #add a legend to chart
Out[57]:
<matplotlib.legend.Legend at 0x1ca9e6f5e10>

We can also quickly show how that distribution is when split into different classes.

In [58]:
df1 = df[df['class']=="First"]      #defining sub-dataframes containing only passengers of first/second/third class
df2 = df[df['class']=="Second"]
df3 = df[df['class']=="Third"]

sns.distplot(df1[df1['who'] == "man"]['age'], hist=True, label="men, 1st class")
sns.distplot(df1[df1['who'] == "woman"]['age'], hist=True, label="women, 1st class")
plt.legend()
plt.show()               #necessary to show the chart. If we ommit this, we will have everything in one chart.

sns.distplot(df2[df2['who'] == "man"]['age'], hist=True, label="men, 2nd class")
sns.distplot(df2[df2['who'] == "woman"]['age'], hist=True, label="women, 2nd class")
plt.legend()
plt.show()

sns.distplot(df3[df3['who'] == "man"]['age'], hist=True, label="men, 3rd class")
sns.distplot(df3[df3['who'] == "woman"]['age'], hist=True, label="women, 3rd class")
plt.legend()
plt.show()

Things to notice:

  • The bins are not equal
  • The bins do not overlap
  • Kernel Density Estimate can be misleading
In [59]:
sns.distplot(df1[df1['who'] == "woman"]['age'], bins= 10, hist=True, kde=False, norm_hist=True, label="Histogram")

sns.distplot(df1[df1['who'] == "woman"]['age'], hist=False, label="KDE, bw: 2",
             kde_kws={'gridsize' : 100, "bw" : 2})

sns.distplot(df1[df1['who'] == "woman"]['age'], hist=False, label="KDE, bw: 5",
             kde_kws={'gridsize' : 100, "bw" : 5})

sns.distplot(df1[df1['who'] == "woman"]['age'], hist=False, label="KDE, bw: 15",
             kde_kws={'gridsize' : 100, "bw" : 15})

plt.legend()
Out[59]:
<matplotlib.legend.Legend at 0x1ca9e7c9cf8>

Why not simply Excel?

Prep takes longer, but reuse is much faster with Python.

Let's load weather data from Lerwick weather station (Shetland Islands, Scotland).

In [60]:
df = pd.read_csv("http://www.ux.uis.no/~atunkiel/lerwickdata.csv", delim_whitespace=True)
                                                                #delim_whitespace - whitespace separated file
                #notice that we read csv straight from a website

#year and month are separate in this dataset. Let's merge it and tell pandas that it is a date.
df["date"] = df["mm"].map(str) + "." +df["yyyy"].map(str)
df['date'] = pd.to_datetime(df['date'],format = '%m.%Y')

Quick inspection of the dataframe

In [61]:
df.head()
Out[61]:
yyyy mm tmax[degC] tmin[degC] af[days] rain[mm] sun[hours] date
0 1930 12 7.0 3.6 0 122.4 13.1 1930-12-01
1 1931 1 4.9 0.2 13 108.0 29.7 1931-01-01
2 1931 2 4.4 -0.3 12 138.3 52.6 1931-02-01
3 1931 3 4.2 -1.0 17 18.2 135.7 1931-03-01
4 1931 4 8.1 2.5 1 70.9 134.7 1931-04-01

Let's inspect the maximum recorded temperatures

In [62]:
sns.scatterplot(df['date'],df['tmax[degC]'])
C:\Users\2921228\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\plotting\_converter.py:129: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ea896d8>

Let's do some rolling average to have a cleaner look at data.

How wide should be the window? Let's try 10, 20, 30, ... , 100 days

In [63]:
for i in range(1,10):
    sns.scatterplot(df['date'],df['tmax[degC]'].rolling(10*i).mean(center=True), 
                    alpha=0.3, label=f'rolling average {10*i}')
    plt.show()

It seems that window of 60 gives the best results. Why 60? Probably because there are 12 months in a year and 60 divides nicely by 12.

Let's check the theory by plotting rolling window with widths of 12, 24, 36, ... , 120.

In [64]:
for i in range(1,10):
    sns.scatterplot(df['date'],df['tmax[degC]'].rolling(12*i).mean(center=True),
                    alpha=0.3, label=f'rolling average {12*i}')
    plt.show()

This explains some articles from the 70s:

  • U.S. Scientist Sees New Ice Age Coming (The Washington Post, July 9, 1971)
  • Ice Age Around the Corner (Chicago Tribune, July 10, 1971)
  • New Ice Age Coming – It’s Already Getting Colder (L.A. Times, October 24, 1971)

Pollution was likely driving factor in the cooling. Currently we somewhat overcompensated with CO2 emissions.

We can explore some correlations, like how is temperature related to rain?

In [65]:
x= df['tmax[degC]']
y= df['rain[mm]']
sns.jointplot(x, y, kind="kde", levels=25, cmap="inferno")
Out[65]:
<seaborn.axisgrid.JointGrid at 0x1ca9ed0a5f8>

We can fairly easily plot LOWESS:

  • Locally
  • Weighted
  • Scatterplot
  • Smoothing
In [66]:
from statsmodels.nonparametric.smoothers_lowess import lowess #import library that will do all the heavy lifting
y = df['tmax[degC]']
x = np.arange(0,len(df['date']),1) #converting into linear space from 0 to length of the dataset
w = lowess(y, x, frac=1/5)         #calculating LOWESS
                                   #frac is the window width in terms of a fraction of full width.

And let's plot the results. Some calculations on the x values to recover the year.

In [67]:
sns.scatterplot(x/12+1930,y.rolling(12).mean(center=True),linewidth=0,color="r",alpha=0.3)
sns.lineplot(w[:,0]/12+1930,w[:,1])
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ecad940>

We can explore how much smooting is too much smoothing

In [68]:
plt.figure(figsize=(15,10)) # bigger plot for visibility

sns.scatterplot(x,y.rolling(12).mean(center=True),linewidth=0,color="r",alpha=0.3) #scatterplot for reference


for i in reversed(range(1,21)):  #note that reversed is used. This is to change the order of plotting of the charts.
    w = lowess(y, x, frac=1/(i*2))
    
    if i != 20:
        sns.lineplot(w[:,0],w[:,1], hue=i, palette='viridis',hue_norm=(1,20), legend=False)
    else:
        sns.lineplot(w[:,0],w[:,1], hue=i, palette='viridis',hue_norm=(1,20))

plt.plot()
Out[68]:
[]

Violin plots are often better than box plots, as they carry more data

In [69]:
x = df['rain[mm]']
y = df['mm']
sns.violinplot(y,x)
plt.show()
sns.boxplot(y,x)
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ebb44a8>

Real drilling data

Let's consider some real drilling data. We can import from web again, straight out of a zip file.

In [70]:
df=pd.read_csv("http://www.ux.uis.no/~atunkiel/Norway-NA-15_$47$_9-F-9%20A%20depth.zip")
In [71]:
list(df)
Out[71]:
['Unnamed: 0',
 'Unnamed: 0.1',
 'Measured Depth m',
 'TOFB s',
 'AVG_CONF unitless',
 'MIN_CONF unitless',
 'Average Rotary Speed rpm',
 'STUCK_RT unitless',
 'Corrected Surface Weight on Bit kkgf',
 'Corrected Total Hookload kkgf',
 'MWD Turbine RPM rpm',
 'MWD Raw Gamma Ray 1/s',
 'MWD Continuous Inclination dega',
 'Pump 2 Stroke Rate 1/min',
 'Rate of Penetration m/h',
 'Bit Drill Time h',
 'Corrected Hookload kkgf',
 'MWD Gravity Toolface dega',
 'Mud Density Out g/cm3',
 'MWD GR Bit Confidence Flag %',
 'Pump Time h',
 'PowerUP Shock Rate 1/s',
 'Total SPM 1/min',
 'HKLO kkgf',
 'Average Hookload kkgf',
 'Total Hookload kkgf',
 'DRET unitless',
 'Extrapolated Hole TVD m',
 'MWD Gamma Ray (API BH corrected) gAPI',
 'Pump 1 Strokes unitless',
 'ROPIH s/m',
 'EDRT unitless',
 'Pump 1 Stroke Rate 1/min',
 'Rate of penetration m/h',
 'Iso-butane (IC4) ppm',
 'Bit Drilling Run m',
 'MWD Total Shocks unitless',
 'Total Bit Revolutions unitless',
 'Nor-butane (NC4) ppm',
 'Lag Depth (TVD) m',
 'Pump 2 Strokes unitless',
 'Corr. Drilling Exponent unitless',
 'Total Vertical Depth m',
 'Ethane (C2) ppm',
 'TMP In degC',
 'Mud Density In g/cm3',
 'n-Penthane ppm',
 'Mud Density In g/cm3.1',
 'Weight on Bit kkgf',
 'Bit Revolutions  (cum) unitless',
 'Averaged WOB kkgf',
 'Hole Depth (TVD) m',
 'MWD Shock Risk unitless',
 'Bit run number unitless',
 'RHX_RT unitless',
 'Pump 3 Strokes unitless',
 'Total Strokes unitless',
 'Inverse ROP s/m',
 'Pass Name unitless',
 'Pump 4 Stroke Rate 1/min',
 'Iso-pentane (IC5) ppm',
 'Rig Mode unitless',
 'MWD Shock Peak m/s2',
 'Flow Pumps L/min',
 'SPN Sp_RigMode 2hz unitless',
 'Average Standpipe Pressure kPa',
 'Bit Depth m',
 'Rate of Penetration (5ft avg) m/h',
 'Gas (avg) %',
 'Propane (C3) ppm',
 'String weight (rot,avg) kkgf',
 'TOBO s',
 'AJAM_MWD unitless',
 'Tank volume (active) m3',
 '1/2ft ROP m/h',
 'Weight On Hook kkgf',
 'Hole depth (MD) m',
 'Mud Flow In L/min',
 'BHFG unitless',
 'Temperature Out degC',
 'Averaged TRQ kN.m',
 'MWD Continuous Azimuth dega',
 'RGX_RT unitless',
 'MWD DNI Temperature degC',
 'Bit Drilling Time h',
 'HKLI kkgf',
 'Average Surface Torque kN.m',
 'Methane (C1) ppm',
 'MWD Magnetic Toolface dega',
 'Total Downhole RPM rpm',
 'SHK3TM_RT min',
 'Elapsed time in-slips s',
 'Stand Pipe Pressure kPa',
 'Pump 3 Stroke Rate 1/min',
 'Averaged RPM rpm',
 'Pump 4 Strokes unitless',
 'Inverse ROP (5ft avg) s/m',
 'S1AC kPa',
 'S2AC kPa',
 'IMWT g/cm3',
 'OSTM s',
 'nameWellbore',
 'name',
 'IMP/ARC Attenuation Conductivity 40-in. at 2 MHz mS/m',
 'ARC Annular Pressure kPa',
 'MWD Collar RPM rpm',
 'IMP/ARC Non-BHcorr Phase-Shift Resistivity 28-in. at 2 MHz ohm.m',
 'IMP/ARC Phase-Shift Conductivity 40-in. at 2 MHz mS/m',
 'Annular Temperature degC',
 'IMP/ARC Non-BHcorr Phase-Shift Resistivity 40-in. at 2 MHz ohm.m',
 'ARC Gamma Ray (BH corrected) gAPI',
 'IMP/ARC Non-BHcorr Attenuation Resistivity 40-in. at 2 MHz ohm.m',
 'MWD Stick-Slip PKtoPK RPM rpm',
 'IMP/ARC Non-BHcorr Attenuation Resistivity 28-in. at 2 MHz ohm.m',
 'IMP/ARC Phase-Shift Conductivity 28-in. at 2 MHz mS/m']

Verify the contents of the DataFrame. Note that real-life datasets can be overrun with NaNs.

In [72]:
df.head()
Out[72]:
Unnamed: 0 Unnamed: 0.1 Measured Depth m TOFB s AVG_CONF unitless MIN_CONF unitless Average Rotary Speed rpm STUCK_RT unitless Corrected Surface Weight on Bit kkgf Corrected Total Hookload kkgf ... MWD Collar RPM rpm IMP/ARC Non-BHcorr Phase-Shift Resistivity 28-in. at 2 MHz ohm.m IMP/ARC Phase-Shift Conductivity 40-in. at 2 MHz mS/m Annular Temperature degC IMP/ARC Non-BHcorr Phase-Shift Resistivity 40-in. at 2 MHz ohm.m ARC Gamma Ray (BH corrected) gAPI IMP/ARC Non-BHcorr Attenuation Resistivity 40-in. at 2 MHz ohm.m MWD Stick-Slip PKtoPK RPM rpm IMP/ARC Non-BHcorr Attenuation Resistivity 28-in. at 2 MHz ohm.m IMP/ARC Phase-Shift Conductivity 28-in. at 2 MHz mS/m
0 0 0 273.101 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 1 273.253 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 2 273.406 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 3 273.558 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 4 4 273.710 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 115 columns

We can explore the data through charts, tweaking them to show data best

In [73]:
df = df[np.isfinite(df['Measured Depth m'])]
df = df[np.isfinite(df['MWD Continuous Inclination dega'])]


plt.figure(figsize=(20,10))
sns.scatterplot(df['Measured Depth m'], df['MWD Continuous Inclination dega'],
                linewidth=0,
                hue=df['Average Hookload kkgf'],
                hue_norm=(85,100),
                palette="viridis",
                size=df['Average Surface Torque kN.m'],
                size_norm=(0,10),
                sizes=(20,300)
                )
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ebda898>

Utilize the .describe() function to gauge the normalization range

In [74]:
df['Average Hookload kkgf'].describe()
Out[74]:
count    1963.000000
mean       94.636537
std         3.698203
min        71.966965
25%        91.820703
50%        96.034577
75%        97.359066
max       104.303565
Name: Average Hookload kkgf, dtype: float64

Simple plot is also useful to understand the data contents, such as outliers.

In [75]:
df['Average Hookload kkgf'].plot()
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ebb46d8>

Color palettes

While in publication grayscale is typically preferred, when using color use good color schemes. Examples below are good - the brightness of colors is consistent with color scale.

In [76]:
palettes = ['viridis','inferno', 'gray', 'Blues', 'Reds', 'hot', 'cividis']
for i in palettes:
    sns.palplot(sns.color_palette(i,10))

Divergin palletes are also possible to highlight that something is below or above a certain level (like a physical map)

In [77]:
sns.palplot(sns.diverging_palette(220, 20, n=11))
sns.palplot(sns.diverging_palette(145, 280, s=85, l=25, n=11))
sns.palplot(sns.diverging_palette(255, 133, l=60, n=11, center="dark"))

Avoid "jet" palette

A ~rainbow pallete that is the most commonly used despite being a poor in representing data

https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/

https://pdfs.semanticscholar.org/ee79/2edccb2c88e927c81285344d2d88babfb86f.pdf

Seaborn will straight out refuse to use it returning ValueError: No.

In [78]:
#palettes = ['jet']
#for i in palettes:
#    sns.palplot(sns.color_palette(i))

Jet colormap will be even more confusing when converted to grayscale.

Don't worry about the code below, it is there just to generate the chart.

In [79]:
def grayify_cmap(cmap):
    """Return a grayscale version of the colormap"""
    cmap = plt.cm.get_cmap(cmap)
    colors = cmap(np.arange(cmap.N))
    
    # convert RGBA to perceived greyscale luminance
    # cf. http://alienryderflex.com/hsp.html
    RGB_weight = [0.299, 0.587, 0.114]
    luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
    colors[:, :3] = luminance[:, np.newaxis]
    
    return cmap.from_list(cmap.name + "_grayscale", colors, cmap.N)



%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 6,300)
y = np.linspace(0, 3,300)[:, np.newaxis]
z = 10 * np.cos(x ** 2) * np.exp(-y)
cmaps = [plt.cm.jet, grayify_cmap('jet'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
fig.subplots_adjust(wspace=0)

for cmap, ax in zip(cmaps, axes):
    im = ax.imshow(z, cmap=cmap)
    ax.set_title(cmap.name)
    fig.colorbar(im, ax=ax)

Use a palette that has lightness corresponding to value for better results

In [80]:
cmaps = [plt.cm.CMRmap, grayify_cmap('CMRmap'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
fig.subplots_adjust(wspace=0)

for cmap, ax in zip(cmaps, axes):
    im = ax.imshow(z, cmap=cmap)
    ax.set_title(cmap.name)
    fig.colorbar(im, ax=ax)

Regression

In Python we can easily perform regression using any function.

Let's assume we have a signal governed by an equation \begin{equation*} y = -0.2x^3 + 3x^2+30sin(x) \end{equation*}

This signal is noisy. Noise has a standard deviation of 30.

We have total 100 samples to work with, but 20 of them are outliers

In [130]:
np.random.seed(10) #fixed random seed, so you always get the same set of random numbers

n = 20 #outlier count
x_0 = np.linspace(-10,10,100-n)  #generate 100 minus n evenly spaced values from -10 to 10
y_0 = -0.2*x_0**3+3*x_0**2+30*np.sin(x_0) #calculate y

sns.lineplot(x_0,y_0,color="red", label="ground truth") #plot clean function

x = x_0  #keep x_0 separate for plotting later
y = y_0 + np.random.normal(0,30,100-n)  #generate new function - same as clean one plus noise with sd=30
sns.scatterplot(x,y,label="truth+noise") #plot the noisy chart

outliers_x = np.random.uniform(-12,12,n) #generate n outliers, x coordinates
outliers_y = np.random.uniform(-200,500,n)#... y coordinates
sns.scatterplot(outliers_x,outliers_y,label="outliers") #plot outliers only

x = np.concatenate([x, outliers_x]) #add the outliers to the x array
y = np.concatenate([y, outliers_y]) #add the outliers to the y array

Let's look at the raw data. You can see, that the data is very noisy and difficult to make a regression of!

In [131]:
sns.scatterplot(x,y,label="outliers",linewidth=0)
Out[131]:
<matplotlib.axes._subplots.AxesSubplot at 0x1caa167ae10>

Let's assume that we know the function itself, but we do not know the parameters, effectively, known shape of the funtion is: \begin{equation*} y = a_0 x^3 + a_1 x^2+a_2sin(x) \end{equation*}

We define a function in a parametric way that is suitable for the algorithm that will come later.

In [132]:
def fun(a, x, y):
    return a[0]*x**3+a[1]*x**2+a[2]*np.sin(x) - y

We can import an optimization function from SciPy based on least squares method

In [133]:
from scipy.optimize import least_squares

Let's develop two models.

One is optimized for outliers, where they no longer carry such a heavy weight (typical issue with least squares method).

Second is a vanilla Least Squares method

In [134]:
a0 = np.ones(3) #initiating the parameters with ones

model = least_squares(fun, a0, args=(x,y),loss='soft_l1') #the soft_l1 is the more lenient loss function

model2 = least_squares(fun, a0, args=(x,y))

print(model.x) #show the calculated parameters
[-0.20871253  2.95637215 31.38991214]

Model returned:

$a_0 = -0.20864698$ (originally $-0.2$)

$a_1 = 2.95738833$ (originally $3$)

$a_2 = 31.35128843$ (originally $30$)

Which is very close to original values. Let's inspect it graphically. The match is nearly perfect.

In [135]:
plt.figure(figsize=(10,7)) #let's use slightly bigger chart
sns.scatterplot(x,y, label="input data",s=100)

sns.lineplot(x_0,y_0,color="red", lw=2,alpha=1,label="ground truth")

y2 = model.x[0]*x**3+model.x[1]*x**2+model.x[2]*np.sin(x) #calculate y based on the model
sns.lineplot(x, y2, color="green",lw=5,alpha=0.5, label="model, optimized")
Out[135]:
<matplotlib.axes._subplots.AxesSubplot at 0x1caa17b8358>

In comparison, vanilla Least Square method did not perform so well:

In [136]:
plt.figure(figsize=(10,7))
sns.scatterplot(x,y, label="input data",s=100)

sns.lineplot(x_0,y_0,color="red", lw=2,alpha=1,label="ground truth")

y3 = model2.x[0]*x**3+model2.x[1]*x**2+model2.x[2]*np.sin(x)
sns.lineplot(x, y3, color="green",lw=2,dashes="-",alpha=1, label="model, vanilla")
Out[136]:
<matplotlib.axes._subplots.AxesSubplot at 0x1caa11bde10>

Let's calculate $R^2$ values for both models.

Note: in this implementation $R^2$ value can be negative

In [137]:
from sklearn.metrics import r2_score  #in a typical Python fashion, we just import what is needed

print(f'R^2 value for optimized fitting is: {r2_score(y,y2)}' )
print(f'R^2 value for standard fitting is : {r2_score(y,y3)}' )
R^2 value for optimized fitting is: 0.06573837456233356
R^2 value for standard fitting is : 0.27094919231420334

The objectively much better model has a lower $R^2$ value. As you can see, this is not the ultimate metric.

Optimization - finding minimum

One of the optimization tasks is in principle finding a minimum of a function. Some examples:

  • Find well path with minimum torque and drag
  • Find lowest highest MSE to ROP ratio
  • Sqaring the circle

Example 1

Let's find the minimum of the function that we explored before.

We will use a differential evolution algorithm, which is conviniently available in scipy library.

We will need to redefine the function slightly to make it work with the algorithm. Minimum seeking algorithms also often need boundaries - we have to provide it as well.

In [138]:
from scipy.optimize import differential_evolution  #import differential evolution algorithm

bounds = [(-10, 10)]  #define bounds within which algorithm will seek minimum

def fun_min(x):
    return -0.2*x**3+3*x**2+30*np.sin(x)  #redefined function - it simply returns y.


result = differential_evolution(fun_min, bounds) #here the magic happens. We feed the function and the bounds 
                                                 #to the algorithm

sns.lineplot(x_0,y_0,color="gray", lw=2,alpha=1,label="ground truth", zorder=1) #plot the ground truth
                                                                                #zorder defines the plotting
                                                                                #sequence, aka. what is on top
plt.scatter(result.x[0], result.fun[0],marker="x",s=200,c="r", label="minimum", zorder=10)
plt.legend()
Out[138]:
<matplotlib.legend.Legend at 0x1caa160b7b8>

Example 2 - multiparameter function

For future reference, let's see how that works when a function has multiple parameters.

In [149]:
from mpl_toolkits.mplot3d import Axes3D  #importing some things necessary for 3D plot

def fun3d(vars):  #definition of a sample 3D function. See that it still takes only one variable
    x = vars[0]   #but now it is an array, which we have to deconstruct
    y = vars[1]
    return np.sin(x)*np.exp(-x**2-y**2)

x = np.linspace(-3,3,100)  #defining X and Y space
y = np.linspace(-3,3,100)
x, y = np.meshgrid(x, y)   #convert x and y to a mesh, so we get a point for each intersection of x and y
z = fun3d([x,y])           #feed the function

fig = plt.figure(figsize=(10,7))
ax = fig.gca(projection='3d')  #define 3D plot
ax.plot_surface(x, y, z, cmap=plt.cm.viridis, linewidth=0.2, alpha=1)  #let's plot the function to verify
                                                                       #that everything works
plt.show()
In [140]:
bounds = [(-3,3),(-3,3)]  #define the bounds
result = differential_evolution(fun3d, bounds) #feed the function and the bounds to DE algorithm

#and we plot everything again
fig = plt.figure(figsize=(10,7))
ax = fig.gca(projection='3d')
ax.plot_surface(x, y, z, cmap=plt.cm.viridis, linewidth=0.2, alpha=0.3)
ax.scatter(result.x[0], result.x[1], result.fun, c="r", marker="x",s=200)
plt.show()

Example 3, squaring the circle

What is the size of a square that corresponds are-wise to a circle? They tried to solve it since 1800 BCE. We need few lines in Python nowadays.

In [141]:
def quadrature(x):               #function that will return the absolute difference between
                                 #the area of a circle, radius = 1, and square, length x
    return np.abs(np.pi - x**2)  #assume radius = 1, hence pi*1^2 = pi

bounds = [(0,2)]
result = differential_evolution(quadrature, bounds)

ax=plt.gca()

patch = plt.Circle((0,0), radius= 1, fill=False,color="r")
ax.add_patch(patch)

patch = plt.Rectangle((-0.5*result.x,-0.5*result.x),result.x,result.x, fill=False, color="b")
ax.add_patch(patch)

plt.axis('scaled')
plt.show()

To visualize the actual function we minimized:

In [150]:
x = np.linspace(0,2,100)
y = quadrature(x)

sns.lineplot(x,y)
Out[150]:
<matplotlib.axes._subplots.AxesSubplot at 0x1caa7da38d0>

Feature engineering

Oftentimes we want to use a parameter that is not delivered in the database. We need to calculate it. Below is a short example of calculating inclination based dogleg severity, done on the dataset we used before.

Let's import the dataset again and plot inclination data

In [164]:
df=pd.read_csv("http://www.ux.uis.no/~atunkiel/Norway-NA-15_$47$_9-F-9%20A%20depth.zip")

sns.scatterplot(df['Measured Depth m'],df['MWD Continuous Inclination dega'],linewidth=0,size=1, alpha=0.3)
Out[164]:
<matplotlib.axes._subplots.AxesSubplot at 0x1caa142fda0>

For clarity, let's develop two interim variables:

  • dDepth - depth change
  • dInc - inclination change

We will also smoothen out the data a little bit with a rolling mean. Deltas will be calculated over slightly larger step than one to avoid noise

In [247]:
df.ffill(inplace=True) #forward filling and backward filling to avoid NaNs in the dataset
df.bfill(inplace=True)

step_d=20    #steps for differentiation and rolling average
step_r = 100 #as variables for easy tweaking

df['dDepth'] = df['Measured Depth m'].rolling(step_r).mean(center=True).diff(step_d)
df['dInc'] = df['MWD Continuous Inclination dega'].rolling(step_r).mean(center=True).diff(step_d)
df['DLS'] = df['dDepth']/df['dInc']

sns.scatterplot(df['Measured Depth m'],df['MWD Continuous Inclination dega'],
                hue=df['DLS'],linewidth=0,size=1, alpha=0.3)
Out[247]:
<matplotlib.axes._subplots.AxesSubplot at 0x1caab3272e8>

We immediatly see an issue. Color scale, that we wanted to use to visualize dogleg is out of whack.

We can inspect basic data parameters with this one liner we learned about before.

In [248]:
df['dInc'].describe()
Out[248]:
count    16551.000000
mean         0.060214
std          0.148738
min         -0.518800
25%          0.000000
50%          0.006000
75%          0.111800
max          1.677000
Name: dInc, dtype: float64

and some more extra info through a histogram

In [249]:
sns.distplot(df['dInc'].dropna())
Out[249]:
<matplotlib.axes._subplots.AxesSubplot at 0x1caaccaeb70>

With this new information, we can specify the range of our color range

In [250]:
plt.figure(figsize=(10,7)) 
sns.scatterplot(df['Measured Depth m'], df['MWD Continuous Inclination dega'],
                linewidth=0,
                hue=df['dInc'],
                hue_norm=(0,0.25),
                palette="viridis",
                )
Out[250]:
<matplotlib.axes._subplots.AxesSubplot at 0x1caab3090f0>

NOTE: There is a simplified approach.

While everything looks correct at a first glance. However, there are number of potential issues here:

  • We did a rolling average and calculated difference with a fixed step, not checking the actual data frequency. This might have smoohted out some important data. We should resample the data for a consistent result

  • There are outliers in the dataset, that should be taken out first.

Correlation

Let's explore correlation. We will import diamond dataset. It is a data frame with 53940 rows and 10 variables:

  • price - price in US dollars (326-18823)
  • carat - weight of the diamond (0.2--5.01)
  • cut - quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color - diamond colour, from D (best) to J (worst)
  • clarity - a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • x - length in mm (0--10.74)
  • y - width in mm (0--58.9)
  • z - depth in mm (0--31.8)
  • depth - total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
  • table - width of top of diamond relative to widest point (43--95)
In [300]:
df = pd.read_csv('http://www.ux.uis.no/~atunkiel/diamonds.csv')
sns.scatterplot(df['carat'],df['price'])
Out[300]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cacb9f7e80>

The readability of this chart is poor. Points are widely overlapping each other. We can try a seaborn jointplot.

Note: colors were log-normalized.

In [318]:
import matplotlib.colors as colors

sns.jointplot(df['carat'],df['price'], kind="hex", gridsize=75,
              cmap="inferno", joint_kws={"norm" : colors.LogNorm()})
Out[318]:
<seaborn.axisgrid.JointGrid at 0x1cad93c6b70>

Many correlations in life are logarithmic in nature. Let's quickly explore if it is usable here.

You will clearly see that the log-log relationship is by far the most linear. Further analysis may benefit greatly from this finding, as linear correlations are much easier to model than logarithmic ones.

In [312]:
fig, axs = plt.subplots(2, 2, figsize=(10,10))


axs[0, 0].hexbin(df['carat'],df['price'],xscale='linear', yscale='linear', norm = colors.LogNorm())
axs[0, 0].set_title('linear-linear')

axs[0, 1].hexbin(df['carat'],df['price'],xscale='linear', yscale='log', norm = colors.LogNorm())
axs[0, 1].set_title('linear-log')

axs[1, 0].hexbin(df['carat'],df['price'],xscale='log', yscale='linear', norm = colors.LogNorm())
axs[1, 0].set_title('log-linear')

axs[1, 1].hexbin(df['carat'],df['price'],xscale='log', yscale='log', norm = colors.LogNorm())
axs[1, 1].set_title('log-log')
Out[312]:
Text(0.5, 1.0, 'log-log')

Seaborn allows one to plot a useful pairplot, to look at correlation between parameters

In [295]:
sns.pairplot(df.sample(100))
Out[295]:
<seaborn.axisgrid.PairGrid at 0x1cac8243240>

One can put a number on the correlation.

In [315]:
sns.heatmap(df.corr(), cmap="viridis")
Out[315]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cadaffaf98>
In [332]:
n = 1000

results = []
for i in range(n):
    if (np.average(np.random.randint(0,2,11)) <= 3):
        results.append(1)
    else:
        results.append(0)
print(results)
In [ ]: