PET575, Data Analysis in Python¶

(Open in Playground mode. Shift+Enter runs a cell.)

What is Python?¶

Python is a programming language
It is completely free
It is the most popular language in Machine Learning and Data Analysis
One of the most popular languages in general

How is Python different?¶

No need to declare a type of variable:¶

a = 5
b = 3.4
c = "test"

print("Variable 'a' is: " + str(type(a)))
print("Variable 'b' is: " + str(type(b)))
print("Variable 'c' is: " + str(type(c)))

Variable 'a' is: <class 'int'>
Variable 'b' is: <class 'float'>
Variable 'c' is: <class 'str'>

Function definitions and loops do not use brackets, only whitespace. This ensures that code looks tidy.¶

def print_n_powers_of_2(n): #function definition
    for i in range(n):      #simple loop
        print(2**i)         #note that python starts counting at zero
        
print_n_powers_of_2(6)      #execute the function passing number 6 as parameter n

1
2
4
8
16
32

Python can be run in different ways:¶

notebook - just like here
- Required Anaconda installation (https://www.anaconda.com) - recommended
- Online on Google Colaboratory (https://colab.research.google.com)
.py file
- Python3 is already installed on MacOS
- Python3 is already installed on Linux
- Python3 can installed on Windows through Anaconda (https://www.anaconda.com)

Note: Use Python 3.x only. Python 2.x also exists, but is being depreciated. For most intents and purposes Python 2 code will work in Python 3.

Working with notebooks¶

memory state is not saved between sessions, but the output is
cells can be executed out of sequence

a = a + 1  #add one to a defined earlier. Re-running this cell will keep incrementing a
print (a)

6

Why Python over Matlab?¶

Python has much higher adoption in the industry
Python pushes out MATLAB in academia too
Python is free, MATLAB is very much not. Many companies simply do not, and will not, own a MATLAB license.

Note: Python is very distributed in nature compared to compact package of MATLAB. If you are searching for funciton - use your search engine of choice: Google/DuckDuckGo/Qwant

Python basics¶

Defining variables¶

Below are some basic examples of variable definition

a = 5        #python automatically knows if it is int, float, str, etc
b = "rabbit" #strings (text) comes in quote marks
c = a + 2    #you can do operations at the same time as definition

Printing output¶

Printing is the basic way of showing data in Python. It is very forgiving.

print(a)
print(b)
print(c+a)

5
rabbit
12

Lists, NumPy¶

Lists, arrays, matrices - a sequence of variables

a = [0, 1, 2, 3]
print(a)

[0, 1, 2, 3]

a = [1, "two", 3.0] #feel free to mix types
print(type(a))
print()
print(type(a[0]))
print(type(a[1]))
print(type(a[2]))

<class 'list'>

<class 'int'>
<class 'str'>
<class 'float'>

To work on such data structures, the best is to use a library called numpy.

import numpy as np

a = [0, 1, 2, 3]
a = np.asarray(a) #convert python list to a numpy array
print(a)          #note the different printing format than before

[0 1 2 3]

This allows us to do different operations easily:

print(a+2)

[2 3 4 5]

print(a*2)

[0 2 4 6]

b = np.asarray([5,6,5,6])
print(a)
print("    +")
print(b)
print("    =")

print(a+b)

[0 1 2 3]
    +
[5 6 5 6]
    =
[5 7 7 9]

Slices¶

To get a part of an array, slices are used.

print(f'This is a complete array,                    a:{a}')
print(f'This is a a single element in an array,   a[2]:{a[2]}')
print(f'This is a slice of an array,            a[0:2]:{a[0:2]}') #note that slice is [inclisive:exclusive]
print(f'This is a last element of an array,      a[-1]:{a[-1]}')

This is a complete array,                    a:[0 1 2 3]
This is a a single element in an array,   a[2]:2
This is a slice of an array,            a[0:2]:[0 1]
This is a last element of an array,      a[-1]:3

Arrays can be nested

#defining a nested list
a = [
    [1,2,3,4],
    [5,6,7,8]
    ]

Caution: Python lists behave differently than numpy arrays!

print(a[0][1])

2

Notice different notation when converted to numpy array

a = np.asarray(a)
print(a[0,1])

2

Slices are particularly useful here

print(a[1,1:3])

[6 7]

CAUTION!¶

Lists and arrays behave differently than variables.

a = 2
b = a
b = b + 2

print(f'a is equal to {a}')
print(f'b is equal to {b}')

a is equal to 2
b is equal to 4

But, if we do the same thing with lists:

a = [1,1,1,1,1]
b = a
b[1] = 3

print(f'List a looks like this: {a}')
print(f'List b looks like this: {b}')

List a looks like this: [1, 3, 1, 1, 1]
List b looks like this: [1, 3, 1, 1, 1]

Notice that list a also changed. Let's see how numpy arrays behave:

a = np.asarray([1,1,1,1,1])
b = a
b[1] = 3

print(f'Numpy array a looks like this: {a}')
print(f'Numpy array b looks like this: {b}')

Numpy array a looks like this: [1 3 1 1 1]
Numpy array b looks like this: [1 3 1 1 1]

Same thing. There may be differences though:

a = [1,1,1,1,1]
b = a[1:3]
b[1] = 3

print(f'List a looks like this: {a}')
print(f'List b looks like this: {b}')

a = np.asarray([1,1,1,1,1])
b = a[1:3]
b[1] = 3

print(f'Numpy array a looks like this: {a}')
print(f'Numpy array b looks like this: {b}')

List a looks like this: [1, 1, 1, 1, 1]
List b looks like this: [1, 3]
Numpy array a looks like this: [1 1 3 1 1]
Numpy array b looks like this: [1 3]

When using a slice list created a copy, but numpy array kept the reference.

Numpy can make a copy of an array if you are explicit about it

a = np.asarray([1,1,1,1,1])
b = np.copy(a)
b[1] = 3

print(f'Numpy array a looks like this: {a}')
print(f'Numpy array b looks like this: {b}')

Numpy array a looks like this: [1 1 1 1 1]
Numpy array b looks like this: [1 3 1 1 1]

NumPy continued¶

Let's generate some random numbers¶

Average: 10
Standard deviation: 2
Shape of output: 30

NOTE: you are NOT to memorize functions, use Google/DuckDuckGo/Qwant!

https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.normal.html

a = np.random.normal(10, 2, 100) #center, std, size
print(a)

[13.64148509  9.37027558 11.55503141  8.83709703  9.48983833  7.16315348
  8.65966235 10.63161198 10.05489806 10.67819031 10.1622397  10.38964733
 12.28941641  9.79891143  9.71739266 11.97647768  9.37573227  9.42148989
 10.55136732  8.95869513  7.49473412 10.43983755  8.89071675  5.83360665
 13.90892978  6.2087077   7.04445124  6.84085617 10.3794853   8.18177811
 10.3832782  10.95173682  7.256328    7.61397148 10.8075306  13.16793323
 12.54771582  7.33607166  9.76968608  8.97856717 11.83631658  8.22604541
  5.81946378 10.40637312 11.99922346  8.30670111  8.97677516  7.88026374
 10.75595371 10.1664446  11.2172536  12.35007311  8.85038553  7.81225402
 10.49074409 10.57742111 10.83404465  7.25457263 10.6038723   9.31629219
  8.24065177  6.34435188  9.20605846  7.9061485  10.54353189  8.07566364
 10.31787531 11.39487657  9.31413649 10.7234155   5.45672487 10.8183719
 12.37989951  6.59965447  8.43541314  9.02318332  8.23714223  8.80445471
  5.89617577 13.41059243  7.76289325  9.10151674  8.52072542  9.42389037
  7.52512661 10.75163073 11.70457519  5.30752053  8.23754211 11.35157485
  8.52781693 13.26589646  9.73581113  7.51540953 10.76267654  9.04549763
  7.48191845  9.22324419 11.28929172  9.53355454]

Numpy allows us to quickly calculate many attributes of a given array

a_max = a.max()
print(f'The maximum value is {a_max}') #note the f notation
print(f'The minimum value is {a.min()}') #we can do more in the curly brackets too
print(f'The sum is {a.sum()}')
print(f'The average is {a.mean()}')
print(f'The standard deviation is {a.std()}')

The maximum value is 13.90892977807637
The minimum value is 5.3075205309070315
The sum is 947.6354450008286
The average is 9.476354450008285
The standard deviation is 1.9149614193114533

To avoid too many decimals, we can simply round the numbers

np.round(a.std(),2)

1.91

Functions¶

Functions work the same way as in most other programming languages

def a_to_b(a,b):
    return(a**b)

print(a_to_b(3,4))

81

Conditional statements¶

same as functions. Remember dual equal signs!

def even_odd(a):
    if a%2 == 0:    #Percentage symbol is modulo. Here divide by 2 and return the rest
        print(f'{a} is even')
    else:
        print(f'{a} is odd')
        
even_odd(3)
even_odd(4)

3 is odd
4 is even

Loops¶

for¶

for loop can be used in a number of ways

for i in range(4):
    print(i)

for i in range(4,10):
    print(i)

text = "rabbit"
for i in text:
    print(i)

r
a
b
b
i
t

some_numbers = np.random.normal(1,1,10)
for i in some_numbers:
    print(np.round(i,2))

1.51
-0.22
1.41
2.47
1.58
0.77
2.68
-0.81
2.72
0.35

Excercise 1¶

You are playing a simple board game. You roll a dice and advance forward by the number of dots. Easy.

Some fields are special. Some are You wait one round or You roll again.

One field is very special. The rules state:

roll dice 3 times
you move forward by the two lowest numbers
you move back by the highest number

Example: you roll 2, 6 and 4. You move forwards by 2 + 4 and move backwards by 6

Question: Is it a good, or a bad field? What is the mean and median result?

def monte_carlo(rolls):
    
    results = []   #initiate an empty list
    
    for i in range(rolls):  #Perform given number of simulations
        sample = np.random.randint(1,7,3) #Return random integers from low (inclusive) to high (exclusive).
        sample = np.sort(sample) #sort the values ascending
        results.append(sample[0]+sample[1]-sample[2])  #apply the rule and append it to the list
        
    return(results)  #return the results

Execute the simulation. You can see how increasing the amount of iterations increases the calculation time

a = monte_carlo(10000)

Print requested values

print(f'The mean movement was {np.mean(a)}')
print(f'The median movement was {np.median(a)}')
print(f'Achieved values are {np.unique(a)}')

The mean movement was 0.6141
The median movement was 1.0
Achieved values are [-4 -3 -2 -1  0  1  2  3  4  5  6]

Data visualization¶

For making of various charts MatPlotLib is the most popular library. It has a similar syntax to MATLAB.

Seaborn is an add-on to MatPlotLib that makes generating more complex charts easier.

First, let's import the libraries:

import matplotlib.pyplot as plt #matplotlib is the most common plotting library for Python.
import seaborn as sns   #seaborn is an extension to matplotlib


sns.set() #this function call sets some defaults to seaborn's defaults.

Let's plot a distribution plot

bins = len(np.unique(a))    #define variable "bins" by counting how many unique values there were in the simulation
sns.distplot(a, bins = bins, hist=True, kde=False, norm_hist=True) #make a histogram
plt.xticks(np.unique(a))    #defining which ticks should be plotted on x axis
plt.show()     #call to show the plot. Note that we mixed seaborn and matplotlib functions.

Experiment on your own: Does it make a difference how many sides does the dice have?

More visualization¶

Let's generate some data to plot

n = 10000
theta = np.random.normal(np.pi*3/4, 1, n)  #random numbers with normal distribution, centered at 3/4 pi
noise_x = np.random.normal(0,0.2,n) #generating numpy arrays with noise
noise_y = np.random.normal(0,0.2,n)
x = np.sin(theta)+noise_x #calculating x coordinate
y = np.cos(theta)+noise_y #calculating y coordinate

We can plot the data as a joint plot with hex plot in the middle. This is very convinient for showing density data.

Seaborn makes creating such relatively complex charts very easy.

sns.jointplot(x, y, kind="hex", gridsize=50, cmap="inferno")

<seaborn.axisgrid.JointGrid at 0x1ca9bc6b710>

Alternatively we can use kernel density estimate for a smooth plot.

sns.jointplot(x=x, y=y, kind="kde",n_levels=35, cmap="viridis", cbar=True)

<seaborn.axisgrid.JointGrid at 0x1ca9d17bdd8>

Some plots can be combined. Note slightly different notation for generating the plot.

n = 1000
theta = np.random.normal(np.pi*3/4, 0.6, n)
noise_x = np.random.normal(0,0.1,n)
noise_y = np.random.normal(0,0.1,n)
x = np.sin(theta)+noise_x
y = np.cos(theta)+noise_y

noise_v = np.random.normal(0,0.2,n)
noise_w = np.random.normal(0,0.2,n)
v = np.sin(theta)*0.2+noise_v
w = np.cos(theta)*0.2+noise_w


ax = sns.kdeplot(x, y, shade=True, cmap="Blues", shade_lowest=False)
ax = sns.kdeplot(v, w, shade=True, cmap="Reds", shade_lowest=False)

Line plot, scatter plot¶

n = 100
x = np.linspace(0,10,n) #create an array that starts at 0, ends at 10 and has n evenly spaced elements
y = np.sin(x)

Very simple plotting with just MatPlotLib

plt.plot(y)

[<matplotlib.lines.Line2D at 0x1ca9d420550>]

Seaborn requires two parameters. Notice x axis ticks. In the example above they correspond to the index of the numpy array.

sns.lineplot(x,y)

<matplotlib.axes._subplots.AxesSubplot at 0x1ca9d4569b0>

Best practice is to specify both axis, see example why:

x = np.random.uniform(0,10,n)   #create an array with random, uniform distribution
                                #between 0 and 10
x = np.sort(x)                  #sort. This is now an unevenly polled domain

y = np.sin(x)

plt.plot(y)

[<matplotlib.lines.Line2D at 0x1ca9cf6b080>]

plt.plot(x,y)

[<matplotlib.lines.Line2D at 0x1ca9bf054a8>]

The first chart is distorted due to uneven polling, even though the signal is perfectly noise free. When specified x and y variables, the signal is now plotted correctly

Scatter plots¶

Very similar to line plots, work better with noisy data

n = 200
x = np.linspace(0,10,n) #create an array that starts at 0, ends at 10 and has n evenly spaced elements
y = np.sin(x)+np.random.normal(0,0.1,n)

outlier_x = np.random.uniform(0,10,30)
outlier_y = np.random.uniform(-1.5, 1.5, 30)

x = np.append(x, outlier_x)
y = np.append(y, outlier_y)

sns.lineplot(x,y)

<matplotlib.axes._subplots.AxesSubplot at 0x1ca9d54ec88>

Line plot looks very bad, compare with scatter plot

sns.scatterplot(x,y)

<matplotlib.axes._subplots.AxesSubplot at 0x1ca9d5b0cf8>

One can plot multiple plots at the same time

n = 500
x = np.linspace(1,100,n)
y1 = np.cumsum(np.random.normal(0,1,n))
y2 = np.cumsum(np.random.normal(-0.1,1,n))
y3 = np.cumsum(np.random.normal(0,2,n))

sns.lineplot(x,y1, label="First one")
sns.lineplot(x,y2, label="Second one")
sns.lineplot(x,y3, label="Third one")

<matplotlib.axes._subplots.AxesSubplot at 0x1ca9d544a58>

Other plot types¶

There are many more plot types, see documentation for MatPlotLib and Seaborn for more information

n = 500
x = np.linspace(1,40,n)
y = 0.3*x+np.sin(x*10)

plt.polar(x,y)

[<matplotlib.lines.Line2D at 0x1ca9d6a25c0>]

Pandas¶

Pandas is an extremely popular package in data science. In short, it deals with 2D arrays called dataframes and has a plethora of data science functions.

Let's start by exploring the Titanic dataset

import numpy as np  #import numpy
import pandas as pd  #import pandas
import seaborn as sns #import seaborn
import matplotlib.pyplot as plt #import matplotlib
df = sns.load_dataset('titanic') #load the titanic dataset to a pandas dataframe
sns.set() #seaborn's defaults

It is very simple to look around the dataframe. Use .head() to view 5 top rows. You can also use .head(a) for defined number of rows.

df.head()

.describe() returns basic statistical data for each column

df.describe()

To get a list of all the attributes in a dataset call list(df).

list(df)

['survived',
 'pclass',
 'sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'embarked',
 'class',
 'who',
 'adult_male',
 'deck',
 'embark_town',
 'alive',
 'alone']

Sorting.

It is sometimes convinient to sort the DataFrame by an attribute. Note that in example below the original dataframe remains untouched (unsorted).

df.sort_values(by="age").head(15)

To check who is the oldest on Titanic, sort descending. Note additional .dropna() that drops the lines containing missing values.

df.dropna().sort_values(by="age", ascending=False).head(15)

To drop NaN existing only in the age column, more complex operation is necessary.

Notice that we got more results - the deck attribute is full of NaNs

df2 = df[np.isfinite(df['age'])]                        #create a new dataframe, that is a copy of df where df['age']
                                                        #is finite, ie. is not NaN or inf
                                                        #Note! This is a reference, not a copy!
df2.sort_values(by="age", ascending=False).head(15)

We can quickly analyse the age distribution among the passangers

df = df[np.isfinite(df['age'])]                                                 #get only results with finite age
sns.distplot(df[df['who'] == "man"]['age'], bins=10, hist=True, label="men")    #histogram of men
sns.distplot(df[df['who'] == "woman"]['age'], bins=10, hist=True, label="women")#histogram of women
plt.legend()                                                                    #add a legend to chart

<matplotlib.legend.Legend at 0x1ca9e6f5e10>

We can also quickly show how that distribution is when split into different classes.

df1 = df[df['class']=="First"]      #defining sub-dataframes containing only passengers of first/second/third class
df2 = df[df['class']=="Second"]
df3 = df[df['class']=="Third"]

sns.distplot(df1[df1['who'] == "man"]['age'], hist=True, label="men, 1st class")
sns.distplot(df1[df1['who'] == "woman"]['age'], hist=True, label="women, 1st class")
plt.legend()
plt.show()               #necessary to show the chart. If we ommit this, we will have everything in one chart.

sns.distplot(df2[df2['who'] == "man"]['age'], hist=True, label="men, 2nd class")
sns.distplot(df2[df2['who'] == "woman"]['age'], hist=True, label="women, 2nd class")
plt.legend()
plt.show()

sns.distplot(df3[df3['who'] == "man"]['age'], hist=True, label="men, 3rd class")
sns.distplot(df3[df3['who'] == "woman"]['age'], hist=True, label="women, 3rd class")
plt.legend()
plt.show()

Things to notice:

The bins are not equal
The bins do not overlap
Kernel Density Estimate can be misleading

sns.distplot(df1[df1['who'] == "woman"]['age'], bins= 10, hist=True, kde=False, norm_hist=True, label="Histogram")

sns.distplot(df1[df1['who'] == "woman"]['age'], hist=False, label="KDE, bw: 2",
             kde_kws={'gridsize' : 100, "bw" : 2})

sns.distplot(df1[df1['who'] == "woman"]['age'], hist=False, label="KDE, bw: 5",
             kde_kws={'gridsize' : 100, "bw" : 5})

sns.distplot(df1[df1['who'] == "woman"]['age'], hist=False, label="KDE, bw: 15",
             kde_kws={'gridsize' : 100, "bw" : 15})

plt.legend()

<matplotlib.legend.Legend at 0x1ca9e7c9cf8>

Why not simply Excel?¶

Prep takes longer, but reuse is much faster with Python.

Let's load weather data from Lerwick weather station (Shetland Islands, Scotland).

df = pd.read_csv("http://www.ux.uis.no/~atunkiel/lerwickdata.csv", delim_whitespace=True)
                                                                #delim_whitespace - whitespace separated file
                #notice that we read csv straight from a website

#year and month are separate in this dataset. Let's merge it and tell pandas that it is a date.
df["date"] = df["mm"].map(str) + "." +df["yyyy"].map(str)
df['date'] = pd.to_datetime(df['date'],format = '%m.%Y')

Quick inspection of the dataframe

df.head()

Let's inspect the maximum recorded temperatures

sns.scatterplot(df['date'],df['tmax[degC]'])

C:\Users\2921228\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\plotting\_converter.py:129: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ea896d8>

Let's do some rolling average to have a cleaner look at data.

How wide should be the window? Let's try 10, 20, 30, ... , 100 days

for i in range(1,10):
    sns.scatterplot(df['date'],df['tmax[degC]'].rolling(10*i).mean(center=True), 
                    alpha=0.3, label=f'rolling average {10*i}')
    plt.show()

It seems that window of 60 gives the best results. Why 60? Probably because there are 12 months in a year and 60 divides nicely by 12.

Let's check the theory by plotting rolling window with widths of 12, 24, 36, ... , 120.

for i in range(1,10):
    sns.scatterplot(df['date'],df['tmax[degC]'].rolling(12*i).mean(center=True),
                    alpha=0.3, label=f'rolling average {12*i}')
    plt.show()

This explains some articles from the 70s:

U.S. Scientist Sees New Ice Age Coming (The Washington Post, July 9, 1971)
Ice Age Around the Corner (Chicago Tribune, July 10, 1971)
New Ice Age Coming – It’s Already Getting Colder (L.A. Times, October 24, 1971)

Pollution was likely driving factor in the cooling. Currently we somewhat overcompensated with CO2 emissions.

We can explore some correlations, like how is temperature related to rain?

x= df['tmax[degC]']
y= df['rain[mm]']
sns.jointplot(x, y, kind="kde", levels=25, cmap="inferno")

<seaborn.axisgrid.JointGrid at 0x1ca9ed0a5f8>

We can fairly easily plot LOWESS:

Locally
Weighted
Scatterplot
Smoothing

from statsmodels.nonparametric.smoothers_lowess import lowess #import library that will do all the heavy lifting
y = df['tmax[degC]']
x = np.arange(0,len(df['date']),1) #converting into linear space from 0 to length of the dataset
w = lowess(y, x, frac=1/5)         #calculating LOWESS
                                   #frac is the window width in terms of a fraction of full width.

And let's plot the results. Some calculations on the x values to recover the year.

sns.scatterplot(x/12+1930,y.rolling(12).mean(center=True),linewidth=0,color="r",alpha=0.3)
sns.lineplot(w[:,0]/12+1930,w[:,1])

<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ecad940>

We can explore how much smooting is too much smoothing

plt.figure(figsize=(15,10)) # bigger plot for visibility

sns.scatterplot(x,y.rolling(12).mean(center=True),linewidth=0,color="r",alpha=0.3) #scatterplot for reference


for i in reversed(range(1,21)):  #note that reversed is used. This is to change the order of plotting of the charts.
    w = lowess(y, x, frac=1/(i*2))
    
    if i != 20:
        sns.lineplot(w[:,0],w[:,1], hue=i, palette='viridis',hue_norm=(1,20), legend=False)
    else:
        sns.lineplot(w[:,0],w[:,1], hue=i, palette='viridis',hue_norm=(1,20))

plt.plot()

[]

Violin plots are often better than box plots, as they carry more data

x = df['rain[mm]']
y = df['mm']
sns.violinplot(y,x)
plt.show()
sns.boxplot(y,x)

<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ebb44a8>

Real drilling data¶

Let's consider some real drilling data. We can import from web again, straight out of a zip file.

df=pd.read_csv("http://www.ux.uis.no/~atunkiel/Norway-NA-15_$47$_9-F-9%20A%20depth.zip")

list(df)

['Unnamed: 0',
 'Unnamed: 0.1',
 'Measured Depth m',
 'TOFB s',
 'AVG_CONF unitless',
 'MIN_CONF unitless',
 'Average Rotary Speed rpm',
 'STUCK_RT unitless',
 'Corrected Surface Weight on Bit kkgf',
 'Corrected Total Hookload kkgf',
 'MWD Turbine RPM rpm',
 'MWD Raw Gamma Ray 1/s',
 'MWD Continuous Inclination dega',
 'Pump 2 Stroke Rate 1/min',
 'Rate of Penetration m/h',
 'Bit Drill Time h',
 'Corrected Hookload kkgf',
 'MWD Gravity Toolface dega',
 'Mud Density Out g/cm3',
 'MWD GR Bit Confidence Flag %',
 'Pump Time h',
 'PowerUP Shock Rate 1/s',
 'Total SPM 1/min',
 'HKLO kkgf',
 'Average Hookload kkgf',
 'Total Hookload kkgf',
 'DRET unitless',
 'Extrapolated Hole TVD m',
 'MWD Gamma Ray (API BH corrected) gAPI',
 'Pump 1 Strokes unitless',
 'ROPIH s/m',
 'EDRT unitless',
 'Pump 1 Stroke Rate 1/min',
 'Rate of penetration m/h',
 'Iso-butane (IC4) ppm',
 'Bit Drilling Run m',
 'MWD Total Shocks unitless',
 'Total Bit Revolutions unitless',
 'Nor-butane (NC4) ppm',
 'Lag Depth (TVD) m',
 'Pump 2 Strokes unitless',
 'Corr. Drilling Exponent unitless',
 'Total Vertical Depth m',
 'Ethane (C2) ppm',
 'TMP In degC',
 'Mud Density In g/cm3',
 'n-Penthane ppm',
 'Mud Density In g/cm3.1',
 'Weight on Bit kkgf',
 'Bit Revolutions  (cum) unitless',
 'Averaged WOB kkgf',
 'Hole Depth (TVD) m',
 'MWD Shock Risk unitless',
 'Bit run number unitless',
 'RHX_RT unitless',
 'Pump 3 Strokes unitless',
 'Total Strokes unitless',
 'Inverse ROP s/m',
 'Pass Name unitless',
 'Pump 4 Stroke Rate 1/min',
 'Iso-pentane (IC5) ppm',
 'Rig Mode unitless',
 'MWD Shock Peak m/s2',
 'Flow Pumps L/min',
 'SPN Sp_RigMode 2hz unitless',
 'Average Standpipe Pressure kPa',
 'Bit Depth m',
 'Rate of Penetration (5ft avg) m/h',
 'Gas (avg) %',
 'Propane (C3) ppm',
 'String weight (rot,avg) kkgf',
 'TOBO s',
 'AJAM_MWD unitless',
 'Tank volume (active) m3',
 '1/2ft ROP m/h',
 'Weight On Hook kkgf',
 'Hole depth (MD) m',
 'Mud Flow In L/min',
 'BHFG unitless',
 'Temperature Out degC',
 'Averaged TRQ kN.m',
 'MWD Continuous Azimuth dega',
 'RGX_RT unitless',
 'MWD DNI Temperature degC',
 'Bit Drilling Time h',
 'HKLI kkgf',
 'Average Surface Torque kN.m',
 'Methane (C1) ppm',
 'MWD Magnetic Toolface dega',
 'Total Downhole RPM rpm',
 'SHK3TM_RT min',
 'Elapsed time in-slips s',
 'Stand Pipe Pressure kPa',
 'Pump 3 Stroke Rate 1/min',
 'Averaged RPM rpm',
 'Pump 4 Strokes unitless',
 'Inverse ROP (5ft avg) s/m',
 'S1AC kPa',
 'S2AC kPa',
 'IMWT g/cm3',
 'OSTM s',
 'nameWellbore',
 'name',
 'IMP/ARC Attenuation Conductivity 40-in. at 2 MHz mS/m',
 'ARC Annular Pressure kPa',
 'MWD Collar RPM rpm',
 'IMP/ARC Non-BHcorr Phase-Shift Resistivity 28-in. at 2 MHz ohm.m',
 'IMP/ARC Phase-Shift Conductivity 40-in. at 2 MHz mS/m',
 'Annular Temperature degC',
 'IMP/ARC Non-BHcorr Phase-Shift Resistivity 40-in. at 2 MHz ohm.m',
 'ARC Gamma Ray (BH corrected) gAPI',
 'IMP/ARC Non-BHcorr Attenuation Resistivity 40-in. at 2 MHz ohm.m',
 'MWD Stick-Slip PKtoPK RPM rpm',
 'IMP/ARC Non-BHcorr Attenuation Resistivity 28-in. at 2 MHz ohm.m',
 'IMP/ARC Phase-Shift Conductivity 28-in. at 2 MHz mS/m']

Verify the contents of the DataFrame. Note that real-life datasets can be overrun with NaNs.

df.head()

We can explore the data through charts, tweaking them to show data best

df = df[np.isfinite(df['Measured Depth m'])]
df = df[np.isfinite(df['MWD Continuous Inclination dega'])]


plt.figure(figsize=(20,10))
sns.scatterplot(df['Measured Depth m'], df['MWD Continuous Inclination dega'],
                linewidth=0,
                hue=df['Average Hookload kkgf'],
                hue_norm=(85,100),
                palette="viridis",
                size=df['Average Surface Torque kN.m'],
                size_norm=(0,10),
                sizes=(20,300)
                )

<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ebda898>

Utilize the .describe() function to gauge the normalization range

df['Average Hookload kkgf'].describe()

count    1963.000000
mean       94.636537
std         3.698203
min        71.966965
25%        91.820703
50%        96.034577
75%        97.359066
max       104.303565
Name: Average Hookload kkgf, dtype: float64

Simple plot is also useful to understand the data contents, such as outliers.

df['Average Hookload kkgf'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1ca9ebb46d8>

Color palettes¶

While in publication grayscale is typically preferred, when using color use good color schemes. Examples below are good - the brightness of colors is consistent with color scale.

palettes = ['viridis','inferno', 'gray', 'Blues', 'Reds', 'hot', 'cividis']
for i in palettes:
    sns.palplot(sns.color_palette(i,10))

Divergin palletes are also possible to highlight that something is below or above a certain level (like a physical map)

sns.palplot(sns.diverging_palette(220, 20, n=11))
sns.palplot(sns.diverging_palette(145, 280, s=85, l=25, n=11))
sns.palplot(sns.diverging_palette(255, 133, l=60, n=11, center="dark"))

Avoid "jet" palette¶

A ~rainbow pallete that is the most commonly used despite being a poor in representing data

https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/

https://pdfs.semanticscholar.org/ee79/2edccb2c88e927c81285344d2d88babfb86f.pdf

Seaborn will straight out refuse to use it returning ValueError: No.

#palettes = ['jet']
#for i in palettes:
#    sns.palplot(sns.color_palette(i))

Jet colormap will be even more confusing when converted to grayscale.

Don't worry about the code below, it is there just to generate the chart.

def grayify_cmap(cmap):
    """Return a grayscale version of the colormap"""
    cmap = plt.cm.get_cmap(cmap)
    colors = cmap(np.arange(cmap.N))
    
    # convert RGBA to perceived greyscale luminance
    # cf. http://alienryderflex.com/hsp.html
    RGB_weight = [0.299, 0.587, 0.114]
    luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
    colors[:, :3] = luminance[:, np.newaxis]
    
    return cmap.from_list(cmap.name + "_grayscale", colors, cmap.N)



%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 6,300)
y = np.linspace(0, 3,300)[:, np.newaxis]
z = 10 * np.cos(x ** 2) * np.exp(-y)
cmaps = [plt.cm.jet, grayify_cmap('jet'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
fig.subplots_adjust(wspace=0)

for cmap, ax in zip(cmaps, axes):
    im = ax.imshow(z, cmap=cmap)
    ax.set_title(cmap.name)
    fig.colorbar(im, ax=ax)

Use a palette that has lightness corresponding to value for better results

cmaps = [plt.cm.CMRmap, grayify_cmap('CMRmap'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
fig.subplots_adjust(wspace=0)

for cmap, ax in zip(cmaps, axes):
    im = ax.imshow(z, cmap=cmap)
    ax.set_title(cmap.name)
    fig.colorbar(im, ax=ax)

Regression¶

In Python we can easily perform regression using any function.

Let's assume we have a signal governed by an equation \begin{equation*} y = -0.2x^3 + 3x^2+30sin(x) \end{equation*}

This signal is noisy. Noise has a standard deviation of 30.

We have total 100 samples to work with, but 20 of them are outliers

np.random.seed(10) #fixed random seed, so you always get the same set of random numbers

n = 20 #outlier count
x_0 = np.linspace(-10,10,100-n)  #generate 100 minus n evenly spaced values from -10 to 10
y_0 = -0.2*x_0**3+3*x_0**2+30*np.sin(x_0) #calculate y

sns.lineplot(x_0,y_0,color="red", label="ground truth") #plot clean function

x = x_0  #keep x_0 separate for plotting later
y = y_0 + np.random.normal(0,30,100-n)  #generate new function - same as clean one plus noise with sd=30
sns.scatterplot(x,y,label="truth+noise") #plot the noisy chart

outliers_x = np.random.uniform(-12,12,n) #generate n outliers, x coordinates
outliers_y = np.random.uniform(-200,500,n)#... y coordinates
sns.scatterplot(outliers_x,outliers_y,label="outliers") #plot outliers only

x = np.concatenate([x, outliers_x]) #add the outliers to the x array
y = np.concatenate([y, outliers_y]) #add the outliers to the y array

Let's look at the raw data. You can see, that the data is very noisy and difficult to make a regression of!

sns.scatterplot(x,y,label="outliers",linewidth=0)

<matplotlib.axes._subplots.AxesSubplot at 0x1caa167ae10>

Let's assume that we know the function itself, but we do not know the parameters, effectively, known shape of the funtion is: \begin{equation*} y = a_0 x^3 + a_1 x^2+a_2sin(x) \end{equation*}

We define a function in a parametric way that is suitable for the algorithm that will come later.

def fun(a, x, y):
    return a[0]*x**3+a[1]*x**2+a[2]*np.sin(x) - y

We can import an optimization function from SciPy based on least squares method

from scipy.optimize import least_squares

Let's develop two models.

One is optimized for outliers, where they no longer carry such a heavy weight (typical issue with least squares method).

Second is a vanilla Least Squares method

a0 = np.ones(3) #initiating the parameters with ones

model = least_squares(fun, a0, args=(x,y),loss='soft_l1') #the soft_l1 is the more lenient loss function

model2 = least_squares(fun, a0, args=(x,y))

print(model.x) #show the calculated parameters

[-0.20871253  2.95637215 31.38991214]

Model returned:

$a_0 = -0.20864698$ (originally $-0.2$)

$a_1 = 2.95738833$ (originally $3$)

$a_2 = 31.35128843$ (originally $30$)

Which is very close to original values. Let's inspect it graphically. The match is nearly perfect.

plt.figure(figsize=(10,7)) #let's use slightly bigger chart
sns.scatterplot(x,y, label="input data",s=100)

sns.lineplot(x_0,y_0,color="red", lw=2,alpha=1,label="ground truth")

y2 = model.x[0]*x**3+model.x[1]*x**2+model.x[2]*np.sin(x) #calculate y based on the model
sns.lineplot(x, y2, color="green",lw=5,alpha=0.5, label="model, optimized")

<matplotlib.axes._subplots.AxesSubplot at 0x1caa17b8358>

In comparison, vanilla Least Square method did not perform so well:

plt.figure(figsize=(10,7))
sns.scatterplot(x,y, label="input data",s=100)

sns.lineplot(x_0,y_0,color="red", lw=2,alpha=1,label="ground truth")

y3 = model2.x[0]*x**3+model2.x[1]*x**2+model2.x[2]*np.sin(x)
sns.lineplot(x, y3, color="green",lw=2,dashes="-",alpha=1, label="model, vanilla")

<matplotlib.axes._subplots.AxesSubplot at 0x1caa11bde10>

Let's calculate $R^2$ values for both models.

Note: in this implementation $R^2$ value can be negative

from sklearn.metrics import r2_score  #in a typical Python fashion, we just import what is needed

print(f'R^2 value for optimized fitting is: {r2_score(y,y2)}' )
print(f'R^2 value for standard fitting is : {r2_score(y,y3)}' )

R^2 value for optimized fitting is: 0.06573837456233356
R^2 value for standard fitting is : 0.27094919231420334

The objectively much better model has a lower $R^2$ value. As you can see, this is not the ultimate metric.

Optimization - finding minimum¶

One of the optimization tasks is in principle finding a minimum of a function. Some examples:

Find well path with minimum torque and drag
Find lowest highest MSE to ROP ratio
Sqaring the circle

Example 1¶

Let's find the minimum of the function that we explored before.

We will use a differential evolution algorithm, which is conviniently available in scipy library.

We will need to redefine the function slightly to make it work with the algorithm. Minimum seeking algorithms also often need boundaries - we have to provide it as well.

from scipy.optimize import differential_evolution  #import differential evolution algorithm

bounds = [(-10, 10)]  #define bounds within which algorithm will seek minimum

def fun_min(x):
    return -0.2*x**3+3*x**2+30*np.sin(x)  #redefined function - it simply returns y.


result = differential_evolution(fun_min, bounds) #here the magic happens. We feed the function and the bounds 
                                                 #to the algorithm

sns.lineplot(x_0,y_0,color="gray", lw=2,alpha=1,label="ground truth", zorder=1) #plot the ground truth
                                                                                #zorder defines the plotting
                                                                                #sequence, aka. what is on top
plt.scatter(result.x[0], result.fun[0],marker="x",s=200,c="r", label="minimum", zorder=10)
plt.legend()

<matplotlib.legend.Legend at 0x1caa160b7b8>

Example 2 - multiparameter function¶

For future reference, let's see how that works when a function has multiple parameters.

from mpl_toolkits.mplot3d import Axes3D  #importing some things necessary for 3D plot

def fun3d(vars):  #definition of a sample 3D function. See that it still takes only one variable
    x = vars[0]   #but now it is an array, which we have to deconstruct
    y = vars[1]
    return np.sin(x)*np.exp(-x**2-y**2)

x = np.linspace(-3,3,100)  #defining X and Y space
y = np.linspace(-3,3,100)
x, y = np.meshgrid(x, y)   #convert x and y to a mesh, so we get a point for each intersection of x and y
z = fun3d([x,y])           #feed the function

fig = plt.figure(figsize=(10,7))
ax = fig.gca(projection='3d')  #define 3D plot
ax.plot_surface(x, y, z, cmap=plt.cm.viridis, linewidth=0.2, alpha=1)  #let's plot the function to verify
                                                                       #that everything works
plt.show()

bounds = [(-3,3),(-3,3)]  #define the bounds
result = differential_evolution(fun3d, bounds) #feed the function and the bounds to DE algorithm

#and we plot everything again
fig = plt.figure(figsize=(10,7))
ax = fig.gca(projection='3d')
ax.plot_surface(x, y, z, cmap=plt.cm.viridis, linewidth=0.2, alpha=0.3)
ax.scatter(result.x[0], result.x[1], result.fun, c="r", marker="x",s=200)
plt.show()

Example 3, squaring the circle¶

What is the size of a square that corresponds are-wise to a circle? They tried to solve it since 1800 BCE. We need few lines in Python nowadays.

def quadrature(x):               #function that will return the absolute difference between
                                 #the area of a circle, radius = 1, and square, length x
    return np.abs(np.pi - x**2)  #assume radius = 1, hence pi*1^2 = pi

bounds = [(0,2)]
result = differential_evolution(quadrature, bounds)

ax=plt.gca()

patch = plt.Circle((0,0), radius= 1, fill=False,color="r")
ax.add_patch(patch)

patch = plt.Rectangle((-0.5*result.x,-0.5*result.x),result.x,result.x, fill=False, color="b")
ax.add_patch(patch)

plt.axis('scaled')
plt.show()

To visualize the actual function we minimized:

x = np.linspace(0,2,100)
y = quadrature(x)

sns.lineplot(x,y)

<matplotlib.axes._subplots.AxesSubplot at 0x1caa7da38d0>

Feature engineering¶

Oftentimes we want to use a parameter that is not delivered in the database. We need to calculate it. Below is a short example of calculating inclination based dogleg severity, done on the dataset we used before.

Let's import the dataset again and plot inclination data

df=pd.read_csv("http://www.ux.uis.no/~atunkiel/Norway-NA-15_$47$_9-F-9%20A%20depth.zip")

sns.scatterplot(df['Measured Depth m'],df['MWD Continuous Inclination dega'],linewidth=0,size=1, alpha=0.3)

<matplotlib.axes._subplots.AxesSubplot at 0x1caa142fda0>

For clarity, let's develop two interim variables:

dDepth - depth change
dInc - inclination change

We will also smoothen out the data a little bit with a rolling mean. Deltas will be calculated over slightly larger step than one to avoid noise

df.ffill(inplace=True) #forward filling and backward filling to avoid NaNs in the dataset
df.bfill(inplace=True)

step_d=20    #steps for differentiation and rolling average
step_r = 100 #as variables for easy tweaking

df['dDepth'] = df['Measured Depth m'].rolling(step_r).mean(center=True).diff(step_d)
df['dInc'] = df['MWD Continuous Inclination dega'].rolling(step_r).mean(center=True).diff(step_d)
df['DLS'] = df['dDepth']/df['dInc']

sns.scatterplot(df['Measured Depth m'],df['MWD Continuous Inclination dega'],
                hue=df['DLS'],linewidth=0,size=1, alpha=0.3)

<matplotlib.axes._subplots.AxesSubplot at 0x1caab3272e8>

We immediatly see an issue. Color scale, that we wanted to use to visualize dogleg is out of whack.

We can inspect basic data parameters with this one liner we learned about before.

df['dInc'].describe()

count    16551.000000
mean         0.060214
std          0.148738
min         -0.518800
25%          0.000000
50%          0.006000
75%          0.111800
max          1.677000
Name: dInc, dtype: float64

and some more extra info through a histogram

sns.distplot(df['dInc'].dropna())

<matplotlib.axes._subplots.AxesSubplot at 0x1caaccaeb70>

With this new information, we can specify the range of our color range

plt.figure(figsize=(10,7)) 
sns.scatterplot(df['Measured Depth m'], df['MWD Continuous Inclination dega'],
                linewidth=0,
                hue=df['dInc'],
                hue_norm=(0,0.25),
                palette="viridis",
                )

<matplotlib.axes._subplots.AxesSubplot at 0x1caab3090f0>

NOTE: There is a simplified approach.¶

While everything looks correct at a first glance. However, there are number of potential issues here:

We did a rolling average and calculated difference with a fixed step, not checking the actual data frequency. This might have smoohted out some important data. We should resample the data for a consistent result
There are outliers in the dataset, that should be taken out first.

Correlation¶

Let's explore correlation. We will import diamond dataset. It is a data frame with 53940 rows and 10 variables:

price - price in US dollars (326-18823)
carat - weight of the diamond (0.2--5.01)
cut - quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color - diamond colour, from D (best) to J (worst)
clarity - a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x - length in mm (0--10.74)
y - width in mm (0--58.9)
z - depth in mm (0--31.8)
depth - total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
table - width of top of diamond relative to widest point (43--95)

df = pd.read_csv('http://www.ux.uis.no/~atunkiel/diamonds.csv')
sns.scatterplot(df['carat'],df['price'])

<matplotlib.axes._subplots.AxesSubplot at 0x1cacb9f7e80>

The readability of this chart is poor. Points are widely overlapping each other. We can try a seaborn jointplot.

Note: colors were log-normalized.

import matplotlib.colors as colors

sns.jointplot(df['carat'],df['price'], kind="hex", gridsize=75,
              cmap="inferno", joint_kws={"norm" : colors.LogNorm()})

<seaborn.axisgrid.JointGrid at 0x1cad93c6b70>

Many correlations in life are logarithmic in nature. Let's quickly explore if it is usable here.

You will clearly see that the log-log relationship is by far the most linear. Further analysis may benefit greatly from this finding, as linear correlations are much easier to model than logarithmic ones.

fig, axs = plt.subplots(2, 2, figsize=(10,10))


axs[0, 0].hexbin(df['carat'],df['price'],xscale='linear', yscale='linear', norm = colors.LogNorm())
axs[0, 0].set_title('linear-linear')

axs[0, 1].hexbin(df['carat'],df['price'],xscale='linear', yscale='log', norm = colors.LogNorm())
axs[0, 1].set_title('linear-log')

axs[1, 0].hexbin(df['carat'],df['price'],xscale='log', yscale='linear', norm = colors.LogNorm())
axs[1, 0].set_title('log-linear')

axs[1, 1].hexbin(df['carat'],df['price'],xscale='log', yscale='log', norm = colors.LogNorm())
axs[1, 1].set_title('log-log')

Text(0.5, 1.0, 'log-log')

Seaborn allows one to plot a useful pairplot, to look at correlation between parameters

sns.pairplot(df.sample(100))

<seaborn.axisgrid.PairGrid at 0x1cac8243240>

One can put a number on the correlation.

sns.heatmap(df.corr(), cmap="viridis")

<matplotlib.axes._subplots.AxesSubplot at 0x1cadaffaf98>

n = 1000

results = []
for i in range(n):
    if (np.average(np.random.randint(0,2,11)) <= 3):
        results.append(1)
    else:
        results.append(0)

	yyyy	mm	tmax[degC]	tmin[degC]	af[days]	rain[mm]	sun[hours]	date
0	1930	12	7.0	3.6	0	122.4	13.1	1930-12-01
1	1931	1	4.9	0.2	13	108.0	29.7	1931-01-01
2	1931	2	4.4	-0.3	12	138.3	52.6	1931-02-01
3	1931	3	4.2	-1.0	17	18.2	135.7	1931-03-01
4	1931	4	8.1	2.5	1	70.9	134.7	1931-04-01

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

	survived	pclass	age	sibsp	parch	fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
803	1	3	male	0.42	0	1	8.5167	C	Third	child	False	NaN	Cherbourg	yes	False
755	1	2	male	0.67	1	1	14.5000	S	Second	child	False	NaN	Southampton	yes	False
644	1	3	female	0.75	2	1	19.2583	C	Third	child	False	NaN	Cherbourg	yes	False
469	1	3	female	0.75	2	1	19.2583	C	Third	child	False	NaN	Cherbourg	yes	False
78	1	2	male	0.83	0	2	29.0000	S	Second	child	False	NaN	Southampton	yes	False
831	1	2	male	0.83	1	1	18.7500	S	Second	child	False	NaN	Southampton	yes	False
305	1	1	male	0.92	1	2	151.5500	S	First	child	False	C	Southampton	yes	False
827	1	2	male	1.00	0	2	37.0042	C	Second	child	False	NaN	Cherbourg	yes	False
381	1	3	female	1.00	0	2	15.7417	C	Third	child	False	NaN	Cherbourg	yes	False
164	0	3	male	1.00	4	1	39.6875	S	Third	child	False	NaN	Southampton	no	False
183	1	2	male	1.00	2	1	39.0000	S	Second	child	False	F	Southampton	yes	False
386	0	3	male	1.00	5	2	46.9000	S	Third	child	False	NaN	Southampton	no	False
172	1	3	female	1.00	1	1	11.1333	S	Third	child	False	NaN	Southampton	yes	False
788	1	3	male	1.00	1	2	20.5750	S	Third	child	False	NaN	Southampton	yes	False
642	0	3	female	2.00	3	2	27.9000	S	Third	child	False	NaN	Southampton	no	False

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
630	1	1	male	80.0	0	0	30.0000	S	First	man	True	A	Southampton	yes	True
96	0	1	male	71.0	0	0	34.6542	C	First	man	True	A	Cherbourg	no	True
745	0	1	male	70.0	1	1	71.0000	S	First	man	True	B	Southampton	no	False
54	0	1	male	65.0	0	1	61.9792	C	First	man	True	B	Cherbourg	no	False
456	0	1	male	65.0	0	0	26.5500	S	First	man	True	E	Southampton	no	True
438	0	1	male	64.0	1	4	263.0000	S	First	man	True	C	Southampton	no	False
275	1	1	female	63.0	1	0	77.9583	S	First	woman	False	D	Southampton	yes	False
252	0	1	male	62.0	0	0	26.5500	S	First	man	True	C	Southampton	no	True
625	0	1	male	61.0	0	0	32.3208	S	First	man	True	D	Southampton	no	True
170	0	1	male	61.0	0	0	33.5000	S	First	man	True	B	Southampton	no	True
587	1	1	male	60.0	1	1	79.2000	C	First	man	True	B	Cherbourg	yes	False
366	1	1	female	60.0	1	0	75.2500	C	First	woman	False	D	Cherbourg	yes	False
11	1	1	female	58.0	0	0	26.5500	S	First	woman	False	C	Southampton	yes	True
487	0	1	male	58.0	0	0	29.7000	C	First	man	True	B	Cherbourg	no	True
268	1	1	female	58.0	0	1	153.4625	S	First	woman	False	C	Southampton	yes	False

	Unnamed: 0	Unnamed: 0.1	Measured Depth m	TOFB s	AVG_CONF unitless	MIN_CONF unitless	Average Rotary Speed rpm	STUCK_RT unitless	Corrected Surface Weight on Bit kkgf	Corrected Total Hookload kkgf	...	MWD Collar RPM rpm	IMP/ARC Non-BHcorr Phase-Shift Resistivity 28-in. at 2 MHz ohm.m	IMP/ARC Phase-Shift Conductivity 40-in. at 2 MHz mS/m	Annular Temperature degC	IMP/ARC Non-BHcorr Phase-Shift Resistivity 40-in. at 2 MHz ohm.m	ARC Gamma Ray (BH corrected) gAPI	IMP/ARC Non-BHcorr Attenuation Resistivity 40-in. at 2 MHz ohm.m	MWD Stick-Slip PKtoPK RPM rpm	IMP/ARC Non-BHcorr Attenuation Resistivity 28-in. at 2 MHz ohm.m	IMP/ARC Phase-Shift Conductivity 28-in. at 2 MHz mS/m
0	0	0	273.101	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1	1	273.253	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2	2	273.406	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	3	3	273.558	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	4	4	273.710	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN