a = 5
b = 3.4
c = "test"
print("Variable 'a' is: " + str(type(a)))
print("Variable 'b' is: " + str(type(b)))
print("Variable 'c' is: " + str(type(c)))
def print_n_powers_of_2(n): #function definition
for i in range(n): #simple loop
print(2**i) #note that python starts counting at zero
print_n_powers_of_2(6) #execute the function passing number 6 as parameter n
Note: Use Python 3.x only. Python 2.x also exists, but is being depreciated. For most intents and purposes Python 2 code will work in Python 3.
a = a + 1 #add one to a defined earlier. Re-running this cell will keep incrementing a
print (a)
Note: Python is very distributed in nature compared to compact package of MATLAB. If you are searching for funciton - use your search engine of choice: Google/DuckDuckGo/Qwant
a = 5 #python automatically knows if it is int, float, str, etc
b = "rabbit" #strings (text) comes in quote marks
c = a + 2 #you can do operations at the same time as definition
Printing is the basic way of showing data in Python. It is very forgiving.
print(a)
print(b)
print(c+a)
Lists, arrays, matrices - a sequence of variables
a = [0, 1, 2, 3]
print(a)
a = [1, "two", 3.0] #feel free to mix types
print(type(a))
print()
print(type(a[0]))
print(type(a[1]))
print(type(a[2]))
To work on such data structures, the best is to use a library called numpy.
import numpy as np
a = [0, 1, 2, 3]
a = np.asarray(a) #convert python list to a numpy array
print(a) #note the different printing format than before
This allows us to do different operations easily:
print(a+2)
print(a*2)
b = np.asarray([5,6,5,6])
print(a)
print(" +")
print(b)
print(" =")
print(a+b)
To get a part of an array, slices are used.
print(f'This is a complete array, a:{a}')
print(f'This is a a single element in an array, a[2]:{a[2]}')
print(f'This is a slice of an array, a[0:2]:{a[0:2]}') #note that slice is [inclisive:exclusive]
print(f'This is a last element of an array, a[-1]:{a[-1]}')
Arrays can be nested
#defining a nested list
a = [
[1,2,3,4],
[5,6,7,8]
]
Caution: Python lists behave differently than numpy arrays!
print(a[0][1])
Notice different notation when converted to numpy array
a = np.asarray(a)
print(a[0,1])
Slices are particularly useful here
print(a[1,1:3])
Lists and arrays behave differently than variables.
a = 2
b = a
b = b + 2
print(f'a is equal to {a}')
print(f'b is equal to {b}')
But, if we do the same thing with lists:
a = [1,1,1,1,1]
b = a
b[1] = 3
print(f'List a looks like this: {a}')
print(f'List b looks like this: {b}')
Notice that list a also changed. Let's see how numpy arrays behave:
a = np.asarray([1,1,1,1,1])
b = a
b[1] = 3
print(f'Numpy array a looks like this: {a}')
print(f'Numpy array b looks like this: {b}')
Same thing. There may be differences though:
a = [1,1,1,1,1]
b = a[1:3]
b[1] = 3
print(f'List a looks like this: {a}')
print(f'List b looks like this: {b}')
a = np.asarray([1,1,1,1,1])
b = a[1:3]
b[1] = 3
print(f'Numpy array a looks like this: {a}')
print(f'Numpy array b looks like this: {b}')
When using a slice list created a copy, but numpy array kept the reference.
Numpy can make a copy of an array if you are explicit about it
a = np.asarray([1,1,1,1,1])
b = np.copy(a)
b[1] = 3
print(f'Numpy array a looks like this: {a}')
print(f'Numpy array b looks like this: {b}')
NOTE: you are NOT to memorize functions, use Google/DuckDuckGo/Qwant!
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.normal.html
a = np.random.normal(10, 2, 100) #center, std, size
print(a)
Numpy allows us to quickly calculate many attributes of a given array
a_max = a.max()
print(f'The maximum value is {a_max}') #note the f notation
print(f'The minimum value is {a.min()}') #we can do more in the curly brackets too
print(f'The sum is {a.sum()}')
print(f'The average is {a.mean()}')
print(f'The standard deviation is {a.std()}')
To avoid too many decimals, we can simply round the numbers
np.round(a.std(),2)
Functions work the same way as in most other programming languages
def a_to_b(a,b):
return(a**b)
print(a_to_b(3,4))
same as functions. Remember dual equal signs!
def even_odd(a):
if a%2 == 0: #Percentage symbol is modulo. Here divide by 2 and return the rest
print(f'{a} is even')
else:
print(f'{a} is odd')
even_odd(3)
even_odd(4)
for i in range(4):
print(i)
for i in range(4,10):
print(i)
text = "rabbit"
for i in text:
print(i)
some_numbers = np.random.normal(1,1,10)
for i in some_numbers:
print(np.round(i,2))
You are playing a simple board game. You roll a dice and advance forward by the number of dots. Easy.
Some fields are special. Some are You wait one round or You roll again.
One field is very special. The rules state:
Example: you roll 2, 6 and 4. You move forwards by 2 + 4 and move backwards by 6
Question: Is it a good, or a bad field? What is the mean and median result?
def monte_carlo(rolls):
results = [] #initiate an empty list
for i in range(rolls): #Perform given number of simulations
sample = np.random.randint(1,7,3) #Return random integers from low (inclusive) to high (exclusive).
sample = np.sort(sample) #sort the values ascending
results.append(sample[0]+sample[1]-sample[2]) #apply the rule and append it to the list
return(results) #return the results
Execute the simulation. You can see how increasing the amount of iterations increases the calculation time
a = monte_carlo(10000)
Print requested values
print(f'The mean movement was {np.mean(a)}')
print(f'The median movement was {np.median(a)}')
print(f'Achieved values are {np.unique(a)}')
For making of various charts MatPlotLib is the most popular library. It has a similar syntax to MATLAB.
Seaborn is an add-on to MatPlotLib that makes generating more complex charts easier.
First, let's import the libraries:
import matplotlib.pyplot as plt #matplotlib is the most common plotting library for Python.
import seaborn as sns #seaborn is an extension to matplotlib
sns.set() #this function call sets some defaults to seaborn's defaults.
Let's plot a distribution plot
bins = len(np.unique(a)) #define variable "bins" by counting how many unique values there were in the simulation
sns.distplot(a, bins = bins, hist=True, kde=False, norm_hist=True) #make a histogram
plt.xticks(np.unique(a)) #defining which ticks should be plotted on x axis
plt.show() #call to show the plot. Note that we mixed seaborn and matplotlib functions.
Experiment on your own: Does it make a difference how many sides does the dice have?
Let's generate some data to plot
n = 10000
theta = np.random.normal(np.pi*3/4, 1, n) #random numbers with normal distribution, centered at 3/4 pi
noise_x = np.random.normal(0,0.2,n) #generating numpy arrays with noise
noise_y = np.random.normal(0,0.2,n)
x = np.sin(theta)+noise_x #calculating x coordinate
y = np.cos(theta)+noise_y #calculating y coordinate
We can plot the data as a joint plot with hex plot in the middle. This is very convinient for showing density data.
Seaborn makes creating such relatively complex charts very easy.
sns.jointplot(x, y, kind="hex", gridsize=50, cmap="inferno")
Alternatively we can use kernel density estimate for a smooth plot.
sns.jointplot(x=x, y=y, kind="kde",n_levels=35, cmap="viridis", cbar=True)
Some plots can be combined. Note slightly different notation for generating the plot.
n = 1000
theta = np.random.normal(np.pi*3/4, 0.6, n)
noise_x = np.random.normal(0,0.1,n)
noise_y = np.random.normal(0,0.1,n)
x = np.sin(theta)+noise_x
y = np.cos(theta)+noise_y
noise_v = np.random.normal(0,0.2,n)
noise_w = np.random.normal(0,0.2,n)
v = np.sin(theta)*0.2+noise_v
w = np.cos(theta)*0.2+noise_w
ax = sns.kdeplot(x, y, shade=True, cmap="Blues", shade_lowest=False)
ax = sns.kdeplot(v, w, shade=True, cmap="Reds", shade_lowest=False)
n = 100
x = np.linspace(0,10,n) #create an array that starts at 0, ends at 10 and has n evenly spaced elements
y = np.sin(x)
Very simple plotting with just MatPlotLib
plt.plot(y)
Seaborn requires two parameters. Notice x axis ticks. In the example above they correspond to the index of the numpy array.
sns.lineplot(x,y)
Best practice is to specify both axis, see example why:
x = np.random.uniform(0,10,n) #create an array with random, uniform distribution
#between 0 and 10
x = np.sort(x) #sort. This is now an unevenly polled domain
y = np.sin(x)
plt.plot(y)
plt.plot(x,y)
The first chart is distorted due to uneven polling, even though the signal is perfectly noise free. When specified x and y variables, the signal is now plotted correctly
Very similar to line plots, work better with noisy data
n = 200
x = np.linspace(0,10,n) #create an array that starts at 0, ends at 10 and has n evenly spaced elements
y = np.sin(x)+np.random.normal(0,0.1,n)
outlier_x = np.random.uniform(0,10,30)
outlier_y = np.random.uniform(-1.5, 1.5, 30)
x = np.append(x, outlier_x)
y = np.append(y, outlier_y)
sns.lineplot(x,y)
Line plot looks very bad, compare with scatter plot
sns.scatterplot(x,y)
One can plot multiple plots at the same time
n = 500
x = np.linspace(1,100,n)
y1 = np.cumsum(np.random.normal(0,1,n))
y2 = np.cumsum(np.random.normal(-0.1,1,n))
y3 = np.cumsum(np.random.normal(0,2,n))
sns.lineplot(x,y1, label="First one")
sns.lineplot(x,y2, label="Second one")
sns.lineplot(x,y3, label="Third one")
There are many more plot types, see documentation for MatPlotLib and Seaborn for more information
n = 500
x = np.linspace(1,40,n)
y = 0.3*x+np.sin(x*10)
plt.polar(x,y)
Pandas is an extremely popular package in data science. In short, it deals with 2D arrays called dataframes and has a plethora of data science functions.
Let's start by exploring the Titanic dataset
import numpy as np #import numpy
import pandas as pd #import pandas
import seaborn as sns #import seaborn
import matplotlib.pyplot as plt #import matplotlib
df = sns.load_dataset('titanic') #load the titanic dataset to a pandas dataframe
sns.set() #seaborn's defaults
It is very simple to look around the dataframe. Use .head() to view 5 top rows. You can also use .head(a) for defined number of rows.
df.head()
.describe() returns basic statistical data for each column
df.describe()
To get a list of all the attributes in a dataset call list(df).
list(df)
Sorting.
It is sometimes convinient to sort the DataFrame by an attribute. Note that in example below the original dataframe remains untouched (unsorted).
df.sort_values(by="age").head(15)
To check who is the oldest on Titanic, sort descending. Note additional .dropna() that drops the lines containing missing values.
df.dropna().sort_values(by="age", ascending=False).head(15)
To drop NaN existing only in the age column, more complex operation is necessary.
Notice that we got more results - the deck attribute is full of NaNs
df2 = df[np.isfinite(df['age'])] #create a new dataframe, that is a copy of df where df['age']
#is finite, ie. is not NaN or inf
#Note! This is a reference, not a copy!
df2.sort_values(by="age", ascending=False).head(15)
We can quickly analyse the age distribution among the passangers
df = df[np.isfinite(df['age'])] #get only results with finite age
sns.distplot(df[df['who'] == "man"]['age'], bins=10, hist=True, label="men") #histogram of men
sns.distplot(df[df['who'] == "woman"]['age'], bins=10, hist=True, label="women")#histogram of women
plt.legend() #add a legend to chart
We can also quickly show how that distribution is when split into different classes.
df1 = df[df['class']=="First"] #defining sub-dataframes containing only passengers of first/second/third class
df2 = df[df['class']=="Second"]
df3 = df[df['class']=="Third"]
sns.distplot(df1[df1['who'] == "man"]['age'], hist=True, label="men, 1st class")
sns.distplot(df1[df1['who'] == "woman"]['age'], hist=True, label="women, 1st class")
plt.legend()
plt.show() #necessary to show the chart. If we ommit this, we will have everything in one chart.
sns.distplot(df2[df2['who'] == "man"]['age'], hist=True, label="men, 2nd class")
sns.distplot(df2[df2['who'] == "woman"]['age'], hist=True, label="women, 2nd class")
plt.legend()
plt.show()
sns.distplot(df3[df3['who'] == "man"]['age'], hist=True, label="men, 3rd class")
sns.distplot(df3[df3['who'] == "woman"]['age'], hist=True, label="women, 3rd class")
plt.legend()
plt.show()
Things to notice:
sns.distplot(df1[df1['who'] == "woman"]['age'], bins= 10, hist=True, kde=False, norm_hist=True, label="Histogram")
sns.distplot(df1[df1['who'] == "woman"]['age'], hist=False, label="KDE, bw: 2",
kde_kws={'gridsize' : 100, "bw" : 2})
sns.distplot(df1[df1['who'] == "woman"]['age'], hist=False, label="KDE, bw: 5",
kde_kws={'gridsize' : 100, "bw" : 5})
sns.distplot(df1[df1['who'] == "woman"]['age'], hist=False, label="KDE, bw: 15",
kde_kws={'gridsize' : 100, "bw" : 15})
plt.legend()
Prep takes longer, but reuse is much faster with Python.
Let's load weather data from Lerwick weather station (Shetland Islands, Scotland).
df = pd.read_csv("http://www.ux.uis.no/~atunkiel/lerwickdata.csv", delim_whitespace=True)
#delim_whitespace - whitespace separated file
#notice that we read csv straight from a website
#year and month are separate in this dataset. Let's merge it and tell pandas that it is a date.
df["date"] = df["mm"].map(str) + "." +df["yyyy"].map(str)
df['date'] = pd.to_datetime(df['date'],format = '%m.%Y')
Quick inspection of the dataframe
df.head()
Let's inspect the maximum recorded temperatures
sns.scatterplot(df['date'],df['tmax[degC]'])
Let's do some rolling average to have a cleaner look at data.
How wide should be the window? Let's try 10, 20, 30, ... , 100 days
for i in range(1,10):
sns.scatterplot(df['date'],df['tmax[degC]'].rolling(10*i).mean(center=True),
alpha=0.3, label=f'rolling average {10*i}')
plt.show()
It seems that window of 60 gives the best results. Why 60? Probably because there are 12 months in a year and 60 divides nicely by 12.
Let's check the theory by plotting rolling window with widths of 12, 24, 36, ... , 120.
for i in range(1,10):
sns.scatterplot(df['date'],df['tmax[degC]'].rolling(12*i).mean(center=True),
alpha=0.3, label=f'rolling average {12*i}')
plt.show()
This explains some articles from the 70s:
Pollution was likely driving factor in the cooling. Currently we somewhat overcompensated with CO2 emissions.
We can explore some correlations, like how is temperature related to rain?
x= df['tmax[degC]']
y= df['rain[mm]']
sns.jointplot(x, y, kind="kde", levels=25, cmap="inferno")
We can fairly easily plot LOWESS:
from statsmodels.nonparametric.smoothers_lowess import lowess #import library that will do all the heavy lifting
y = df['tmax[degC]']
x = np.arange(0,len(df['date']),1) #converting into linear space from 0 to length of the dataset
w = lowess(y, x, frac=1/5) #calculating LOWESS
#frac is the window width in terms of a fraction of full width.
And let's plot the results. Some calculations on the x values to recover the year.
sns.scatterplot(x/12+1930,y.rolling(12).mean(center=True),linewidth=0,color="r",alpha=0.3)
sns.lineplot(w[:,0]/12+1930,w[:,1])
We can explore how much smooting is too much smoothing
plt.figure(figsize=(15,10)) # bigger plot for visibility
sns.scatterplot(x,y.rolling(12).mean(center=True),linewidth=0,color="r",alpha=0.3) #scatterplot for reference
for i in reversed(range(1,21)): #note that reversed is used. This is to change the order of plotting of the charts.
w = lowess(y, x, frac=1/(i*2))
if i != 20:
sns.lineplot(w[:,0],w[:,1], hue=i, palette='viridis',hue_norm=(1,20), legend=False)
else:
sns.lineplot(w[:,0],w[:,1], hue=i, palette='viridis',hue_norm=(1,20))
plt.plot()
Violin plots are often better than box plots, as they carry more data
x = df['rain[mm]']
y = df['mm']
sns.violinplot(y,x)
plt.show()
sns.boxplot(y,x)
Let's consider some real drilling data. We can import from web again, straight out of a zip file.
df=pd.read_csv("http://www.ux.uis.no/~atunkiel/Norway-NA-15_$47$_9-F-9%20A%20depth.zip")
list(df)
Verify the contents of the DataFrame. Note that real-life datasets can be overrun with NaNs.
df.head()
We can explore the data through charts, tweaking them to show data best
df = df[np.isfinite(df['Measured Depth m'])]
df = df[np.isfinite(df['MWD Continuous Inclination dega'])]
plt.figure(figsize=(20,10))
sns.scatterplot(df['Measured Depth m'], df['MWD Continuous Inclination dega'],
linewidth=0,
hue=df['Average Hookload kkgf'],
hue_norm=(85,100),
palette="viridis",
size=df['Average Surface Torque kN.m'],
size_norm=(0,10),
sizes=(20,300)
)
Utilize the .describe() function to gauge the normalization range
df['Average Hookload kkgf'].describe()
Simple plot is also useful to understand the data contents, such as outliers.
df['Average Hookload kkgf'].plot()
While in publication grayscale is typically preferred, when using color use good color schemes. Examples below are good - the brightness of colors is consistent with color scale.
palettes = ['viridis','inferno', 'gray', 'Blues', 'Reds', 'hot', 'cividis']
for i in palettes:
sns.palplot(sns.color_palette(i,10))
Divergin palletes are also possible to highlight that something is below or above a certain level (like a physical map)
sns.palplot(sns.diverging_palette(220, 20, n=11))
sns.palplot(sns.diverging_palette(145, 280, s=85, l=25, n=11))
sns.palplot(sns.diverging_palette(255, 133, l=60, n=11, center="dark"))
A ~rainbow pallete that is the most commonly used despite being a poor in representing data
https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/
https://pdfs.semanticscholar.org/ee79/2edccb2c88e927c81285344d2d88babfb86f.pdf
Seaborn will straight out refuse to use it returning ValueError: No.
#palettes = ['jet']
#for i in palettes:
# sns.palplot(sns.color_palette(i))
Jet colormap will be even more confusing when converted to grayscale.
Don't worry about the code below, it is there just to generate the chart.
def grayify_cmap(cmap):
"""Return a grayscale version of the colormap"""
cmap = plt.cm.get_cmap(cmap)
colors = cmap(np.arange(cmap.N))
# convert RGBA to perceived greyscale luminance
# cf. http://alienryderflex.com/hsp.html
RGB_weight = [0.299, 0.587, 0.114]
luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
colors[:, :3] = luminance[:, np.newaxis]
return cmap.from_list(cmap.name + "_grayscale", colors, cmap.N)
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 6,300)
y = np.linspace(0, 3,300)[:, np.newaxis]
z = 10 * np.cos(x ** 2) * np.exp(-y)
cmaps = [plt.cm.jet, grayify_cmap('jet'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
fig.subplots_adjust(wspace=0)
for cmap, ax in zip(cmaps, axes):
im = ax.imshow(z, cmap=cmap)
ax.set_title(cmap.name)
fig.colorbar(im, ax=ax)
Use a palette that has lightness corresponding to value for better results
cmaps = [plt.cm.CMRmap, grayify_cmap('CMRmap'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
fig.subplots_adjust(wspace=0)
for cmap, ax in zip(cmaps, axes):
im = ax.imshow(z, cmap=cmap)
ax.set_title(cmap.name)
fig.colorbar(im, ax=ax)
In Python we can easily perform regression using any function.
Let's assume we have a signal governed by an equation \begin{equation*} y = -0.2x^3 + 3x^2+30sin(x) \end{equation*}
This signal is noisy. Noise has a standard deviation of 30.
We have total 100 samples to work with, but 20 of them are outliers
np.random.seed(10) #fixed random seed, so you always get the same set of random numbers
n = 20 #outlier count
x_0 = np.linspace(-10,10,100-n) #generate 100 minus n evenly spaced values from -10 to 10
y_0 = -0.2*x_0**3+3*x_0**2+30*np.sin(x_0) #calculate y
sns.lineplot(x_0,y_0,color="red", label="ground truth") #plot clean function
x = x_0 #keep x_0 separate for plotting later
y = y_0 + np.random.normal(0,30,100-n) #generate new function - same as clean one plus noise with sd=30
sns.scatterplot(x,y,label="truth+noise") #plot the noisy chart
outliers_x = np.random.uniform(-12,12,n) #generate n outliers, x coordinates
outliers_y = np.random.uniform(-200,500,n)#... y coordinates
sns.scatterplot(outliers_x,outliers_y,label="outliers") #plot outliers only
x = np.concatenate([x, outliers_x]) #add the outliers to the x array
y = np.concatenate([y, outliers_y]) #add the outliers to the y array
Let's look at the raw data. You can see, that the data is very noisy and difficult to make a regression of!
sns.scatterplot(x,y,label="outliers",linewidth=0)
Let's assume that we know the function itself, but we do not know the parameters, effectively, known shape of the funtion is: \begin{equation*} y = a_0 x^3 + a_1 x^2+a_2sin(x) \end{equation*}
We define a function in a parametric way that is suitable for the algorithm that will come later.
def fun(a, x, y):
return a[0]*x**3+a[1]*x**2+a[2]*np.sin(x) - y
We can import an optimization function from SciPy based on least squares method
from scipy.optimize import least_squares
Let's develop two models.
One is optimized for outliers, where they no longer carry such a heavy weight (typical issue with least squares method).
Second is a vanilla Least Squares method
a0 = np.ones(3) #initiating the parameters with ones
model = least_squares(fun, a0, args=(x,y),loss='soft_l1') #the soft_l1 is the more lenient loss function
model2 = least_squares(fun, a0, args=(x,y))
print(model.x) #show the calculated parameters
Model returned:
$a_0 = -0.20864698$ (originally $-0.2$)
$a_1 = 2.95738833$ (originally $3$)
$a_2 = 31.35128843$ (originally $30$)
Which is very close to original values. Let's inspect it graphically. The match is nearly perfect.
plt.figure(figsize=(10,7)) #let's use slightly bigger chart
sns.scatterplot(x,y, label="input data",s=100)
sns.lineplot(x_0,y_0,color="red", lw=2,alpha=1,label="ground truth")
y2 = model.x[0]*x**3+model.x[1]*x**2+model.x[2]*np.sin(x) #calculate y based on the model
sns.lineplot(x, y2, color="green",lw=5,alpha=0.5, label="model, optimized")
In comparison, vanilla Least Square method did not perform so well:
plt.figure(figsize=(10,7))
sns.scatterplot(x,y, label="input data",s=100)
sns.lineplot(x_0,y_0,color="red", lw=2,alpha=1,label="ground truth")
y3 = model2.x[0]*x**3+model2.x[1]*x**2+model2.x[2]*np.sin(x)
sns.lineplot(x, y3, color="green",lw=2,dashes="-",alpha=1, label="model, vanilla")
Let's calculate $R^2$ values for both models.
Note: in this implementation $R^2$ value can be negative
from sklearn.metrics import r2_score #in a typical Python fashion, we just import what is needed
print(f'R^2 value for optimized fitting is: {r2_score(y,y2)}' )
print(f'R^2 value for standard fitting is : {r2_score(y,y3)}' )
The objectively much better model has a lower $R^2$ value. As you can see, this is not the ultimate metric.
One of the optimization tasks is in principle finding a minimum of a function. Some examples:
Let's find the minimum of the function that we explored before.
We will use a differential evolution algorithm, which is conviniently available in scipy library.
We will need to redefine the function slightly to make it work with the algorithm. Minimum seeking algorithms also often need boundaries - we have to provide it as well.
from scipy.optimize import differential_evolution #import differential evolution algorithm
bounds = [(-10, 10)] #define bounds within which algorithm will seek minimum
def fun_min(x):
return -0.2*x**3+3*x**2+30*np.sin(x) #redefined function - it simply returns y.
result = differential_evolution(fun_min, bounds) #here the magic happens. We feed the function and the bounds
#to the algorithm
sns.lineplot(x_0,y_0,color="gray", lw=2,alpha=1,label="ground truth", zorder=1) #plot the ground truth
#zorder defines the plotting
#sequence, aka. what is on top
plt.scatter(result.x[0], result.fun[0],marker="x",s=200,c="r", label="minimum", zorder=10)
plt.legend()
For future reference, let's see how that works when a function has multiple parameters.
from mpl_toolkits.mplot3d import Axes3D #importing some things necessary for 3D plot
def fun3d(vars): #definition of a sample 3D function. See that it still takes only one variable
x = vars[0] #but now it is an array, which we have to deconstruct
y = vars[1]
return np.sin(x)*np.exp(-x**2-y**2)
x = np.linspace(-3,3,100) #defining X and Y space
y = np.linspace(-3,3,100)
x, y = np.meshgrid(x, y) #convert x and y to a mesh, so we get a point for each intersection of x and y
z = fun3d([x,y]) #feed the function
fig = plt.figure(figsize=(10,7))
ax = fig.gca(projection='3d') #define 3D plot
ax.plot_surface(x, y, z, cmap=plt.cm.viridis, linewidth=0.2, alpha=1) #let's plot the function to verify
#that everything works
plt.show()
bounds = [(-3,3),(-3,3)] #define the bounds
result = differential_evolution(fun3d, bounds) #feed the function and the bounds to DE algorithm
#and we plot everything again
fig = plt.figure(figsize=(10,7))
ax = fig.gca(projection='3d')
ax.plot_surface(x, y, z, cmap=plt.cm.viridis, linewidth=0.2, alpha=0.3)
ax.scatter(result.x[0], result.x[1], result.fun, c="r", marker="x",s=200)
plt.show()
What is the size of a square that corresponds are-wise to a circle? They tried to solve it since 1800 BCE. We need few lines in Python nowadays.
def quadrature(x): #function that will return the absolute difference between
#the area of a circle, radius = 1, and square, length x
return np.abs(np.pi - x**2) #assume radius = 1, hence pi*1^2 = pi
bounds = [(0,2)]
result = differential_evolution(quadrature, bounds)
ax=plt.gca()
patch = plt.Circle((0,0), radius= 1, fill=False,color="r")
ax.add_patch(patch)
patch = plt.Rectangle((-0.5*result.x,-0.5*result.x),result.x,result.x, fill=False, color="b")
ax.add_patch(patch)
plt.axis('scaled')
plt.show()
To visualize the actual function we minimized:
x = np.linspace(0,2,100)
y = quadrature(x)
sns.lineplot(x,y)
Oftentimes we want to use a parameter that is not delivered in the database. We need to calculate it. Below is a short example of calculating inclination based dogleg severity, done on the dataset we used before.
Let's import the dataset again and plot inclination data
df=pd.read_csv("http://www.ux.uis.no/~atunkiel/Norway-NA-15_$47$_9-F-9%20A%20depth.zip")
sns.scatterplot(df['Measured Depth m'],df['MWD Continuous Inclination dega'],linewidth=0,size=1, alpha=0.3)
For clarity, let's develop two interim variables:
We will also smoothen out the data a little bit with a rolling mean. Deltas will be calculated over slightly larger step than one to avoid noise
df.ffill(inplace=True) #forward filling and backward filling to avoid NaNs in the dataset
df.bfill(inplace=True)
step_d=20 #steps for differentiation and rolling average
step_r = 100 #as variables for easy tweaking
df['dDepth'] = df['Measured Depth m'].rolling(step_r).mean(center=True).diff(step_d)
df['dInc'] = df['MWD Continuous Inclination dega'].rolling(step_r).mean(center=True).diff(step_d)
df['DLS'] = df['dDepth']/df['dInc']
sns.scatterplot(df['Measured Depth m'],df['MWD Continuous Inclination dega'],
hue=df['DLS'],linewidth=0,size=1, alpha=0.3)
We immediatly see an issue. Color scale, that we wanted to use to visualize dogleg is out of whack.
We can inspect basic data parameters with this one liner we learned about before.
df['dInc'].describe()
and some more extra info through a histogram
sns.distplot(df['dInc'].dropna())
With this new information, we can specify the range of our color range
plt.figure(figsize=(10,7))
sns.scatterplot(df['Measured Depth m'], df['MWD Continuous Inclination dega'],
linewidth=0,
hue=df['dInc'],
hue_norm=(0,0.25),
palette="viridis",
)
While everything looks correct at a first glance. However, there are number of potential issues here:
We did a rolling average and calculated difference with a fixed step, not checking the actual data frequency. This might have smoohted out some important data. We should resample the data for a consistent result
There are outliers in the dataset, that should be taken out first.
Let's explore correlation. We will import diamond dataset. It is a data frame with 53940 rows and 10 variables:
df = pd.read_csv('http://www.ux.uis.no/~atunkiel/diamonds.csv')
sns.scatterplot(df['carat'],df['price'])
The readability of this chart is poor. Points are widely overlapping each other. We can try a seaborn jointplot.
Note: colors were log-normalized.
import matplotlib.colors as colors
sns.jointplot(df['carat'],df['price'], kind="hex", gridsize=75,
cmap="inferno", joint_kws={"norm" : colors.LogNorm()})
Many correlations in life are logarithmic in nature. Let's quickly explore if it is usable here.
You will clearly see that the log-log relationship is by far the most linear. Further analysis may benefit greatly from this finding, as linear correlations are much easier to model than logarithmic ones.
fig, axs = plt.subplots(2, 2, figsize=(10,10))
axs[0, 0].hexbin(df['carat'],df['price'],xscale='linear', yscale='linear', norm = colors.LogNorm())
axs[0, 0].set_title('linear-linear')
axs[0, 1].hexbin(df['carat'],df['price'],xscale='linear', yscale='log', norm = colors.LogNorm())
axs[0, 1].set_title('linear-log')
axs[1, 0].hexbin(df['carat'],df['price'],xscale='log', yscale='linear', norm = colors.LogNorm())
axs[1, 0].set_title('log-linear')
axs[1, 1].hexbin(df['carat'],df['price'],xscale='log', yscale='log', norm = colors.LogNorm())
axs[1, 1].set_title('log-log')
Seaborn allows one to plot a useful pairplot, to look at correlation between parameters
sns.pairplot(df.sample(100))
One can put a number on the correlation.
sns.heatmap(df.corr(), cmap="viridis")
n = 1000
results = []
for i in range(n):
if (np.average(np.random.randint(0,2,11)) <= 3):
results.append(1)
else:
results.append(0)