Learn Python in 3 days : Step by Step Guide
- Getting Started with Python
- Python 2.7 vs. 3.6
- Python for Data Science
- How to install Python?
- Spyder Shortcut keys
- Basic programs in Python
- Comparison, Logical and Assignment Operators
- Data Structures and Conditional Statements
- Python Data Structures
- Python Conditional Statements
- Python Libraries
- List of popular packages (comparison with R)
- Popular python commands
- How to import a package
- Data Manipulation using Pandas
- Pandas Data Structures – Series and DataFrame
- Important Pandas Functions (vs. R functions)
- Examples – Data analysis with Pandas
- Data Science with Python
- Logistic Regression
- Decision Tree
- Random Forest
- Grid Search – Hyper Parameter Tuning
- Cross Validation
- Preprocessing Steps
Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles (late 2016 onwards), you would see majority of bloggers are in favor of Python 3.6. See the following reasons to support Python 3.6.
1. The official end date for the Python 2.7 is year 2020. Afterward there would be no support from community. It does not make any sense to learn 2.7 if you learn it today.
2. Python 3.6 supports 95% of top 360 python packages and almost 100% of top packages for data science.
What’s new in Python 3.6
It is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series.
There are several reasons to learn Python. Some of them are as follows –
- Python runs well in automating various steps of a predictive model.
- Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence.
- Python wins over R when it comes to deploying machine learning models in production.
- It can be easily integrated with big data frameworks such as Spark and Hadoop.
- Python has a great online community support.
Do you know these sites are developed in Python?
- YouTube
- Dropbox
- Disqus
How to Install Python
There are two ways to download and install Python
- Download Anaconda. It comes with Python software along with preinstalled popular libraries.
- Download Python from its official website. You have to manually install libraries.
Recommended : Go for first option and download anaconda. It saves a lot of time in learning and coding Python
- Jupyter (Ipython) Notebook
- Spyder
Spyder – Python Coding Environment |
- Press F5 to run the entire script
- Press F9 to run selection or line
- Press Ctrl + 1 to comment / uncomment
- Go to front of function and then press Ctrl + I to see documentation of the function
- Run %reset -f to clean workspace
- Ctrl + Left click on object to see source code
- Ctrl+Enter executes the current cell.
- Shift+Enter executes the current cell and advances the cursor to the next cell
List of arithmetic operators with examples
Arithmetic Operators | Operation | Example |
---|---|---|
+ | Addition | 10 + 2 = 12 |
– | Subtraction | 10 – 2 = 8 |
* | Multiplication | 10 * 2 = 20 |
/ | Division | 10 / 2 = 5.0 |
% | Modulus (Remainder) | 10 % 3 = 1 |
** | Power | 10 ** 2 = 100 |
// | Floor | 17 // 3 = 5 |
(x + (d-1)) // d | Ceiling | (17 +(3-1)) // 3 = 6 |
Basic Programs
Example 1
#Basics
x = 10
y = 3
print(“10 divided by 3 is”, x/y)
print(“remainder after 10 divided by 3 is”, x%y)
10 divided by 3 is 3.33
remainder after 10 divided by 3 is 1
Example 2
x = 100
x > 80 and x <=95
x > 35 or x < 60
x > 80 and x <=95 Out[45]: False
x > 35 or x < 60 Out[46]: True
Comparison & Logical Operators | Description | Example |
---|---|---|
> | Greater than | 5 > 3 returns True |
< | Less than | 5 < 3 returns False |
>= | Greater than or equal to | 5 >= 3 returns True |
<= | Less than or equal to | 5 <= 3 return False |
== | Equal to | 5 == 3 returns False |
!= | Not equal to | 5 != 3 returns True |
and | Check both the conditions | x > 18 and x <=35 |
or | If atleast one condition hold True | x > 35 or x < 60 |
not | Opposite of Condition | not(x>7) |
Assignment Operators
It is used to assign a value to the declared variable. For e.g. x += 25 means x = x +25.
x = 100
y = 10
x += y
print(x)
print(x) 110
In this case, x+=y implies x=x+y which is x = 100 + 10.
Similarly, you can use x-=y, x*=y and x /=y
Python Data Structure
In every programming language, it is important to understand the data structures. Following are some data structures used in Python.
1. List
It is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.
- x = [1, 2, 3, 4, 5]
- y = [‘A’, ‘O’, ‘G’, ‘M’]
- z = [‘A’, 4, 5.1, ‘M’]
Get List Item
We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).
x = [1, 2, 3, 4, 5]
x[0]
x[1]
x[4]
x[-1]
x[-2]
x[0] Out[68]: 1 x[1] Out[69]: 2 x[4] Out[70]: 5 x[-1] Out[71]: 5 x[-2] Out[72]: 4
x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[-1] selects the last element from list.
You can select multiple elements from a list using the following method
x[:3] returns [1, 2, 3]
2. Tuple
A tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows –
- A tuple cannot be changed once created whereas list can be modified.
- A tuple is created by placing comma-separated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]
Examples
K = (1,2,3)
City = (‘Delhi’,’Mumbai’,’Bangalore’)
Perform for loop on Tuple
for i in City:
print(i)
Delhi Mumbai Bangalore
Functions
Like print(), you can create your own custom function. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way.
Rules to define a function
- Function starts with def keyword followed by function name and ( )
- Function body starts with a colon (:) and is indented
- The keyword return ends a function and give value of previous expression.
def sum_fun(a, b):
result = a + b
return result
z = sum_fun(10, 15)
Result : z = 25
Suppose you want python to assume 0 as default value if no value is specified for parameter b.
def sum_fun(a, b=0):
result = a + b
return result
z = sum_fun(10)
Conditional Statements (if else)
Conditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : ” if a condition holds true, then execute something. Else execute something else”
Example
k = 27
if k%5 == 0:
print(‘Multiple of 5’)
else:
print(‘Not a Multiple of 5’)
Result : Not a Multiple of 5
Popular python packages for Data Analysis & Visualization
Some of the leading packages in Python along with equivalent libraries in R are as follows-
- pandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.
- NumPy. For numerical computing. It’s a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R.
- Scipy. For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.
- Scikit-learn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages.
- Matplotlib. For data visualization. It’s a leading package for graphics in Python. It is equivalent to ggplot2 package in R.
- Statsmodels. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.
- pandasql. It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.
Maximum of the above packages are already preinstalled in Spyder.
Comparison of Python and R Packages by Data Mining Task
Task | Python Package | R Package |
---|---|---|
IDE | Rodeo / Spyder | Rstudio |
Data Manipulation | pandas | dplyr and reshape2 |
Machine Learning | Scikit-learn | glm, knn, randomForest, rpart, e1071 |
Data Visualization | ggplot + seaborn + bokeh | ggplot2 |
Character Functions | Built-In Functions | stringr |
Reproducibility | Jupyter | Knitr |
SQL Queries | pandasql | sqldf |
Working with Dates | datetime | lubridate |
Web Scraping | beautifulsoup | rvest |
Popular Python Commands
The commands below would help you to install and update new and existing packages. Let’s say, you want to install / uninstall pandas package.
Install Package
!pip install pandas
Uninstall Package
!pip uninstall pandas
Show Information about Installed Package
!pip show pandas
List of Installed Packages
!pip list
Upgrade a package
!pip install –upgrade pandas
1. import pandas as pd
3. from pandas import *
It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.
In pandas package, there are two data structures – series and dataframe. These structures are explained below in detail –
- Series is a one-dimensional array. You can access individual elements of a series using position. It’s similar to vector in R.
import pandas as pd
s1 = pd.Series(np.random.randn(5))
s1
0 -2.412015 1 -0.451752 2 1.174207 3 0.766348 4 -0.361815 dtype: float64
Extract first and second value
You can get a particular element of a series using index value. See the examples below –
-2.412015
-0.451752
0 -2.412015 1 -0.451752 2 1.174207
Comparison of Data Type in Python and Pandas
The following table shows how Python and pandas package stores data.
Data Type | Pandas | Standard Python |
---|---|---|
For character variable | object | string |
For categorical variable | category | – |
For Numeric variable without decimals | int64 | int |
Numeric characters with decimals | float64 | float |
For date time variables | datetime64 | – |
Important Pandas Functions
The table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorise pandas functions. It’s a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks.
Functions | R | Python (pandas package) |
---|---|---|
Installing a package | install.packages(‘name’) | !pip install name |
Loading a package | library(name) | import name as other_name |
Checking working directory | getwd() | import os os.getcwd() |
Setting working directory | setwd() | os.chdir() |
List files in a directory | dir() | os.listdir() |
Remove an object | rm(‘name’) | del object |
Select Variables | select(df, x1, x2) | df[[‘x1’, ‘x2’]] |
Drop Variables | select(df, -(x1:x2)) | df.drop([‘x1’, ‘x2’], axis = 1) |
Filter Data | filter(df, x1 >= 100) | df.query(‘x1 >= 100’) |
Structure of a DataFrame | str(df) | df.info() |
Summarize dataframe | summary(df) | df.describe() |
Get row names of dataframe “df” | rownames(df) | df.index |
Get column names | colnames(df) | df.columns |
View Top N rows | head(df,N) | df.head(N) |
View Bottom N rows | tail(df,N) | df.tail(N) |
Get dimension of data frame | dim(df) | df.shape |
Get number of rows | nrow(df) | df.shape[0] |
Get number of columns | ncol(df) | df.shape[1] |
Length of data frame | length(df) | len(df) |
Get random 3 rows from dataframe | sample_n(df, 3) | df.sample(n=3) |
Get random 10% rows | sample_frac(df, 0.1) | df.sample(frac=0.1) |
Check Missing Values | is.na(df$x) | pd.isnull(df.x) |
Sorting | arrange(df, x1, x2) | df.sort_values([‘x1’, ‘x2’]) |
Rename Variables | rename(df, newvar = x1) | df.rename(columns={‘x1’: ‘newvar’}) |
Data Manipulation with pandas – Examples
1. Import Required Packages
You can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The ‘as’ is used to alias package name.
import numpy as np
import pandas as pd
2. Build DataFrame
We can build dataframe using DataFrame() function of pandas package.
mydata = {‘productcode’: [‘AA’, ‘AA’, ‘AA’, ‘BB’, ‘BB’, ‘BB’],
‘sales’: [1010, 1025.2, 1404.2, 1251.7, 1160, 1604.8],
‘cost’ : [1020, 1625.2, 1204, 1003.7, 1020, 1124]}
df = pd.DataFrame(mydata)
In this dataframe, we have three variables – productcode, sales, cost.
Sample DataFrame |
To import data from CSV file
mydata= pd.read_csv(“C:UsersDeepanshuDocumentsfile1.csv”)
3. To see number of rows and columns
You can run the command below to find out number of rows and columns.
df.shape
Result : (6, 3). It means 6 rows and 3 columns.
4. To view first 3 rows
The df.head(N) function can be used to check out first some N rows.
df.head(3)
cost productcode sales 0 1020.0 AA 1010.0 1 1625.2 AA 1025.2 2 1204.0 AA 1404.2
5. Select or Drop Variables
df.productcode
df[“productcode”]
df.loc[: , “productcode”]
To select variable by column position, you can use df.iloc function. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column.
df.iloc[: , 1]
df[[“productcode”, “cost”]]
df.loc[ : , [“productcode”, “cost”]]
Drop Variable
We can remove variables by using df.drop() function. See the example below –
df2 = df.drop([‘sales’], axis = 1)
6. To summarize data frame
To summarize or explore data, you can submit the command below.
df.describe()
cost sales count 6.000000 6.00000 mean 1166.150000 1242.65000 std 237.926793 230.46669 min 1003.700000 1010.00000 25% 1020.000000 1058.90000 50% 1072.000000 1205.85000 75% 1184.000000 1366.07500 max 1625.200000 1604.80000
To summarise all the character variables, you can use the following script.
df.describe(include=[‘O’])
Similarly, you can use df.describe(include=[‘float64’]) to view summary of all the numeric variables with decimals.
To select only a particular variable, you can write the following code –
df.productcode.describe()
OR
df[“productcode”].describe()
count 6 unique 2 top BB freq 3 Name: productcode, dtype: object
7. To calculate summary statistics
df.sales.mean()
df.sales.median()
df.sales.count()
df.sales.min()
df.sales.max()
Suppose you are asked to apply condition – productcode is equal to “AA” and sales greater than or equal to 1250.
df1 = df[(df.productcode == “AA”) & (df.sales >= 1250)]
It can also be written like :
df1 = df.query(‘(productcode == “AA”) & (sales >= 1250)’)
In the second query, we do not need to specify DataFrame along with variable name.
In the code below, we are arrange data in ascending order by sales.
df.sort_values([‘sales’])
10. Group By : Summary by Grouping Variable
Like SQL GROUP BY, you want to summarize continuous variable by classification variable. In this case, we are calculating average sale and cost by product code.
df.groupby(df.productcode).mean()
cost sales productcode AA 1283.066667 1146.466667 BB 1049.233333 1338.833333
Instead of summarising for multiple variable, you can run it for a single variable i.e. sales. Submit the following script.
df[“sales”].groupby(df.productcode).mean()
11. Define Categorical Variable
Let’s create a classification variable – id which contains only 3 unique values – 1/2/3.
df0 = pd.DataFrame({‘id’: [1, 1, 2, 3, 1, 2, 2]})
Let’s define as a categorical variable.
We can use astype() function to make id as a categorical variable.
df0.id = df0[“id”].astype(‘category’)
Summarize this classification variable to check descriptive statistics.
df0.describe()
id count 7 unique 3 top 2 freq 3
Frequency Distribution
You can calculate frequency distribution of a categorical variable. It is one of the method to explore a categorical variable.
df[‘productcode’].value_counts()
BB 3 AA 3
12. Generate Histogram
df[‘sales’].hist()
Histogram |
df.boxplot(column=’sales’)
BoxPlot |
Data Science using Python – Examples
In this section, we cover how to perform data mining and machine learning algorithms with Python. sklearn is the most frequently used library for running data mining and machine learning algorithms. We will also cover statsmodels library for regression techniques. statsmodels library generates formattable output which can be used further in project report and presentation.
1. Install the required libraries
Import the following libraries before reading or exploring data
#Import required libraries
import pandas as pd
import statsmodels.api as sm
import numpy as np
2. Download and import data into Python
With the use of python library, we can easily get data from web into python.
# Read data from web
df = pd.read_csv(“https://stats.idre.ucla.edu/stat/data/binary.csv”)
Variables Type Description gre Continuous Graduate Record Exam score gpa Continuous Grade Point Average rank Categorical Prestige of the undergraduate institution admit Binary Admission in graduate school
The binary variable admit is a target variable.
3. Explore Data
Let’s explore data. We’ll answer the following questions –
- How many rows and columns in the data file?
- What are the distribution of variables?
- Check if any outlier(s)
- If outlier(s), treat them
- Check if any missing value(s)
- Impute Missing values (if any)
# See no. of rows and columns
df.shape
Result : 400 rows and 4 columns
In the code below, we rename the variable rank to ‘position‘ as rank is already a function in python.
# rename rank column
df = df.rename(columns={‘rank’: ‘position’})
Summarize and plot all the columns.
# Summarize
df.describe()
# plot all of the columns
df.hist()
Categorical variable Analysis
It is important to check the frequency distribution of categorical variable. It helps to answer the question whether data is skewed.
# Summarize
df.position.value_counts(ascending=True)
1 61 4 67 3 121 2 151
Generating Crosstab
By looking at cross tabulation report, we can check whether we have enough number of events against each unique values of categorical variable.
pd.crosstab(df[‘admit’], df[‘position’])
position 1 2 3 4 admit 0 28 97 93 55 1 33 54 28 12
Number of Missing Values
We can write a simple loop to figure out the number of blank values in all variables in a dataset.
for i in list(df.columns) :
k = sum(pd.isnull(df[i]))
print(i, k)
Logistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable.It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family.
In python, we can write R-style model formula y ~ x1 + x2 + x3 using patsy and statsmodels libraries. In the formula, we need to define variable ‘position’ as a categorical variable by mentioning it inside capital C(). You can also define reference category using reference= option.
#Reference Category
from patsy import dmatrices, Treatment
y, X = dmatrices(‘admit ~ gre + gpa + C(position, Treatment(reference=4))’, df, return_type = ‘dataframe’)
P P_1 P_2 P_3 3 0 0 1 3 0 0 1 1 1 0 0 4 0 0 0 4 0 0 0 2 0 1 0
Split Data into two parts
80% of data goes to training dataset which is used for building model and 20% goes to test dataset which would be used for validating the model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
By default, the regression without formula style does not include intercept. To include it, we already have added intercept in X_train which would be used as a predictor.
#Fit Logit modellogit = sm.Logit(y_train, X_train)result = logit.fit()#Summary of Logistic regression modelresult.summary()result.params
Logit Regression Results ============================================================================== Dep. Variable: admit No. Observations: 320 Model: Logit Df Residuals: 315 Method: MLE Df Model: 4 Date: Sat, 20 May 2017 Pseudo R-squ.: 0.03399 Time: 19:57:24 Log-Likelihood: -193.49 converged: True LL-Null: -200.30 LLR p-value: 0.008627 ======================================================================================= coef std err z P|z| [95.0% Conf. Int.] --------------------------------------------------------------------------------------- C(position)[T.1] 1.4933 0.440 3.392 0.001 0.630 2.356 C(position)[T.2] 0.6771 0.373 1.813 0.070 -0.055 1.409 C(position)[T.3] 0.1071 0.410 0.261 0.794 -0.696 0.910 gre 0.0005 0.001 0.442 0.659 -0.002 0.003 gpa 0.4613 0.214 -2.152 0.031 -0.881 -0.041 ======================================================================================
Confusion Matrix and Odd Ratio
Odd ratio is exponential value of parameter estimates.
#Confusion Matrix
result.pred_table()
#Odd Ratio
np.exp(result.params)
#prediction on test data
y_pred = result.predict(X_test)
Calculate Area under Curve (ROC)
# AUC on test data
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc(false_positive_rate, true_positive_rate)
Calculate Accuracy Score
accuracy_score([ 1 if p > 0.5 else 0 for p in y_pred ], y_test)
Decision trees can have a target variable continuous or categorical. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4.5, Chi-Square. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data – both categorical and continuous variables. In terms of speed, it is a very fast algorithm.
#Drop Intercept from predictors for tree algorithms X_train = X_train.drop(['Intercept'], axis = 1) X_test = X_test.drop(['Intercept'], axis = 1) #Decision Tree from sklearn.tree import DecisionTreeClassifier model_tree = DecisionTreeClassifier(max_depth=7) #Fit the model: model_tree.fit(X_train,y_train) #Make predictions on test set predictions_tree = model_tree.predict_proba(X_test) #AUC false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_tree[:,1]) auc(false_positive_rate, true_positive_rate)
Result : AUC = 0.664
Important Note
Feature engineering plays an important role in building predictive models. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning technique.
Random Forest Model
Decision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit.
#Random Forest from sklearn.ensemble import RandomForestClassifier model_rf = RandomForestClassifier(n_estimators=100, max_depth=7) #Fit the model: target = y_train['admit'] model_rf.fit(X_train,target) #Make predictions on test set predictions_rf = model_rf.predict_proba(X_test) #AUC false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1]) auc(false_positive_rate, true_positive_rate) #Variable Importance importances = pd.Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False) print(importances) importances.plot.bar()
Result : AUC = 0.6974
Grid Search – Hyper Parameters Tuning
The sklearn library makes hyper-parameters tuning very easy. It is a strategy to select the best parameters for an algorithm. In scikit-learn they are passed as arguments to the constructor of the estimator classes. For example, max_features in randomforest. alpha for lasso.
from sklearn.model_selection import GridSearchCV rf = RandomForestClassifier() target = y_train['admit'] param_grid = { 'n_estimators': [100, 200, 300], 'max_features': ['sqrt', 3, 4] } CV_rfc = GridSearchCV(estimator=rf , param_grid=param_grid, cv= 5, scoring='roc_auc') CV_rfc.fit(X_train,target) #Parameters with Scores CV_rfc.grid_scores_ #Best Parameters CV_rfc.best_params_ CV_rfc.best_estimator_ #Make predictions on test set predictions_rf = CV_rfc.predict_proba(X_test) #AUC false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1]) auc(false_positive_rate, true_positive_rate)
Cross Validation
# Cross Validation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict,cross_val_score
target = y[‘admit’]
prediction_logit = cross_val_predict(LogisticRegression(), X, target, cv=10, method=’predict_proba’)
#AUC
cross_val_score(LogisticRegression(fit_intercept = False), X, target, cv=10, scoring=’roc_auc’)
Data Mining : PreProcessing Steps
1. The machine learning package sklearn requires all categorical variables in numeric form. Hence, we need to convert all character/categorical variables to be numeric. This can be accomplished using the following script. In sklearn, there is already a function for this step.
from sklearn.preprocessing import LabelEncoder def ConverttoNumeric(df): cols = list(df.select_dtypes(include=['category','object'])) le = LabelEncoder() for i in cols: try: df[i] = le.fit_transform(df[i]) except: print('Error in Variable :'+i) return df ConverttoNumeric(df)
Encoding |
2. Create Dummy Variables
Suppose you want to convert categorical variables into dummy variables. It is different to the previous example as it creates dummy variables instead of convert it in numeric form.
productcode_dummy = pd.get_dummies(df[“productcode”])
df2 = pd.concat([df, productcode_dummy], axis=1)
The output looks like below –
AA BB 0 1 0 1 1 0 2 1 0 3 0 1 4 0 1 5 0 1
Create k-1 Categories
productcode_dummy = pd.get_dummies(df[“productcode”], prefix=’pcode’, drop_first=True)
df2 = pd.concat([df, productcode_dummy], axis=1)
Fill missing values of a particular variable
# fill missing values with 0
df[‘var1’] = df[‘var1’].fillna(0)
# fill missing values with mean
df[‘var1’] = df[‘var1’].fillna(df[‘var1’].mean())
Apply imputation to the whole dataset
from sklearn.preprocessing import Imputer
# Set an imputer object
mean_imputer = Imputer(missing_values=’NaN’, strategy=’mean’, axis=0)# Train the imputor
mean_imputer = mean_imputer.fit(df)# Apply imputation
df_new = mean_imputer.transform(df.values)
4. Outlier Treatment
There are many ways to handle or treat outliers (or extreme values). Some of the methods are as follows –
- Cap extreme values at 95th / 99th percentile depending on distribution
- Apply log transformation of variables. See below the implementation of log transformation in Python.
import numpy as np
df[‘var1’] = np.log(df[‘var1’])
5. Standardization
In some algorithms, it is required to standardize variables before running the actual algorithm. Standardization refers to the process of making mean of variable zero and unit variance (standard deviation).
#load dataset
dataset = load_boston()
predictors = dataset.data
target = dataset.target
df = pd.DataFrame(predictors, columns = dataset.feature_names)#Apply Standardization
from sklearn.preprocessing import StandardScaler
k = StandardScaler()
df2 = k.fit_transform(df)
Recent Comments