Pandas Basics: Machine Learning in Python

Divyansh Chaudhary
4 min readDec 31, 2020

Continuing with the series “Machine Learning in Python”, we have the next most commonly used software library in Python, that is, Pandas. In the next few minutes, we shall learn about the basics of Pandas library and how to get yourself setup to explore the vast world of data.

Pandas Logo Creator: Marc Garcia

“In the future, I think that programming languages are going to diminish in importance relative to data itself and common computational libraries.”
— Wes McKinney (Creator — Pandas)

Pandas: A tool for Data Analysis in Python

Pandas — derived from the word “Panel Data” is a play on the phrase “Pythons Analysis of Data” itself. Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
Even though Pandas comes with its own problems — low performance and long runtime — when dealing with datasets that are over the limit of 1GB, it is still widely used in the Data Science Community for processing small to large data.

Image Source: Jeffrey Czum from Pexels

Pandas mainly works on DataFrame and Series or by converting raw data to DataFrames and Series.

Image Source: pandas.pydata.org

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.

Each column in a DataFrame is a Series. A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.

Image Source: pandas.pydata.org

As there is nothing more to say about Pandas, lets get straight to the handiwork.

Working with Pandas: How to get setup!

  • Like any other library, lets first install Pandas with:
    For additional help: Install Pandas
pip install pandas
  • Importing Pandas to your .py project:
import pandas
or
import pandas as pd
or
import pandas as <alias>

Pandas gives you three different ways of dealing with data:

  • Convert a Python list, dictionary or NumPy array to Pandas data-frame
# List to DataFrame
lst = ["A","B","C"]
df = pd.DataFrame(lst)
# Dictionary to DataFrame
dct = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame.from_dict(dct)
# NumPy array to DataFrame
data = np.array([[5.8, 2.8], [6.0, 2.2]])
df = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]})
  • Open a local file using Pandas, usually a CSV file, but could also be a delimited text file, Excel, etc.
pd.read_csv("../data_folder/data.csv")
or
pd.read_<filetype>()
  • Open a remote file or database like a CSV or a JSON on a website through a URL or read from a SQL table/database
# Reading from a RAW URL
url="https://raw.githubusercontent.com/.../data.csv"
c=pd.read_csv(url)
# Reading a SQL Query
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
pd.read_sql("SELECT * FROM my_table;", engine)
pd.read_sql_table('my_table', engine)
pd.read_sql_query("SELECT * FROM my_table;", engine)
  • You can even save a data-frame you’re working with/on to a different kind of file extension
df.to_<filetype>(<filename>)

Other basics commands of Pandas

  • Creating a Series
series1 = pd.Series([1,2,3,4]), index=['a', 'b', 'c', 'd'])
# Pandas will default count index from 0
  • Set the Series name
srs.name = "Insert name"
  • Set index name
srs.index.name = "Index name"
  • Create a DataFrame
df = pd.DataFrame(
{"a" : [1 ,2, 3],
"b" : [7, 8, 9],
"c" : [10, 11, 12]}, index = [1, 2, 3])
  • Specify values in DataFrame columns
df = pd.DataFrame( 
[[1, 2, 3],
[4, 6, 8],
[10, 11, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])
  • Understanding your data
# To get the first 5 entries in your data table
df.head(<filename>)
# To get statistical data
df.describe(<filename>)
  • Select a single value
# By Position
df.iloc[[0],[0]] 'Name'
df.iat([0],[0]) 'Name'
# By Label
df.loc[[0], ['Label']] 'Name'
df.at([0], ['Label']) 'Name'
  • Retrieve rows and columns description
df.shape

Other functions of Pandas can be read here.

This blog provides a small overview of advantages and functionality of Pandas Library in Python. This documentation is by no means a complete guide to Pandas but a way to kickstart your journey of Machine Learning with Pandas.

Thanks for reading.
Don’t forget to click on 👏!

--

--

Divyansh Chaudhary

Machine Learning and Python Student. Coding Enthusiast. Pursuing Bachelors in Computer Science.