Top level notes
- Pandas Documentation Homepage
- Installing (2.0) —
pip install pandas==2.0
- installing pandas will also install numpy
- Pandas helps in working with tabular data (databases, spreadsheets)
- Helps = explore, clean and proess
- Anything you can do with SQL you can also do with Pandas.
- Group by operations
- Summary Statistics. eg., mean, median, std..
- The data table is called a
DataFrame
- pandas can read (
read_
methods) from various file formats like csv, xls, parquet, hdf5, json, sql (db) and generate output (to_
methods) in the above formats. - matplotlib is used to generated plots from data tables.
split-apply-combine
approach. Hmm.. map-reduce?melt()
to convert from wide → long/tidy form andpivot()
to convert from long to wide format.- combine multiple tables using
concat()
function for column-wise or row-wise joining - for database like joining/merging, use the
merge
function.pd.merge(tbone, tbtwo, how="left", left_on="col1", right_on="col2")
. alsomerge_asof
- pandas supports inner, outer, right and left joins.
- use
shape()
to find therowxcolumn
shape. - TODO: how vectorization is done in pandas
- TODO: where numpy is faster than pandas
- TODO: Pyspark with pandas
Quick intro
import numpy as np
import pandas as pd
# create new Series
s = pd.Seres([1,2,3, pd.nan, 8, 121])
# pd.nan = not a number
# pd.date_range("<YYYYMMDD", periods=n) produces a
# a date range starting at YYYYMMDD for the period `n`
# dataframes are created with
df = pd.DataFrame(nparray, index=<indexvec>, columns=list(names))
# the nparray is a 2-d numpy array of mxn dimensions
# the df can also be created by passing a dict with keys as the
# column names and the value as the series.
Articles
- Practical SQL for Data Analysis | Haki Benita in comparison to pandas.