-
Pandas is a data wrangling platform for Python providing data ingestion, transformation, and export functions
-
Enables high-performance data structures and data analysis without the R Programming Language
- Library is built upon NumPy
-
Data structures called Series and DataFrames
-
Series
-
is a one-dimensional labeled array capable of holding any data type
-
built on top of NumPy's array
-
each data point has an axis label unlike in NumPy
import pandas as pd ########## # Axis labels are collectively referred to as the index # # To create a Series is to call: ########## s = pd.Series(data, index=index)
-
-
built on top of NumPy's array
-
is a one-dimensional labeled array capable of holding any data type
-
DataFrames
-
is a two-dimensional labeled data structure with columns of potentially different types
-
you can optionally pass index (row labels) and columns (labels) arguments
import pandas as pd d = { "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]), "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]), } df = pd.DataFrame(d) pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"]) ''' Out[#]: two three d 4.0 NaN b 2.0 NaN a 1.0 NaN '''
-
-
is a two-dimensional labeled data structure with columns of potentially different types
-
Series
-
Enables high-performance data structures and data analysis without the R Programming Language
Reading
Basics
Getting Started
Basic DataFrame Operations
# selecting a specific column
latitudes = df ["Latitude"]
# filtering rows based on a condition
df_filtered = df[df["Latitude"] > 36]
# adding a new column with a calculation
df["Lat_Radians"] = np.radians(df["Latitude"])
# group by 'country' and calculate the total population for each country
df_grouped = df.groupby("Country")["Population"].sum()
# merge two DataFrames on the 'ciy' column
df_merged = pd.merge(df1, df2, on="City")
# handling missing data
df_nan = pd.DataFrame(data_with_nan)
# fill missing values with the mean
df_filled = df_nan.fillna(df_nan["Population"].mean())
# can also read and write data in various formats
df = pd.read_csv(url)
# we can calculate the total population of all cities in the dataset
np.sum(df["population"])
# pandas provides built-in plotting capabilities
df.plot()
# plot only the columns of data table with data from Paris
df["station_paris"].plot()
# visually compare the values measured in London and Paris
df.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
# visualize each of the columns in a separate subplot
df.plot.area(figsize=(12,4), subplots=True)
Advanced
PyArrow Functionality
-
Pandas can utilize PyArrow to extend functionality and improve the performance of various APIs
- more data types compared to NumPy
- missing data support for all data types
- performant IO reader integration
- interoperability with other DataFrame libraries based on the Apache Arrow specification
Data Structures Integration
-
a Series, Index, or the columns of a DataFrame can be directly backed by a pyarrow.ChunkedArray similar to NumPy array
- to use in pandas pass in a string of a type followed by [pyarrow] into dtype parameter
import pandas as pd
ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")
idx = pd.Index([True, None], dtype="bool[pyarrow]")
df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")
-
For PyArrow types that accept parameters
-
you can pass in a PyArrow type
-
with those parameters into ArrowDtype
- to use in the dtype parameter.
-
with those parameters into ArrowDtype
-
you can pass in a PyArrow type
import pyarrow as pa
list_str_type = pa.list_(pa.string())
ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))
from datetime import time
idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us")))
from decimal import Decimal
decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2))
data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]]
df = pd.DataFrame(data, dtype=decimal_type)
-
to retrieve a pyarrow pyarrow.ChunkedArray from a Series or Index
- you can call the pyarrow array constructor on the Series or Index
ser = pd.Series([1, 2, None], dtype="uint8[pyarrow]")
pa.array(ser) # pyarrow.lib.UInt8Array
idx = pd.Index(ser)
pa.array(idx) # pyarrow.lib.UInt8Array
-
to convert a pyarrow.Table to a DataFrame
-
you can call the pyarrow.Table.topandas() method
- with typesmapper=pd.ArrowDtype
-
you can call the pyarrow.Table.topandas() method
table = pa.table([pa.array([1, 2, 3], type=pa.int64())], names=["a"])
df = table.to_pandas(types_mapper=pd.ArrowDtype)
Operations
-
PyArrow data structure is implemented through pandas' ExtensionArray interface, an abstract base class for custom 1-D array types
- the following are examples of operations accelerated by native PyArrow compute functions
import pyarrow as pa
ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]")
ser.mean()
ser + ser
ser > (ser + 1)
ser.dropna()
ser.isna()
ser.fillna(0)
ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string()))
ser_str.str.startswith("a")
from datetime import datetime
######################
# ArrowDtype is useful
# if the data type contains parameters
# like pyarrow.timestamp.
####################
pa_type = pd.ArrowDtype(pa.timestamp("ns"))
ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)
ser_dt.dt.strftime("%Y-%m")
-
also included
- numeric aggregations, numeric arithmetic, numeric rounding, logical and comparison functions, string functionality, datetime functionality
IO Reading
-
PyArrow also provides IO reading functionality thats been integrated into pandas IO readers
-
the following functions provide an engine keyword
-
that dispatches to PyArrow to accelerate reading from an IO source
-
readcsv()
-
readjson()
-
readorc()
- readfeather()
-
readorc()
-
readjson()
-
readcsv()
-
that dispatches to PyArrow to accelerate reading from an IO source
-
the following functions provide an engine keyword
-
by default these all and other IO readers return NumPy-backed data
-
these readers can return PyArrow-backed data
-
by specifying the parameter dtypebackend="pyarrow"
- instead of the engine keyword
-
by specifying the parameter dtypebackend="pyarrow"
-
these readers can return PyArrow-backed data
import io
data = io.StringIO("""a,b,c
1,2.5,True
3,4.5,False
""")
df = pd.read_csv(data, engine="pyarrow")
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")
#########
# Several non-IO reader functions
# can also use the dtype_backend argument
# to return PyArrow-backed data
########
s = pd.Series([1.0, 2.1, 3.0], dtype="Float64")
pd.to_numeric(s, downcast="float", dtype_backend="pyarrow")
dfn = df.convert_dtypes( dtype_backend="pyarrow")
s.convert_dtypes( dtype_backend="pyarrow")