-
GeoPandas adds support for geographic data to pandas objects
-
it implements GeoSeries and GeoDataFrame
- which are subclasses of pandas.Series and pandas.DataFrame
-
GeoPandas objects can act on shapely geometry objects and perform geometric operations
-
GeoPandas geometry operations are cartesian
- the coordinate reference system can be stored as an attribute of an object
-
GeoPandas geometry operations are cartesian
-
it implements GeoSeries and GeoDataFrame
-
GeoPandas is written in pure Python, but has several dependencies written in C
- GEOS, GDAL, PROJ
also depends on
- pandas
- shapely
- pyogrio
- pyproj
- packaging
optional
- matplotlib
First Example
import geopandas as gpd
from shapely.geometry import Polygon
p1 = Polygon([(0, 0), (1, 0), (1, 1)])
p2 = Polygon([(0, 0), (1, 0), (1, 1), (0, 1)])
p3 = Polygon([(2, 0), (3, 0), (3, 1), (2, 1)])
g = gpd.GeoSeries([p1, p2, p3])
#0 POLYGON ((0 0, 1 0, 1 1, 0 0))
#1 POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))
#2 POLYGON ((2 0, 3 0, 3 1, 2 1, 2 0))
g.area # returns a pandas.Series containing area for each item in GeoSeries
#0 0.5
#1 1.0
#2 1.0
#dtype: float64
g.buffer(0.5) # returns GeoPandas objects
#0 POLYGON ((-0.3535533905932737 0.35355339059327...
#1 POLYGON ((-0.5 0, -0.5 1, -0.4975923633360985 ...
#2 POLYGON ((1.5 0, 1.5 1, 1.502407636663901 1.04...
#dtype: geometry
g.plot() # use matplotlib to generate plot for GeoSeries
#########
# GeoPandas implements constructors to read any data format
# recognized by pyogrio
# SHP, GPKG, GEOJSON, FLATGEOBUF, ...
#########
import geodatasets
# externally hosted datasets for educational purposes
nybb_path = geodatasets.get_path('nybb')
boros = gpd.read_file(nybb_path)
boros.set_index('Borocode', inplace=True)
boros.sort_index(inplace=True)
# boroughs boundaries of NYC
boros['geometry'].convex_hull
Basics
# pip install geopandas
# install other necessary libraries
import pandas as pd
import geopandas gpd
import matplotlib.pyplot as plt
Creating GeoDataFrames
-
GeoDataFrame is a tabular data structure that contains a geometry column
data = { "City": ["Tokyo", "New York", "London", "Paris"], "Latitude": [35.6895, 40.7128, 51.5074, 48.8566], "Longitude": [139.6917, -74.0060, -0.1278, 2.3522], } df = pd.DataFrame(data) gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
Reading and Writing a GeoJSON
url = "https://github.com/opengeos/datasets/releases/download/vector/nybb.geojson"
gdf = gpd.read_file(url)
output_file = "nyc_boroughs.geojson"
gdf.to_file(output_file, driver="GeoJSON")
# or write to other formats
output_file = "nyc_boroughs.shp"
gdf.to_file(output_file)
output_file = "nyc_boroughs.gpkg"
gdf.to_file(output_file, driver="GPKG")
Simple Accessors
# Set BoroName as the index for easier reference
gdf = gdf.set_index("BoroName")
# Calculate the area
gdf["area"] = gdf.area
# Get the boundary of each polygon
gdf["boundary"] = gdf.boundary
# Get the centroid of each polygon
gdf["centroid"] = gdf.centroid
# Use Manhattan's centroid as the reference point
manhattan_centroid = gdf.loc["Manhattan", "centroid"]
# Calculate the distance from each centroid to Manhattan's centroid
gdf["distance_to_manhattan"] = gdf["centroid"].distance(manhattan_centroid)
mean_distance = gdf["distance_to_manhattan"].mean()
gdf.plot("area", legend=True, figsize=(10, 6))
plt.title("NYC Boroughs by Area")
plt.show()
# Plot the boundaries and centroids
ax = gdf["geometry"].plot(figsize=(10, 6), edgecolor="black")
gdf["centroid"].plot(ax=ax, color="red", markersize=50)
plt.title("NYC Borough Boundaries and Centroids")
plt.show()
gdf.explore("area", legend=False)
# explore your data interactively w/ Leaflet
Geometry Manipulation
# Buffer the boroughs by 10000 feet
gdf["buffered"] = gdf.buffer(10000)
# Plot the buffered geometries
gdf["buffered"].plot(alpha=0.5, edgecolor="black")
plt.title("Buffered NYC Boroughs (10,000 feet)")
plt.show()
# Calculate convex hull
gdf["convex_hull"] = gdf.convex_hull
# Plot the convex hulls
gdf["convex_hull"].plot(alpha=0.5, color="lightblue", edgecolor="black")
plt.title("Convex Hull of NYC Boroughs")
plt.show()
Spatial Queries and Relations
# Get the geometry of Manhattan
manhattan_geom = gdf.loc["Manhattan", "geometry"]
# Check which buffered boroughs intersect with Manhattan's geometry
gdf["intersects_manhattan"] = gdf["buffered"].intersects(manhattan_geom)
# Check if centroids are within the original borough geometries
gdf["centroid_within_borough"] = gdf["centroid"].within(gdf["geometry"])
Projections and Coordinate Reference Systems
print(gdf.crs)
# Reproject the GeoDataFrame to WGS84 (EPSG:4326)
gdf_4326 = gdf.to_crs(epsg=4326)
# Plot the reprojected geometries
gdf_4326.plot(figsize=(10, 6), edgecolor="black")
plt.title("NYC Boroughs in WGS84 (EPSG:4326)")
plt.show()
Advanced
Spatial Indexing
-
when you want to know a spatial predicate/relationship between a sets of geometries
-
An efficient approach utilizing a spatial index
-
for pre-filtering comparisons of geometries
- before using more computationally expensive spatial predicates
-
for pre-filtering comparisons of geometries
-
An efficient approach utilizing a spatial index
-
GeoPandas exposes the Sort-Tile-Recursive R-Tree from shapely
-
on any GeoDataFrame
- or GeoSeries (using the GeoSeries.sindex property)
-
on any GeoDataFrame
-
GeoPandas automatically uses spatial indexes in sjoin() overlay(), or clip()
- a more direct approach
import geopandas
import matplotlib.pyplot as plt
import shapely
from geodatasets import get_path
nyc = geopandas.read_file(get_path("geoda nyc"))
##############
# R-Tree Principle
##############
"""
Any R-Tree index
builds a hierarchical collection
of bounding boxes/envelopes representing
first individual geometries
and then
their most efficient combinations
"""
point_inside = shapely.Point(950000, 155000)
point_outside = shapely.Point(1050000, 150000)
points = geopandas.GeoSeries([point_inside, point_outside], crs=nyc.crs)
fig, axs = plt.subplots(1, 2, sharey=True, figsize=(8, 4))
nyc.plot(ax=axs[0], edgecolor="black", linewidth=1)
nyc.envelope.boundary.plot(ax=axs[1], color='black')
points.plot(ax=axs[0], color="limegreen")
points.plot(ax=axs[1], color="limegreen");
Scalar Query
-
use the sindex property to query the index
- the query() method returns positions of all geometry whose bounding boxes intersect the bounding box of the input geometry
nyc.sindex.validquerypredicates {None, 'contains', 'containsproperly', 'coveredby', 'covers', 'crosses', 'dwithin', 'intersects', 'overlaps', 'touches', 'within'}
bbox_query_inside = nyc.sindex.query(point_inside)
bbox_query_outside = nyc.sindex.query(point_outside)
bbox_query_inside, bbox_query_outside
## plots
fig, axs = plt.subplots(1, 2, sharey=True, figsize=(8, 4))
nyc.plot(ax=axs[0], edgecolor="black", linewidth=1)
nyc.envelope.boundary.plot(ax=axs[1], color='black')
points.plot(ax=axs[0], color="limegreen", zorder=3, edgecolor="black", linewidth=.5)
points.plot(ax=axs[1], color="limegreen", zorder=3, edgecolor="black", linewidth=.5)
nyc.iloc[bbox_query_inside].plot(ax=axs[0], color='orange')
nyc.iloc[bbox_query_outside].plot(ax=axs[0], color='orange')
nyc.envelope.iloc[bbox_query_inside].plot(ax=axs[1], color='orange')
nyc.envelope.iloc[bbox_query_outside].plot(ax=axs[1], color='orange');
##############
# The spatial index allows filtering based on the actual geometry
# ... the is first queried and afterwards each hit is checked
# using a spatial predicate
#############
pred_inside = nyc.sindex.query(point_inside, predicate="intersects")
pred_outside = nyc.sindex.query(point_outside, predicate="intersects")
# "intersects" here is a spatial query predicate
## plots
fig, axs = plt.subplots(1, 2, sharey=True, figsize=(8, 4))
nyc.plot(ax=axs[0], edgecolor="black", linewidth=1)
nyc.envelope.boundary.plot(ax=axs[1], color='black')
points.plot(ax=axs[0], color="limegreen", zorder=3, edgecolor="black", linewidth=.5)
points.plot(ax=axs[1], color="limegreen", zorder=3, edgecolor="black", linewidth=.5)
nyc.iloc[pred_inside].plot(ax=axs[0], color='orange')
nyc.envelope.iloc[pred_inside].plot(ax=axs[1], color='orange');
Array Query
-
Checking a single geometry against the tree is not very efficient for many-to-many relationships
bbox_array_query = nyc.sindex.query(points) #the method returns a 2-D array of indices # where the query found a hit # where the subarrays correspond to the indices # of the input geometries and indices # of the tree geometries associated with each. bbox_array_query # array([[ 0, 1], # [ 1, 16]]) # the 0-th geometry in the points GeoSeries # intersects the bounding box of the geometry # at the position 1 from the nyc GeoDataFrame # the geometry 1 in the points matches geometry 16 in the nyc bbox_array_query_dense = nyc.sindex.query(points, output_format="dense") bbox_array_query_dense # return a boolean array with shape (len(tree), n) # boolean values marking whether the bounding box of a geometry # in the tree intersects a bounding box of a given geometry ##### # can either be dense numpy array or a sparse scipy array ##### it is recommended to use the sparse format #array([[False, False], # [ True, False], # ... # [False, False], # [False, False], # [False, False]]) ###### # scipy.sparse.coo_array ###### bbox_array_query_sparse = nyc.sindex.query(points, output_format="sparse") bbox_array_query_sparse #<COOrdinate sparse array of dtype 'bool' # with 2 stored elements and shape (55, 2)> # to find the number of neighboring geometries for each subborough # you can use the spatial index to compare all geometries against each other neighbors = nyc.sindex.query(nyc.geometry, predicate="intersects", output_format="dense") # a geometry always intersects itself, you need to subtract one n_neighbors = neighbors.sum(axis=1) - 1 ## plots nyc.plot(n_neighbors, legend=True);
Nearest geometry query
-
GeoPandas allows you to use the spatial index to find the nearest geometry
nearest_indices = nyc.sindex.nearest(points) # returns the indices representation nearest_indices, distance = nyc.sindex.nearest(points, return_distance=True) # returns also tuple of arrays of distances