-
Python library designed for working with multi-dimensional labeled datasets
-
climate science, oceanography, remote sensing
- can be thought of as extensions of NumPy arrays
-
supports labeled axes (dimensions), coordinates, and metadata
- for datasets varying across time and space, n-dimensions, etc.
-
climate science, oceanography, remote sensing
Why xarray?
- labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone DX
What labels enable
-
N-Dimensional arrays, sometimes called tensors,
-
essential part of computational science
- physics, astronomy, geoscience, bioinformatics, engineering, finance, deep learning, etc.
-
NumPy provides ndarrays
-
Xarray uses labels for concise interfacing
-
some examples
- apply operations over dimensions by name x.sum('time')
- select values by label instead of integer location x.loc['2014-01-01'] x.sel(time='2014-01-01')
- Mathematical operations vectorize across multiple dimensions– Array Broadcasting
- Use the split-apply-combine paradigm with groupby x.groupby('time.dayofyear').mean()
- Database-like alignment based on coordinate labels that smoothly handles missing values x, y = xr.align(x, y, join='outer')
- Keep track of arbitrary metadata in the form of a Python dictionary x.attrs
-
some examples
-
Xarray uses labels for concise interfacing
-
essential part of computational science
Core data structures
-
xarray.DataArray: a labeled, ndarray w/ coordinates and dims
-
generalization of a pandas.Series
- convert to a pandas Series x.toseries()
- wrapper around numpy ndarrays
-
objects can have any number of dimensions
- and their contents have fixed data types
-
parameters:
-
data (arraylike) must be numpy.ndarray-compatible
- a view of the array's data is used when possible instead of a copy
-
coords (sequence or dict of arraylike or Coordinates) optional
- tick labels to use for indexing along each dim
-
dims (Hashable or sequence of Hashable) optional
- names of the data dimension(s)
-
if argument name is omitted
-
dimension names are taken from coords
- otherwise defaults to ['dim0', … 'dimn']
-
dimension names are taken from coords
- name (str or None) - name of array optional
- attrs (dict-like or None) - attributes to assign to the new instance. optional
-
indexes (Indexes or dict-like) optional
-
for internal use only
- use coords instead to passing indexes objects to the new DataArray
-
for internal use only
-
data (arraylike) must be numpy.ndarray-compatible
-
generalization of a pandas.Series
-
xarray.Dataset: a multi-dimensional, in-memory array database
-
dict-like container of DataArray objects
- aligned along any number of shared dimensions
- generalization of pandas.DataFrame
-
resembles an in-memory representation of a NetCDF file
- consists of variables, coordinates, and attributes
-
implements the mapping interface
- with keys given by variable names
-
and values given by DataArray objects
- for each variable name
-
parameters:
-
datavars (dict-like) optional
-
a mapping from variable names to DataArray objects, Variable objects or to tuples of the form (dims, data[, attrs])
-
which can be used as args to create a new Variable
The following notations are accepted:
mapping {var name: DataArray}
mapping {var name: Variable}
mapping {var name: (dimension name, array-like)}
mapping {var name: (tuple of dimension names, array-like)}
mapping {dimension name: array-like} (if array-like is not a scalar it will be automatically moved to coords, see below)
-
-
a mapping from variable names to DataArray objects, Variable objects or to tuples of the form (dims, data[, attrs])
-
coords (Coordinates or dict-like) optional
-
a Coordinates object or another mapping in similar form as the datavars arg
- except that each item is saved on the dataset as a "coordinate"
-
These variables have associated meaning
-
they describe constant/fixed/independent/ quantities
-
unlike the varying/measured/dependent/ quantities that belong in variables
The following notations are accepted for arbitrary mappings:
mapping {coord name: DataArray}
mapping {coord name: Variable}
mapping {coord name: (dimension name, array-like)}
mapping {coord name: (tuple of dimension names, array-like)}
mapping {dimension name: array-like} (the dimension name is implicitly set to be the same as the coord name)
-
-
they describe constant/fixed/independent/ quantities
-
a Coordinates object or another mapping in similar form as the datavars arg
-
attrs (dict-like) optional
- global attributes to save on this dataset
-
datavars (dict-like) optional
-
dict-like container of DataArray objects
Installing and Importing Xarray
# pip install xarray pooch
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
xr.set_options(keep_attrs=True, display_expand_data=False)
np.set_printoptions(threshold=10, edgeitems=2)
#########
# built-in access to tutorial datasets
#########
ds = xr.tutorial.open_dataset("air_temperature")
# stored in the netCDF format
# stored in a temporary cache directory
# Linux: ~/.cache/xarray_tutorial_data
# macOS: ~/Library/Caches/xarray_tutorial_data
# Windows: ~/AppData/Local/xarray_tutorial_data
Working with DataArrays
# Access a specific DataArray
temperature = ds["air"] # ds.air # dot notation
#####
# DataArray Components
#####
# data stored in numpy array
temperature.values
# named axes of the data
temperature.dims
# labels for the values in each dimension
temperature.coords
# metadata associated with the data
temperature.attrs
########
# Indexing and Selecting Data
########
# Select data for a specific time and location
selected_data = temperature.sel(time="2013-01-01", lat=40.0, lon=260.0)
# Slice data across a range of times
time_slice = temperature.sel(time=slice("2013-01-01", "2013-01-31"))
########
# Performing Operations on DataArrays
########
# Calculate the mean temperature over time
mean_temperature = temperature.mean(dim="time")
# Subtract the mean temperature from the original data
anomalies = temperature - mean_temperature
Visualization with Xarray
# Plot the mean temperature
mean_temperature.plot()
plt.show()
# customize appearance of plots by passing args
mean_temperature.plot(cmap="jet", figsize=(10, 6))
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Mean Temperature")
# Plot a time series for a specific location
temperature.sel(lat=40.0, lon=260.0).plot()
plt.show()
Working with Datasets
# List all variables in the dataset
print(ds.data_vars)
# Access a DataArray from the Dataset
temperature = ds["air"]
# Perform operations on the Dataset
mean_temp_ds = ds.mean(dim="time")
Reading and Writing files
xr.open_dataset("example.nc", engine="netcdf4")
# netCDF format (recommended)
-
the "engine" provides a set of instructions that tell xarray how to read the data and pack them into a Dataset ( or DataArray )
- these are stored in an underlying "backend"
-
xarray comes with several backends
- covering common data formats, many more backends are available via external libraries
-
you can add a new backend for read support to xarray
-
create a class that inherits from xarray BackendEntrypoint
- implements opendataset()
-
declare this class as an external plugin in project configuration
-
define a entrypoint in pyproject.toml or setup.py
-
group: xarray.backends
-
name: the name to be passed to opendataset() as engine
-
object reference: the reference of the class that you have implemented
[project.entry-points."xarray.backends"] my_engine = "my_package.my_module:MyBackendEntrypoint"
-
-
define a entrypoint in pyproject.toml or setup.py
-
create a class that inherits from xarray BackendEntrypoint