introduction to xray
Xarray is a Python library that extends the features and functionality of NumPy, giving us the possibility to work with labeled arrays and datasets.
In fact, as they say on their website:
Xarray makes working with labeled multi-dimensional arrays in Python simple, efficient, and fun!
Even more:
Xarray introduces labels as dimensions, coordinates, and attributes on top of raw NumPy-like multidimensional arrays, allowing for a more intuitive, more concise, and less error-prone developer experience.
In other words, it extends the functionality of NumPy arrays by adding labels or coordinates to the array dimensions. These labels provide metadata and enable more advanced analysis and manipulation of multi-dimensional data.
For example, in NumPy, arrays are accessed using integer-based indexing.
Instead, in Xarray, each dimension can have a label associated with it, making it easier to understand and manipulate the data based on meaningful names.
For example, instead of accessing the data arr(0, 1, 2)we can use arr.sel(x=0, y=1, z=2) in Xarray, where x, yAnd z Dimensions are labeled.
It makes the code more readable!
So let’s see some features of Xarray.
Some features of Xarray in action
As usual, to install it:
$ pip install xarray
Feature One: Working with Labeled Coordinates
Let’s say we want to create some data related to temperature and we want to label it with coordinates like latitude and longitude. We can do it like this:
import xarray as xr
import numpy as np# Create temperature data
temperature = np.random.rand(100, 100) * 20 + 10
# Create coordinate arrays for latitude and longitude
latitudes = np.linspace(-90, 90, 100)
longitudes = np.linspace(-180, 180, 100)
# Create an Xarray data array with labeled coordinates
da = xr.DataArray(
temperature,
dims=('latitude', 'longitude'),
coords='latitude': latitudes, 'longitude': longitudes
)
# Access data using labeled coordinates
subset = da.sel(latitude=slice(-45, 45), longitude=slice(-90, 0))
and if we print them we get:
# Print data
print(subset)>>>
<xarray.DataArray (latitude: 50, longitude: 25)>
array(((13.45064786, 29.15218061, 14.77363206, ..., 12.00262833,
16.42712411, 15.61353963),
(23.47498117, 20.25554247, 14.44056286, ..., 19.04096482,
15.60398491, 24.69535367),
(25.48971105, 20.64944534, 21.2263141 , ..., 25.80933737,
16.72629302, 29.48307134),
...,
(10.19615833, 17.106716 , 10.79594252, ..., 29.6897709 ,
20.68549602, 29.4015482 ),
(26.54253304, 14.21939699, 11.085207 , ..., 15.56702191,
19.64285595, 18.03809074),
(26.50676351, 15.21217526, 23.63645069, ..., 17.22512125,
13.96942377, 13.93766583)))
Coordinates:
* latitude (latitude) float64 -44.55 -42.73 -40.91 ... 40.91 42.73 44.55
* longitude (longitude) float64 -89.09 -85.45 -81.82 ... -9.091 -5.455 -1.818
So, let’s look at the process step by step:
- We created the temperature values as NumPy arrays.
- We have defined the latitude and longitude values as NumPy arrays.
- We have stored all the data in Xarray array with the method
DataArray(), - We have chosen a subset of latitudes and longitudes by the method
sel()This selects the values we want for our subgroups.
The result is also easily readable, so labeling is really helpful in many cases.
Feature Two: Handling Missing Data
Suppose we are collecting data related to temperature during the year. We want to know whether there are some null values in our table. Here’s how we can do that:
import xarray as xr
import numpy as np
import pandas as pd# Create temperature data with missing values
temperature = np.random.rand(365, 50, 50) * 20 + 10
temperature(0:10, :, :) = np.nan # Set the first 10 days as missing values
# Create time, latitude, and longitude coordinate arrays
times = pd.date_range('2023-01-01', periods=365, freq='D')
latitudes = np.linspace(-90, 90, 50)
longitudes = np.linspace(-180, 180, 50)
# Create an Xarray data array with missing values
da = xr.DataArray(
temperature,
dims=('time', 'latitude', 'longitude'),
coords='time': times, 'latitude': latitudes, 'longitude': longitudes
)
# Count the number of missing values along the time dimension
missing_count = da.isnull().sum(dim='time')
# Print missing values
print(missing_count)
>>>
<xarray.DataArray (latitude: 50, longitude: 50)>
array(((10, 10, 10, ..., 10, 10, 10),
(10, 10, 10, ..., 10, 10, 10),
(10, 10, 10, ..., 10, 10, 10),
...,
(10, 10, 10, ..., 10, 10, 10),
(10, 10, 10, ..., 10, 10, 10),
(10, 10, 10, ..., 10, 10, 10)))
Coordinates:
* latitude (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0
* longitude (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0
And so we find that we have 10 zero values.
Also, if we take a closer look at the code, we can see that we can implement pandas methods like xray isnull.sum()As in this case, it counts the total number of missing values.
Feature One: Handling and Analyzing Multidimensional Data
The temptation to handle and analyze multi-dimensional data is high when we have the possibility to label our arrays. So, why not give it a try?
For example, let’s say we’re still collecting data on temperature at certain latitudes and longitudes.
We may want to calculate mean, maximum and median temperature. We can do it like this:
import xarray as xr
import numpy as np
import pandas as pd# Create synthetic temperature data
temperature = np.random.rand(365, 50, 50) * 20 + 10
# Create time, latitude, and longitude coordinate arrays
times = pd.date_range('2023-01-01', periods=365, freq='D')
latitudes = np.linspace(-90, 90, 50)
longitudes = np.linspace(-180, 180, 50)
# Create an Xarray dataset
ds = xr.Dataset(
'temperature': (('time', 'latitude', 'longitude'), temperature),
,
coords=
'time': times,
'latitude': latitudes,
'longitude': longitudes,
)
# Perform statistical analysis on the temperature data
mean_temperature = ds('temperature').mean(dim='time')
max_temperature = ds('temperature').max(dim='time')
min_temperature = ds('temperature').min(dim='time')
# Print values
print(f"mean temperature:\n mean_temperature\n")
print(f"max temperature:\n max_temperature\n")
print(f"min temperature:\n min_temperature\n")
>>>
mean temperature:
<xarray.DataArray 'temperature' (latitude: 50, longitude: 50)>
array(((19.99931701, 20.36395016, 20.04110699, ..., 19.98811842,
20.08895803, 19.86064693),
(19.84016491, 19.87077812, 20.27445405, ..., 19.8071972 ,
19.62665953, 19.58231185),
(19.63911165, 19.62051976, 19.61247548, ..., 19.85043831,
20.13086891, 19.80267099),
...,
(20.18590514, 20.05931149, 20.17133483, ..., 20.52858247,
19.83882433, 20.66808513),
(19.56455575, 19.90091128, 20.32566232, ..., 19.88689221,
19.78811145, 19.91205212),
(19.82268297, 20.14242279, 19.60842148, ..., 19.68290006,
20.00327294, 19.68955107)))
Coordinates:
* latitude (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0
* longitude (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0
max temperature:
<xarray.DataArray 'temperature' (latitude: 50, longitude: 50)>
array(((29.98465531, 29.97609171, 29.96821276, ..., 29.86639343,
29.95069558, 29.98807808),
(29.91802049, 29.92870312, 29.87625447, ..., 29.92519055,
29.9964299 , 29.99792388),
(29.96647016, 29.7934891 , 29.89731136, ..., 29.99174546,
29.97267052, 29.96058079),
...,
(29.91699117, 29.98920555, 29.83798369, ..., 29.90271746,
29.93747041, 29.97244906),
(29.99171911, 29.99051943, 29.92706773, ..., 29.90578739,
29.99433847, 29.94506567),
(29.99438621, 29.98798699, 29.97664488, ..., 29.98669576,
29.91296382, 29.93100249)))
Coordinates:
* latitude (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0
* longitude (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0
min temperature:
<xarray.DataArray 'temperature' (latitude: 50, longitude: 50)>
array(((10.0326431 , 10.07666029, 10.02795524, ..., 10.17215336,
10.00264909, 10.05387097),
(10.00355858, 10.00610942, 10.02567816, ..., 10.29100316,
10.00861792, 10.16955806),
(10.01636216, 10.02856619, 10.00389027, ..., 10.0929342 ,
10.01504103, 10.06219179),
...,
(10.00477003, 10.0303088 , 10.04494723, ..., 10.05720692,
10.122994 , 10.04947012),
(10.00422182, 10.0211205 , 10.00183528, ..., 10.03818058,
10.02632697, 10.06722953),
(10.10994581, 10.12445222, 10.03002468, ..., 10.06937041,
10.04924046, 10.00645499)))
Coordinates:
* latitude (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0
* longitude (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0
And we got what we wanted, that too in a clearly readable manner.
And then, as before, we’ve used pandas’ functions applied to an array to calculate the maximum, minimum, and average values of the temperature.











