4. Exploring EMIT data and inspecting spectra for alien species#
4.1. Overview#
In this notebook, we will start analysing satellite imaging spectroscopy data. We will open data from The Earth Surface Mineral Dust Source Investigation (EMIT) stored remotely on AWS and explore it. We will then extract some spectra for invasive plant species and visualize these.
4.2. Learning Objectives#
Understand how to access data direclty from AWS s3 storage
Load, mask, orthorectify, and visualize EMIT data
Extract spectra at locations with and without aline plant and inspect these
4.2.1. Load libraries#
#imports
import xarray as xr
import hvplot.xarray
import hvplot.pandas
import holoviews as hv
import pandas as pd
import numpy as np
import math
import geopandas as gpd
import panel as pn
import xvec
#our modules
import emit_tools #from https://github.com/nasa/EMIT-Data-Resources/blob/main/python/modules/emit_tools.py
# Function to adjust gamma across all bands - adjust brightness
def gamma_adjust(rgb_ds, bright=0.2, white_background=False):
array = rgb_ds.reflectance.data
gamma = math.log(bright)/math.log(np.nanmean(array)) # Create exponent for gamma scaling - can be adjusted by changing 0.2
scaled = np.power(array,gamma).clip(0,1) # Apply scaling and clip to 0-1 range
if white_background == True:
scaled = np.nan_to_num(scaled, nan = 1) # Assign NA's to 1 so they appear white in plots
rgb_ds.reflectance.data = scaled
return rgb_ds
4.2.2. Authenticate with NASA Earthdata portal#
Earthaccess is a python library to search, download or stream NASA Earth science data. You will also need an account on NASA’s Earthdata data portal https://search.earthdata.nasa.gov/
import earthaccess
auth = earthaccess.login(persist=True)
4.2.3. Accessing data on AWS S3#
The emit files are stored on Amazon S3 object storage. We will access these directly without needing to download them as we can stream s3 files. Also, because we are using xarray with dask and our data is stored in a chunked file format netcdf
, we only need to stream the chunks we are currently working on and not the entire file at once.
The best way to access the data is on a virtual machine located in us-west-2
. This is where the data is located and this mean that will get the fastest reading and writing of data possible, and all read/writes will be free. If your machine is not in us-west-2
or you are on a local machine, I do not recommend using S3, as it might be slower and there will be costs involved.
4.2.4. Finding data#
earthaccess has really convineint methods for searching and loading nasa data over S3 using earthaccess.search_data()
. Here is an example:
results = earthaccess.search_data(
short_name='EMITL2ARFL',
point=(19.03711,-33.94747),
temporal=('2023-01-18','2023-01-24'),
count=2
)
you can then load the results into an xarray
files = earthaccess.open(results)
ds = xr.open_dataset(files.f)
if you do not want to stream the data from S3 you can download the data directly using:
files = earthaccess.download(results, "./local_folder")
for more info on using earthaccess see the docs
4.2.5. Open the datset#
If you are in us-west-2: Authenticate with s3 and stream data
results = earthaccess.search_data(
short_name='EMITL2ARFL',
point=(19.03711,-33.94747),
temporal=('2023-01-18','2023-01-24'),
count=1
)
file_handlers = earthaccess.open(results)
file_handlers
Granules found: 2
Opening 1 granules, approx size: 3.5 GB
using endpoint: https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials
[<File-like object S3FileSystem, lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20230119T114247_2301907_005/EMIT_L2A_RFL_001_20230119T114247_2301907_005.nc>,
<File-like object S3FileSystem, lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20230119T114247_2301907_005/EMIT_L2A_RFLUNCERT_001_20230119T114247_2301907_005.nc>,
<File-like object S3FileSystem, lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20230119T114247_2301907_005/EMIT_L2A_MASK_001_20230119T114247_2301907_005.nc>]
Inspect the raw data
xr.open_dataset(file_handlers[0].f,engine='h5netcdf',chunks='auto')
<xarray.Dataset> Size: 2GB Dimensions: (downtrack: 1280, crosstrack: 1242, bands: 285) Dimensions without coordinates: downtrack, crosstrack, bands Data variables: reflectance (downtrack, crosstrack, bands) float32 2GB dask.array<chunksize=(538, 522, 119), meta=np.ndarray> Attributes: (12/38) ncei_template_version: NCEI_NetCDF_Swath_Template_v2.0 summary: The Earth Surface Mineral Dust Source ... keywords: Imaging Spectroscopy, minerals, EMIT, ... Conventions: CF-1.63 sensor: EMIT (Earth Surface Mineral Dust Sourc... instrument: EMIT ... ... southernmost_latitude: -34.15500797254596 spatialResolution: 0.000542232520256367 spatial_ref: GEOGCS["WGS 84",DATUM["WGS_1984",SPHER... geotransform: [ 1.84708746e+01 5.42232520e-04 -0.00... day_night_flag: Day title: EMIT L2A Estimated Surface Reflectance...
If you are not in us-west-2: Download locally
#get the files we want
results = earthaccess.search_data(
short_name='EMITL2ARFL',
point=(19.03711,-33.94747),
temporal=('2023-01-18','2023-01-24'),
count=1
)
#download them
files = earthaccess.download(results, "data/downloads")
Granules found: 2
Getting 1 granules, approx download size: 3.5 GB
Accessing cloud dataset using dataset endpoint credentials: https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials
Downloaded: data/downloads/EMIT_L2A_RFL_001_20230119T114247_2301907_005.nc
Downloaded: data/downloads/EMIT_L2A_RFLUNCERT_001_20230119T114247_2301907_005.nc
Downloaded: data/downloads/EMIT_L2A_MASK_001_20230119T114247_2301907_005.nc
# Open dataset with xarray
fp = 'data/downloads/EMIT_L2A_RFL_001_20230119T114247_2301907_005.nc'
fp_mask = 'data/downloads/EMIT_L2A_MASK_001_20230119T114247_2301907_005.nc'
ds = xr.open_dataset(fp,engine='h5netcdf',chunks='auto')
ds
<xarray.Dataset> Size: 2GB Dimensions: (downtrack: 1280, crosstrack: 1242, bands: 285) Dimensions without coordinates: downtrack, crosstrack, bands Data variables: reflectance (downtrack, crosstrack, bands) float32 2GB dask.array<chunksize=(538, 522, 119), meta=np.ndarray> Attributes: (12/38) ncei_template_version: NCEI_NetCDF_Swath_Template_v2.0 summary: The Earth Surface Mineral Dust Source ... keywords: Imaging Spectroscopy, minerals, EMIT, ... Conventions: CF-1.63 sensor: EMIT (Earth Surface Mineral Dust Sourc... instrument: EMIT ... ... southernmost_latitude: -34.15500797254596 spatialResolution: 0.000542232520256367 spatial_ref: GEOGCS["WGS 84",DATUM["WGS_1984",SPHER... geotransform: [ 1.84708746e+01 5.42232520e-04 -0.00... day_night_flag: Day title: EMIT L2A Estimated Surface Reflectance...
4.2.6. Load the mask#
The EMIT data is composed of 3 files, the actual data, a mask that tells us about various quality issues with the data, and the reflectance uncertainties (we will ignore the uncertainty for now). We can selecting which quality filter (mask) to use. Here we used the aggregate filter - the sum of all filters to be very strict about what data we include. The quality_mask
function provided by the EMIT team reads and processes this file for us.
flags=[7]
mask = emit_tools.quality_mask(file_handlers[2].f,flags)
Flags used: ['Aggregate Flag']
4.2.7. Load the data#
As you saw the raw data is the unprojected, unmasked focal plane array and does not include much metadata for us to tell what is going on. The emit_array
function provided by the EMIT team reads and processes this file for us. It can orthorectify the data and apply the mask too if we want.
Warning: in order to orthorectify the data, the entire array must be loaded into memory. This means you need a decent amount of RAM or the operation will fail. I found that 16GB is the minimum needed.
ds = emit_tools.emit_xarray(file_handlers[0].f, ortho=True, qmask=mask)
ds
<xarray.Dataset> Size: 5GB Dimensions: (latitude: 1901, longitude: 2346, wavelengths: 285) Coordinates: * wavelengths (wavelengths) float32 1kB 381.0 388.4 ... 2.493e+03 fwhm (wavelengths) float32 1kB ... good_wavelengths (wavelengths) float32 1kB ... * latitude (latitude) float64 15kB -33.12 -33.13 ... -34.15 -34.15 * longitude (longitude) float64 19kB 18.47 18.47 18.47 ... 19.74 19.74 elev (latitude, longitude) float32 18MB -9.999e+03 ... -9.99... spatial_ref int64 8B 0 Data variables: reflectance (latitude, longitude, wavelengths) float32 5GB -9.999e+... Attributes: (12/40) ncei_template_version: NCEI_NetCDF_Swath_Template_v2.0 summary: The Earth Surface Mineral Dust Source ... keywords: Imaging Spectroscopy, minerals, EMIT, ... Conventions: CF-1.63 sensor: EMIT (Earth Surface Mineral Dust Sourc... instrument: EMIT ... ... spatial_ref: GEOGCS["WGS 84",DATUM["WGS_1984",SPHER... geotransform: [ 1.84708746e+01 5.42232520e-04 -0.00... day_night_flag: Day title: EMIT L2A Estimated Surface Reflectance... granule_id: EMIT_L2A_RFL_001_20230119T114247_23019... Orthorectified: True
we will chunk the data so that we dont can apply any analysis to chunks rather than the entire array
ds = ds.chunk({'latitude':100,'longitude':100,'wavelengths':10})
The first thing we do is drop the bad bands (those in the atmospheric water absorption zones) and replace -9999 which is used to code no data with nan
ds = ds.where(ds.good_wavelengths.compute()==1,drop=True)
ds = ds.where(ds.reflectance != -9999)
4.2.8. Quick plot#
#select rgb
rgb = ds.sel(wavelengths=[650, 560, 470], method='nearest')
#make colors nicer
rgb = gamma_adjust(rgb)
plot natural colors
rgb.hvplot.rgb(x='longitude', y='latitude', bands='wavelengths', frame_width=400,rasterize=True, data_aspect=1)
plot a single wavelength
ds.sel(wavelengths=850, method='nearest').hvplot(cmap='Magma', geo=True,
tiles='ESRI', rasterize=True, alpha=0.8,
frame_height=600,clim =(0,0.7)
).opts(title=f"Reflectance at 850nm (Orthorectified)")
We can make nice interactive plots in python. Here we can use a slider to select the wavelength we want to view
def plot_reflectance(wavelength):
return ds.sel(wavelengths=wavelength, method='nearest').hvplot(
cmap='Magma', geo=True, tiles='ESRI', rasterize=True, alpha=0.8,
frame_height=600, clim=(0, 0.7)
).opts(title=f"Reflectance at {wavelength}nm (Orthorectified)")
wavelength_slider = pn.widgets.IntSlider(name='Wavelength (nm)', start=int(min(ds.wavelengths.values)), end=int(max(ds.wavelengths.values)), step=5, value=850)
interactive_plot = pn.bind(plot_reflectance, wavelength=wavelength_slider)
pn.Column(wavelength_slider, interactive_plot)
4.2.9. Extract points#
We will load a small dataset of points for Fynbos and Pine covered sites across the EMIT image we are working with.
#load points
gdf = gpd.read_file("/shared/users/gmoncrieff/emit_unmix.gpkg")
#explore the data
gdf[['Category','geometry']].explore('Category')
The xvec package will extract data at a point or area within an xarray and return the data as an xarray
extracted = ds.xvec.extract_points(gdf['geometry'], x_coords="longitude", y_coords="latitude",index=True)
extracted
<xarray.Dataset> Size: 113kB Dimensions: (geometry: 111, wavelengths: 244) Coordinates: * wavelengths (wavelengths) float32 976B 381.0 388.4 ... 2.493e+03 fwhm (wavelengths) float32 976B 8.415 8.415 ... 8.807 8.809 good_wavelengths (wavelengths) float32 976B 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 elev (geometry) float32 444B dask.array<chunksize=(111,), meta=np.ndarray> spatial_ref int64 8B 0 * geometry (geometry) object 888B POINT (19.2194 -33.9549) ... POI... index (geometry) int64 888B 0 1 2 3 4 5 ... 106 107 108 109 110 Data variables: reflectance (geometry, wavelengths) float32 108kB dask.array<chunksize=(111, 10), meta=np.ndarray> Indexes: geometry GeometryIndex (crs=EPSG:4326) Attributes: (12/40) ncei_template_version: NCEI_NetCDF_Swath_Template_v2.0 summary: The Earth Surface Mineral Dust Source ... keywords: Imaging Spectroscopy, minerals, EMIT, ... Conventions: CF-1.63 sensor: EMIT (Earth Surface Mineral Dust Sourc... instrument: EMIT ... ... spatial_ref: GEOGCS["WGS 84",DATUM["WGS_1984",SPHER... geotransform: [ 1.84708746e+01 5.42232520e-04 -0.00... day_night_flag: Day title: EMIT L2A Estimated Surface Reflectance... granule_id: EMIT_L2A_RFL_001_20230119T114247_23019... Orthorectified: True
Lets turn our xarray into a dataframe, and join it with the original dataset
extracted = (extracted
.to_dataframe() #turn into a dataframe
.reset_index() #convert the row names to a column
.merge( #merge with the original dataset
gdf.reset_index(),on='index')
)
#select the columns we want
extracted = extracted[['wavelengths','reflectance','geometry_x','Category','index']]
extracted
wavelengths | reflectance | geometry_x | Category | index | |
---|---|---|---|---|---|
0 | 381.005585 | 0.017709 | POINT (19.2194 -33.9549) | PINE | 0 |
1 | 388.409210 | 0.018134 | POINT (19.2194 -33.9549) | PINE | 0 |
2 | 395.815826 | 0.018555 | POINT (19.2194 -33.9549) | PINE | 0 |
3 | 403.225403 | 0.018993 | POINT (19.2194 -33.9549) | PINE | 0 |
4 | 410.638000 | 0.019613 | POINT (19.2194 -33.9549) | PINE | 0 |
... | ... | ... | ... | ... | ... |
27079 | 2463.381592 | 0.044974 | POINT (19.38343349154113 -33.445019637266505) | FYNBOS | 110 |
27080 | 2470.767822 | 0.045954 | POINT (19.38343349154113 -33.445019637266505) | FYNBOS | 110 |
27081 | 2478.153076 | 0.044455 | POINT (19.38343349154113 -33.445019637266505) | FYNBOS | 110 |
27082 | 2485.538574 | 0.042368 | POINT (19.38343349154113 -33.445019637266505) | FYNBOS | 110 |
27083 | 2492.923828 | 0.041585 | POINT (19.38343349154113 -33.445019637266505) | FYNBOS | 110 |
27084 rows × 5 columns
Finally lets select a class and plot the spectra for all points of that class
extracted_plot = extracted.query('Category == "FYNBOS"')
extracted_plot.hvplot.line(y='reflectance',x='wavelengths',by='index',
color='green',ylim=(0,0.5),alpha=0.5,legend=False)
4.2.10. credits#
This lesson has borrowed from: