4. Exploring EMIT data and inspecting spectra for alien species#

4.1. Overview#

In this notebook, we will start analysing satellite imaging spectroscopy data. We will open data from The Earth Surface Mineral Dust Source Investigation (EMIT) stored remotely on AWS and explore it. We will then extract some spectra for invasive plant species and visualize these.

4.2. Learning Objectives#

  1. Understand how to access data direclty from AWS s3 storage

  2. Load, mask, orthorectify, and visualize EMIT data

  3. Extract spectra at locations with and without aline plant and inspect these

4.2.1. Load libraries#

#imports
import xarray as xr
import hvplot.xarray
import hvplot.pandas
import holoviews as hv
import pandas as pd
import numpy as np
import math
import geopandas as gpd
import panel as pn
import xvec
#our modules
import emit_tools #from https://github.com/nasa/EMIT-Data-Resources/blob/main/python/modules/emit_tools.py

# Function to adjust gamma across all bands - adjust brightness
def gamma_adjust(rgb_ds, bright=0.2, white_background=False):
    array = rgb_ds.reflectance.data
    gamma = math.log(bright)/math.log(np.nanmean(array)) # Create exponent for gamma scaling - can be adjusted by changing 0.2 
    scaled = np.power(array,gamma).clip(0,1) # Apply scaling and clip to 0-1 range
    if white_background == True:
        scaled = np.nan_to_num(scaled, nan = 1) # Assign NA's to 1 so they appear white in plots
    rgb_ds.reflectance.data = scaled
    return rgb_ds

4.2.2. Authenticate with NASA Earthdata portal#

Earthaccess is a python library to search, download or stream NASA Earth science data. You will also need an account on NASA’s Earthdata data portal https://search.earthdata.nasa.gov/

import earthaccess
auth = earthaccess.login(persist=True)

4.2.3. Accessing data on AWS S3#

The emit files are stored on Amazon S3 object storage. We will access these directly without needing to download them as we can stream s3 files. Also, because we are using xarray with dask and our data is stored in a chunked file format netcdf, we only need to stream the chunks we are currently working on and not the entire file at once.

The best way to access the data is on a virtual machine located in us-west-2. This is where the data is located and this mean that will get the fastest reading and writing of data possible, and all read/writes will be free. If your machine is not in us-west-2 or you are on a local machine, I do not recommend using S3, as it might be slower and there will be costs involved.

4.2.4. Finding data#

earthaccess has really convineint methods for searching and loading nasa data over S3 using earthaccess.search_data(). Here is an example:

results = earthaccess.search_data(
    short_name='EMITL2ARFL',
    point=(19.03711,-33.94747),
    temporal=('2023-01-18','2023-01-24'),
    count=2
)

you can then load the results into an xarray

files = earthaccess.open(results)
ds = xr.open_dataset(files.f)

if you do not want to stream the data from S3 you can download the data directly using:

files = earthaccess.download(results, "./local_folder")

for more info on using earthaccess see the docs

4.2.5. Open the datset#

If you are in us-west-2: Authenticate with s3 and stream data

results = earthaccess.search_data(
    short_name='EMITL2ARFL',
    point=(19.03711,-33.94747),
    temporal=('2023-01-18','2023-01-24'),
    count=1
)
file_handlers = earthaccess.open(results)
file_handlers
Granules found: 2
Opening 1 granules, approx size: 3.5 GB
using endpoint: https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials
[<File-like object S3FileSystem, lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20230119T114247_2301907_005/EMIT_L2A_RFL_001_20230119T114247_2301907_005.nc>,
 <File-like object S3FileSystem, lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20230119T114247_2301907_005/EMIT_L2A_RFLUNCERT_001_20230119T114247_2301907_005.nc>,
 <File-like object S3FileSystem, lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20230119T114247_2301907_005/EMIT_L2A_MASK_001_20230119T114247_2301907_005.nc>]

Inspect the raw data

xr.open_dataset(file_handlers[0].f,engine='h5netcdf',chunks='auto')
<xarray.Dataset> Size: 2GB
Dimensions:      (downtrack: 1280, crosstrack: 1242, bands: 285)
Dimensions without coordinates: downtrack, crosstrack, bands
Data variables:
    reflectance  (downtrack, crosstrack, bands) float32 2GB dask.array<chunksize=(538, 522, 119), meta=np.ndarray>
Attributes: (12/38)
    ncei_template_version:             NCEI_NetCDF_Swath_Template_v2.0
    summary:                           The Earth Surface Mineral Dust Source ...
    keywords:                          Imaging Spectroscopy, minerals, EMIT, ...
    Conventions:                       CF-1.63
    sensor:                            EMIT (Earth Surface Mineral Dust Sourc...
    instrument:                        EMIT
    ...                                ...
    southernmost_latitude:             -34.15500797254596
    spatialResolution:                 0.000542232520256367
    spatial_ref:                       GEOGCS["WGS 84",DATUM["WGS_1984",SPHER...
    geotransform:                      [ 1.84708746e+01  5.42232520e-04 -0.00...
    day_night_flag:                    Day
    title:                             EMIT L2A Estimated Surface Reflectance...

If you are not in us-west-2: Download locally

#get the files we want
results = earthaccess.search_data(
    short_name='EMITL2ARFL',
    point=(19.03711,-33.94747),
    temporal=('2023-01-18','2023-01-24'),
    count=1
)
#download them
files = earthaccess.download(results, "data/downloads")
Granules found: 2
 Getting 1 granules, approx download size: 3.5 GB
Accessing cloud dataset using dataset endpoint credentials: https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials
Downloaded: data/downloads/EMIT_L2A_RFL_001_20230119T114247_2301907_005.nc
Downloaded: data/downloads/EMIT_L2A_RFLUNCERT_001_20230119T114247_2301907_005.nc
Downloaded: data/downloads/EMIT_L2A_MASK_001_20230119T114247_2301907_005.nc
# Open dataset with xarray
fp = 'data/downloads/EMIT_L2A_RFL_001_20230119T114247_2301907_005.nc'
fp_mask = 'data/downloads/EMIT_L2A_MASK_001_20230119T114247_2301907_005.nc'

ds = xr.open_dataset(fp,engine='h5netcdf',chunks='auto')
ds
<xarray.Dataset> Size: 2GB
Dimensions:      (downtrack: 1280, crosstrack: 1242, bands: 285)
Dimensions without coordinates: downtrack, crosstrack, bands
Data variables:
    reflectance  (downtrack, crosstrack, bands) float32 2GB dask.array<chunksize=(538, 522, 119), meta=np.ndarray>
Attributes: (12/38)
    ncei_template_version:             NCEI_NetCDF_Swath_Template_v2.0
    summary:                           The Earth Surface Mineral Dust Source ...
    keywords:                          Imaging Spectroscopy, minerals, EMIT, ...
    Conventions:                       CF-1.63
    sensor:                            EMIT (Earth Surface Mineral Dust Sourc...
    instrument:                        EMIT
    ...                                ...
    southernmost_latitude:             -34.15500797254596
    spatialResolution:                 0.000542232520256367
    spatial_ref:                       GEOGCS["WGS 84",DATUM["WGS_1984",SPHER...
    geotransform:                      [ 1.84708746e+01  5.42232520e-04 -0.00...
    day_night_flag:                    Day
    title:                             EMIT L2A Estimated Surface Reflectance...

4.2.6. Load the mask#

The EMIT data is composed of 3 files, the actual data, a mask that tells us about various quality issues with the data, and the reflectance uncertainties (we will ignore the uncertainty for now). We can selecting which quality filter (mask) to use. Here we used the aggregate filter - the sum of all filters to be very strict about what data we include. The quality_mask function provided by the EMIT team reads and processes this file for us.

flags=[7]
mask = emit_tools.quality_mask(file_handlers[2].f,flags)
Flags used: ['Aggregate Flag']

4.2.7. Load the data#

As you saw the raw data is the unprojected, unmasked focal plane array and does not include much metadata for us to tell what is going on. The emit_array function provided by the EMIT team reads and processes this file for us. It can orthorectify the data and apply the mask too if we want.

Warning: in order to orthorectify the data, the entire array must be loaded into memory. This means you need a decent amount of RAM or the operation will fail. I found that 16GB is the minimum needed.

ds = emit_tools.emit_xarray(file_handlers[0].f, ortho=True, qmask=mask)
ds
<xarray.Dataset> Size: 5GB
Dimensions:           (latitude: 1901, longitude: 2346, wavelengths: 285)
Coordinates:
  * wavelengths       (wavelengths) float32 1kB 381.0 388.4 ... 2.493e+03
    fwhm              (wavelengths) float32 1kB ...
    good_wavelengths  (wavelengths) float32 1kB ...
  * latitude          (latitude) float64 15kB -33.12 -33.13 ... -34.15 -34.15
  * longitude         (longitude) float64 19kB 18.47 18.47 18.47 ... 19.74 19.74
    elev              (latitude, longitude) float32 18MB -9.999e+03 ... -9.99...
    spatial_ref       int64 8B 0
Data variables:
    reflectance       (latitude, longitude, wavelengths) float32 5GB -9.999e+...
Attributes: (12/40)
    ncei_template_version:             NCEI_NetCDF_Swath_Template_v2.0
    summary:                           The Earth Surface Mineral Dust Source ...
    keywords:                          Imaging Spectroscopy, minerals, EMIT, ...
    Conventions:                       CF-1.63
    sensor:                            EMIT (Earth Surface Mineral Dust Sourc...
    instrument:                        EMIT
    ...                                ...
    spatial_ref:                       GEOGCS["WGS 84",DATUM["WGS_1984",SPHER...
    geotransform:                      [ 1.84708746e+01  5.42232520e-04 -0.00...
    day_night_flag:                    Day
    title:                             EMIT L2A Estimated Surface Reflectance...
    granule_id:                        EMIT_L2A_RFL_001_20230119T114247_23019...
    Orthorectified:                    True

we will chunk the data so that we dont can apply any analysis to chunks rather than the entire array

ds = ds.chunk({'latitude':100,'longitude':100,'wavelengths':10})

The first thing we do is drop the bad bands (those in the atmospheric water absorption zones) and replace -9999 which is used to code no data with nan

ds = ds.where(ds.good_wavelengths.compute()==1,drop=True)
ds = ds.where(ds.reflectance != -9999)

4.2.8. Quick plot#

#select rgb
rgb = ds.sel(wavelengths=[650, 560, 470], method='nearest')
#make colors nicer
rgb = gamma_adjust(rgb)

plot natural colors

rgb.hvplot.rgb(x='longitude', y='latitude', bands='wavelengths', frame_width=400,rasterize=True, data_aspect=1)

plot a single wavelength

ds.sel(wavelengths=850, method='nearest').hvplot(cmap='Magma', geo=True, 
                                                 tiles='ESRI', rasterize=True, alpha=0.8, 
                                                 frame_height=600,clim =(0,0.7)
                                                ).opts(title=f"Reflectance at 850nm (Orthorectified)")

We can make nice interactive plots in python. Here we can use a slider to select the wavelength we want to view

def plot_reflectance(wavelength):
    return ds.sel(wavelengths=wavelength, method='nearest').hvplot(
        cmap='Magma', geo=True, tiles='ESRI', rasterize=True, alpha=0.8,
        frame_height=600, clim=(0, 0.7)
    ).opts(title=f"Reflectance at {wavelength}nm (Orthorectified)")

wavelength_slider = pn.widgets.IntSlider(name='Wavelength (nm)', start=int(min(ds.wavelengths.values)), end=int(max(ds.wavelengths.values)), step=5, value=850)

interactive_plot = pn.bind(plot_reflectance, wavelength=wavelength_slider)

pn.Column(wavelength_slider, interactive_plot)

4.2.9. Extract points#

We will load a small dataset of points for Fynbos and Pine covered sites across the EMIT image we are working with.

#load points
gdf = gpd.read_file("/shared/users/gmoncrieff/emit_unmix.gpkg")
#explore the data
gdf[['Category','geometry']].explore('Category')
Make this Notebook Trusted to load map: File -> Trust Notebook

The xvec package will extract data at a point or area within an xarray and return the data as an xarray

extracted = ds.xvec.extract_points(gdf['geometry'], x_coords="longitude", y_coords="latitude",index=True)
extracted
<xarray.Dataset> Size: 113kB
Dimensions:           (geometry: 111, wavelengths: 244)
Coordinates:
  * wavelengths       (wavelengths) float32 976B 381.0 388.4 ... 2.493e+03
    fwhm              (wavelengths) float32 976B 8.415 8.415 ... 8.807 8.809
    good_wavelengths  (wavelengths) float32 976B 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0
    elev              (geometry) float32 444B dask.array<chunksize=(111,), meta=np.ndarray>
    spatial_ref       int64 8B 0
  * geometry          (geometry) object 888B POINT (19.2194 -33.9549) ... POI...
    index             (geometry) int64 888B 0 1 2 3 4 5 ... 106 107 108 109 110
Data variables:
    reflectance       (geometry, wavelengths) float32 108kB dask.array<chunksize=(111, 10), meta=np.ndarray>
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
Attributes: (12/40)
    ncei_template_version:             NCEI_NetCDF_Swath_Template_v2.0
    summary:                           The Earth Surface Mineral Dust Source ...
    keywords:                          Imaging Spectroscopy, minerals, EMIT, ...
    Conventions:                       CF-1.63
    sensor:                            EMIT (Earth Surface Mineral Dust Sourc...
    instrument:                        EMIT
    ...                                ...
    spatial_ref:                       GEOGCS["WGS 84",DATUM["WGS_1984",SPHER...
    geotransform:                      [ 1.84708746e+01  5.42232520e-04 -0.00...
    day_night_flag:                    Day
    title:                             EMIT L2A Estimated Surface Reflectance...
    granule_id:                        EMIT_L2A_RFL_001_20230119T114247_23019...
    Orthorectified:                    True

Lets turn our xarray into a dataframe, and join it with the original dataset

extracted = (extracted
             .to_dataframe() #turn into a dataframe
             .reset_index() #convert the row names to a column
             .merge( #merge with the original dataset
                 gdf.reset_index(),on='index')
            )
#select the columns we want
extracted = extracted[['wavelengths','reflectance','geometry_x','Category','index']]
extracted
wavelengths reflectance geometry_x Category index
0 381.005585 0.017709 POINT (19.2194 -33.9549) PINE 0
1 388.409210 0.018134 POINT (19.2194 -33.9549) PINE 0
2 395.815826 0.018555 POINT (19.2194 -33.9549) PINE 0
3 403.225403 0.018993 POINT (19.2194 -33.9549) PINE 0
4 410.638000 0.019613 POINT (19.2194 -33.9549) PINE 0
... ... ... ... ... ...
27079 2463.381592 0.044974 POINT (19.38343349154113 -33.445019637266505) FYNBOS 110
27080 2470.767822 0.045954 POINT (19.38343349154113 -33.445019637266505) FYNBOS 110
27081 2478.153076 0.044455 POINT (19.38343349154113 -33.445019637266505) FYNBOS 110
27082 2485.538574 0.042368 POINT (19.38343349154113 -33.445019637266505) FYNBOS 110
27083 2492.923828 0.041585 POINT (19.38343349154113 -33.445019637266505) FYNBOS 110

27084 rows × 5 columns

Finally lets select a class and plot the spectra for all points of that class

extracted_plot = extracted.query('Category == "FYNBOS"')
extracted_plot.hvplot.line(y='reflectance',x='wavelengths',by='index',
                                    color='green',ylim=(0,0.5),alpha=0.5,legend=False)

4.2.10. credits#

This lesson has borrowed from:

the EMIT-Data-Resources repository by LPDAAC/ the EMIT team