Subsetting GEDI L4A Footprints

Subsetting GEDI L4A Footprints#

This tutorial will demonstrate how to subset Global Ecosystem Dynamics Investigation (GEDI) L4A Footprint Level Aboveground Biomass Density (AGBD)[Dubayah et al., 2022] dataset to a study area of interest. GEDI L4A dataset is available for the period starting 2019-04-17 and covers latitudes of 52 North to 52 South. GEDI L4A data files are natively in HDF5 format, and each file represents one International Space Station (ISS) orbit.

The previous tutorial explains how to download GEDI L4A files for a study area of interest (bounding box or polygon) and a specific period. Once all the GEDI L4A files are downloaded, the global orbits of GEDI L4A can be clipped or subsetted to the study area of interest.

Learning Objectives

Learn how to download GEDI L4A data by space (study area).
Learn how to subset downloaded GEDI L4A dataset for an area of interest.
Create a map of the GEDI shots showing AGBD

Requirements#

Additional prerequisites are provided here.

import h5py
import pandas as pd
import geopandas as gpd
import contextily as ctx
import h5py
import numpy as np
import earthaccess
from glob import glob
from os import path
from shapely.geometry import MultiPolygon, Polygon
from shapely.ops import orient

1. Polygonal Area of Interest#

We will use the boundary of the Great Smokey Mountain National Park (GRSM) to demonstrate the spatial subsetting process. The boundary file is in ESRI Shapefile format in the folder called grsm in the repository. Let’s read the boundary file and print out its coordinate system.

poly = gpd.read_file('polygons/grsm/GRSM_BOUNDARY_POLYGON_fid17.shp').geometry
poly.crs

<Projected CRS: EPSG:26917>
Name: NAD83 / UTM zone 17N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: North America - between 84°W and 78°W - onshore and offshore. Canada - Nunavut; Ontario; Quebec. United States (USA) - Florida; Georgia; Kentucky; Maryland; Michigan; New York; North Carolina; Ohio; Pennsylvania; South Carolina; Tennessee; Virginia; West Virginia.
- bounds: (-84.0, 23.81, -78.0, 84.0)
Coordinate Operation:
- name: UTM zone 17N
- method: Transverse Mercator
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

As we see above, the boundary file is in the UTM 17N projection. Now let’s plot our study area over a base map.

m = poly.explore(color='red',  fill=False)
m

Make this Notebook Trusted to load map: File -> Trust Notebook

2. Searching and Downloading GEDI L4A Files#

We will search for all the GEDI L4A files with the orbits passing over the GRSM boundary using the earthaccess python module. Please refer to the previous tutorial for more details.

# converting to WGS84 coordinate system
poly = poly.to_crs(epsg=4326)

# orienting coordinates clockwise
poly = poly.apply(orient, args=(1,))

# reducing number of vertices in the polygon
# CMR has 1000000 bytes limit
xy = poly.simplify(0.005).get_coordinates()

granules = earthaccess.search_data(
    count=-1, # needed to retrieve all granules
    doi="10.3334/ORNLDAAC/2056", # GEDI L4A DOI 
    polygon=list(zip(xy.x, xy.y))
)
print(f"Total granules found: {len(granules)}")

Total granules found: 187

The granules object contains metadata about the granules overlapping the GRSM area, including the bounding geometry, publication dates, data providers, etc. Now, let’s convert the above granule metadata from json-formatted to geopandas dataframe. Converting to geopandas dataframe will let us generate plots of the granule geometry. Let’s plot the bounding geometries of the first few files.

def convert_umm_geometry(gpoly):
    """converts UMM geometry to multipolygons"""
    multipolygons = []
    for gl in gpoly:
        ltln = gl["Boundary"]["Points"]
        points = [(p["Longitude"], p["Latitude"]) for p in ltln]
        multipolygons.append(Polygon(points))
    return MultiPolygon(multipolygons)

def convert_list_gdf(datag):
    """converts List[] to geopandas dataframe"""
    # create pandas dataframe from json
    df = pd.json_normalize([vars(granule)['render_dict'] for granule in datag])
    # keep only last string of the column names
    df.columns=df.columns.str.split('.').str[-1]
    # convert polygons to multipolygonal geometry
    df["geometry"] = df["GPolygons"].apply(convert_umm_geometry)
    # return geopandas dataframe
    return gpd.GeoDataFrame(df, geometry="geometry", crs="EPSG:4326")

# only keep three columns
gdf = convert_list_gdf(granules)[['GranuleUR', 'size', 'geometry']]
# plotting the first three granule geometries
gdf[:10].explore(m=m, color='green',  fillcolor='green')

Make this Notebook Trusted to load map: File -> Trust Notebook

As we see in the map above, spatial bounds of GEDI L4A files extend beyond the area of interest (GRSM).

3. Downloading the GEDI L4A files#

We will use earthaccess python module to download all files. Please refer to the previous tutorial for more details on downloading programmatically.

# authenticate using the EDL login
auth = earthaccess.login(strategy="netrc") 
if not auth.authenticated:
    # ask for EDL credentials and persist them in a .netrc file
    auth = earthaccess.login(strategy="interactive", persist=True)
# download files to full_orbits directory
downloaded_files = earthaccess.download(granules, local_path="full_orbits")

4. Subsetting the GEDI L4A files#

Once all the GEDI L4A files are downloaded, we can clip the full orbit files to retrieve the footprints that fall within the area of interest, i.e. GRSM boundary. We have downloaded all the files to a folder called full_orbits for this tutorial.

4a. Exploring the data structure#

Let’s first open one of the L4A file GEDI04_A_2019133103100_O02354_03_T00724_02_002_02_V002.h5 we just downloaded and print the root-level variable group.

hf = h5py.File('full_orbits/GEDI04_A_2019133103100_O02354_03_T00724_02_002_02_V002.h5', 'r')
hf.keys()

<KeysViewHDF5 ['ANCILLARY', 'BEAM0000', 'BEAM0001', 'BEAM0010', 'BEAM0011', 'BEAM0101', 'BEAM0110', 'BEAM1000', 'BEAM1011', 'METADATA']>

All science variables are organized by eight beams of the GEDI. Please refer to the GEDI L4A user guide and GEDI L4A data dictionary for details on file organization. Let’s look into one of the beam group BEAM0110 and print all the science dataset (SDS) variables within the group.

beam0110 = hf.get('BEAM0110')
beam0110.keys()

<KeysViewHDF5 ['agbd', 'agbd_pi_lower', 'agbd_pi_upper', 'agbd_prediction', 'agbd_se', 'agbd_t', 'agbd_t_se', 'algorithm_run_flag', 'beam', 'channel', 'degrade_flag', 'delta_time', 'elev_lowestmode', 'geolocation', 'l2_quality_flag', 'l4_quality_flag', 'land_cover_data', 'lat_lowestmode', 'lon_lowestmode', 'master_frac', 'master_int', 'predict_stratum', 'predictor_limit_flag', 'response_limit_flag', 'selected_algorithm', 'selected_mode', 'selected_mode_flag', 'sensitivity', 'shot_number', 'solar_elevation', 'surface_flag', 'xvar']>

In the above list of variables, 2 science dataset (SDS) are particulary useful for spatial subsetting: lat_lowestmode and lon_lowestmode, which represent ground location of each GEDI shot.

Let’s plot all the beams in the map.

lat_l = []
lon_l = []
beam_n = []
for var in list(hf.keys()):
    if var.startswith('BEAM'):
        beam = hf.get(var)
        lat = beam.get('lat_lowestmode')[:]
        lon = beam.get('lon_lowestmode')[:]
        lat_l.extend(lat.tolist()) # latitude
        lon_l.extend(lon.tolist()) # longitude
        n = lat.shape[0] # number of shots in the beam group
        beam_n.extend(np.repeat(str(var), n).tolist())
geo_arr = list(zip(beam_n,lat_l,lon_l))
l4adf = pd.DataFrame(geo_arr, columns=["beam", "lat_lowestmode", "lon_lowestmode"])
l4adf

	beam	lat_lowestmode	lon_lowestmode
0	BEAM0000	51.820311	-136.039649
1	BEAM0000	51.820308	-136.038820
2	BEAM0000	51.820305	-136.037992
3	BEAM0000	51.820302	-136.037163
4	BEAM0000	51.820299	-136.036335
...	...	...	...
1338244	BEAM1011	-0.243276	-52.168463
1338245	BEAM1011	-0.243705	-52.168161
1338246	BEAM1011	-0.244112	-52.167877
1338247	BEAM1011	-0.244559	-52.167559
1338248	BEAM1011	-0.244976	-52.167267

1338249 rows × 3 columns

The pandas dataframe l4adf contains beam names, latitude and longitude columns. This particular GEDI orbit recorded a total of 1,338,249 shots. Now we can convert l4adf to a geopandas dataframe l4agdf and clip the file by the boundary of the GRSM.

l4agdf = gpd.GeoDataFrame(l4adf, crs="EPSG:4326", 
                          geometry=gpd.points_from_xy(l4adf.lon_lowestmode, l4adf.lat_lowestmode))
l4agdf_gsrm = l4agdf[l4agdf['geometry'].within(poly.geometry[0])]    
l4agdf_gsrm

	beam	lat_lowestmode	lon_lowestmode	geometry
77915	BEAM0000	35.695545	-83.583207	POINT (-83.58321 35.69555)
77916	BEAM0000	35.695196	-83.582730	POINT (-83.58273 35.69520)
77917	BEAM0000	35.694867	-83.582310	POINT (-83.58231 35.69487)
77918	BEAM0000	35.694515	-83.581828	POINT (-83.58183 35.69452)
77919	BEAM0000	35.694179	-83.581389	POINT (-83.58139 35.69418)
...	...	...	...	...
1249364	BEAM1011	35.490195	-83.381600	POINT (-83.38160 35.49020)
1249365	BEAM1011	35.489855	-83.381146	POINT (-83.38115 35.48986)
1249366	BEAM1011	35.489514	-83.380690	POINT (-83.38069 35.48951)
1249367	BEAM1011	35.489177	-83.380240	POINT (-83.38024 35.48918)
1249368	BEAM1011	35.488839	-83.379788	POINT (-83.37979 35.48884)

4503 rows × 4 columns

The geopandas dataframe l4agdf_gsrm contains the shots (4,505 shots in total) that fall within the GSRM boundary. Now, let’s plot these into a map.

l4agdf_gsrm.explore(column='beam', legend=True)

Make this Notebook Trusted to load map: File -> Trust Notebook

The above map shows the swath coverage of GEDI - a typical swath is ~4200m wide. GEDI instrument produces eight ground tracks plotted in the map with different colors. The GEDI shots (or footprints) are separated by ~60 m along the track and ~600 m across the track. For more detailed look on GEDI L4A data, refer to this tutorial on exploring GEDI L4A.

4b. Subsetting all downloaded files#

In the Steps 2 and 3 above, we downloaded L4A files into the directory full_orbits. We will now loop over each of these files and create a clipped version of the files into a new directory subsets.

indir = 'full_orbits'
outdir = 'subsets'

for infile in glob(path.join(indir, 'GEDI04_A*.h5')):
    name, ext = path.splitext(path.basename(infile))
    subfilename = "{name}_sub{ext}".format(name=name, ext=ext)
    outfile = path.join(outdir, path.basename(subfilename))
    
    # if the subset file doesn't already exists
    if not path.isfile(outfile):
        hf_in = h5py.File(infile, 'r')
        hf_out = h5py.File(outfile, 'w')
        
        # copy ANCILLARY and METADATA groups
        var1 = ["/ANCILLARY", "/METADATA"]
        for v in var1:
            hf_in.copy(hf_in[v],hf_out)
        
        # loop through BEAMXXXX groups
        for v in list(hf_in.keys()):
            if v.startswith('BEAM'):
                beam = hf_in[v]
                # find the shots that overlays the area of interest (GRSM)
                lat = beam['lat_lowestmode'][:]
                lon = beam['lon_lowestmode'][:]
                i = np.arange(0, len(lat), 1) # index
                geo_arr = list(zip(lat,lon, i))
                l4adf = pd.DataFrame(geo_arr, columns=["lat_lowestmode", "lon_lowestmode", "i"])
                l4agdf = gpd.GeoDataFrame(l4adf, crs="EPSG:4326", 
                                          geometry=gpd.points_from_xy(l4adf.lon_lowestmode, l4adf.lat_lowestmode))
                l4agdf_gsrm = l4agdf[l4agdf['geometry'].within(poly.geometry[0])]  
                indices = l4agdf_gsrm.i
    
                # copy BEAMS to the output file
                for key, value in beam.items():
                    if isinstance(value, h5py.Group):
                        for key2, value2 in value.items():
                            group_path = value2.parent.name
                            group_id = hf_out.require_group(group_path)
                            dataset_path = group_path + '/' + key2
                            hf_out.create_dataset(dataset_path, data=value2[:][indices])
                            for attr in value2.attrs.keys():
                                hf_out[dataset_path].attrs[attr] = value2.attrs[attr]
                    else:
                        group_path = value.parent.name
                        group_id = hf_out.require_group(group_path)
                        dataset_path = group_path + '/' + key
                        hf_out.create_dataset(dataset_path, data=value[:][indices])
                        for attr in value.attrs.keys():
                            hf_out[dataset_path].attrs[attr] = value.attrs[attr]
    
        hf_in.close()
        hf_out.close()

Now, new the subset files are created in the subsets directory. We will use the subset files to create a map of above ground biomass density (the variable agbd inside BEAMXXXX groups) of the GRSM.

lat_l = []
lon_l = []
agbd = []
outdir = 'subsets'
for subfile in glob(path.join(outdir, 'GEDI04_A*.h5')):
    hf_in = h5py.File(subfile, 'r')
    for v in list(hf_in.keys()):
        if v.startswith('BEAM'):
            beam = hf_in[v]
            lat_l.extend(beam['lat_lowestmode'][:].tolist()) 
            lon_l.extend(beam['lon_lowestmode'][:].tolist()) 
            agbd.extend(beam['agbd'][:].tolist())
    hf_in.close()
geo_arr = list(zip(agbd,lat_l,lon_l))
df = pd.DataFrame(geo_arr, columns=["agbd", "lat_lowestmode", "lon_lowestmode"])
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lon_lowestmode, df.lat_lowestmode), crs="EPSG:4326")
gdf

	agbd	lat_lowestmode	lon_lowestmode	geometry
0	-9999.0	35.717420	-83.486708	POINT (-83.48671 35.71742)
1	-9999.0	35.717073	-83.486231	POINT (-83.48623 35.71707)
2	-9999.0	35.716726	-83.485766	POINT (-83.48577 35.71673)
3	-9999.0	35.716377	-83.485258	POINT (-83.48526 35.71638)
4	-9999.0	35.716030	-83.484793	POINT (-83.48479 35.71603)
...	...	...	...	...
727205	-9999.0	35.447732	-83.705914	POINT (-83.70591 35.44773)
727206	-9999.0	35.447385	-83.705452	POINT (-83.70545 35.44739)
727207	-9999.0	35.447038	-83.704989	POINT (-83.70499 35.44704)
727208	-9999.0	35.446691	-83.704527	POINT (-83.70453 35.44669)
727209	-9999.0	35.446345	-83.704073	POINT (-83.70407 35.44634)

727210 rows × 4 columns

In the above table, we see there are some shots with fill value (-9999). We will exclude these shots in the map below. The following is a map of aboveground biomass density (Mg/ha) or agbd.

ax =gdf[gdf['agbd'] != -9999].to_crs(epsg=3857).plot(column='agbd', alpha=0.1, 
                                                     linewidth=0, legend=True, figsize=(22, 7))
ctx.add_basemap(ax)

../_images/be654eb04810fad5f29fa5278275ef1fe3df63c9da18068a9d2f442dbeb35442.png

4c. Saving the subsets to different formats#

In the step above, we created a HDF5 formatted output, which is the native format of GEDI L4A datasets. The HDF5 files can be output to geojson or csv or ESRI Shapefile using the geopandas.

outdir = 'subsets'
subset_df = pd.DataFrame()
for subfile in glob(path.join(outdir, 'GEDI04_A*_sub.h5')):
    hf_in = h5py.File(subfile, 'r')
    for v in list(hf_in.keys()):
        if v.startswith('BEAM'):
            col_names = []
            col_val = []
            beam = hf_in[v]
            # copy BEAMS 
            for key, value in beam.items():
                # looping through subgroups
                if isinstance(value, h5py.Group):
                    for key2, value2 in value.items():
                        if (key2 != "shot_number"):
                             # xvar variables have 2D
                            if (key2.startswith('xvar')):
                                for r in range(4):
                                    col_names.append(key2 + '_' + str(r+1))
                                    col_val.append(value2[:, r].tolist())
                            else:
                                col_names.append(key2)
                                col_val.append(value2[:].tolist())
                
                #looping through base group
                else:
                    # xvar variables have 2D
                    if (key.startswith('xvar')):
                        for r in range(4):
                            col_names.append(key + '_' + str(r+1))
                            col_val.append(value[:, r].tolist())
                    else:
                        col_names.append(key)
                        col_val.append(value[:].tolist())
            
            # create a pandas dataframe        
            beam_df = pd.DataFrame(map(list, zip(*col_val)), columns=col_names) 
            # Inserting BEAM names
            beam_df.insert(0, 'BEAM', np.repeat(str(v), len(beam_df.index)).tolist())
            
            # Appending to the subset_df dataframe
            if len(beam_df.index) > 0:
                subset_df = pd.concat([subset_df, beam_df])

    hf_in.close()

Now, all the variables are stored in a pandas dataframe subset_df. We can print the dataframe.

# Setting 'shot_number' as dataframe index. shot_number column is unique
subset_df = subset_df.set_index('shot_number') 
subset_df.head()

	BEAM	agbd	agbd_pi_lower	agbd_pi_upper	agbd_a1	agbd_a10	agbd_a2	agbd_a3	agbd_a4	agbd_a5	...	selected_algorithm	selected_mode	selected_mode_flag	sensitivity	solar_elevation	surface_flag	xvar_1	xvar_2	xvar_3	xvar_4
shot_number
41710000300413617	BEAM0000	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	...	1	0	0	0.315933	21.527996	0	-9999.0	-9999.0	-9999.0	-9999.0
41710000300413618	BEAM0000	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	...	1	0	0	78.354553	21.528458	0	-9999.0	-9999.0	-9999.0	-9999.0
41710000300413619	BEAM0000	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	...	1	0	0	2.461639	21.528910	0	-9999.0	-9999.0	-9999.0	-9999.0
41710000300413620	BEAM0000	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	...	1	0	0	-1.414248	21.529398	0	-9999.0	-9999.0	-9999.0	-9999.0
41710000300413621	BEAM0000	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	...	1	0	0	8.056547	21.529850	0	-9999.0	-9999.0	-9999.0	-9999.0

5 rows × 204 columns

The dataframe has 204 columns representing GEDI L4A variables. We can export it to a CSV file directly as:

subset_df.to_csv('subsets/grsm_subset.csv') # Export to CSV

If we want to save the file as one of the geospatial formats (such as GEOJSON, KML, ESRI Shapefile), we need to first convert the dataframe into a geopandas dataframe, and export to various formats.

subset_gdf = gpd.GeoDataFrame(subset_df, crs="EPSG:4326", 
                              geometry=gpd.points_from_xy(subset_df.lon_lowestmode,
                                                          subset_df.lat_lowestmode))
# convert object types columns to strings. object types are not supported
for c in subset_gdf.columns:
    if subset_gdf[c].dtype == 'object':
        subset_gdf[c] = subset_gdf[c].astype(str)

# Export to GeoJSON
subset_gdf.to_file('subsets/grsm_subset.geojson', driver='GeoJSON')
# Export to ESRI Shapefile
subset_gdf.to_file('subsets/grsm_subset.shp')

References#

[1]

R.O. Dubayah, J. Armston, J.R. Kellner, L. Duncanson, S.P. Healey, P.L. Patterson, S. Hancock, H. Tang, J. Bruening, M.A. Hofton, J.B. Blair, and S.B. Luthcke. GEDI L4A Footprint Level Aboveground Biomass Density, Version 2.1. 2022. doi:10.3334/ORNLDAAC/2056.

Subsetting GEDI L4A Footprints

Contents

Subsetting GEDI L4A Footprints#

Requirements#

1. Polygonal Area of Interest#

2. Searching and Downloading GEDI L4A Files#

3. Downloading the GEDI L4A files#

4. Subsetting the GEDI L4A files#

4a. Exploring the data structure#

4b. Subsetting all downloaded files#

4c. Saving the subsets to different formats#

References#