Discover CMS Datasets

Discover CMS Datasets#

Learning Objectives

Learn how to search NASA earth data collections programatically in Python
Learn how to restrict search results by space (study area) and time (project period).
Plot data extent/boundaries in a map
Download data files of interest.

In this tutorial, we will use earthaccess Python module to search for CMS datasets.

import earthaccess
import pandas as pd
import geopandas as gpd
import re
import matplotlib.pyplot as plt
from shapely.geometry import box
from geodatasets import get_path
from collections import Counter
from wordcloud import WordCloud

Search datasets by project#

Let’s use earthaccess module’s search_datasets API to search for datasets published under the CMS project. search_datasets will return a list DataCollection from NASA’s Common Metadata Repository (CMR) API.

datasets = earthaccess.search_datasets(project="CMS")
print(f"Total Datasets found: {len(datasets)}")

Total Datasets found: 168

First convert the above search results into a pandas dataframe. Converting to a pandas dataframe will streamline the query of the dataset search results.

def convert_list_gdf(datag):
    """converts List[] to geopandas dataframe"""
    # create pandas dataframe from json
    df = pd.json_normalize([vars(granule)['render_dict'] for granule in datag])
    # keep only last string of the column names
    df.columns=df.columns.str.split('.').str[-1]
    return df
    
df = convert_list_gdf(datasets)

Spatial extent of datasets#

NASA’s CMR also stores the bounding coordinates of the data collections. Now, let’s convert the pandas dataframe df we created above into a geopandas dataframe and plot the bounding box of each data collection.

def convert_umm_bbox(b):
    """converts UMM bounding rectangles"""
    try:
        return box(b[0]['WestBoundingCoordinate'], 
              b[0]['SouthBoundingCoordinate'],
              b[0]['EastBoundingCoordinate'], 
              b[0]['NorthBoundingCoordinate'])
    except TypeError as e:
        return None
        
df['geometry'] = df['BoundingRectangles'].apply(convert_umm_bbox)
gdf = gpd.GeoDataFrame(df, geometry="geometry", crs="EPSG:4326")
world = gpd.read_file(get_path('naturalearth.land'))
ax = world.boundary.plot(alpha=0.3, facecolor='gray', edgecolor='gray')
gdf.plot(ax=ax, alpha=0.02, facecolor='g', edgecolor="r")
plt.title('Spatial extent of CMS datasets')
plt.show()

../_images/0ded96a1779020e31a2153cacb547ecfd7664b3398261c025fc7f541d3764eeb.png

As we see above, there are more CMS datasets available over North America than other parts of the world. Some datasets have global coverage as well.

Datasets by DAACs#

There is a provider-id tag in the CMR metadata, which provides information on the DAAC responsible for archiving the datasets. CMR datasets are archived by the ORNL DAAC (ORNL_CLOUD) as well as by the GES DISC (GES_DISC). Let compute how many of the total datasets are archived by each DAACs.

gdf.groupby('provider-id').size().plot.pie(autopct='%1.1f%%')
plt.title('Distribution of CMS datasets across DAACs')
plt.ylabel('')
plt.show()

../_images/41b1c7d11ca1427b58797047e478b44fe1b285bae069ae5a5aaeb7970878d0bd.png

The ORNL DAAC counts for ~73% of the total CMS datasets publication.

Datasets publication dates#

NASA has published CMS datasets starting 2014. The following chart shows the number of CMS datasets published by year.

def get_releasedates(ddates):
    """get data release dates"""
    for d in ddates:
        try:
            reldate = pd.to_datetime(d['ReleaseDate']).year
        except KeyError as e:
            reldate = int(re.findall('(\d{4})', d['OtherCitationDetails'] )[0])
    return reldate
    
gdf['pub_date'] = gdf['CollectionCitations'].apply(get_releasedates)
gdf.groupby(['pub_date', 'provider-id']).size().unstack().plot(kind='bar', stacked=True)
plt.xlabel('Year')
plt.ylabel('Number of Datasets published')
plt.title('CMS Dataset Publication by Year')
plt.show()

../_images/ca2495f3f1a66004678f49e1e098b8228e93a47087a42a0830c4617412ccdd8c.png

As we see in the above chart, the CMS datasets is still being actively produced and published.

Datasets granule count#

Let’s plot some statistics of how many granules (or data files) are in each data collection.

gdf.boxplot(column=['granule-count'], by='provider-id', meanline=True, showmeans=True)
plt.ylabel('Number of Granules')
plt.title('Number of Granules by Datasets')
plt.show()

../_images/0a42228296f208a242db96299fce69912e504ff34235ced7b4f8b20a979bc808.png

As we see above, there are few data collections that have more than 100,000 granules. However, most collections have much fewer granules.

Let’s print the dataset with the largest number of granules.

# increase the max column width of pandas dataframe
pd.options.display.max_colwidth = 100
gdf.iloc[gdf['granule-count'].idxmax()][['EntryTitle', 'DOI', 'granule-count']]

EntryTitle       LiDAR Surveys over Selected Forest Research Sites, Brazilian Amazon, 2008-2018
DOI                                                                       10.3334/ORNLDAAC/1644
granule-count                                                                              3154
Name: 2, dtype: object

Instrument keywords of the dataset#

The CMR metadata of the data collection includes GCMD keywords, which provides information such as science variable, instruments and platforms of the data collections. Let’s plot a word cloud of the GCMD Instrument keywords. This will give indication of what instruments were used to collect CMS datasets.

def get_keywords(keywords):
    """get data release dates"""
    keys = []
    if 'Instruments' in keywords[0].keys():
        for instr in keywords[0]['Instruments']:
            if 'LongName' in instr.keys():
                keys.append(instr['LongName'])
    return keys

gdf['gcmd_keys'] = gdf['Platforms'].apply(get_keywords)
gmcd_var = gdf['gcmd_keys'].explode().to_list()
gmcd_var =  [x.lower() for x in gmcd_var if x==x]
gmcd_var_dict=Counter(gmcd_var)
wordcloud = WordCloud(background_color='white', 
                      width = 1000, 
                      height = 500).generate_from_frequencies(gmcd_var_dict)
plt.figure(figsize=(10,6))
plt.imshow(wordcloud)
plt.title('A Word Cloud of Instruments Keywords for CMS Datasets')
plt.axis("off")
plt.show()

../_images/5307ec65299bde62eafc3535e08c334b4abc9fee339cb81bcbcd948d597fce77.png

As we see in the above word cloud, lidar (light detection and ranging) is one of the most used instrument keyword for the CMS data collections.

Search by keyword#

Lets search for all CMS datasets that has keyword “lidar” on their GCMD keyword metadata.

datasets = earthaccess.search_datasets(project="CMS", keyword="lidar")
print(f"Total CMS Datasets with lidar keyword: {len(datasets)}")

Total CMS Datasets with lidar keyword: 32

Search by bounding box#

We can also limit the search results for one area. Let’s use the bounds of Brazil to only search for CMS datasets with “lidar” keyword.

# Brazil bounding box
bound = (-73.99, -33.77, -34.73, 5.25) 
datasets = earthaccess.search_datasets(project="CMS", keyword="lidar",
                                       bounding_box = bound)
print(f"Total CMS Datasets within Brazil: {len(datasets)}")

Total CMS Datasets within Brazil: 8

We can plot the bounding boxes of the CMS data collections over Brazil.

df = convert_list_gdf(datasets)
df['geometry'] = df['BoundingRectangles'].apply(convert_umm_bbox)
gdf = gpd.GeoDataFrame(df, geometry="geometry", crs="EPSG:4326")
world = gpd.read_file(get_path('naturalearth.land'))
ax = world.boundary.plot(alpha=0.3, facecolor='gray', edgecolor='gray')
gdf.plot(ax=ax, alpha=0.1, facecolor='g', edgecolor="r")
plt.title('CMS datasets in Brazil with Lidar Keyword')
plt.show()

../_images/3ce651a82245b550228948d8768a8d18550f57007477fba968e6f02c2ff1aec7.png

We now print the data title, short name and DOI of the CMS datasets in Brazil with “lidar” keyword.

def hyperlink_doi(doi):
    """make doi link"""
    return f'<a target="_blank" href="https://doi.org/{doi}">{doi}</a>'

# print dataset title and doi    
df[['EntryTitle', 'DOI']].style.format({'DOI': hyperlink_doi})

	EntryTitle	DOI
0	CMS: LiDAR Data for Forested Areas in Paragominas, Para, Brazil, 2012-2014	10.3334/ORNLDAAC/1302
1	CMS: Global Mangrove Canopy Height Maps Derived from TanDEM-X, 2015	10.3334/ORNLDAAC/2251
2	LiDAR Surveys over Selected Forest Research Sites, Brazilian Amazon, 2008-2018	10.3334/ORNLDAAC/1644
3	LiDAR and PALSAR-Derived Forest Aboveground Biomass, Paragominas, Para, Brazil, 2012	10.3334/ORNLDAAC/1648
4	CMS: GLAS LiDAR-derived Global Estimates of Forest Canopy Height, 2004-2008	10.3334/ORNLDAAC/1271
5	Global Forest Aboveground Carbon Stocks and Fluxes from GEDI and ICESat-2, 2018-2021	10.3334/ORNLDAAC/2180
6	Pantropical Forest Height and Biomass from GEDI and TanDEM-X Data Fusion	10.3334/ORNLDAAC/2298
7	Aboveground Biomass Change for Amazon Basin, Mexico, and Pantropical Belt, 2003-2016	10.3334/ORNLDAAC/1824

In the next tutorial, we will download the lidar dataset “LiDAR Surveys over Selected Forest Research Sites, Brazilian Amazon, 2008-2018” and look into details.