Prepare access to datacube#

This notebook handles download of a pre-made datacube. It is required to run this first for the other notebooks to work. Eventually this step should be replaced by directly referencing a datacube in object storage.

If you run the code below without modification it will generate two copies of the data cube, at:

geomagnetic_datacubes_dev/data/interim/SwA_20140501-20190501_proc1.nc
geomagnetic_datacubes_dev/notebooks/datacube_test.zarr

(about 5GB x2)

Note that this datacube has been prepared at only 10-second sampling (compared to the 1Hz / 1-second original data, or the 50Hz high resolution data)

Environment setup#

import os
import pooch
import pandas as pd
import numpy as np
import xarray as xr
import dask as da
from dask.diagnostics import ProgressBar
import zarr
import holoviews as hv
import hvplot.xarray
import matplotlib.pyplot as plt

from chaosmagpy.plot_utils import nio_colormap

try:
    from src.data.proc_env import INT_FILE_PATHS
    from src.env import REFRAD, TMPDIR
except ImportError:
    print("Failed to import src...")
    TMPDIR = os.getcwd()
    INT_FILE_PATHS = {"A": os.path.join(TMPDIR, "SwA_20140501-20190501_proc1.nc")}
    REFRAD = 6371200
    print("Using instead for cube download and scratch space:", TMPDIR)
if not os.path.exists(TMPDIR):
    print("Can't find scratch space:", TMPDIR)
    TMPDIR = os.getcwd()
    print("Using instead:", TMPDIR)

xr.set_options(
    display_expand_attrs=False,
    display_expand_data_vars=True
);
Can't find scratch space: /scratch/local/asmith44/geomagcubes
Using instead: /home/ash/code/geomagnetic_datacubes_dev/notebooks
print(
    "Using temporary working directory:",
    TMPDIR,
    "Is this a good location for data I/O? Configure this path in the file: geomagnetic_datacubes_dev/config.ini",
    "(or manually enter new paths above if not using the geomagcubes environment)",
    sep="\n"
)
Using temporary working directory:
/home/ash/code/geomagnetic_datacubes_dev/notebooks
Is this a good location for data I/O? Configure this path in the file: geomagnetic_datacubes_dev/config.ini
(or manually enter new paths above if not using the geomagcubes environment)

Load/prepare datacube#

Download pre-made datacube#

This part to be refactored into the datacube generation pipeline, when a permanent link is made available.

Download the file if we don’t already have it available here.

# The location at which the data will be located
fpath = INT_FILE_PATHS["A"]
path, fname = os.path.split(fpath)
path, fname
('/home/ash/code/geomagnetic_datacubes_dev/data/interim',
 'SwA_20140501-20190501_proc1.nc')
# # Delete it if you want to redownload it
# os.remove(fpath)
if os.path.exists(fpath):
    # Skip the download if we already have the file
    print("Already found file:", fpath, sep="\n")
    pass
else:
    pooch.retrieve(
        url="https://nc.smithara.net/index.php/s/H5R923DsbtirfJy/download",
        known_hash="1b7a8cbc0cb1657f8d4444ae7f6bbab91841318e1a172fa1f8a487b9d9492912",
        path=path, fname=fname,
        progressbar=True,
    )
Downloading data from 'https://nc.smithara.net/index.php/s/H5R923DsbtirfJy/download' to file '/home/ash/code/geomagnetic_datacubes_dev/data/interim/SwA_20140501-20190501_proc1.nc'.
100%|█████████████████████████████████████| 5.14G/5.14G [00:00<00:00, 1.43TB/s]

Make a copy of the input datacube as a Zarr store#

Includes temporary fixes to the datacube

We want to make sure we don’t accidentally modify the input dataset, so let’s work on a copy. There are also some opportunities with xarray and dask and the zarr format to increase performance by dividing into chunks / rearranging the data in different ways - the input data format is not necessarily what we want to use for computation. So here we convert the data to the Zarr format

(could work with the .nc file just the same; not sure yet what the advantages of zarr are)

file_in = INT_FILE_PATHS["A"]
zarr_store = os.path.join(TMPDIR, "datacube_test.zarr")
print("Input file:", file_in, "Copying to:", zarr_store, sep="\n")
Input file:
/home/ash/code/geomagnetic_datacubes_dev/data/interim/SwA_20140501-20190501_proc1.nc
Copying to:
/home/ash/code/geomagnetic_datacubes_dev/notebooks/datacube_test.zarr
def clean_datacube(ds):
    ds.attrs.pop("Sources")
    # Generate residuals to use
    ds["B_NEC_res_CHAOS-full"] = (
        ds["B_NEC"]
        - ds["B_NEC_CHAOS-MCO"]
        - ds["B_NEC_CHAOS-MMA"]
        - ds["B_NEC_CHAOS-Static_n16plus"]
    )
    # Remove unphysical outliers (TODO: remove them from the datacube!)
    outliers = np.fabs((ds["B_NEC_res_CHAOS-full"]**2).sum(axis=1)) > 2000**2
    idx_to_scrub = np.argwhere(outliers.data)[:, 0]
    # Replace the bad data with nan
    ds["B_NEC"].data[idx_to_scrub, :] = np.nan
    ds["B_NEC_res_CHAOS-full"].data[idx_to_scrub, :] = np.nan
    return ds
if os.path.exists(zarr_store):
    print("Already found zarr:", zarr_store)
else:
    pbar = ProgressBar()
    with pbar:
        with xr.open_dataset(file_in, chunks=100000) as ds:
            print("Cleaning and storing as zarr:", zarr_store)
            ds = clean_datacube(ds)
            ds.to_zarr(zarr_store)
Cleaning and storing as zarr: /home/ash/code/geomagnetic_datacubes_dev/notebooks/datacube_test.zarr
[########################################] | 100% Completed |  1min 10.4s

NB: The dataset has been created in chunks of size 100,000 (Each file within the zarr contains this number of measurements). This won’t be optimal, but will require some experimentation to find better chunk sizes.

# # To delete the zarr:

# from shutil import rmtree

# rmtree(zarr_store)