Demo looking at relationship between residuals and solar wind#

Prep access to packages and dataset#

import os
import datetime as dt
import pooch
import pandas as pd
import numpy as np
import xarray as xr
import dask as da
from dask.diagnostics import ProgressBar
import zarr
# import holoviews as hv
# import hvplot.xarray
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

from chaosmagpy.plot_utils import nio_colormap

from src.env import ICOS_FILE

TMPDIR = os.getcwd()
zarr_store = os.path.join(TMPDIR, "datacube_test.zarr")
print("Using:", zarr_store)

xr.set_options(
    display_expand_attrs=False,
    display_expand_data_vars=True
);

Using: /home/ash/code/geomagnetic_datacubes_dev/notebooks/datacube_test.zarr

zarr_store = os.path.join(TMPDIR, "datacube_test.zarr")
ds = xr.open_dataset(
    zarr_store, engine="zarr",
    chunks="auto"
)

Load information about the grid points#

The grid coordinates are stored in a separate file. These are locations within a spherical (theta, phi) shell. The grid_index matches up with the numbers given in the gridpoint_geo and gridpoint_qdmlt variables, so can be used to identify the (theta, phi) coordinates for each bin.

For gridpoint_geo, (theta, phi) correspond to (90-Latitude, Longitude%360)
For gridpoint_qdmlt, (theta, phi) correspond to (90-QDLat, MLT*15)

# Load the coordinates, stored as "40962" within the HDF file.
gridcoords = pd.read_hdf(ICOS_FILE, key="40962")
# # Transform into a DataArray
# gridcoords = xr.DataArray(
#     data=gridcoords.values,
#     dims=("grid_index", "theta_phi"),
#     coords={
#         "grid_index": gridcoords.index,
#         "theta_phi": ["theta", "phi"]
#     }
# )
gridcoords["Latitude"] = 90 - gridcoords["theta"]
gridcoords["Longitude"] = np.vectorize(lambda x: x if x <= 180 else x - 360)(gridcoords["phi"])
gridcoords["QDLat"] = 90 - gridcoords["theta"]
gridcoords["MLT"] = gridcoords["phi"]/15
gridcoords

	theta	phi	Latitude	Longitude	QDLat	MLT
0	92.910599	125.730300	-2.910599	125.730300	-2.910599	8.382020
1	94.071637	127.340074	-4.071637	127.340074	-4.071637	8.489338
2	92.908770	127.341071	-2.908770	127.341071	-2.908770	8.489405
3	94.438861	126.233661	-4.438861	126.233661	-4.438861	8.415577
4	93.857171	125.428187	-3.857171	125.428187	-3.857171	8.361879
...	...	...	...	...	...	...
40957	148.282526	90.000000	-58.282526	90.000000	-58.282526	6.000000
40958	121.717474	0.000000	-31.717474	0.000000	-31.717474	0.000000
40959	58.282526	0.000000	31.717474	0.000000	31.717474	0.000000
40960	121.717474	180.000000	-31.717474	180.000000	-31.717474	12.000000
40961	58.282526	180.000000	31.717474	180.000000	31.717474	12.000000

40962 rows × 6 columns

Reduce down to what we might actually work with#

ds points to the original dataset. In some cases we will just use _ds (defined below) which points to a subset of the data.

We will only consider the variable B_NEC_res_CHAOS-full which is the residual to the full CHAOS model (which parameterises the core, crustal, and magnetospheric fields), i.e. approximately the magnetic disturbance created by the ionosphere.

# Select out some interesting parameters to work with
_ds = ds[
    [
        "B_NEC_res_CHAOS-full",
        "Latitude", "Longitude", "QDLat", "QDLon", "MLT",
        # "SunZenithAngle", "OrbitNumber",
        "gridpoint_geo", "gridpoint_qdmlt",
        "IMF_BY", "IMF_BZ", "IMF_Em", "IMF_V",
        # "Kp", "RC", "dRC"
    ]
]
# Downsampled by 1/60 (i.e. 10-minute sampling) to make it easier to work on prototyping
_ds = _ds.isel(Timestamp=slice(0, -1, 60))
_ds

<xarray.Dataset>
Dimensions:               (Timestamp: 257807, NEC: 3)
Coordinates:
  * NEC                   (NEC) object 'N' 'E' 'C'
  * Timestamp             (Timestamp) datetime64[ns] 2014-05-01 ... 2019-04-3...
Data variables:
    B_NEC_res_CHAOS-full  (Timestamp, NEC) float64 dask.array<chunksize=(91667, 3), meta=np.ndarray>
    Latitude              (Timestamp) float64 dask.array<chunksize=(257807,), meta=np.ndarray>
    Longitude             (Timestamp) float64 dask.array<chunksize=(257807,), meta=np.ndarray>
    QDLat                 (Timestamp) float64 dask.array<chunksize=(257807,), meta=np.ndarray>
    QDLon                 (Timestamp) float64 dask.array<chunksize=(257807,), meta=np.ndarray>
    MLT                   (Timestamp) float64 dask.array<chunksize=(257807,), meta=np.ndarray>
    gridpoint_geo         (Timestamp) int64 dask.array<chunksize=(257807,), meta=np.ndarray>
    gridpoint_qdmlt       (Timestamp) int64 dask.array<chunksize=(257807,), meta=np.ndarray>
    IMF_BY                (Timestamp) float64 dask.array<chunksize=(257807,), meta=np.ndarray>
    IMF_BZ                (Timestamp) float64 dask.array<chunksize=(257807,), meta=np.ndarray>
    IMF_Em                (Timestamp) float64 dask.array<chunksize=(257807,), meta=np.ndarray>
    IMF_V                 (Timestamp) float64 dask.array<chunksize=(257807,), meta=np.ndarray>
Attributes: (2)

Since we loaded the zarr using Dask, the dataset is not actually in memory. That will be useful later for scaling to the full dataset, but for now let’s load it into memory (so we can forget about Dask for now).

Some notes on those parameters:

The IMF_.. variables are created from OMNI data which has already been time-shifted to Earth’s bow shock, but also time-averaged over 20 minutes to give a smoothed input of energy to the magnetosphere and account for typical lag times from the input at the magnetopause to the response in the ionosphere. This should be pretty much the same input as that used in the AMPS model. For a better model we should use the full original OMNI data as input (both to give the full time-history to drive the model better, but also open up the opportunity for the model to account for varying lag times - we expect a more prompt response on the dayside, “direct driving”, and a much more variable lagged response on the nightside, “magnetospheric unloading”).
IMF_Em is merging electric field (a type of coupling function, composed from BY, BZ & V)
gridpoint_geo and gridpoint_qdmlt are indexes into positions in spherical grids of 40962 points. Use the gridcoords dataframe to identify coordinates of those points

Some inspection of the data#

_ds.plot.scatter(
    x="Longitude", y="Latitude", hue="B_NEC_res_CHAOS-full",
    s=0.1, cmap=nio_colormap(), col="NEC", robust=True
)

<xarray.plot.facetgrid.FacetGrid at 0x7f82e16b5fa0>

../_images/02a_solar-wind-coupling_14_1.png

_ds.plot.scatter(
    x="MLT", y="QDLat", hue="B_NEC_res_CHAOS-full",
    s=0.1, cmap=nio_colormap(), col="NEC", robust=True
)

<xarray.plot.facetgrid.FacetGrid at 0x7f82d8720fa0>

../_images/02a_solar-wind-coupling_15_1.png

_ds.plot.scatter(
    x="Latitude", y="B_NEC_res_CHAOS-full", hue="IMF_Em",
    s=0.1, cmap="plasma", col="NEC", robust=True
)

<xarray.plot.facetgrid.FacetGrid at 0x7f82d864dac0>

../_images/02a_solar-wind-coupling_16_1.png

def inspect_gridpoint(_ds=_ds, index=0, x="IMF_Em"):
    """Quick scatterplot of data in a given bin"""
    # Select data from given gridpoint
    __ds = _ds.where(ds["gridpoint_qdmlt"]==index, drop=True)
    # Identify coordinates of that bin
    qdlat, mlt = gridcoords.iloc[index][["QDLat", "MLT"]]
    # Construct figure
    facetgrid = __ds.plot.scatter(x=x, y="B_NEC_res_CHAOS-full", col="NEC")
    # Add coordinates for the displayed gridpoint
    facetgrid.fig.suptitle(
        f"Grid point: QDLat={np.round(qdlat, 1)}, MLT={np.round(mlt, 1)}",
        verticalalignment="bottom"
    )
    return facetgrid


# Plot a few different bins. NB: We use the full dataset this time
for index in range(0, 5):
    inspect_gridpoint(_ds=ds, index=index)