Software Architecture

Software Design

Unlike most InSAR processing software (e.g., StamPS, MintPy) that have designated processing workflow, Moraine only provides a collection of Python functions or commands. The reason is, in real application, there is no perfect workflow that always generate satisfactory deformation result. Especially when the coherence is not good and atmospheric artifact is strong. One needs to try a lot of different methods but they are generally implented in different packages. Even worse, the workflow-based software are encapsulated too much and generally no detailed documentation is provided. It is really frustrating when users need to save intermediate data from one software and prepared them in a designated format and structure required by another software. Sometimes it is necessarry to read a lot of source code to understand what are the output, what are their data structure and what kind of inputs are needed as their typical workflows is not followed. So, instead of providing a standard workflow, Moraine is designed as a collection of functions that implement specific InSAR processing techniques (e.g. calculate the dispersion index, do phase linking) and users are encouraged to make their own workflow that are suitable for their case. We provide the necessary infrastructure and your role is to be innovative! To make it easier, Moraine provides detailed documentation for each function that explain the usage. We also provide the tutorials section that provide some typical workflow for your reference. In case users want to try methods that are not implemented in Moraine, the input and output are well explained in the documentation of every Moraine functions.

Although we provide detailed documentation and reference workflow, we still admit this software is not that easy that users only need to run from the first step to the last step. It doesn’t mean we don’t value user-friendliness, but it shouldn’t come at the expense of flexibility and creativity.

Software Structure

Most of the functions in this package provide 2 kind of API, the array-based API and the file-based API. The inputs and output of array-based functions generally are numpy or cupy arrays (Simply, cupy is a package that provides same functions as numpy but runs on GPU), while inputs and outputs of file-based functions are string of path to the array stored in disk. InSAR techniques that can be greatly accelerated with parallel processing are implented in cupy or numba for better performance. The file-based functions are not simple wrapper of the array-based functions. Since Moraine aims at big data processing, most array-based functions may not be used due to the memory limitation. However, the file-based functions support chunkwise processing with the help of dask, and mulit-GPU processing is also supported.

To make it simpler, we call the file-based functions CLI (command line interface) (The name is just a convention, we don’t provide the commands that can be runned from the terminal.), the array-based API API. The API and CLI functions are arranged in different namespace: moraine.* and moraine.cli.*. The CLI functions support logging if logger is created.

Data format

Most of the stored data in this package is in the zarr format, which is a file storage format for chunked, compressed, N-dimensional arrays. The figure below shows how the structure of zarr data. The reading and writing speed is fast since the data volume is compressed. Before compressing, the data are divided into chunks to be more flexiable for dask parallel operation. Generally, the file name is xxxxxx.zarr. You will find it is indeed a directory in the file system. But just treat it as a single file in use.

imga

imga

Note that the sturcture of dask array is similar. Each chunk of a big dask array is just a numpy or cupy array. Independent operations on every chunks are automatically parallelized.

In this software, there are mainly two kind of dataset. One is stack of raster data, another is stack of point cloud data. The raster dataset are divided into chunks both azimuth dimension and range dimension. The point cloud dataset are divided into chunks along the spatial dimension. These two chunksize needs to be determined by the user. The chunksize in high dims are automatically determined. Users don’t need to care about it.

Chunksize affect the performance of the program. Unproper chunksize slows down the processing speed or even crash the program. Using too small chunksize makes too much inter-process communication and slows down the program. Too big chunksize may crash the program due to mamory limit. For raster data, it is good to make sure range chunksize of the last chunk is same as others. And it is prefered to divide raster data along azimuth direction rather than range direction.

An example

Here we provide an simple example. The API function moraine.emi implemented the EMI phase linking method and moraine.cli.emi is the file-based API of it.

Import them first:

import moraine as mr
import moraine.cli as mc
from nbdev.showdoc import show_doc # this is just a function to show the document

source

emi

 emi (coh:numpy.ndarray, ref:int=0)
Type Default Details
coh ndarray complex coherence metrix,dtype cupy.complex
ref int 0 index of reference image in the phase history output, optional. Default: 0
Returns tuple estimated phase history ph, dtype complex; quality (minimum eigvalue, dtype float)

source

emi

 emi (coh:str, ph:str, emi_quality:str, ref:int=0, chunks:int=None,
      cuda:bool=False, processes=None, n_workers=None,
      threads_per_worker=None, rmm_pool_size=0.9, **dask_cluster_arg)

Phase linking with EMI estimator.

Type Default Details
coh str coherence matrix
ph str output, wrapped phase
emi_quality str output, pixel quality
ref int 0 reference image for phase
chunks int None # chunk size of output zarr dataset, optional. Default: same as coh.
cuda bool False if use cuda for processing, false by default
processes NoneType None use process for dask worker over thread, the default is False for cpu, only applied if cuda==False
n_workers NoneType None number of dask worker, the default is 1 for cpu, number of GPU for cuda
threads_per_worker NoneType None number of threads per dask worker, the default is 2 for cpu, only applied if cuda==False
rmm_pool_size float 0.9 set the rmm pool size, only applied when cuda==True
dask_cluster_arg

To apply the emi API funtion:

import zarr
import numpy as np
from moraine.utils_ import is_cuda_available
if is_cuda_available():
    import cupy as cp
coh_zarr = zarr.open('./software_architecture/ds_can_coh.zarr/','r')
coh_zarr,coh_zarr.shape,coh_zarr.chunks,coh_zarr.dtype
(<zarr.core.Array (740397, 17, 17) complex64 read-only>,
 (740397, 17, 17),
 (200000, 17, 17),
 dtype('complex64'))

It is coherence matrix for 740397 selected DS candidate and there are 17 SAR images. So the coherence matrix for one pixel is 17 \(\times\) 17. The coherence matrix is stored in 4 chunks and each chunks stores data for 200000 DS candidate. (The last chunk only have 140397 DS candidate).

!ls -al ./software_architecture/ds_can_coh.zarr/ #It is a directory indeed!
total 1570400
drwxrwxr-x 2 kangl kangl      4096 Sep 28  2023 .
drwxrwxr-x 5 kangl kangl      4096 Apr 28 20:27 ..
-rw-rw-r-- 1 kangl kangl 434775676 Sep 28  2023 0.0.0
-rw-rw-r-- 1 kangl kangl 432578417 Sep 28  2023 1.0.0
-rw-rw-r-- 1 kangl kangl 434846911 Sep 28  2023 2.0.0
-rw-rw-r-- 1 kangl kangl 305857416 Sep 28  2023 3.0.0
-rw-rw-r-- 1 kangl kangl       398 Sep 28  2023 .zarray
coh = coh_zarr[:] # read as numpy array
if is_cuda_available():
    coh = cp.asarray(coh) # convert to cupy array
coh.shape
(740397, 17, 17)
# The processing is really fast!
if is_cuda_available():
    ph,emi_quality = mr.emi(coh)
CPU times: user 1.45 s, sys: 202 ms, total: 1.65 s
Wall time: 1.65 s

Now we apply the CLI function:

logger = mc.get_logger()
if is_cuda_available():
    mc.emi('./software_architecture/ds_can_coh.zarr/',
           './software_architecture/ds_can_ph.zarr',
           './software_architecture/ds_can_emi_quality.zarr',)
2024-04-30 22:49:13 - log_args - INFO - running function: emi
2024-04-30 22:49:13 - log_args - INFO - fetching args:
2024-04-30 22:49:13 - log_args - INFO - coh = './software_architecture/ds_can_coh.zarr/'
2024-04-30 22:49:13 - log_args - INFO - ph = './software_architecture/ds_can_ph.zarr'
2024-04-30 22:49:13 - log_args - INFO - emi_quality = './software_architecture/ds_can_emi_quality.zarr'
2024-04-30 22:49:13 - log_args - INFO - ref = 0
2024-04-30 22:49:13 - log_args - INFO - chunks = None
2024-04-30 22:49:13 - log_args - INFO - fetching args done.
2024-04-30 22:49:13 - zarr_info - INFO - ./software_architecture/ds_can_coh.zarr/ zarray shape: (740397, 17, 17)
2024-04-30 22:49:13 - zarr_info - INFO - ./software_architecture/ds_can_coh.zarr/ zarray chunks: (200000, 17, 17)
2024-04-30 22:49:13 - zarr_info - INFO - ./software_architecture/ds_can_coh.zarr/ zarray dtype: complex64
2024-04-30 22:49:13 - emi - INFO - starting dask CUDA local cluster.
2024-04-30 22:49:16 - emi - INFO - dask local CUDA cluster started.
2024-04-30 22:49:16 - darr_info - INFO - coh dask array shape: (740397, 17, 17)
2024-04-30 22:49:16 - darr_info - INFO - coh dask array chunksize: (200000, 17, 17)
2024-04-30 22:49:16 - darr_info - INFO - coh dask array dtype: complex64
2024-04-30 22:49:16 - emi - INFO - phase linking with EMI.
2024-04-30 22:49:16 - emi - INFO - got ph and emi_quality.
2024-04-30 22:49:16 - darr_info - INFO - ph dask array shape: (740397, 17)
2024-04-30 22:49:16 - darr_info - INFO - ph dask array chunksize: (200000, 17)
2024-04-30 22:49:16 - darr_info - INFO - ph dask array dtype: complex64
2024-04-30 22:49:16 - darr_info - INFO - emi_quality dask array shape: (740397,)
2024-04-30 22:49:16 - darr_info - INFO - emi_quality dask array chunksize: (200000,)
2024-04-30 22:49:16 - darr_info - INFO - emi_quality dask array dtype: float32
2024-04-30 22:49:16 - emi - INFO - rechunk ph
2024-04-30 22:49:16 - darr_info - INFO - ph dask array shape: (740397, 17)
2024-04-30 22:49:16 - darr_info - INFO - ph dask array chunksize: (200000, 17)
2024-04-30 22:49:16 - darr_info - INFO - ph dask array dtype: complex64
2024-04-30 22:49:16 - emi - INFO - saving ph and emi_quality.
2024-04-30 22:49:16 - emi - INFO - computing graph setted. doing all the computing.
2024-04-30 22:49:21 - emi - INFO - computing finished.leted |  5.2s
2024-04-30 22:49:22 - emi - INFO - dask cluster closed.
CPU times: user 578 ms, sys: 304 ms, total: 883 ms
Wall time: 8.61 s

The CLI function is slower than the API function since it needs to read and write the data and set up the dask CUDA cluster.

The CLI also include functions for simple data manipulation (e.g. array slicing and point clouds merging). As it is very easy to do them for numpy/cupy arrays, these CLI do not have corresponding API.