Memory tests with `axis: hijacking cbor output

@maxstack @bnlawrence @davidhassell as you folks are already aware, we have an issue when returning larger payloads that are produced when returning a stat with axis, those payloads can take up quite a bit of memory, once unpacked by cbor. I've done a number of tests whereby I "hijack" the payload right before decoding it, using the native `cbor.CBORDecoder()` function, that allows me to manipulate the payload right before decoding it, as such:

```python
def object_hook(decoder, obj):
    if isinstance(obj, dict):
        # converting to arrays doesn't really impact much
        # apart from, of cource, counts which is HUGE - even so, this below works
        # but only for when counts is not needed
        # dict_keys(['bytes', 'dtype', 'shape', 'count', 'byte_order'])
        # [<class 'bytes'>, <class 'str'>, <class 'list'>, <class 'list'>, <class 'str'>]
        new_obj = {}
        new_obj["bytes"] = obj["bytes"]
        new_obj["dtype"] = obj["dtype"]
        new_obj["count"] = []  # np.array(obj["count"]): worse than list
        new_obj["shape"] = obj["shape"]
        new_obj["dtype"] = obj["dtype"]
        new_obj["byte_order"] = obj["byte_order"]
    return new_obj


def decode_result(response):
    """Decode a successful response, return as a 2-tuple of (numpy array or scalar, count)."""
    decoder = cbor.CBORDecoder(BytesIO(response.content), object_hook=object_hook)
    reduction_result = decoder.decode_from_bytes(response.content)
    dtype = reduction_result['dtype']
    shape = reduction_result['shape'] if "shape" in reduction_result else None

    # Result
    result = np.frombuffer(reduction_result['bytes'], dtype=dtype)
    result = result.reshape(shape)

    # Counts
    count = reduction_result['count']
    # TODO: When reductionist is ready, we need to fix 'count'

    # Mask the result
    result = np.ma.masked_where(count == 0, result)

    return result, count
```

`object_hook()` is a callable, that allows to du stuff with the cbor `obj`, in this case I am keeping the original `obj` apart from making the `count` list zero-len.

**Conclusions**

- `count` list (`obj["count"]`) is the MAIN memory consumer - for many cases when setting it to zero len (stats that don't need counts) can buy us 30-40% memory
- trying to manipulate it in the `object_hook` callable eg convert to Numpy array had no, or mildly adverse effect since the interpreter actually loads it in memory so it can convert it - no lazy conversion - we could try smarter things with Dask for such a conversion, but I don't think we can get anywhere in a significant way
- it'd be great if Reductionist encoded it differently - perhaps an array or Bytes struct, so it doesn't decode to a list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory tests with `axis: hijacking cbor output #316

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory tests with `axis: hijacking cbor output #316

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions