Skip to content

Memory tests with `axis: hijacking cbor output #316

@valeriupredoi

Description

@valeriupredoi

@maxstack @bnlawrence @davidhassell as you folks are already aware, we have an issue when returning larger payloads that are produced when returning a stat with axis, those payloads can take up quite a bit of memory, once unpacked by cbor. I've done a number of tests whereby I "hijack" the payload right before decoding it, using the native cbor.CBORDecoder() function, that allows me to manipulate the payload right before decoding it, as such:

def object_hook(decoder, obj):
    if isinstance(obj, dict):
        # converting to arrays doesn't really impact much
        # apart from, of cource, counts which is HUGE - even so, this below works
        # but only for when counts is not needed
        # dict_keys(['bytes', 'dtype', 'shape', 'count', 'byte_order'])
        # [<class 'bytes'>, <class 'str'>, <class 'list'>, <class 'list'>, <class 'str'>]
        new_obj = {}
        new_obj["bytes"] = obj["bytes"]
        new_obj["dtype"] = obj["dtype"]
        new_obj["count"] = []  # np.array(obj["count"]): worse than list
        new_obj["shape"] = obj["shape"]
        new_obj["dtype"] = obj["dtype"]
        new_obj["byte_order"] = obj["byte_order"]
    return new_obj


def decode_result(response):
    """Decode a successful response, return as a 2-tuple of (numpy array or scalar, count)."""
    decoder = cbor.CBORDecoder(BytesIO(response.content), object_hook=object_hook)
    reduction_result = decoder.decode_from_bytes(response.content)
    dtype = reduction_result['dtype']
    shape = reduction_result['shape'] if "shape" in reduction_result else None

    # Result
    result = np.frombuffer(reduction_result['bytes'], dtype=dtype)
    result = result.reshape(shape)

    # Counts
    count = reduction_result['count']
    # TODO: When reductionist is ready, we need to fix 'count'

    # Mask the result
    result = np.ma.masked_where(count == 0, result)

    return result, count

object_hook() is a callable, that allows to du stuff with the cbor obj, in this case I am keeping the original obj apart from making the count list zero-len.

Conclusions

  • count list (obj["count"]) is the MAIN memory consumer - for many cases when setting it to zero len (stats that don't need counts) can buy us 30-40% memory
  • trying to manipulate it in the object_hook callable eg convert to Numpy array had no, or mildly adverse effect since the interpreter actually loads it in memory so it can convert it - no lazy conversion - we could try smarter things with Dask for such a conversion, but I don't think we can get anywhere in a significant way
  • it'd be great if Reductionist encoded it differently - perhaps an array or Bytes struct, so it doesn't decode to a list

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions