Skip to content

Job tracking #31

@nevali

Description

@nevali

Optionally support a libsql connection URI which will be used to track jobs as they are processed by twine-writerd or twine-cli.

A job consists of:

  • A UUID to identify it
  • Optional a parent UUID
  • A URI to identify it (which may simply be a urn:uuid: representation of the job UUID, if nothing else is suitable, otherwise it'll be the canonical source or target URI, depending upon the processing pipeline; workflow components may update it accordingly during processing)
  • Timestamps for added and updated
  • A status: WAITING, ACTIVE, ABORTED (by the user), COMPLETE, FAILED, ERRORS (partial failure)
  • A status annotation (free-text) which may be set to indicate the failure reason
  • If active, the cluster/instance details of the node processing the job (preserved for diagnosis once set)
  • Processing item x of y progress indicators (particularly for bulk ingests from filesystem sources)

UUIDs should be where possible taken from the source, if it incorporates one into its identification, or generated on-the-fly if this is not possible.

A job stack should be maintained internally to libtwine in order to track parent/child relationships, rather than requiring it to be made explicit.

As an example, an ingest of N-Quads from a file, processing with spindle-correlate might yield the following:

  • A job is created in state WAITING with a newly-generated UUID and a file:/// URI
  • The N-Quads are parsed and the number of graphs determined; the job is updated to state ACTIVE, with progress set to 0 of number-of-graphs
  • For each graph that is correlated by Spindle, progress is updated, and a new child job is created in state WAITING, using the Spindle-generated UUID and URI
  • Once processing of the N-Quads is complete, the job status is updated to COMPLETE

As spindle-generate later processes its queue of items, it performs the following:

  • A job is created in state WAITING using the Spindle-generated UUID and URI; if it already exists, its parentage is preserved (thus, if the job originated from an ingest as described above, the proxy-generation step maintains the parent-child relationship allowing for ready visualisation
  • As the proxy is generated, its status is updated accordingly

With this arrangement, a small number of relatively simple SQL queries can result in progress tracking and volumetrics across a processing cluster.

Open question: how would Twine know when to preserve versus replace the parent of a job?

Perhaps it could be as simple as user action (i.e., twine-cli) taking precedence over an on-going process — thus, a queue-driven twine-writerd will only set the parent of a job if it's newly-created, whereas twine-cli will always override it. Both would create an overarching job for their processing runs, whether that's from a file or a queue.

Tracked as RESDATA-1279

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions