announcing websemaphore v1 and new features
websemaphore v1 is now available
fine-grained job state control
job timeouts and more

Announcing WebSemaphore V1 and new features

1. Intro

We are pleased to release the next version of WebSemaphore. Sequentially, it should be called v2 but since the previous version had an incomplete feature set it was more appropriate to call it Beta. The current version is fully featured, which makes it v1.

Key changes in this release:

  • Fine-grained job state control
  • High-resolution automated job timeouts
  • Job archive
  • Improved UI

Below we introduce details on each of the key changes and conclude with a brief example to show how they serve a real-life scenario.

2. Fine-grained job state control

The Beta included a way to acquire a resource and pass control to a processor via an http/websocket callback, limiting concurrent throughput as set in the semaphore configuration. However, the only action that could be taken from there was to release the semaphore once the job is done.
This presented multiple challenges for customer setups, and we set on a refactoring mission to help them.

Job states

The job states fall into three groups: transient, final and archived. Table 1.1 describes each state and the actions that a client/processor can take to transition a job to another state.

StatusExplanationPossible actions
TransientscheduledAccepted and waiting for resource acquisitiondelete
sendingAcquired resource, waiting to be sent[internal short-lived state, no actions on user side]
inflightMapping executed and job was sent to processorrelease, cancel, delete
FinaldoneJob was completed and semaphore released by processorrequeue, delete
errorJob failed during mapping or callbackrequeue, reschedule, delete
timeoutJob was sent to processor but not released before timeoutrequeue, reschedule, release, delete
ArchivedarchivedThe job is considered deleted and will be moved to an archive storage or disposed of

Table 1.1 - Job Statuses mapped to actions available to clients

State control API

There is no direct way for clients to change a job’s status. Instead, actions are available that apply only to jobs in a corresponding status. Each action will transition the job to a specific status according to the next table.

ActionFrom StatusTo StatusExplanation
acquirescheduledSchedule a new job by client
requeuedone, error, timeoutscheduledReprocess with a new message id at the end of the channel queue
rescheduletimeout, errorscheduledReprocess preserving message id and in-channel order
releaseinflight, timeoutdoneSet status to done. If in flight will release the lock
cancelinflightarchivedMove to archived state, keep original state in lastState
deleteall except archived

Table 1.2 - Actions available to client mapped to Job Statuses

To visualize the above dry information, below is a state diagram. The black and green arrows represent the happy path that was available in the beta version. The purple arrows represent retry actions (requeue, reschedule); The yellow arrows represent archive/delete transitions. Finally, the red and lime arrows represent error and timeout transitions that are not available to clients directly, but are useful for understanding the job state space. Note the color codes of the transition arrows correspond to the colors in the Actions column in Table 1.2.

!

Diagram 1.3 - Actions mapped to Job Statuses

  1. High-resolution automated job timeouts

Detailed job tracking allows for the implementation of this highly requested feature. While simple in principle, it requires all of the job state machinery described above to function.

A timer is a simple setting on a semaphore that defines the maximum time a job can be in “inflight” state. If the processor does not release the job during that time, the job will be transitioned to the “timeout” state. It is then up to the configuration of the semaphore to decide whether to:

  1. Deactivate the channel
  2. Deactivate the semaphore
  3. Drop (skip) the job and continue processing

The choice will depend on the use case.

Notes:

  1. The timeout setting is in seconds, however the actual precision of the timer is currently at about 20 seconds, meaning a timeout will occur no earlier than the configured number of seconds passed but can be up to 20 seconds late to execute. This is expected to be improved based on demand.

  2. [TO COMPLETE!] Note that the mapping is capable of detecting “poisoned pills” and drop messages via custom routing. WHICH STATE SUCH MESSAGES WILL END UP IN?!!!

  3. Archive

    In high-traffic situations It’s neither practical nor economically viable to keep a full log of all the jobs ever received in the active storage. Additionally, it’s desirable to delete irrelevant messages altogether.
    The archived state of a message can be thought of as a recycle bin of an operating system. In the next releases it will be possible to configure the period after which jobs in the done state will automatically be moved to the archive or deleted. We expect the archive to be billed as a long-term grade storage, and are happy to hear more customer feedback on the future of this direction.

  4. Improved UI

    The UI was improved for overall better UX and to accommodate the new features. Most notably, it is now possible to track individual jobs and act on them, in accordance with the rules laid out in parts 2 and 3 above.

    Kanban View

    This view is expected to be most useful for real-time monitoring and debugging. The jobs are color-hashed to make it easy to visually track a job across the state changes. Clicking on Details for each job will provide details and actions that are possible to take, following tables 1.1 and 1.2 above.

    !
    Fig 4.1 - Kanban View

    !

    Fig 4.2 - Job details in the Kanban View

    Standard (Segmented) View

    This view is expected to be most useful during manual reconciliation. The information provided is the same as in the kanban view but it’s more appropriate for inspecting jobs in a specific state. To that end, the actions are available on the right side of the job details in both collapsed and expanded states.

    !

    Fig 4.3 - Job list and details details in the Standard

  5. Scenario

    As an abstract scenario, let’s see what can be done in a flow where the job sequence is critical. To mitigate processing errors, processor downtime and other failures:

  6. Suspend on error: use the option to suspend the channel or semaphore upon an error or timeout. This will minimize errors while preserving the job sequence.

  7. Reconcile: Use the API or the new UI console to review and requeue, reschedule or archive the erroring messages, alongside any adjustments in the downstream systems.

  8. Activate the channel processing to resume from the point it stopped, considering the changes performed during the reconciliation.

  9. Conclusion

    For more details on the new API and configuration options please head to the docs.
    We hope users find these features a key enabler on their journey to stable and resilient distributed systems.