Zero-failure replication service

Requires EVA ICS Enterprise.

Zero-failure replication service solves a typical IoT problem, when real-time data is lost in cases if pub/sub target is offline or a source has temporally no connection with pub/sub.

The service provides a second replication layer, in addition to Replication service, which 100% guaranties that all telemetry data is transferred to the target node, unless deleted as expired.

The service is a perfect helper to fill all gaps in logs, charts or any other kind of archive data representation, collection or analysis.

Zero-failure replication schema

The service can work in 3 roles (only one can be defined in the deployment config):

Service roles

Collector

Collects real-time data for local items and stores them into blocks of the subscribed mailboxes. The mailboxes must be called same as the remote nodes, which collect the data.

The mailbox blocks have compact and crash-free format with serialize+CRC32 scheme, which allows processing all available frames in the block unless a broken one is detected.

Telemetry data is known to be compressed well so it is highly recommended to compress blocks when transferred (the service client applies BZIP2-compression automatically).

Additionally, if replication blocks are lost but there is a history database service on a local node (e.g. InfluxDB state history or SQL databases state history), the collector may be asked to fill a mailbox with blocks from the database (see mailbox.fill).

The service performing the collector role is always online.

Replicator

Allows to setup mailbox replication, based on a flexible custom schedule (e.g. every minute, at night only etc.).

Automatically collects replication blocks from remote nodes and pushes them to the local bus replication archive topic (ST/RAR/<OID>).

Requires a Pub/Sub server (PSRT or MQTT). Both source and target node must share the same API key. The API key is used to check a particular service configuration-mapped mailbox access only and can have an empty ACL. While being usually deployed together with Replication service, uses a dedicated connection (or a dedicated server).

Transfers blocks compressed and encrypted.

Warning

The replicator role MUST be deployed on the same machine as the collector.

The replicator client may fetch both prepared-to-replicate blocks as well as the current collector block. In the last case, the block is forcibly rotated. This means if the mailbox replication schedule is set as continuous, the replication frequency is nearly equal to the block requests interval set.

The service performing the replicator role is automatically restarted on pub/sub failures.

Standalone

Allows to import manually copied blocks only (see process_dir).

To process the block directory manually, use:

eva svc call eva.zfrepl.1.replicator \
    process_dir path=/path/to/blocks node=SOURCE_NAME delete=true
# or using the bus CLI client
/opt/eva4/sbin/bus /opt/eva4/var/bus.ipc rpc call eva.zfrepl.1.replicator \
    process_dir path=/path/to/blocks node=SOURCE_NAME delete=true

The service performing the standalone role is always online.

Recommendations

  • Large blocks may cause database service data-flooding on target nodes. Make sure these services have enough resources and bus queue size set.

  • Keep data blocks small (2-3MB). Approximately, telemetry data is compressed 10x but the ratio may vary depending on setup.

  • If large amount of blocks is generated, increase block_ttl_sec mailbox collector field.

  • mailbox.fill may cause significant disk/event queue overhead. Make sure the collector service has:

    • enough bus queue

    • enough file ops queue

  • if huge network load is expected (e.g. equipment, connected to the node, is reconfigured) because of lots of real-time data, a service, which runs under the replicator role may be temporally disabled:

eva svc call eva.zfrepl.1.replicator disable
# or using the bus CLI client
/opt/eva4/sbin/bus /opt/eva4/var/bus.ipc rpc call eva.zfrepl.1.replicator disable

When disabled, the service stops all local replication client tasks (which must be later triggered either by schedulers or manually) and forbids serving blocks via pub/sub for external clients. Other methods and tasks are not affected.

To enable the service back, repeat the above command with “enable” method or restart it.

Untrusted nodes and zero-failure replication

The approach is similar to real-time replication: by default remote zero-failure replication mailboxes are trusted, which means all remotes can provide telemetry data for all items.

To setup zero-failure replication with an untrusted node, mark its mailbox with “trusted: false” in the replicator/client section of the service configuration and make sure the configured API key has ACL with “write” permission for the allowed items.

Setup

Use the template EVA_DIR/share/svc-tpl/svc-tpl-zfrepl.yml:

# EVA ICS zero-failure replication service
command: svc/eva-zfrepl
workers: 2
bus:
  path: var/bus.ipc
config:
  # the service can work in three roles:
  #
  # collector - collects data from the local node bus events to mailboxes,
  # always online. Must have the "collector"
  #
  # standalone - allows only to import manually copied blocks from a local dir
  #
  # replicator - serves and collects the data from the mailboxes via pub/sub,
  # MUST be deployed on the same machine as the collector. Must have the
  # "replicator" section
  collector:
    # mailboxes location, relative to EVA_DIR or absolute. if running under a
    # restricted user account (default: eva), the directory MUST be created
    # manually and the effective account must have read/write/execute (list)
    # permissions to it
    path: runtime/zfrepl/spool
    mailboxes:
      node1:
        # max data block size (uncompressed)
        max_block_size: 2_000_000
        # block time-to-live (sec) before creating a new block
        block_ttl_sec: 600
        # keep unrequested blocks for (sec)
        keep: 86400
        # file ops max queue size, if full, incoming events are dropped
        queue_size: 512
        auto_flush: false
        # periodic collection interval
        interval: null
        # ignore real-time events
        ignore_events: false
        # oids to watch
        oids:
          - "#"
        # DANGEROUS, enable for multi-level clusters only
        #replicate_remote: true
  #standalone: {}
  #replicator:
    #pubsub:
      ## mqtt or psrt
      #proto: psrt
      ## path to CA certificate file. Enables SSL if set
      #ca_certs: null
      ## single or multiple hosts
      #host:
        #- 127.0.0.1:2873
      ## if more than a single host is specified, shuffle the list before connecting
      #cluster_hosts_randomize: false
      ## user name / password auth
      #username: null
      #password: null
      #ping_interval: 10
      ## pub/sub queue size
      #queue_size: 1024
      ## pub/sub QoS (not required for PSRT)
      #qos: 1
    ## the local key service, required both to make and process API calls via PubSub
    #key_svc: eva.aaa.localauth
    #client:
      ## watch the services, if any is down, client operations are suspended
      #watch_svcs:
        #- eva.db.i1
        #- eva.db.i2
      #mailboxes:
        ## collect data from the mailbox at node_remote (mailbox name = local system name)
        #node_remote:
            ## API key, required to open the mailbox
            #key_id: default
            ## a cron-like schedule, when the client is triggered:
            ## second minute hour day month weekday year
            ##
            ## the year field can be omitted
            ## to run the task every N, use */N
            #schedule: "* * * * * *"
            ## block requests interval (sec). it is recommended to set the interval
            ## lower than block ttl on the remote node collector
            #interval: 30
            ## client session duration (sec). after the specified perioid of time the
            ## client stops, until triggered again manaully or by the scheduler
            #duration: 3600
            #timeout: 60 # override the default timeout
            #trusted: true
    #server:
      ## collector service
      #collector_svc: eva.zfrepl.default.collector
      #mailboxes:
        ## mailbox for the node_remote
        #node_remote:
          ## API key, required to open the mailbox
          #key_id: default
user: eva

Create the service using eva-shell:

eva svc create eva.zfrepl.N.collector|replicator /opt/eva4/share/svc-tpl/svc-tpl-zfrepl.yml

or using the bus CLI client:

cd /opt/eva4
cat DEPLOY.yml | ./bin/yml2mp | \
    ./sbin/bus ./var/bus.ipc rpc call eva.core svc.deploy -

(see eva.core::svc.deploy for more info)

EAPI methods

See EAPI commons for the common information about the bus, types, errors and RPC calls.

client.start

Description

[replicator] Trigger mailbox client startup

Parameters

required

Returns

nothing

Parameters

Name

Type

Description

Required

i

String

Mailbox name

yes

disable

Description

[replicator] Disable replication and kill all running tasks

Parameters

none

Returns

nothing

enable

Description

[replicator] Enable replication

Parameters

none

Returns

nothing

mailbox.delete_block

Description

[collector] Delete a block

Parameters

required

Returns

nothing

Parameters

Name

Type

Description

Required

i

String

Mailbox name

yes

block_id

String

block ID

yes

mailbox.fill

Description

[collector] Fill blocks from a local database service

Parameters

required

Returns

nothing

Parameters

Name

Type

Description

Required

i

String

Mailbox name

yes

db_svc

String

Database service name

yes

t_start

f64

Starting timestamp (default: last 24 hours)

no

t_end

f64

Ending timestamp (default: now)

no

xopts

Map<String,Any>

extra options, passed to the database service as-is

no

mailbox.get_block

Description

[collector] Get ready-to-replicate-block

Parameters

required

Returns

Block or nothing

Parameters

Name

Type

Description

Required

i

String

Mailbox name

yes

Return payload example:

{
    "block_id": "mbb_1656445625",
    "last": false,
    "path": "/opt/eva4/runtime/zfrepl/spool/rtest1/mbb_1656445625"
}

mailbox.list_blocks

Description

[collector] List ready-to-replicate blocks

Parameters

required

Returns

Block list

Parameters

Name

Type

Description

Required

i

String

Mailbox name

yes

Return payload example:

[
    {
        "block_id": "mbb_1656445625",
        "path": "/opt/eva4/runtime/zfrepl/spool/rtest1/mbb_1656445625",
        "size": 2983121
    },
    {
        "block_id": "mbb_1656445635",
        "path": "/opt/eva4/runtime/zfrepl/spool/rtest1/mbb_1656445635",
        "size": 2916
    }
]

mailbox.rotate

Description

[collector] Delete all blocks in the mailbox

Parameters

required

Returns

nothing

Parameters

Name

Type

Description

Required

i

String

Mailbox name

yes

process_dir

Description

[replicator/standalone] Process blocks from a local dir

Parameters

required

Returns

nothing

Parameters

Name

Type

Description

Required

path

String

Local path

yes

node

String

Source node name (any if not important)

yes

delete

bool

Delete processed blocks (r/w permissions required)

no

status

Description

[replicator] Replication status

Parameters

none

Returns

Status payload

Return payload example:

{
    "active_clients": ["node1"],
    "enabled": true
}