Labels¶

As already mentioned this system is based on labels to know what metrics to get and what rules to apply. This labels must be written in spec.template.metadata.labels within the deployment yaml file.

Overscaler labels¶

In addition to metrics and rules it is also necessary to add some extra labels for the correct operation of the system.

app: Stateful Set name.

overscaler: “true” or “false”, active or deactivate overscaler in this Stateful set.

current-count: Rescaling counter. During monitoring, this value is reduced until 0, then is possible to rescale.

autoscaler-count: Value to be assigned in “current-count” after rescaling.

min-replicas: Maximum number of replicas for this stateful set.

max-replicas: Minimum number of replicas for this stateful set.

rescaling: Flag to know when a Stateful Set is being rescaled.

Current-count and autoscaler-count labels play a key role. Each type of service requires a certain time after start to configure and start working in parallel with the other replicas. With these labels we guarantee that time.

Metrics¶

Overscaler is designed for a customizable monitoring through labels, adding a label for each metric to monitor, and there are different sets of node and pod metrics.

Label format:

metric-n: “metric-name”

Example:

metric-1: "cpu-usage-percent"

However, it is still possible to monitor the entire node or pod using the label “all-metrics: true”.

Node metrics¶

These metrics determine the status of the different nodes and are assigned by labels in the Google Kubernetes Engine.

Node metrics¶
Metric Name	Description
cpu-limit	Cpu hard limit in millicores.
cpu-node-capacity	Cpu capacity of a node.
cpu-node-allocatable	Cpu allocatable of a node.
cpu-node-reservation	Share of cpu that is reserved on the node allocatable.
cpu-node-utilization	Cpu utilization as a share of node allocatable.
cpu-request	Cpu request (the guaranteed amount of resources) in millicores.
cpu-usage	Cumulative cpu usage on all cores.
cpu-usage-rate	Cpu usage on all cores in millicores.
cpu-usage-percent	Cpu usage percent of total cpu Node.
memory-limit	Memory hard limit in bytes.
memory-major-page-faults	Number of major page faults.
memory-major-page-faults-rate	Number of major page faults per second.
memory-node-capacity	Memory capacity of a node.
memory-node-allocatable	Memory allocatable of a node.
memory-node-reservation	Share of memory that is reserved on the node allocatable.
memory-node-utilization	Memory utilization as a share of memory allocatable.
memory-page-faults	Number of page faults.
memory-page-faults-rate	Number of page faults per second.
memory-request	Memory request (the guaranteed amount of resources) in bytes.
memory-usage	Total memory usage.
memory-rss	RSS memory usage.
memory-working-set	Total working set usage. Working set is the memory being used and not easily dropped by the kernel.
memory-usage-percent	Memory usage percent of total memory Node.
network-rx	Cumulative number of bytes received over the network.
network-rx-errors	Cumulative number of errors while receiving over the network.
network-rx-errors-rate	Number of errors while receiving over the network per second.
network-rx-rate	Number of bytes received over the network per second.
network-tx	Cumulative number of bytes sent over the network
network-tx-errors	Cumulative number of errors while sending over the network
network-tx-errors-rate	Number of errors while sending over the network
network-tx-rate	Number of bytes sent over the network per second.
uptime	Number of milliseconds since the container was started.

Pod metrics¶

These metrics determine the status of any Pods and are assigned by labels in the different Stateful sets.

Pod metrics¶
Metric Name	Description
cpu-limit	Cpu hard limit in millicores.
cpu-request	Cpu request (the guaranteed amount of resources) in millicores.
cpu-usage-rate	Cpu usage on all cores in millicores.
cpu-usage-percent	Cpu usage percent of total node cpu.
memory-limit	Memory hard limit in bytes.
memory-major-page-faults-rate	Number of major page faults per second.
memory-page-faults-rate	Number of page faults per second.
memory-request	Memory request (the guaranteed amount of resources) in bytes.
memory-usage	Total memory usage.
memory-rss	RSS memory usage.
memory-working-set	Total working set usage. Working set is the memory being used and not easily dropped by the kernel.
memory-usage-percent	Memory usage percent of total node memory.
network-rx	Cumulative number of bytes received over the network.
network-rx-errors	Cumulative number of errors while receiving over the network.
network-rx-errors-rate	Number of errors while receiving over the network per second.
network-rx-rate	Number of bytes received over the network per second.
network-tx	Cumulative number of bytes sent over the network
network-tx-errors	Cumulative number of errors while sending over the network
network-tx-errors-rate	Number of errors while sending over the network
network-tx-rate	Number of bytes sent over the network per second.
uptime	Number of milliseconds since the container was started.

Rules¶

The rules for scaling are also assigned by labels and must have a specific syntax:

Label format:

rule-n: “metric_greater|lower_limit_scale|reduce”

metric: Previously established metrics.

greater or lower: “>” or “<” that limit.

limit: Number that establishes a limit

scale or reduce: Action to be realized when the limit is exceeded.

Example:

rule-1: "cpu-usage-percent_greater_90_scale"
rule-2: "memory-usage-percent_greater_90_scale"
rule-3: "cpu-usage-percent_lower_10_reduce"
rule-4: "memory-usage-percent_lower_10_reduce"