Troubleshooting

Troubleshooting Tools and Tips
- Kafka Queue: Consumer Group Message Lag
- CPU/Memory Usage
Logs
- Reading Logs
- Enabling Certain Logs
Metrics
Prometheus metrics
- TBMQ-specific metrics:
- PostgreSQL-specific metrics:
Getting help

Troubleshooting Tools and Tips

Kafka Queue: Consumer Group Message Lag

You can use the log shown below to identify any issues with the processing of messages or other parts of TBMQ infrastructure. Since Kafka is used for MQTT message processing and other major parts of the system, such as client sessions, client subscriptions, retained messages, etc., you can analyze the overall state of the broker.

TBMQ provides the ability to monitor whether the rate of producing messages to Kafka is faster than the rate of consuming and processing them. In such cases, you will experience a growing latency for message processing. To enable this functionality, ensure that Kafka consumer-stats are enabled (see the queue.kafka.consumer-stats section of the Configuration properties).

Once Kafka consumer-stats are enabled, logs (see Troubleshooting) about offset lag for consumer groups will be generated.

Here is an example of the log message:

2022-11-27 02:33:23,625 [kafka-consumer-stats-1-thread-1] INFO  o.t.m.b.q.k.s.TbKafkaConsumerStatsService - [msg-all-consumer-group] Topic partitions with lag: [[topic=[tbmq.msg.all], partition=[2], lag=[5]]].

From this message we can see that there are five messages pushed to the tbmq.msg.all topic but not yet processed.

In general, the logs have the following structure:

TIME [STATS_PRINTING_THREAD_NAME] INFO  o.t.m.b.q.k.s.TbKafkaConsumerStatsService - [CONSUMER_GROUP_NAME] Topic partitions with lag: [[topic=[KAFKA_TOPIC], partition=[KAFKA_TOPIC_PARTITION], lag=[LAG]],[topic=[ANOTHER_TOPIC], partition=[], lag=[]],...].

Where:

CONSUMER_GROUP_NAME - Name of the consumer group that is processing messages.
KAFKA_TOPIC - Name of the exact Kafka topic.
KAFKA_TOPIC_PARTITION - Number of the topic’s partition.
LAG - The amount of unprocessed messages.

NOTE: Logs about consumer lag are printed only if there is a lag for this consumer group.

CPU/Memory Usage

Sometimes, a problem arises due to a lack of resources for a particular service. You can view CPU and Memory usage by logging into your server/container/pod and executing the top Linux command.

For more convenient monitoring, it is better to configure Prometheus and Grafana.

If you see that some services sometimes use 100% of the CPU, you should either scale the service horizontally by creating new nodes in the cluster or scale it vertically by increasing the total amount of CPU.

Logs

Reading Logs

Regardless of the deployment type, TBMQ logs are stored in the following directory:

/var/log/thingsboard-mqtt-broker

Different deployment tools provide different ways to view logs:

Docker-Compose Deployment

Kubernetes Deployment

View last logs in runtime:

docker compose logs -f tb-mqtt-broker-1 tb-mqtt-broker-2

You can use grep command to show only the output with desired string in it. For example, you can use the following command in order to check if there are any errors on the backend side:

docker compose logs tb-mqtt-broker-1 tb-mqtt-broker-2 | grep ERROR

Tip: you can redirect logs to file and then analyze with any text editor:

docker compose logs -f tb-mqtt-broker-1 tb-mqtt-broker-2 > tb-mqtt-broker.log

Note: you can always log into TBMQ container and view logs there:

docker ps
docker exec -it NAME_OF_THE_CONTAINER bash

View all pods of the cluster:

kubectl get pods

View last logs for the desired pod:

kubectl logs -f POD_NAME

To view TBMQ logs use command:

kubectl logs -f tb-broker-0

You can use grep command to show only the output with desired string in it. For example, you can use the following command in order to check if there are any errors on the backend side:

kubectl logs -f tb-broker-0 | grep ERROR

If you have multiple nodes you could redirect logs from all nodes to files on your machine and then analyze them:

kubectl logs -f tb-broker-0 > tb-broker-0.log
kubectl logs -f tb-broker-1 > tb-broker-1.log

Note: you can always log into TBMQ container and view logs there:

kubectl exec -it tb-broker-0 -- bash
cat /var/log/thingsboard-mqtt-broker/thingsboard-mqtt-broker.log

Enabling Certain Logs

To facilitate troubleshooting, TBMQ allows users to enable or disable logging for specific parts of the system. This can be achieved by modifying the logback.xml file, which is located in the following directory:

/usr/share/thingsboard-mqtt-broker/conf

Please note that there are separate files for k8s and Docker deployments.

Here’s an example of the logback.xml configuration:

<!DOCTYPE configuration>
<configuration scan="true" scanPeriod="10 seconds">

    <appender name="fileLogAppender"
              class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.log</file>
        <rollingPolicy
                class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <maxFileSize>100MB</maxFileSize>
            <maxHistory>30</maxHistory>
            <totalSizeCap>3GB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <pattern>%d{ISO8601} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <logger name="org.thingsboard.mqtt.broker.actors.client.service.connect" level="TRACE"/>
    <logger name="org.thingsboard.mqtt.broker.actors.client.service.disconnect.DisconnectServiceImpl" level="INFO"/>
    <logger name="org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem" level="OFF"/>

    <root level="INFO">
        <appender-ref ref="fileLogAppender"/>
    </root>
</configuration>

The configuration files contain loggers which are the most useful for troubleshooting, as they allow you to enable or disable logging for a certain class or group of classes. In the example given above, the default logging level is set to INFO, which means that the logs will contain general information, warnings, and errors. However, for the org.thingsboard.mqtt.broker.actors.client.service.connect package, the most detailed level of logging is enabled. You can also completely disable logs for a part of the system, as is done for the org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem class using the OFF log-level.

To enable or disable logging for a certain part of the system, you need to add the appropriate </logger> configuration and wait for up to 10 seconds.

Different deployment tools have different ways to update logs:

Docker-Compose Deployment

Kubernetes Deployment

For docker-compose deployment we are mapping /config folder to your local system (./tb-mqtt-broker/conf folder). So in order to change logging configuration you need to update ./tb-mqtt-broker/conf/logback.xml file.

For kubernetes deployment we are using ConfigMap kubernetes entity to provide tb-brokers with logback configuration. So in order to update logback.xml you need to edit tb-broker-configmap.yml and execute the following command:

kubectl apply -f tb-broker-configmap.yml

After 10 seconds the changes should be applied to logging configuration.

Metrics

To enable Prometheus metrics in TBMQ you must:

Set the STATS_ENABLED environment variable to true.
Set the METRICS_ENDPOINTS_EXPOSE environment variable to prometheus in the configuration file.

The metrics can then be accessed via the following path: https://<yourhostname>/actuator/prometheus, and scraped by Prometheus (authentication is not required).

Prometheus metrics

The Spring Actuator in TBMQ can expose some internal state metrics through Prometheus.

Here is a list of the metrics that TBMQ pushes to Prometheus:

TBMQ-specific metrics:

incomingPublishMsg_published (statsNames - totalMsgs, successfulMsgs, failedMsgs): stats about incoming Publish messages to be persisted in the general queue.
incomingPublishMsg_consumed (statsNames - totalMsgs, successfulMsgs, timeoutMsgs, failedMsgs, tmpTimeout, tmpFailed, successfulIterations, failedIterations): stats about incoming Publish messages processing from general queue.
deviceProcessor (statsNames - successfulMsgs, failedMsgs, tmpFailed, successfulIterations, failedIterations): stats about DEVICE client messages processing. Some stats descriptions:
- failedMsgs: number of failed messages to be persisted in database and were discarded afterwards
- tmpFailed: number of failed messages to be persisted in database and got reprocessed later
appProcessor (statsNames - successfulPublishMsgs, successfulPubRelMsgs, tmpTimeoutPublish, tmpTimeoutPubRel, timeoutPublishMsgs, timeoutPubRelMsgs, successfulIterations, failedIterations): stats about APPLICATION client messages processing. Some stats descriptions:
- tmpTimeoutPubRel: number of PubRel messages that timed out and got reprocessed later
- tmpTimeoutPublish: number of Publish messages that timed out and got reprocessed later
- timeoutPubRelMsgs: number of PubRel messages that timed out and were discarded afterwards
- timeoutPublishMsgs: number of Publish messages that timed out and were discarded afterwards
- failedIterations: iterations of processing messages pack where at least one message wasn’t processed successfully
appProcessor_latency (statsNames - puback, pubrec, pubcomp): stats about APPLICATION processor latency of different message types.
actors_processing (statsNames - MQTT_CONNECT_MSG, MQTT_PUBLISH_MSG, MQTT_PUBACK_MSG, etc.): stats about actors processing average time of different message types.
clientSubscriptionsConsumer (statsNames - totalSubscriptions, acceptedSubscriptions, ignoredSubscriptions): stats about the client subscriptions read from Kafka by the broker node. Some stats descriptions:
- totalSubscriptions: total number of new subscriptions added to the broker cluster
- acceptedSubscriptions: number of new subscriptions persisted by the broker node
- ignoredSubscriptions: number of ignored subscriptions since they were already initially processed by the broker node
retainedMsgConsumer (statsNames - totalRetainedMsgs, newRetainedMsgs, clearedRetainedMsgs): stats about retain messages processing.
subscriptionLookup: stats about average time of client subscriptions lookup in trie data structure.
retainedMsgLookup: stats about average time of retain messages lookup in trie data structure.
clientSessionsLookup: stats about average time of client sessions lookup from found client subscriptions for publish message.
notPersistentMessagesProcessing: stats about average time for processing message delivery for not persistent clients.
persistentMessagesProcessing: stats about average time for processing message delivery for persistent clients.
delivery: stats about average time for message delivery to clients.
subscriptionTopicTrieSize: stats about client subscriptions count in trie data structure.
subscriptionTrieNodes: stats about client subscriptions nodes count in trie data structure.
retainMsgTrieSize: stats about retain message count in trie data structure.
retainMsgTrieNodes: stats about retain message nodes count in trie data structure.
lastWillClients: stats about last will clients count.
connectedSessions: stats about connected sessions count.
connectedSslSessions: stats about connected via TLS sessions count.
allClientSessions: stats about all client sessions count.
clientSubscriptions: stats about client subscriptions count in the in-memory map.
retainedMessages: stats about retain messages count in the in-memory map.
activeAppProcessors: stats about active APPLICATION processors count.
activeSharedAppProcessors: stats about active APPLICATION processors count for shared subscriptions.
runningActors: stats about running actors count.

PostgreSQL-specific metrics:

sqlQueue_InsertUnauthorizedClientQueue_${index_of_queue} (statsNames - totalMsgs, failedMsgs, successfulMsgs): stats about updating unauthorized clients to the database.
sqlQueue_DeleteUnauthorizedClientQueue_${index_of_queue} (statsNames - totalMsgs, failedMsgs, successfulMsgs): stats about removing unauthorized clients to the database.
sqlQueue_LatestTimeseriesQueue_${index_of_queue} (statsNames - totalMsgs, failedMsgs, successfulMsgs): stats about latest historical stats persistence to the database.
sqlQueue_TimeseriesQueue_${index_of_queue} (statsNames - totalMsgs, failedMsgs, successfulMsgs): stats about historical stats persistence to the database.

Please note that in order to achieve maximum performance, TBMQ uses several queues (threads) per each of the specified queues above.

Getting help

Community chat

The best way to contact our engineers and share your ideas with them is through our Gitter channel.

Q&A forum

For community support, we recommend visiting our user forum. It's a great place to connect with other users and find solutions to common issues.

Stack Overflow

The ThingsBoard team actively monitors posts that are tagged with "thingsboard" on the user forum. If you can't find an existing question that addresses your issue, feel free to ask a new one. Our team will be happy to assist you.

If you are unable to find a solution to your problem from any of the guides provided above, please do not hesitate to contact the ThingsBoard team for further assistance.