We will explore the different techniques that can be used to increase Apache Kafka retention period to store more data and answer questions like how long does Kafka keep data, what is the maximum retention size, and how to change the retention maximum time and size
Apache Kafka is a distributed messaging system that allows the handling of large volumes of data in real-time. It offers high throughput and low latency for real-time data processing. However, data retention in Kafka can be a challenging issue due to the high volume of data that it handles.
Table of Contents
How Long Does Kafka Keep Data?
By default, Kafka retains data for seven days. This is specified by the log.retention.hours
parameter in the broker configuration, which is set to 168 hours (7 days * 24 hours). However, this is not a hard limit, and the retention period can be adjusted according to the needs of your application.
You can configure the retention period to be based on time, size, or both. This means Kafka can delete data after a specified period, after the size of the data exceeds a certain limit, or whichever condition is met first.
Change data retention time
For time-based retention, you can set the retention period using log.retention.minutes
or log.retention.ms
for minute-level and millisecond-level precision, respectively.
Change data retention size
For size-based retention, you can use the log.retention.bytes
parameter. Once the size of the log for a partition exceeds this value, Kafka will start deleting old segments.
What is the Kafka max retention period?
Apache Kafka does not set a hard limit for the maximum retention period. The retention period is configured by parameters such as log.retention.hours
, log.retention.minutes
, or log.retention.ms
at the broker level, and these values can be overridden at the topic level. By default, the retention period is 168 hours (7 days), but this can be extended to any length of time based on your needs and the storage capacity of your Kafka brokers. However, if you set a very large value for these parameters, you are essentially treating Kafka as an infinite or unbounded data retention system. This may not be practical due to storage limitations. Therefore, it’s important to balance the retention period against your hardware resources and the operational costs of maintaining a large amount of data for a long period.
What is the Kafka max retention size?
Kafka doesn’t inherently have a maximum retention size limit. The size-based retention in Kafka is determined by the log.retention.bytes
configuration setting. This setting controls the maximum size of the log before old log segments are deleted, on a per-topic basis. If this setting is not defined or set to -1, Kafka applies only time-based retention.
However, it’s important to note that this is a maximum limit for each partition of the topic, not for the entire topic or the whole Kafka cluster. So, the overall data that a topic can retain would be log.retention.bytes * number of partitions
.
If log.retention.bytes
is not set (or set to -1), Kafka will only use time-based retention (log.retention.hours
, log.retention.minutes
, or log.retention.ms
) to decide when to delete data.
In terms of physical storage, the maximum amount of data Kafka can store is determined by the storage capacity of the Kafka brokers in your cluster. Therefore, it’s crucial to plan your storage capacity based on your data production rate, retention configurations, and the replication factor of your topics. Overloading Kafka brokers can result in data loss or a halt in data ingestion.
How To Improve Data Retention in Kafka
1. Increase Partition Size
One way to improve data retention in Kafka is by increasing the partition size. When a partition size is increased, more data can be stored in a single partition, which means more data can be retained for a longer period of time. However, increasing the partition size can also increase the overhead in the cluster, which can lead to a decrease in performance.
2. Use Compression
Kafka supports message compression, which can significantly reduce the size of messages that are stored in the Kafka cluster. By compressing messages, more data can be stored in the same amount of storage space. Compression can be enabled in Kafka by setting the compression.type configuration parameter.
3. Set Retention Policy
Kafka has a retention policy that determines how long messages are retained in the cluster. By default, Kafka retains messages for seven days. However, this can be changed by setting the retention.ms configuration parameter. Increasing the retention period can increase the amount of data that can be stored in Kafka.
4. Use Tiered Storage
Tiered storage is a technique that involves using different types of storage media for different types of data. For example, frequently accessed data can be stored on fast storage media, while less frequently accessed data can be stored on slower, cheaper storage media. This can help reduce the cost of storing data in Kafka while increasing the amount of data that can be retained.
5. Increase Replication Factor
Increasing the replication factor can also improve data retention in Kafka. When the replication factor is increased, multiple copies of the same data are stored in different brokers, which means that data can be retained even if some brokers fail. However, increasing the replication factor can also increase the overhead in the cluster, which can lead to a decrease in performance.
Conclusion
Improving data retention in Kafka can be achieved using different techniques such as increasing partition size, using compression, setting retention policy, using tiered storage, and increasing the replication factor. These techniques can help reduce the cost of storing data in Kafka while increasing the amount of data that can be retained. Kafka is a powerful messaging system that can handle large volumes of data in real-time, and by using these techniques, organizations can take full advantage of its capabilities.
Further Reading
- “Kafka Documentation – Retention” (https://kafka.apache.org/documentation/#retention) – This is the official documentation for Apache Kafka, which includes information on setting retention policies, managing disk space, and configuring topic-level retention.
- “Kafka Summit Talk – Tiered Storage and Backup for Apache Kafka” (https://www.youtube.com/watch?v=ryZj0eQMp6M) – This talk from the 2019 Kafka Summit covers the use of tiered storage and backups for Kafka, including the benefits of using different storage media and backup strategies for disaster recovery.