What is Time Series Database (TSDB)
Shimon Ben David. July 21, 2020
What is a time series database?
In many scenarios there is a need to save massive amounts of data that is received at a very high rate. For example, a manufacturing facility with hundreds of assembly line machines, each sending their status and activity logs per second. Another example would be an autonomous car with multiple sensors, cameras, radar, lidar and additional equipment that is logging its activity on the road for later analysis. Another example would be in financial markets, trading companies, hedge funds and stock exchanges that are constantly being updated with the real time market data and activity, every transaction that has been made per every product (stock, bond, etc…), this data will be used for trading decisions.
In the above scenarios the data itself is very important, but equally important is the timestamp of when it was generated. This timestamp allows us to get insights from the data in a near real time fashion or later on in batch processing. For example, the faster we can understand what a financial market is doing as it is done, the faster we can react to it to gain value. Another example would be, the faster we can go over large amounts of historical market data the faster we can verify if a prediction model we created is valid or not.
This is where time series databases (TSDB) come in. These are databases (see reference to what is structured and unstructured data) that are specifically designed and optimized for working with high data rates and its time stamp. this optimization is done in several ways, starting with a more efficient representation of a timestamp that takes less capacity than that of a regular relational database (which ends up saving huge amounts of capacity), as well as the ability to write millions of records per second into the TSDB, which would usually overload a most/all non-time series database. Additionally, unlike regular relational database which need to be generic and allow for sorting and querying according to multiple different columns, keys and indexes, TSDBs are specific for querying and sorting data according to its timestamp and are therefore much more efficient and faster when doing that compared to relational databases.
Another difference is that Time Series Databases would also usually not use the standard SQL query language that is used in standard relational databases but would usually have their own optimized time query language.
There are many Time Series Databases while they vary between each other in many aspects, some would be, mostly in memory databases while others would not, each would be able to scale to different sizes and capacities, some will require additional frameworks to be maintained, some will be open source while others will be proprietary. Each will represent the data with different structures on the underlying storage side and would require distributed filesystems or block devices and would differ in the behavior with each and as a result they will also differ in their overall performance and management.
Some of the most commonly used time series databases* are InfluxDB, KDB+ by KX, Prometheus, Graphite, and more.
* Source is https://db-engines.com/en/ranking/time+series+dbms