ClickHouse Basics
ClickHouse is a columnar DBMS that does not have the overhead of working with wide tables with a large number of columns, which means it allows you to store data in a denormalized form.
Where ClickHouse is used
ClickHouse works effectively in projects of any scale: from MVPs to analytical systems of large corporations. Used in:
- Financial technologies
- Scientific research
- Analytical services (marketing, marketplaces)
- The work of warehouses and production facilities (analysis of operations, metrics of facilities)
ClickHouse is suitable for
- Quick queries on non-aggregated statistics
- Storing and reading logs
- SIEM - Information security and Security Event Management
- Storage and processing of denormalized data
- Storage of telemetry, metrics for clients and devices
- Storing statistics (for example, a data source in Grafana)
When ClickHouse is not needed
- Complex JOIN queries between large tables (require a lot of memory and data transfer)
- Samples of individual rows with low latency
- OLTP load with frequent changes and real-time data updates
ClickHouse is designed to increase memory for caching frequently used data. For analytical queries with large amounts of data, the response time is shorter than for other databases.
Key Features of ClickHouse
- Column storage: The data of each column is stored in a separate file. When reading, only the necessary columns are decompressed, which speeds up the work.
- Vector engine: processing data by vectors (pieces of columns) using SIMD instructions and efficient CPU caching.
- Natural parallelization: Large requests are automatically parallelized to all available server resources.
- Working in single-node and cluster configurations: allows you to scale data processing.
- Scalability: Adding new nodes to process petabytes of data.
- Decentralization and fault tolerance: Asynchronous multi-master replication with automatic data recovery.
- OLAP-oriented: real-time analytical data processing, without support for OLTP transactionality.
What are SIMD instructions and why are they needed in ClickHouse
SIMD (Single Instruction, Multiple Data) is a technology in which a processor can perform the same operation on multiple data simultaneously.
How it works
- Usually, the processor processes data one element at a time (for example, adds two numbers).
- With SIMD, the processor can, for example, add up 4 or 8 pairs of numbers at once in one command.
- This is achieved through special instructions and extended processor registers.
In ClickHouse, the use of SIMD instructions is especially important because of its columnar data storage structure. Since data is stored in columns, and analytical queries often require performing the same operation on a large number of values in one column (for example, filtering, aggregation), SIMD instructions can significantly speed up these calculations. Instead of the processor processing each value separately, SIMD allows it to perform a single operation on multiple values at once. For example, if you need to add two columns, each of which contains 1000 numbers, the processor will perform 1000 addition operations without SIMD. Using SIMD, when the processor can add, say, 8 pairs of numbers at the same time, the total number of operations will be reduced to 125, which significantly increases the processing speed.
ClickHouse operating modes
One node (Single-node)
- Easy installation and launch of the service.
- Suitable if:
- No fault tolerance required
- No need for horizontal scaling
- The amount of data fits on one server
- CPU and RAM resources are sufficient for the load
Cluster
- Requires ZooKeeper™ (3 or 5 nodes) or ClickHouse Keeper for coordination.
- ZooKeeper is a Java service coordinator from Apache.
- ClickHouse Keeper is a ZooKeeper alternative compatible with it.
- Stores cluster configuration, metadata, task queues, and information about replicated tables. Allows you to configure replication and sharding for scaling and increased fault tolerance.
- MPP support (executing a single query on multiple nodes at the same time).
- Replication and sharding can be used together or separately.
Differences between ClickHouse and other DBMS
- Optimized for OLAP tasks, not OLTP.
- Column storage and vector processing of data.
- There is no support for real-time transactions.
- High performance for analytical queries on large amounts of data.
Connection interfaces
ClickHouse supports two protocols:
Native TCP:
- Used in the command line client and for server-to-server communication
- Has less overhead
- Implemented in C++, Go, Python
HTTP:
- More versatile, implemented in almost all languages
- More limited compared to the native interface
- Has more overhead and lower performance
Both interfaces support TLS encryption.
Tools for connecting
Officially supported:
- Console client
- JDBC driver
- ODBC driver
- C++ Client Library
For most programming languages, there are libraries that implement both Native and HTTP connections.
Working with ClickHouse via DBeaver
DBeaver is a versatile tool with a graphical interface for working with various databases, including ClickHouse. To connect to ClickHouse via DBeaver:
Installing DBeaver:
- Download and install DBeaver from the official website (dbeaver.io )
- Available versions for Windows, macOS and Linux
Creating a connection:
- Click "New Connection" in the menu or on the toolbar
- Select "ClickHouse" from the list of available databases
- Enter the connection parameters:
- Host (ClickHouse server address)
- Port (default 8123 for HTTP or 9000 for TCP)
- Database (database name)
- Username and password
- Configure SSL/TLS for secure connection if necessary
Creating tables in ClickHouse
As with most databases, ClickHouse logically groups tables into databases. Use the CREATE DATABASE
command to create a new database in ClickHouse:
CREATE DATABASE IF NOT EXISTS helloworld
Similarly, use CREATE TABLE
to define a new table. (If you do not specify a database name, the table will be in the default
database.) The following table is called my_first_table
in the 'helloworld` database:
CREATE TABLE helloworld.my_first_table
(
user_id UInt32,
message String,
timestamp DateTime,
metric Float32
)
ENGINE = MergeTree()
PRIMARY KEY (user_id, timestamp)
In the example above, my_first_table
is a MergeTree
table with four columns:
`user_id': 32-bit unsigned integer
message': string data type
String`timestamp': the
DateTime` value that represents a point in timemetric
: 32-bit floating point number
A brief introduction to primary keys
Before proceeding, it is important to understand how primary keys work in ClickHouse (the implementation of primary keys may seem unexpected!):
Attention
Primary keys in ClickHouse are not unique for each row in the table
The primary key of the ClickHouse table determines how the data is sorted when writing to disk. Every 8,192 rows or 10 MB of data (called the index granularity') creates an entry in the primary key index file. This concept of granularity creates a sparse index that fits easily into memory, and the granules are a strip of the smallest amount of column data that is processed during
SELECT`queries.
The primary key can be determined using the 'PRIMARY KEYparameter. If you define a table without the specified
PRIMARY KEY', then the key becomes the tuple specified in the ORDER BY
clause. If you specify both PRIMARY KEY
and `ORDER BY', the primary key must be a prefix of the sort order.
The primary key is also a sorting key, which is a tuple of (user_id, timestamp)'. Thus, the data stored in each column file will be sorted by
user_id', then by `timestamp'.
Inserting data into ClickHouse
You can use the familiar INSERT INTO TABLE
command from ClickHouse. Let's insert some data into the created table.
INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
(101, 'Hello, ClickHouse!', now(), -1.0 ),
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421 ),
(102, 'Sort your data based on your commonly-used queries', today(), 2.718 ),
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159 )
To make sure that everything worked, we will run the following SELECT
query:
SELECT * FROM helloworld.my_first_table
ClickHouse is OLAP (Online Analytical Processing) A system specifically designed for analytics and processing of large amounts of data with high speed and scalability. Due to its architecture, ClickHouse is capable of processing millions of line inserts per second. This high level of performance is achieved through parallel data processing and efficient column compression. However, unlike classic transactional databases, ClickHouse does not guarantee immediate data consistency: the system is focused on eventual consistency (consistency over time) and is optimized for operations involving primarily the addition of new data.
In contrast, OLTP (Online Transaction Processing) Databases such as PostgreSQL are designed to work with transactions and provide strict ACID guarantees. PostgreSQL, for example, implements the MVCC (Multi-Version Concurrency Control) mechanism, which allows processing parallel transactions and maintaining multiple versions of data simultaneously. Such databases are great for scenarios where instant data consistency and support for complex modification operations are important, but because of this, data insertion can be slower, especially with large volumes.
To maximize performance when inserting data into ClickHouse, it is important to take into account its features and adhere to certain recommendations. Following these rules allows you to avoid typical errors that may occur when trying to use ClickHouse in the same way as transactional OLTP databases, and to build an insertion process that is optimal for analytical workloads.
Best practices for inserting data into ClickHouse
Insert in large batches
EachINSERT
creates a separate piece of data, so it is better to send fewer large batches (from 1,000 to 100,000 rows at a time) than many small ones. This reduces the load on the file system and improves performance.Use asynchronous inserts for small batches
If it is not possible to collect large batches on the client side (for example, in monitoring systems), use asynchronous ClickHouse inserts. In this mode, data is first buffered and then dumped into storage in one piece, which allows efficient aggregation of small inserts on the server.Ensure that repeated inserts are idempotent
ClickHouse inserts are idempotent by default: resending the same data does not lead to duplication if the order and content match. This is important for reliability in case of network failures.Insert directly into MergeTree or replicated tables
It is best to insert data immediately into MergeTree tables (or their replicated versions), distributing the load across sharded nodes. For distributed tables, it is recommended to enable `internal_replication=true'.Use the native format for maximum performance For the fastest inserts, use the ClickHouse native data format, which minimizes the overhead of parsing and conversion. The alternative is
RowBinary' (optimal for string format), and
JSONEachRow' - for fast integration, but with high parsing costs.Use the official ClickHouse client Official clients for popular languages support optimal insertion modes, including asynchronous inserts.
Use the HTTP interface if necessary ClickHouse supports an HTTP interface for inserting and reading data, which is convenient for load balancing and integration with external systems. However, the native protocol is usually a bit faster.
Know the limitations of asynchronous inserts
The data in the buffer is unavailable for requests until it is flushed to the main storage. Buffer reset parameters can be configured.
Data sampling
ClickHouse supports similar SELECT queries that you are already familiar with. For example:
SELECT *
FROM helloworld.my_first_table
ORDER BY timestamp
Data update
Use the ALTER TABLE' command...UPDATE
to update the rows in the table:
ALTER TABLE [<database>.]<table> UPDATE <column> = <expression> WHERE <filter_expr>
<expression>
is the new value for the column for which <filter_expr>
is executed. The <expression>
must have the same data type as the column, or be converted to the same data type using the 'CAST operator. '<filter_expr>
should return the value UInt8' (zero or not zero) for each row of data. Multiple 'UPDATE
<column>
statements can be combined in a single ALTER TABLE
command, separated by commas.
Example:
ALTER TABLE helloworld.my_first_table
UPDATE message = 'Updated message', metric = metric * 2
WHERE user_id = 102
Deleting data
Use the ALTER TABLE
command to delete rows:
ALTER TABLE [<database>.]<table> DELETE WHERE <filter_expr>
'<filter_expr>` should return the UInt8 value for each row of data.
Example:
ALTER TABLE helloworld.my_first_table
DELETE WHERE user_id = 101
Easy removals
Another option to delete lines is to use the DELETE FROM
command, which is called easy deletion. Deleted rows are immediately marked as deleted and automatically filtered from all subsequent queries, so you don't have to wait for parts to merge or use the FINAL
keyword. Data cleanup occurs asynchronously in the background.
DELETE FROM [db.]table [ON CLUSTER cluster] [WHERE expr]
Example:
DELETE FROM helloworld.my_first_table
WHERE user_id = 102