June 2024 Newsletter

Welcome to the June ClickHouse newsletter, which will round up what’s happened in real-time data warehouses over the last month.

This month, we have the dynamic data type in the 24.5 release, why HyperDX chose ClickHouse over Elasticsearch for observability data, and how to use ClickHouse to count unique users at scale.

 

In this issue

 

Featured Community Member: Michael Driscoll

This month’s featured community member is Michael Driscoll, Co-Founder and CEO at Rill Data.

june-communitymember.png

Michael has worked in the tech industry for two decades as a technologist, entrepreneur, and investor. Over the years, he has founded several companies, including Metamarkets, a real-time analytics platform for digital ad firms, which Snap, Inc. acquired in 2017.

His latest company is Rill, a cloud service for operational intelligence. The Rill and ClickHouse worlds collided when Michael met Alexey, ClickHouse’s Co-Founder and CTO, at FOSDEM earlier this year.

Alexey suggested running Rill on top of a ClickHouse-powered data set of Wikipedia traffic. Michael and his team got this working in a couple of days, and Michael joined the 24.2 Community Call to share Rill’s connector for ClickHouse. Michael also presented at the ClickHouse San Francisco meetup two weeks ago.

Follow Michael on LinkedIn

 

Upcoming events

 

24.5 release

release-245.png

The journey to add a semi-structured data type to ClickHouse continues with the introduction of the Dynamic type. This release also saw performance improvements for CROSS JOINs and functionality to read into archive files on S3.

Read the release post

 

Why HyperDX Chose Clickhouse Over Elasticsearch for Storing Observability Data

Michael Shi works on HyperDX, an open-source observability platform built on OpenTelemetry and Clickhouse. In this blog post, he explains why they use ClickHouse rather than Elasticsearch, pointing out that observability has become more of an analytics problem than a search problem. He identifies ClickHouse’s columnar data layout and sparse indexes as key differentiators.

Read the blog post

 

Python User-Defined Functions in ClickHouse

Tom Weisner has written a tutorial on using Python User-Defined functions in ClickHouse. He starts with a simple function that reverses a string before moving onto a multi-argument function that adds minutes or hours to a provided DateTime. He concludes with a function that detects elevated heart rate activity in time-series data with help from numpy and scipy.

Read the blog post

 

Tweeq Data Platform: Journey and Lessons Learned: Clickhouse, dbt, Dagster, and Superset

Tweeq is a fintech startup building a highly scalable and flexible payments platform from scratch. ClickHouse is the data warehouse, and Tweeq uses the Kafka table engine to ingest data. In this blog post, Atheer Alabdullatif explains how they chose ClickHouse and the other tools that form part of the data platform.

Read the blog post

 

Using ClickHouse to count unique users at scale

diagram-june2024-nl.png

Twilio Engage is an Omnichannel Customer Engagement Tool that lets users define customers’ journeys. They wanted to show their users the overall stats per journey and provide more accurate step-level stats. This worked well for all users except those storing vast amounts of data. In the blog post, they explain how they solved this problem using semantic sharding and the distributed_group_by_no_merge setting, as well as reducing the size of grouping keys in the database.

Read the blog post

 

ClickHouse as part of the ETL/ELT process

Nikolai Potapov discusses the different ways in which ClickHouse can transform data in a data pipeline. We learn about parameterized views, materialized views, and various table engines.

Read the blog post

 

Post of the month

Our favorite post this month was by Pascal Senn, who’s having a great time working with ClickHouse.

tweet_1799506036828610991_20240619_135817_via_10015_io.png

Read the post