Вопрос про production ML

A speech-AI pipeline needs fast analytical queries over training-data processing events. What requirements would you give DevOps before asking for ClickHouse?

Ответить самому

Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.

Загрузка

Specify data volume, write pattern, query patterns, retention, schema, latency/SLA, backup/replication, access control, monitoring and expected growth. A database request should be an engineering spec, not just a tool name.

Полный разбор

Start with workload. Estimate events per day, row size, retention period, peak ingest rate, late-arriving data and whether writes are append-only. Describe query patterns: aggregations by model, dataset, language, processing stage, failure reason and time window. ClickHouse is strong when these are large analytical scans. Then specify operational requirements: single node or cluster, replication, backups, disaster recovery, access control, network access, monitoring, disk size, compression, TTL/lifecycle policy and expected growth. If the data drives model training decisions, restore and audit requirements are part of the spec. Finally, propose a schema and partition/sort keys. For time-series pipeline events, partitioning by date and sorting by dataset/model/stage/time is often reasonable, but the right key should follow the most common filters. DevOps can size the system only if these assumptions are explicit.