Пройти собеседование: Diagnocat: ML System Design

1Вопрос10 мин

BatchNorm при обучении, инференсе и multi-GPU

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

BatchNorm normalizes activations by batch/channel statistics, then applies learned gamma and beta. Inference uses running statistics. Multi-GPU SyncBN must aggregate enough raw statistics to compute global mean and variance.

Подробный разбор

For image tensors, BatchNorm usually computes per-channel mean and variance across the batch and spatial dimensions. It normalizes by subtracting the mean and dividing by sqrt(variance + eps), then applies learned scale gamma and shift beta. There are 2C learned parameters for C channels.

At inference, the layer should not depend on the current small batch. It uses running mean and running variance collected during training. In multi-GPU training, ordinary BatchNorm may use only local per-device statistics. SyncBatchNorm aggregates across workers so the layer behaves closer to a larger global batch.

For variance, simply averaging local standard deviations is wrong. Implementations communicate statistics such as sums, squared sums and counts, or equivalent sufficient statistics, then derive the global mean and variance.

Типичные ошибки

Divide by variance instead of standard deviation.
Assume averaging local variances is always enough.
Forget running stats at inference.

Как сказать на собеседовании

Mention gamma and beta explicitly.
For SyncBN, say what statistics must be globally aggregated.

2Вопрос12 мин

Обучение со смешанной точностью, FP16/BF16 и память

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

Autocast runs selected ops in lower precision while keeping sensitive ops or master state in higher precision. FP16 has more mantissa precision but smaller range; BF16 has FP32-like range. FP16 often needs GradScaler.

Подробный разбор

Mixed precision means not every operation and state tensor uses the same dtype. In PyTorch, autocast chooses lower precision for many forward operations while keeping numerically sensitive operations in safer precision. Optimizers often keep master weights or optimizer state in FP32, even when forward computation uses FP16 or BF16.

FP16 has more mantissa precision than BF16 but a much smaller exponent range. BF16 has less precision but a range similar to FP32, which often makes training more stable on supported hardware. With FP16, gradient underflow is common enough that PyTorch workflows usually use GradScaler to scale the loss before backward and unscale before optimizer step.

Other memory/speed tools are separate: activation checkpointing trades compute for activation memory, DDP/FSDP/ZeRO shard or replicate parameters and optimizer state differently, and LoRA changes which parameters are trained.

Типичные ошибки

Call model.half() and assume training is solved.
Reverse the FP16/BF16 range tradeoff.
Forget activations often dominate memory in deep networks.

Как сказать на собеседовании

Mention autocast and GradScaler as the practical PyTorch pair for FP16.
Separate precision, checkpointing and sharding in the answer.

3Вопрос10 мин

ROC-AUC, ранжирующая интерпретация и бинаризованные предсказания

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

ROC-AUC measures ranking quality or area under TPR/FPR over thresholds. PR-AUC is often more informative for rare positives. With binary predictions, ROC has only one interior operating point.

Подробный разбор

ROC-AUC can be explained two ways. As a curve, sweep the decision threshold over model scores and plot TPR against FPR. As a ranking metric, ROC-AUC is the probability that a random positive receives a higher score than a random negative, with tie handling.

For severe class imbalance, PR-AUC is often more useful when the positive class is the product-critical rare class, because precision directly reflects false positives among predicted positives. ROC-AUC can look deceptively good when there are many true negatives.

If you pass already-binarized predictions instead of scores, the ROC curve still can be computed, but it has only one meaningful interior operating point plus the endpoints. It loses threshold-ranking information and is usually less informative than using raw scores or probabilities.

Типичные ошибки

Use class labels when scores are available.
Treat ROC-AUC as a calibrated probability metric.
Ignore PR-AUC for rare-positive tasks.

Как сказать на собеседовании

Give both curve and pairwise-ranking interpretations.
Say what information binarization destroys.

4Кейс18 мин

3D-сегментация dental lesions при ограниченной разметке

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

Use tooth-aware cropping to control 3D memory and class imbalance, train a 3D segmentation/detection baseline, convert masks to instances with connected components or detection heads, and evaluate per-instance recall/precision with clinical slices.

Подробный разбор

A strong design starts with constraints. Full CBCT volumes can be too large for direct 3D U-Net training, and lesions are small and sparse. If tooth masks are available, crop around each tooth plus context. This makes inputs smaller, normalizes anatomy and turns one huge rare-object problem into many tooth-local problems.

For a baseline, train a 3D U-Net-style semantic segmentation model per lesion class on high-quality voxel masks. Use class-balanced sampling, hard-negative mining, focal/Tversky/Dice-style losses, and augmentations that preserve clinical validity. Tooth-level weak labels can help with pretraining, auxiliary heads or sample mining, but should not replace voxel labels for final localization quality.

To produce instances and probabilities, threshold class probability maps, run connected components in 3D, filter tiny islands, and aggregate voxel scores per component. A detection-first alternative is a 3D detector followed by local segmentation. Evaluation should include instance-level sensitivity/precision at IoU or overlap thresholds, per-class and per-tooth slices, lesion-size slices, false-positive burden per scan and clinician review.

Типичные ошибки

Feed the full 1000^3 volume into a network without memory planning.
Use only pixel IoU and miss instance-level false positives.
Ignore tooth-local class imbalance.

Как сказать на собеседовании

Ask what labels exist and whether tooth masks are available.
Offer a baseline first, then weak-label and instance-refinement improvements.