Пройти собеседование: Navio: Техническое собеседование

1Вопрос7 мин

Вопрос

Explain how dropout behaves during training and inference. Why does the implementation need scaling, and what is inverted dropout?

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

During training dropout randomly zeroes activations; during inference it is disabled. Scaling keeps the expected activation magnitude consistent between train and inference.

Подробный разбор

Dropout is a training-time regularizer. For each forward pass it randomly zeroes a fraction p of activations, so the model cannot rely on one fixed set of neurons and behaves more like an ensemble of subnetworks.

At inference time dropout is turned off, because predictions should be deterministic and should use the full network. Without scaling, the expected activation magnitude would differ between training and inference. In inverted dropout, which is common in frameworks such as PyTorch, the kept activations are divided by 1 - p during training. Then inference needs no extra multiplication.

The alternative convention is to leave training activations unscaled and multiply by 1 - p at inference. The key interview point is expectation matching, not the exact convention.

Типичные ошибки

Say dropout also randomly zeroes neurons during normal inference.
Forget the scaling convention.
Confuse p with the keep probability 1 - p.

Как сказать на собеседовании

State train and inference behavior separately.
Mention inverted dropout if the interviewer asks about framework defaults.

2Вопрос10 мин

Вопрос по метрикам

A binary image classifier is trained with BCE loss. On validation, accuracy rises but BCE loss also rises. Can this happen and what are plausible causes?

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

Yes. Accuracy only checks the thresholded class, while BCE also penalizes confidence. A few confidently wrong examples can increase loss while more examples cross the threshold correctly.

Подробный разбор

This can happen because accuracy and BCE measure different things. Accuracy is threshold-based: after choosing a threshold such as 0.5, it only counts whether each prediction is on the correct side. BCE uses the predicted probability, so confidence matters.

Suppose more validation examples move from 0.49 to 0.51 for the correct class. Accuracy improves. At the same time, a small number of mislabeled, shifted or hard examples can receive very confident wrong probabilities, such as 0.999 for the wrong class. BCE on those examples can grow sharply and dominate the average loss.

Plausible causes include label noise, validation/train domain shift, overconfident miscalibration, or a distribution slice where the model becomes confidently wrong. A strong answer proposes inspecting the confusion matrix, per-slice loss, calibration curves and mislabeled examples.

Типичные ошибки

Assume validation loss and accuracy must always be monotonic together.
Ignore confidence and calibration.
Look only at the aggregate metric instead of per-slice failures.

Как сказать на собеседовании

Build a tiny counterexample with one confidently wrong sample.
Name both label noise and domain shift as practical explanations.

3Вопрос8 мин

Вопрос про production ML

In PyTorch DDP training, which common layer can behave badly across processes and how do teams usually handle it?

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

BatchNorm is the classic issue: each process sees only its local mini-batch, so statistics can diverge. Use SyncBatchNorm, larger effective batches, different normalization, or accept the approximation.

Подробный разбор

BatchNorm depends on batch statistics. In DDP, each process usually receives only a shard of the global batch, so ordinary BatchNorm computes mean and variance from the local shard. If the per-GPU batch is small or non-representative, the normalization can be noisy or inconsistent.

One common fix is synchronized batch normalization, such as PyTorch SyncBatchNorm, which synchronizes statistics across processes. Another path is to use normalization layers that do not depend on cross-sample batch statistics, such as LayerNorm or GroupNorm, depending on the architecture. Some teams also simply tolerate local BatchNorm when the per-device batch is large enough and metrics are stable.

The production answer should include the tradeoff: synchronization costs communication and can slow training, so it is not automatically the best choice.

Типичные ошибки

Assume DDP automatically makes BatchNorm global.
Forget the communication overhead of SyncBatchNorm.
Ignore per-device batch size.

Как сказать на собеседовании

Say exactly what statistic is local.
Mention LayerNorm or GroupNorm as practical alternatives.

4Вопрос12 мин

Вопрос про production ML

Explain at a high level how TensorRT or similar inference optimizers speed up neural networks, and why INT8 quantization usually needs calibration.

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

Optimizers compile a static graph for target hardware, fuse operators, choose kernels and reduce precision. INT8 needs calibration to map real activation ranges into 8-bit values without destroying accuracy.

Подробный разбор

ONNX is commonly used as an interchange format that describes a static computation graph and a set of operations. TensorRT can take such a graph and build an engine tuned for a specific NVIDIA GPU. The speedups come from operator fusion, kernel selection, memory-layout planning, constant folding, removing unnecessary work and using lower precision where safe.

FP16 quantization is often relatively straightforward because it keeps floating-point semantics with fewer bits. INT8 is more delicate: activations and weights must be mapped into a small integer range. Calibration uses representative data to estimate activation ranges or distributions and choose scales and zero points. Without this, outliers and poorly chosen ranges can destroy model quality.

A good production answer also mentions dynamic shapes and unsupported operators as common reasons conversions fail or become slower than expected.

Типичные ошибки

Treat ONNX and TensorRT as the same thing.
Say quantization only saves disk size.
Ignore representative calibration data.

Как сказать на собеседовании

Separate graph optimizations from numeric precision changes.
Call out validation after conversion as mandatory.

5Вопрос8 мин

Вопрос

If a YOLO-style detector was trained at one image resolution, what can happen if you run inference at a different resolution? When is it technically possible?

Ответьте без подсказки

Сначала проговорите ответ вслух или тезисами.

Запишите черновик

Формулы, план решения, риски и примеры.

Сравните с разбором

Откройте разбор только после своей попытки.

Открыть отдельную страницу вопроса

Показать разбор

Короткий ответ

It is technically possible for fully convolutional detectors, but quality can change because feature scales, anchors and small-object visibility shift. Fully connected heads may require fixed input size.

Подробный разбор

Whether inference at a different resolution is technically possible depends on the architecture. A fully convolutional detector can usually accept different spatial dimensions, subject to stride divisibility and implementation constraints. If the model has fixed-size fully connected layers, those layers break unless they are replaced or adapted.

Even when the forward pass works, model quality can degrade. The detector learned feature scales, anchors, receptive-field behavior and preprocessing assumptions at the training resolution. Downscaling can hurt small objects; upscaling can change calibration and increase compute. In production, teams usually train and validate on the same resolution used for deployment or run explicit multi-scale training/evaluation.

For an automotive perception role, the stronger answer connects resolution choices to latency, hardware budget and safety-critical recall.

Типичные ошибки

Assume any CNN accepts any resolution with no metric change.
Forget stride and head constraints.
Discuss only speed and not detection quality.

Как сказать на собеседовании

Start with architecture compatibility.
Then discuss quality and deployment tradeoffs.