Высокоточная модерация фото авто при редком фроде
Нужно автоматически отклонять объявления, когда признаки с фото авто противоречат введенным пользователем атрибутам. Фрод редкий, а ложные отклонения бьют по пользователям. Как обучать модель, валидировать качество и выбирать пороги?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Treat this as a high-precision production classifier: use reliable clean slices, proxy fraud evidence, manual review, business-owned FPR limits, online unblock/appeal monitoring and conservative thresholds by class.
Подробный разбор
Start from the decision cost. An auto-reject should happen only when the model is very confident because false positives block legitimate sellers. Agree an acceptable false positive rate with the business, then maximize recall under that constraint rather than optimizing generic accuracy.
For training, use historical listings with user-entered attributes and photo-derived labels, but do not assume all history is clean. Build a high-confidence clean slice from ownership documents or other trusted signals, compare it with random traffic, and manually inspect model triggers. Rare classes should have separate thresholds or be excluded from auto-reject until there is enough evidence.
For validation and launch, track precision on reviewed triggers, unblock or appeal rate, trigger volume by brand/model/color, segment drift and business outcomes. Roll out gradually, keep manual-review fallback for low-confidence cases, and monitor whether the trigger distribution changes after launch.
Типичные ошибки
- Optimize ROC-AUC or accuracy while ignoring the cost of false rejects.
- Assume historical user-provided attributes are perfect labels.
- Use one global threshold for all classes, including rare ones.
- Launch without appeal or unblock feedback as a precision proxy.
Как сказать на собеседовании
- State the business-owned FPR constraint before discussing the model.
- Name one offline proxy and one online monitoring signal.
Metric learning для сравнения двух машин по фото
Metric learning для сравнения двух машин по фото
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Normalize comparable views, match angles, learn embeddings with hard positives and hard negatives, aggregate per-view similarities, and calibrate for high precision on obvious visual mismatches.
Подробный разбор
Break the problem into view normalization and comparison. First classify or infer photo angles so front, side, rear and interior views are compared to similar views. Crop or normalize away background when possible, because the model should focus on vehicle identity and visible details.
Then train an embedding model with metric learning. Positives are different photos of the same vehicle, while hard negatives should be close in make, model, color or trim but differ in visible details. Random negatives are too easy and will overstate quality. Triplet, contrastive or supervised contrastive losses are reasonable choices.
At serving time, compute cosine similarities for matched view pairs and aggregate them with a small model or calibrated rules. Evaluate on a deliberately hard dataset and report precision/recall at the threshold used for manual review or automatic action. Thin differences may be impossible from photos, so define the product use case as surfacing likely mismatches, not proving identity.
Типичные ошибки
- Train only on random negative pairs.
- Compare unmatched angles directly.
- Ignore background, lighting and crop normalization.
- Report average quality without separating obvious and subtle mismatches.
Как сказать на собеседовании
- Explicitly mention angle matching before metric learning.
- Explain how hard negatives are mined or labeled.
Метрики классификации, ties в ROC-AUC и F1
Метрики классификации, ties в ROC-AUC и F1
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
ROC-AUC is the probability a random positive is scored above a random negative. Score ties contribute half credit. F1 is the harmonic mean, so a low precision or recall strongly limits the final value.
Подробный разбор
ROC-AUC evaluates ranking, not a fixed threshold. It can be interpreted as the share of positive-negative pairs where the positive object receives a higher score than the negative object, with ties usually contributing 0.5.
If scores are rounded, ordering can be lost. Pairs that were correctly ordered may become ties, which lowers their contribution from 1 to 0.5. Pairs that were incorrectly ordered may also become ties, improving from 0 to 0.5. In practice rounding often reduces resolution and can change AUC even if classification at one threshold looks unchanged.
F1 is 2 * precision * recall / (precision + recall). The harmonic mean is closer to the smaller value, so a model cannot compensate terrible recall with excellent precision, or the reverse. Use F-beta when one side is more important.
Типичные ошибки
- Say ROC-AUC is accuracy over thresholds.
- Ignore the 0.5 contribution of tied scores.
- Treat F1 as an arithmetic average.
- Use F1 when the business has asymmetric error costs but no beta choice.
Как сказать на собеседовании
- Give the pairwise interpretation of ROC-AUC.
- For F1, say explicitly that harmonic mean penalizes the smaller metric.
Важность признаков в линейных моделях при мультиколлинеарности
Важность признаков в линейных моделях при мультиколлинеарности
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Coefficient magnitude is interpretable only after feature scaling and under stable feature relationships. Correlated or linearly dependent features can split or flip weights, so use regularization, diagnostics, feature removal or transformed features carefully.
Подробный разбор
A coefficient says how the prediction changes when that feature changes by one unit while other features are fixed. Because units matter, comparing raw coefficient magnitudes is usually invalid. Standardize numerical features before using coefficient magnitude as a rough importance signal.
Even after scaling, coefficients can mislead under multicollinearity. If features are strongly correlated or linearly dependent, many coefficient combinations can explain the same signal. Weights may become unstable, split across duplicate features, change sign, or depend heavily on regularization.
Mitigations include checking correlation and VIF-like diagnostics, removing redundant features, using L1 or L2 regularization, grouping features, or transforming the space with PCA. PCA can reduce collinearity, but it also reduces direct feature interpretability because coefficients now apply to components rather than original business features.
Типичные ошибки
- Compare coefficients before scaling features.
- Assume a large coefficient always means high business importance.
- Forget that correlated features make individual coefficients unstable.
- Use PCA and still talk about original feature-level coefficients.
Как сказать на собеседовании
- Mention scaling first.
- Use multicollinearity as the main counterexample.
Градиентный бустинг, остатки и диапазон предсказаний
Градиентный бустинг, остатки и диапазон предсказаний
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Boosting adds trees that approximate negative gradients of the loss, not raw targets. Because predictions are sums of many gradient steps, a boosted regressor can move outside the original target range.
Подробный разбор
Gradient boosting builds an additive model. Start with an initial prediction, then repeatedly fit a weak learner to the negative gradient of the loss with respect to current predictions. For MSE, that gradient is proportional to the residual y - y_hat, so the intuition of fitting residuals is correct.
The leaf values in later trees are not simply averages of original y values. They are fitted updates, often gradients or Newton-style leaf estimates depending on the implementation and objective. The final prediction is the initial value plus learning-rate-scaled contributions from all trees.
A random forest regressor averages target values in leaves and then averages trees, so for standard settings it tends to stay within the range of training targets. Gradient boosting is a sum of updates; it can overshoot and predict outside the observed target range, especially with many trees, high learning rate or objectives that allow such updates.
Типичные ошибки
- Say each boosting leaf stores only averaged y values.
- Explain gradient boosting exactly like random forest.
- Forget the learning rate in the additive prediction.
- Assume tree-based regressors can never extrapolate outside target range.
Как сказать на собеседовании
- For MSE, derive the residual as the negative gradient.
- Contrast boosting with random forest leaf averaging.
Transformer attention, токенизация и cross-attention
Transformer attention, токенизация и cross-attention
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Tokens are embedded, positional information is added or injected, self-attention mixes contextual information, masked decoder attention prevents future leakage, and cross-attention lets decoder queries attend to encoder keys and values.
Подробный разбор
A Transformer starts with tokenization, often subword tokenization such as BPE or WordPiece. Token ids are mapped to embeddings. Because attention alone is permutation-invariant over positions, the model needs positional information, either added as sinusoidal or learned embeddings or injected through rotary position embeddings.
In self-attention, each token produces query, key and value vectors. Query-key dot products score relevance, scaling stabilizes logits, softmax turns scores into weights, and the output is a weighted sum of values. Multi-head attention repeats this in several learned subspaces, then the block applies residual connections, normalization and a feed-forward network.
In an encoder-decoder Transformer, the encoder builds contextual representations of the source sequence. The decoder uses masked self-attention so position t cannot see future tokens. Cross-attention then uses decoder states as queries and encoder states as keys and values, letting generation condition on the source sequence. GPT-style models are decoder-only; BERT-style models are encoder-only.
Типичные ошибки
- Forget positional information.
- Mix up query/key/value sources in cross-attention.
- Describe decoder self-attention without causal masking.
- Say BERT and GPT differ only by training data.
Как сказать на собеседовании
- Use one sentence each for tokenization, attention and masking.
- For cross-attention, say decoder queries attend to encoder keys and values.
Устойчивость градиентов, активации, skip connections и инициализация
Устойчивость градиентов, активации, skip connections и инициализация
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Gradients vanish or explode through repeated multiplication by small or large derivatives. Use stable activations, normalization, residual paths, careful initialization, gradient clipping for explosions and architecture choices that preserve signal.
Подробный разбор
In backpropagation, gradients are multiplied through many layers. If typical derivatives or Jacobian norms are much smaller than one, gradients vanish. If they are much larger than one, gradients explode. Saturating activations such as sigmoid can create near-zero derivatives, while unnormalized dot products or unstable recurrent dynamics can increase norms.
Activation choices help. ReLU avoids sigmoid saturation on the positive side, but dead ReLUs can receive zero gradient. Leaky ReLU, GELU and related activations keep smoother or nonzero gradients in more regions.
Other controls matter as much: residual or skip connections give gradients shorter paths, normalization stabilizes activation distributions, Xavier or He-style initialization preserves variance early in training, and gradient clipping limits explosions. In sequence models, LSTM-style gates historically helped preserve long-term signal; Transformers rely heavily on residual paths, normalization and scaled attention.
Типичные ошибки
- Call gradient clipping a fix for vanishing gradients.
- Say ReLU has no gradient problems.
- Ignore initialization and normalization.
- Explain exploding gradients only as numeric overflow, not as unstable optimization.
Как сказать на собеседовании
- Separate vanishing and exploding remedies.
- Mention residual connections as the main deep-network answer.
Интуиция Adam, momentum и RMSProp
Интуиция Adam, momentum и RMSProp
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
SGD updates weights by minibatch gradients. Momentum smooths direction with a running average of gradients. RMSProp normalizes by a running average of squared gradients. Adam combines both with bias-corrected first and second moments.
Подробный разбор
Plain minibatch SGD computes a gradient on a batch and moves parameters opposite that gradient with a learning rate. Its noise can help escape shallow local structure, but it can zigzag or move slowly in ill-conditioned directions.
Momentum keeps an exponential moving average of gradients. This first-moment estimate behaves like velocity: directions that persist accumulate speed, while noisy sign changes are smoothed.
RMSProp keeps an exponential moving average of squared gradients. This second-moment estimate rescales updates coordinate-wise, reducing steps where gradients are consistently large or volatile. Adam combines momentum and RMSProp: update roughly follows first_moment / sqrt(second_moment + epsilon), usually with bias correction in early steps.
Типичные ошибки
- Say Adam is just SGD with a different learning rate.
- Mix up first moment and second moment.
- Forget the epsilon and square root role in normalization.
- Assume Adam always generalizes better than SGD.
Как сказать на собеседовании
- Describe first moment as velocity and second moment as scale normalization.
- Name AdamW if asked about decoupled weight decay.
Форматирование целого числа с разделителями тысяч
Given an integer, return its string representation with commas between thousands groups, without using built-in number formatting.
Решение прямо на странице
Напишите код, запустите проверки и только потом открывайте разбор.
Нажмите «Запустить проверки» или Ctrl+Enter.
Показать разбор
Подсказки
- Идите справа налево
Последняя группа цифр получается через остаток от деления на 1000.
- Не потеряйте внутренние нули
Группы справа от самой левой должны иметь ровно 3 символа, поэтому используйте дополнение нулями.
Идея решения
Сначала отдельно запоминаем знак и работаем с модулем числа. Ноль стоит обработать явно, потому что цикл деления на 1000 для него не выполнится.
Дальше можно последовательно брать остаток от деления на 1000. Это очередная правая группа цифр. Все группы, кроме самой левой, должны быть дополнены ведущими нулями до длины 3. После этого разворачиваем список групп и соединяем его запятыми.
Главная ошибка на собеседовании возникает на числах вроде 1000000: если не дополнить внутренние группы нулями, получится 1,0,0 вместо 1,000,000.
Эталонный код
def format_thousands(n: int) -> str:
if n == 0:
return "0"
negative = n < 0
n = abs(n)
groups: list[str] = []
while n > 0:
n, chunk = divmod(n, 1000)
text = str(chunk)
if n > 0:
text = text.zfill(3)
groups.append(text)
result = ",".join(reversed(groups))
if negative:
return "-" + result
return result