Разбор training loop на PyTorch для многоклассовой классификации
Разбор training loop на PyTorch для многоклассовой классификации
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Check architecture, batching, labels, optimizer step/zero_grad, train/eval loops, loss inputs, device transfer, validation, batch size, epochs and metrics.
Подробный разбор
A code-review answer should separate correctness from style. Correctness issues include whether Dataset returns objects that the default collate function can batch, whether the model receives tensors rather than custom records, whether labels are passed to the loss, and whether the optimizer is created, zeroed, stepped and tied to model.parameters().
For multiclass classification with CrossEntropyLoss, the model should normally return raw logits, not softmax probabilities, because the loss applies log-softmax internally. Labels should be class indices with the right dtype and range. The training loop should run for multiple epochs, set model.train(), move the batch to the target device in the loop, and have a separate validation loop with model.eval() and no_grad().
Design issues include a weak linear-only image architecture, missing batch_size, no metrics/logging, no validation split, no seed/reproducibility strategy and unclear input-shape assumptions.
Типичные ошибки
- Apply softmax before CrossEntropyLoss.
- Forget labels or pass images as targets.
- Move data to GPU inside Dataset workers.
Как сказать на собеседовании
- Start with runtime-breaking bugs before architecture opinions.
- Mention the raw-logits contract for CrossEntropyLoss.
Вопрос про production ML
In PyTorch, what should Dataset do, what should collate_fn do, how do num_workers affect this, and where should .to(device) usually happen?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
Dataset maps indices to CPU examples; collate_fn batches fetched examples; num_workers parallelize fetching/collation. Device transfer usually belongs in the training loop, after batching.
Подробный разбор
A Dataset should expose __len__ and __getitem__, and __getitem__ should return one example. It can read files and apply CPU transforms, but it should not usually know about training state or GPU device placement.
collate_fn receives a list of already-fetched examples and turns them into a batch. It is useful when examples are custom classes, variable-length sequences or nested structures that the default collate cannot stack. It should not fetch indices itself; that would mix responsibilities.
num_workers creates worker processes for data loading. If __getitem__ moves tensors to GPU, multiple workers can compete for GPU memory, prefetch GPU batches and make device ownership messy. The common pattern is CPU Dataset and collate, then in the train loop do batch = batch.to(device), optionally with pinned memory and non_blocking transfers.
Типичные ошибки
- Make collate_fn fetch from the dataset by index.
- Move samples to CUDA inside Dataset.__getitem__.
- Forget __len__ for map-style datasets.
Как сказать на собеседовании
- Use separation of responsibilities as the organizing principle.
- Mention num_workers and prefetching when explaining device placement.
Вопрос
Why does a custom nn.Module need super().__init__()? Separately, why is tags=[] as a default argument in Python dangerous?
Сначала проговорите ответ вслух или тезисами.
Формулы, план решения, риски и примеры.
Откройте разбор только после своей попытки.
Показать разбор
Короткий ответ
nn.Module.__init__ initializes internal registries for parameters, buffers and submodules. A mutable default list is shared across calls, so mutations leak between instances.
Подробный разбор
A custom PyTorch module should call super().__init__() before assigning submodules. nn.Module.__init__ creates internal dictionaries and hooks used to register parameters, buffers and child modules. Without it, methods may exist through inheritance, but assigning Linear layers and later calling parameters(), state_dict(), train(), eval() or to(device) can break or behave incorrectly.
The Python default argument issue is separate. Default values are evaluated once when the function is defined, not each time it is called. If tags=[] is used and one call mutates that list, later calls without tags see the same mutated object. The usual pattern is tags: list[str] | None = None, then inside the function create a new empty list when tags is None.
The common theme is object lifecycle: initialization matters for framework objects, and defaults must not hide shared mutable state.
Типичные ошибки
- Think inheritance alone initializes nn.Module internals.
- Use [] or {} as default arguments.
- Use `tags or []` when an explicitly empty list has semantic meaning.
Как сказать на собеседовании
- Mention parameter registration for super().__init__.
- Give a tiny example of two instances sharing the same default list.