Spark Broadcast Join и производительность Python UDF
Spark Broadcast Join и производительность Python UDF
Ответить самому
Сначала сформулируйте ответ как на собеседовании, затем откройте разбор и оцените себя.
Короткий ответ
Broadcast Join sends a small table to executors so each partition joins locally and avoids expensive shuffle. Python UDFs are slow because Spark must cross the JVM-Python boundary, serialize data and lose many Catalyst/codegen optimizations.
Полный разбор
Broadcast Join is fast when one side is small enough to fit in executor memory. Spark ships that small side to all executors, and each partition of the large table can join locally. This avoids repartitioning both sides by join key and avoids the network-heavy shuffle path.
Python UDFs can be slow because Spark's execution engine is JVM-based, while the function runs in a Python worker. Rows or batches must be serialized between JVM and Python, Catalyst cannot freely optimize inside the UDF, and code generation/vectorized execution may not apply. Plain row-wise UDFs are especially expensive.
Prefer built-in Spark SQL functions, joins, expressions and window functions. If custom Python logic is unavoidable, consider pandas/vectorized UDFs, Arrow, batch processing, pushing logic upstream, or implementing performance-critical logic in Scala/Java.
Теория
Spark is fast when the optimizer sees relational operations; Python UDFs hide semantics and add process-boundary overhead.
Типичные ошибки
- Broadcast a table that does not fit in executor memory.
- Use Python UDFs for logic expressible in Spark SQL.
- Forget serialization and JVM-Python boundary costs.
Как отвечать на собеседовании
- Say “local join without shuffle” for broadcast.
- For UDFs, mention JVM-Python serialization and lost Catalyst optimization.