Реализуйте вычисление TF-IDF (Term Frequency — Inverse Document Frequency) матрицы с нуля.
- TF(t, d) = (количество вхождений терма t в документ d) / (общее количество слов в документе d)
- IDF(t) = log(N / df(t)), где N — количество документов, df(t) — количество документов, содержащих терм t
- TF-IDF(t, d) = TF(t, d) × IDF(t)
Используйте натуральный логарифм (ln).
Верните кортеж: (список термов отсортированных лексикографически, матрица TF-IDF округлённая до 4 знаков).
def tfidf(documents: list[list[str]]) -> tuple[list[str], list[list[float]]]:
tfidf([["the","cat","sat"], ["the","dog","sat"]])
→ (["cat","dog","sat","the"], [[0.231,0.0,0.0,0.0],[0.0,0.231,0.0,0.0]])
- 1 ≤ N ≤ 100
- 1 ≤ слов в документе ≤ 1000
- Слова — строчные латинские буквы
documents = [["the","cat","sat"],["the","dog","sat"]][["cat","dog","sat","the"],[[0.231,0,0,0],[0,0.231,0,0]]]documents = [["hello","world"]][["hello","world"],[[0,0]]]documents = [["a","b","c"],["a","b"],["a"]][["a","b","c"],[[0,0.1352,0.3662],[0,0.2027,0],[0,0,0]]]