Natural Language Processing

Text Preprocessing & Feature Creation

= convert raw text into numeric features suitable for ML models

Basic Text Preprocessing

Tokenization

= split sequence of characters into “tokens”

Creating Features from Text Data

= represent text based on the words it contains

Bag-of-Words (BoW)

= represents text as a vector of word counts.

TF–IDF (Term Frequency–Inverse Document Frequency)

= weighted Bag-of-Words, which highlights words that are more significant or informative

  1. Term Frequency (TF): measures the frequency of a term appears in a specific document
  2. Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus.
  3. The TF-IDF score is the product of the two.

Word embeddings

What are embeddings

What to use, and when

NLP Component Tasks

= core building blocks that many NLP systems rely on

NLP Problem Types

natural-language-processing-1.png|600

Example: Bag-of-Words Workflow

A classical ML approach:

  1. Tokenize training text
  2. Convert to BoW or TF–IDF vectors
  3. Fit a model (e.g., logistic regression)
  4. Vectorize test text using the same vocabulary
  5. Predict labels (Yes/No, categories, sentiment, etc.)