JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
All guides

Classic NLP Tasks

Before chatbots, NLP was a toolbox of focused jobs — tag the words, find the names, translate, judge the mood, shrink the text. Meet those five classic tasks, and the BLEU and ROUGE scores we use to grade two of them.

From words to jobs

So far in this rung you have turned text into pieces (tokenization) and into numbers (word vectors and embeddings), and you have seen a model learn to predict the next word (language modeling). Those are *capabilities*. This guide is about *jobs* — the concrete things people actually wanted computers to do with language, long before anyone typed a prompt into a chatbot. For decades, NLP was not one big model but a workshop of separate tools, each tuned for one task.

Why bother with the old jobs when one big model can now do most of them? Three reasons. First, the tasks are how the whole field defined progress — benchmarks, error types, and the very vocabulary you will hear ('entity', 'span', 'reference translation') come from them. Second, many of these jobs still ship in production exactly as classic components because they are cheap, fast, and auditable. Third, understanding a task sharply — what counts as a correct answer, how we score it — is the only way to tell whether a flashy demo actually works.

Labeling each word: POS tagging and NER

The simplest family of tasks attaches a label to each token. Part-of-speech tagging (POS tagging) decides whether each word is a noun, verb, adjective, and so on. It sounds trivial until you notice that 'book' is a noun in 'read a book' but a verb in 'book a flight' — the right tag depends on context, not the word alone. POS tags were once the backbone of grammar checkers, search engines, and the first step of almost every NLP pipeline.

Named-entity recognition (NER) is the same shape of problem, aimed at the words that matter most: which spans are people, places, organizations, dates, amounts? From 'Apple opened a store in Paris in 2015', NER pulls out *Apple* (organization), *Paris* (location), *2015* (date). This powers résumé parsing, financial-news scraping, and medical records — anywhere you need structured facts out of free text. Both POS and NER are examples of sequence labeling (sequence labeling): one label per token, in order.

Tokens:  John   lives  in  New     York    .
POS:     NOUN   VERB   ADP NOUN    NOUN    PUNCT
NER:     B-PER  O      O   B-LOC   I-LOC   O
One label per token. The B-/I-/O scheme marks the Beginning and Inside of an entity span, and O for tokens outside any entity — so 'New York' is tagged as a single two-word place.

How are they solved? The historically important answer is the hidden Markov model and its successors, which treat the sentence as a chain and pick the most likely *sequence* of tags rather than guessing each word in isolation. Modern systems do the same job with neural sequence models, but the framing — context-aware labels over a chain — is unchanged. Honest accuracy note: POS tagging on clean English news text passes 97%, which sounds finished, yet that last 3% hides exactly the hard, ambiguous cases, and accuracy drops sharply on tweets, speech, or other languages.

Sentiment, translation, summarization

Not every task labels words. Sentiment analysis (sentiment analysis) reads a whole review or tweet and decides whether the *feeling* is positive, negative, or neutral. Early versions were almost embarrassingly simple — count positive and negative words from a list and subtract — and that crude word-counting baseline already worked surprisingly often. But it stumbles on the things humans grasp instantly: sarcasm ('great, another delay'), negation ('not bad at all'), and mixed feelings in one sentence. Sentiment is a useful reminder that 'meaning' is more than a tally of words.

Machine translation (machine translation) is the heavyweight. Output is not a label or a class but a whole new sentence in another language. For decades it leaned on hand-written rules and word-by-word statistics, which produced the famously stilted 'translatorese'. The leap came when researchers framed translation as turning one sequence into another — sequence-to-sequence learning with an encoder-decoder: one network reads the source sentence into a meaning representation, another writes it out in the target language.

Text summarization (summarization) shrinks a long document into a short one that keeps the gist. There are two honest flavors. *Extractive* summarization copies the most important sentences verbatim — safe, but choppy. *Abstractive* summarization writes fresh sentences, like a human would — fluent, but prone to inventing details that were never in the source, the failure we now call hallucination. Knowing which flavor you have tells you exactly what can go wrong.

How do you grade a translation? BLEU and ROUGE

For POS tagging you can just count how many tags matched — accuracy. But for translation there is no single correct sentence, so simple matching fails. The clever workaround is the BLEU score (BLEU), the standard for machine translation. BLEU compares the machine's output against one or more human *reference* translations and asks: what fraction of the candidate's word chunks also appear in a reference?

Those 'word chunks' are n-grams — single words, pairs, triples, quadruples. Matching single words checks vocabulary; matching longer runs checks fluency and word order. BLEU multiplies these together and adds a 'brevity penalty' so a model cannot cheat by outputting one perfect word and stopping. The result is a number from 0 to 100; higher is better, and a score above roughly 40 is usually a genuinely good translation.

Summarization uses a sibling metric, the ROUGE score (ROUGE). The key difference is emphasis: BLEU leans toward *precision* (of the words I produced, how many were right?), which suits translation; ROUGE leans toward *recall* (of the words that should be there, how many did I cover?), which suits summarization, where missing a key point is the worse sin. Same n-gram-overlap idea, tuned for a different worry.

Why the toolbox mattered — and where it leads

Step back and a pattern appears. Each classic task needed its own dataset, its own model, its own metric, and its own team. A company doing translation, sentiment, and search ran three unrelated systems. This was the world right before the transformer era you are about to enter: capable but fragmented, with knowledge locked inside each narrow tool. There was also a quieter cost — every task needed thousands of hand-labeled examples, which is slow and expensive to create.

The big idea coming in the next guides is that one model, pre-trained on raw text, can be gently adapted to *all* of these jobs — tagging, NER, sentiment, even translation and question answering — often with far fewer labeled examples. The classic tasks did not disappear; they became things a single general model is asked to do. So the names you learned today are not history to forget. They are the test sheet the new models are still graded on, and the vocabulary you will use to say precisely what a system can and cannot do.