What Is Lexical Analysis?

Written by Coursera Staff • Updated on

Lexical analysis is the first step of text processing used in many artificial intelligence algorithms. Learn why this process is a key step in natural language processing, allowing machines to understand human text more effectively.

[Featured Image] A programmer sits at a desk and uses a laptop application that includes lexical analysis.

Key takeaways

Lexical analysis is a fundamental step in natural language processing (NLP) preprocessing that converts text into tokens.

  • Natural language processing engineers, who earn a median annual total salary of $162,000, commonly use lexical analysis as a part of their role [1]. 

  • When performing lexical analysis, the following steps occur: identifying tokens, assigning strings to tokens, and returning the lexeme or value of each token. 

  • You can use lexical analysis in various fields, including artificial intelligence engineering, machine learning engineering, and data science. 

Explore key terms related to lexical analysis, the steps of lexical analysis, advantages and limitations, and what types of careers utilize this process. Then, if you’re ready to learn more natural language processing techniques, enroll in the Natural Language Processing Specialization from DeepLearning.AI. In as little as three months, you can learn techniques for implementing sentiment analysis, autocorrect, autocomplete, identifying part-of-speech tags, completing analogies, and translating words.

What is meant by lexical analysis? Key terms 

When learning about lexical analysis, having a firm grasp of several key terms can help you understand the underlying process and how lexical analysis fits into the larger picture of natural language processing and artificial intelligence. Some keywords and phrases to become familiar with include the following.

Natural language processing

NLP is a branch of computer science and artificial intelligence that centers around designing ways for computers to communicate with humans. NLP aims for computers to be able to listen and converse with humans using natural language. NLP programs enable computers to read, understand, interpret, and mimic human languages in a valuable and meaningful way.

Token 

A token is a sequence of characters grouped into a single entity. Each token represents a set of character sequences conveying a specific meaning. In programming languages, tokens can be keywords, operators, identifiers, or other elements that have a syntactical role.

Tokenizer 

A tokenizer is a program that divides an input into separate tokens. These tokens have distinct meanings and represent individual entities. The tokenizer needs to identify the boundaries of tokens, which can vary depending on the context and the rules of the specific language. Tokenization is typically the first step before natural language processing.

Lexer (lexical analyzer)

A lexer, short for lexical analyzer, is a more complex program that tokenizes the input text and classifies these tokens into predefined categories. For example, in programming languages, a lexer would categorize tokens as keywords, operators, literals, etc. The lexer plays a crucial role in the parsing stage, as it feeds the parser with tokens to facilitate syntactical analysis.

Lexeme

A lexeme is a dictionary word or abstract entity that is the base meaning of a certain word or category. For example, “write,” “wrote,” “writing,” and “written” would generally all belong to the lexeme “write” unless you wanted to classify each as a separate lexeme. In the expression “5 + 6,” “5,” “+,” and “6” are separate lexemes. 

What is lexical analysis in NLP?

Lexical analysis, or scanning, is a fundamental step in NLP. In programming languages, this process involves the lexical analyzer (lexer or scanner) reading the source code character by character to group these characters into tokens, the smallest units in the code that convey meaning. These tokens typically fall into categories such as constants (like integers, doubles, characters, strings), operators (arithmetic, logical, relational), punctuation (commas, semicolons, braces), and keywords (reserved words with predefined meanings like if, while, return).

Once the lexical analyzer, or lexer, scans the text, it produces a stream of tokens. This tokenized format is essential for the next processing or program compilation stages. Lexical analysis is also an appropriate time for you to perform other data-cleaning chores, like stripping white spaces or compiling certain types of text. You can think of lexical analysis as a preprocessing step before more complex NLP analysis.

Read more: What Is Data Cleaning?

Steps of lexical analysis

Lexical analysis is a set of key steps that transform an input text into tokens or lexemes for further NLP analysis. While the process will vary depending on the method used and the type of input, most lexical analysis processes follow these general steps.

1. Identify tokens.

The first step is to determine a fixed set of input symbols. These include letters, digits, operators, brackets, and other special symbols. Each of these symbols or combinations has a specific token type.

2. Assign strings to tokens.

The lexer is programmed to recognize and categorize inputs. For example, it might be set up to recognize “cat” as a string token and “2023” as an integer token. Keywords, identifiers, whitespace, and other elements are similarly categorized.

3. Return the lexeme or value of the token.

The lexeme is essentially the smallest unit in the set of substrings that form the token. The lexer returns this lexeme, which subsequent processing stages can then use.

Lexical scanner types

When choosing what type of lexical analysis method to use to process your text or input, you will likely want to use one of two primary types: “loop and switch” or “regular expressions and finite automata.” Each method uses a distinct algorithm to analyze the input and break it down into tokens that are more easily processed by machines.

Loop and switch algorithm 

Loop constructs are like the tools used to read through a book line by line. They do a similar job in lexical analysis by going through the code, one character at a time. Think of them as being on a mission to read every letter and symbol in the code to ensure it doesn’t overlook anything. They keep doing this until they reach the end of the code. This method helps the lexer capture every piece of the code and break it down into small, meaningful tokens.

Switch statements act like quick decision-makers. Once the lexer reads a character or a group of characters, the switch statement jumps in to decide what type of token these characters belong to. This is like coming across different items when packing your garage and deciding which box to go into. For example, if the loop reads “dog”, the switch statement quickly decides if it’s a string or a keyword token. This step is crucial for organizing the code into different categories like keywords, numbers, or operators, making it easier to understand and process.

Regular expressions and finite automata 

Regular expressions describe patterns in text. In lexical analysis, they define the rules for how different tokens should look. For instance, a regular expression might describe what an email address or phone number should look like. The lexer uses these expressions to identify tokens by matching the text in the code with these patterns. It’s like having a checklist to see if a piece of text meets certain criteria to be considered a specific type of token.

Finite automata are like smart robots that follow a set of instructions to perform a task. In lexical analysis, they take the rules described by regular expressions and use them to analyze the code. They check each part of the code against these rules to see if they match. If they do, they identify a token.

What is the difference between lexical and semantic analysis?

Lexical and semantic analysis are both parts of natural language processing and stages of a language compiler. The main difference is that lexical analysis converts text into tokens, and semantic analysis determines whether the tokens are semantically correct. Every language has semantics, or a set of rules that determine how the words are arranged to give them meaning. This is true for both computer and natural languages. Semantic analysis takes the output from lexical analysis and arranges it in a coherent way. 

Should you choose lexical analysis for your text processing? 

When deciding whether to choose lexical analysis for your text processing, you should consider the advantages and disadvantages of lexical analysis to make an informed decision. While lexical analysis is a common method for text preprocessing within NLP, it is not a perfect algorithm. Some key advantages and limitations are as follows.

Advantages of lexical analysis

  • Data cleaning: It effectively removes extraneous elements like white spaces or comments, making the source program cleaner.

  • Simplifies input for further analysis: By organizing the input into relevant tokens and discarding irrelevant information, lexical analysis simplifies subsequent syntactical analysis tasks.

  • Compresses the input: Beyond simplification, the lexer plays an important role in reducing and compiling the input.

Limitations of lexical analysis

  • Ambiguity: Lexical analysis can sometimes be ambiguous in its categorization of tokens.

  • Lookahead limitations: The lexer often requires a lookahead feature to decide on the categorization of tokens, which can be a complex process.

  • Localized view of the source program: Lexical analyzers may not detect issues like garbled sequences, undeclared identifiers, or misspelled words, as they only report separate tokens without understanding their interrelation.

Start a career using lexical analysis

A wide range of careers and fields leverage the power of lexical analysis. As NLP and artificial intelligence continue to grow, the applications across industries will likely increase. One of the most common careers that uses NLP and lexical analysis is an NLP engineer. NLP engineers’ job duties vary, but typically, they design natural language processing systems, work with speech systems within artificial intelligence applications, implement new NLP algorithms, and refine models, among other tasks. In the United States, on average, NLP engineers take home a median annual total salary of $162,000, according to Glassdoor’s January 2026 data [1]. This number includes base salary and additional pay, such as profit-sharing, commissions, bonuses, or other compensation.

Other related careers that may use NLP, depending on your focus, may include:

Explore our free natural language processing resources 

If you’re considering a career in natural language processing or just want to learn more, subscribe to our LinkedIn newsletter, Career Chat, to stay current on industry-related topics. You can also explore these free resources and more to learn more about this exciting field.

Whether you want to develop a new skill, get comfortable with an in-demand technology, or advance your abilities, keep growing with a Coursera Plus subscription. You’ll get access to over 10,000 flexible courses. 

Article sources

  1. Glassdoor. “How much does a NLP Engineer make?, https://www.glassdoor.com/Salaries/nlp-engineer-salary-SRCH_KO0,12.htm.” Accessed January 14, 2026.

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.