How to Build a Math Expression Tokenizer in JavaScript

Parsing and evaluating mathematical expressions is a common task in many applications, from calculators to data analysis tools to math education software. Tokenization is the critical first step that converts a raw input string like "5 + sin(x^2)" into a more structured format that captures the components and syntax of the expression.

In this article, we‘ll take an in-depth look at how to build a tokenizer for math expressions in JavaScript. While the concepts apply to any programming language, the syntax and code samples will be JS-specific. Let‘s dig in!

Anatomy of a Math Expression

Before we start coding, it‘s important to understand the building blocks that make up a valid mathematical expression. Most expressions will include some combination of the following:

Numbers – These can be integers like 42 or floating point values like 3.14. We may also want to allow common math constants like pi and e. Negative numbers should be handled as well.

Variables – Variables are stand-ins for unknown or changeable quantities, usually represented by a single letter like x or y. More advanced tokenizers could allow multi-character variable names.

Operators – These symbols define a mathematical operation to perform on the surrounding numbers/variables. The most common are the basic arithmetic operators:

  • Addition: +
  • Subtraction: –
  • Multiplication: *
  • Division: /
  • Exponentiation: ^

Functions – Math functions take in one or more arguments and return a single value. Examples include trigonometric functions like sin() and cos(), as well as misc functions like sqrt() and log(). The arguments are separated by commas and enclosed in parentheses.

Parentheses – Used to group parts of an expression to enforce a desired order of operations. Parentheses can be nested.

To be considered valid, a math expression must follow a specific syntax that dictates how the above components can be arranged. Some key rules:

  • Numbers can be integer or floating point, positive or negative
  • Variables are case-sensitive and can be a single letter or multiple (if allowed)
  • Operators must have a number/variable/expression on either side
  • Function arguments must be enclosed in parentheses and separated by commas
  • Parentheses must be balanced – same number of opens and closes, properly nested

With these rules in mind, here are some examples of valid math expressions:

  • 2 * (3 + 4)
  • -5.3 * x^2 / sqrt(y)
  • sin(pi/4) + log(100,10)
  • (a + b) * (c – d)

And some invalid ones:

  • 2 + * 3
  • (4 + 5] * 6
  • 8x^^2
  • sin(,3)

As we build our tokenizer, we‘ll want to keep these rules in mind so that we can properly parse valid expressions while flagging syntax errors in invalid ones.

Tokenization Algorithm

Now that we understand what makes up a math expression, we can devise an algorithm for splitting an input expression string into an array of tokens. Each token will be an object storing the token‘s type (number, operator, etc.) and value.

Here‘s the basic algorithm:

  1. Remove any whitespace characters from the input string since we‘ll allow arbitrary spaces between tokens

  2. Convert the string to an array of single characters

  3. Iterate through the character array, and for each character:

  • If it‘s a digit, decimal, or sign, categorize it as a NUMBER token
  • If it‘s a letter, categorize it as a VARIABLE or FUNCTION token
  • If it‘s an arithmetic symbol, categorize it as an OPERATOR
  • If it‘s an opening/closing parenthesis, categorize accordingly
  • If it‘s a comma, categorize as an ARGUMENT_SEPARATOR used for function parameters
  • Else, categorize it as an UNKNOWN token
  1. Handle multi-character tokens:
  • If a NUMBER is followed by another digit/decimal, concatenate them into a single number token
  • If a VARIABLE is followed by another letter, concatenate them into a single variable token (if allowing multi-character variables)
  • If a letter is followed by an opening parenthesis, categorize as a FUNCTION
  1. Check for implicit multiplication between certain token types:
  • NUMBER * VARIABLE
  • NUMBER * FUNCTION
  • RIGHT_PARENTHESIS * NUMBER
  • RIGHT_PARENTHESIS * VARIABLE
  • RIGHT_PARENTHESIS * FUNCTION
  1. Return the final array of tokens

Let‘s see how we can implement this in JavaScript!

Coding the Tokenizer

First, let‘s define a Token class that will represent a single token:

class Token {
  constructor(type, value) {
    this.type = type;
    this.value = value;
  }
}

Each Token instance will have a type property indicating the token category, and a value property with the actual string contents of that token.

Next, we‘ll define some helper functions to identify different character types:

function isDigit(ch) {
  return /\d/.test(ch);
}

function isDecimal(ch) {
  return ch === ".";
}

function isSign(ch) {
  return ch === "+" || ch === "-";
}

function isLetter(ch) {
  return /[a-z]/i.test(ch);
}

function isOperator(ch) {
  return /[+\-*/^]/.test(ch);
}

function isLeftParen(ch) {
  return ch === "(";
}

function isRightParen(ch) {
  return ch === ")";
}

function isComma(ch) {
  return ch === ",";
}

These use simple regular expressions to check if a given character falls into a certain category.

Now we can start implementing the core tokenize function:

function tokenize(str) {
  // Remove spaces and convert to array
  const chars = str.replace(/\s+/g, "").split("");

  const tokens = [];
  let i = 0;

  while (i < chars.length) {
    const ch = chars[i];

    if (isDigit(ch) || isDecimal(ch) || isSign(ch)) {
      let num = ch;

      // Handle multi-digit numbers
      while (isDigit(chars[i+1]) || isDecimal(chars[i+1])) {
        num += chars[i+1];
        i++;
      }

      tokens.push(new Token("NUMBER", num));
    }

    else if (isLetter(ch)) {
      let text = ch;

      // Handle multi-character variables/functions
      while (isLetter(chars[i+1])) {
        text += chars[i+1];
        i++;
      }

      // Functions have opening paren after name
      if (isLeftParen(chars[i+1])) {
        tokens.push(new Token("FUNCTION", text));  
      } else {
        tokens.push(new Token("VARIABLE", text));
      }
    }

    else if (isOperator(ch)) {
      tokens.push(new Token("OPERATOR", ch));
    }

    else if (isLeftParen(ch)) {
      tokens.push(new Token("LEFT_PARENTHESIS", ch));
    }

    else if (isRightParen(ch)) {
      tokens.push(new Token("RIGHT_PARENTHESIS", ch));
    }

    else if (isComma(ch)) {
      tokens.push(new Token("ARGUMENT_SEPARATOR", ch));
    }

    else {
      tokens.push(new Token("UNKNOWN", ch));
    }

    i++;
  }

  // Check for implicit multiplication
  for (let i = 0; i < tokens.length; i++) {
    const token = tokens[i];
    const next = tokens[i+1];

    if (token.type === "NUMBER" || token.type === "RIGHT_PARENTHESIS") {
      if (next && (next.type === "VARIABLE" || next.type === "FUNCTION" || next.type === "LEFT_PARENTHESIS")) {
        tokens.splice(i+1, 0, new Token("OPERATOR", "*")); 
      }
    }
  }

  return tokens;
}

Let‘s break this down step-by-step:

  1. First we remove any whitespace from the input string using a regex and the string replace() method. Then we split the string into an array of individual characters.

  2. We create an empty array to hold the tokens, and start iterating through the character array using an index variable i.

  3. For each character, we use if/else statements to determine the token type.

  4. If the character is a digit, decimal point, or arithmetic sign, we know it‘s the start of a number token. We initialize a num string with the current character. Then we look ahead to the next characters and keep concatenating them to num as long as they are also part of the number. Finally we add the completed number token to the result array.

  5. If the character is a letter, it could be a variable or function name. We initialize a text string and keep adding subsequent letters to it. If the next character after the text is an opening parenthesis, we know it‘s a function name, else it‘s a variable name. The token is added accordingly.

  6. If the character is an operator, left/right parenthesis, or comma, we add a token with the corresponding type and the character as the value.

  7. If the character isn‘t any of the above, we consider it unknown and add an UNKNOWN token (this is helpful for debugging).

  8. Once the entire expression string has been processed, we do a final pass through the tokens to check for implicit multiplication between certain token combinations. If found, we splice a new * operator token in between them.

  9. Finally, we return the array of tokens.

To see the tokenizer in action, let‘s test it on some sample expressions:

const expressions = [
  "2 + 3 * 4",
  "-5.3x^2 / sqrt(y)",
  "sin(pi/4) + log10(100)",
  "(a + b)(c - d)"
];

expressions.forEach(expr => {
  const tokens = tokenize(expr);
  console.log(`Expression: ${expr}`);
  tokens.forEach(token => {
    console.log(`${token.type}: ${token.value}`);  
  });
  console.log("");
});

This gives the output:

Expression: 2 + 3 * 4
NUMBER: 2
OPERATOR: +
NUMBER: 3
OPERATOR: *
NUMBER: 4

Expression: -5.3x^2 / sqrt(y)
NUMBER: -5.3
VARIABLE: x
OPERATOR: ^
NUMBER: 2
OPERATOR: /
FUNCTION: sqrt
LEFT_PARENTHESIS: (
VARIABLE: y
RIGHT_PARENTHESIS: )

Expression: sin(pi/4) + log10(100)  
FUNCTION: sin
LEFT_PARENTHESIS: (
VARIABLE: pi
OPERATOR: /
NUMBER: 4
RIGHT_PARENTHESIS: )
OPERATOR: +
FUNCTION: log10
LEFT_PARENTHESIS: (
NUMBER: 100
RIGHT_PARENTHESIS: )

Expression: (a + b)(c - d)
LEFT_PARENTHESIS: (
VARIABLE: a
OPERATOR: +
VARIABLE: b
RIGHT_PARENTHESIS: )
OPERATOR: *
LEFT_PARENTHESIS: (
VARIABLE: c
OPERATOR: -
VARIABLE: d
RIGHT_PARENTHESIS: )

As we can see, the tokenizer handles basic arithmetic, negative numbers, decimals, exponents, function calls, variables, implicit multiplication, and nested parentheses. This provides a solid foundation that we can build upon.

Further Enhancements

There are many ways we could extend and optimize our tokenizer. Some ideas:

  • Allow for multi-character variable names, not just single letters. The existing code already supports this, but we‘d need to decide on acceptable naming rules.

  • Similarly, allow function names longer than the builtin ones like sin and log. We would no longer be able to use single character lookahead to differentiate functions from variables.

  • Add support for more advanced number types like scientific notation (e.g. 3.2e-5) or fractions (e.g. 3/4). The tokenizer would need to recognize the e and / symbols in the right context.

  • Implement a configurable option for implicit multiplication. Sometimes it‘s desirable, other times explicit operators are preferred for clarity.

  • Report more specific errors for invalid expressions, rather than just generating UNKNOWN tokens. We could throw custom error objects with details about the problem and its location.

  • Use a more efficient technique than an array for the token stream, especially when dealing with very long expressions. A linked list would allow O(1) insertions.

  • Optimize the regular expressions used for character type detection. Regexes can be expensive so they should be used judiciously.

Of course, the tokenizer is just the first phase of the math expression pipeline. To be truly useful, it needs to feed into a parser that converts the token stream into an abstract syntax tree. The tree can then be evaluated to produce a final result, or manipulated further before evaluation.

But the tokenizer is the crucial first step that makes the rest possible! Hopefully this deep dive has given you insight into how it works under the hood.

Conclusion

Tokenization is an essential technique for working with math expressions and other domain-specific languages. By breaking down a raw input string into meaningful chunks, we gain the structure and context needed for the remaining steps in expression parsing.

The JavaScript tokenizer we built handles all the core components of basic arithmetic and algebra. We processed numbers, variables, operators, functions, and parentheses. The algorithm does a linear scan through the input characters, paying special attention to multi-character tokens and implicit multiplication.

While there‘s room for enhancement, the core techniques of character type detection, lookahead, and token array manipulation will scale to handle more advanced use cases. I encourage you to experiment with extending the tokenizer to support your own math syntax.

Equipped with a deeper knowledge of tokenization principles, you‘re now ready to tackle more complex parsing problems. Let me know what you build!

Similar Posts