A state machine for a non-regular language


Martin McBride, 2018-05-18
Tags none
Categories none

Following on from the section on state machines, we will now look at an example of a language that isn't regular, and see what sort of problems you might have designing a state machine.

As it turns out, most programming languages are not regular languages, as we will see at the end of this section.

A non regular language

We will look at a simple example of a language that isn't regular. The language contains only the characters X and Y, and consists of all the strings that consist of n Xs followed by n Ys, where n is any integer. Examples of valid strings are:

XY or XXYY or XXXYYY etc

Invalid strings include:

XXY or XXYYY

because the number of Xs and Ys are not equal. There are lots of other invalid strings, of course.

The state machine

Here is our state machine. For clarity we have excluded all the error paths and just show the correct paths through the machine:

Here is how it works for the string XY:

  • Starting in state 1
  • X moves us to state 2
  • Y moves us to state 9, which is the valid end state

Here is how it works for the string XXYY:

  • Starting in state 1
  • X moves us to state 2
  • X moves us to state 3
  • Y moves us to state 8
  • Y moves us to state 9, which is the valid end state

It should be clear how this works for other strings. Each X takes us one step further to the right, along the top row. We then need exactly the same number of Y characters to get us back to the valid end state, 9.

The problem

Here is the problem. The state machine shown for only works for 4 or less incoming X characters.

Of course, we could easily add extra states to allow us to handle 5 Xs. Or 10. Or 100.

But however many states you add, someone could come along with a string that was too long and the state machine would fail. To be capable of handling a string of any length you would need a state machine with an infinite number of states.

But we are dealing with finite state machines, which by definition cannot have an infinite number of states (the clue's in the name). A finite state machine cannot handle this particular language.

What does this mean?

Just because this languages isn't regular doesn't mean a computer can't parse it, of course. It is a very simple language - it would clearly be possible to write a program to check a string to see whether it was valid or not. You just need to count the number of Xs and then count the number of Ys, and check if they are the same.

But you can't use a finite state machine to do that, so our simple language isn't regular.

We can't apply some other methods of parsing that only apply to regular languages.

Programming languages

Are programming languages like Python or Java regular? Well it turns out most of them are not.

A simple way to see that is to look at a basic maths expression:

x = a * (b + c * (d + e * (f + g)))

This expression uses brackets. Brackets can be nested, as shown. One of the things that has to be true of any valid expression is that each opening bracket must be matched by a closing bracket.

Can a state machine do this? well we already know that it can't. Even if we ignore all the other parts of the expression, the very least we need to do is check a string like this:

((()))

for any number of brackets. That is exactly the same problem as we looked at before for strings like XXXYYY, so we know a finite state machine can't parse a mathematical expression, so any language that allows nested brackets can't be regular.

If fact you get a similar problem with other types of nesting. Nested if statements, nested for loops, nested function definitions, nested classes - any language that supports those things (and most langauges do) cannot be a regular language.

Copyright (c) Axlesoft Ltd 2021