Lecture 8: Regular Expressions

Regular Expressions

Definition: A regular expression over an alphabet \(\Sigma\) is a formula with alphabet symbols \(x\in \Sigma\), the empty set \(\emptyset\), and the empty string \(\varepsilon\) as constants, and union, concatenation, and Kleene star as operations.

Order of precedence: Kleene star has the highest precedence, followed by concatenation, and then union. For example, \(0 \cup 10^\ast\) is short for \(0 \cup (1(0^\ast))\).

Example: The following expression describes the set of strings that contain \(11\) or \(101\) as substring. \[ (0 \cup 1)^\ast 1(0\cup \varepsilon)1(0\cup 1)^\ast \] The language of a regular expression \(R\) is defined recursively. In the base case, \(R\) is a constant. Then,

\(L(x)=\{x\}\) for all \(x\in \Sigma\),
\(L(\varepsilon)= \{\varepsilon\}\),
\(L(\emptyset)=\emptyset\).

Otherwise, a regular expression is obtained by applying an operation to one or two smaller expressions. In this case,

\(L(R_1\cup R_2)= L(R_1) \cup L(R_2)\) (union),
\(L(R_1R_2)= L(R_1) L(R_1)\) (concatenation),
\(L(R^\ast)=L(R)^\ast\) (Kleene star).

Regular Expressions vs. Finite Automata

The following theorem shows that the set of languages generated by regular expressions is the same as the set of languages accepted by finite automata.

Theorem: For every language \(A\), there exists a regular expression \(R\) with \(L(R)=A\) if and only if there exists a finite automaton \(M\) with \(L(M)=A\).

To show this theorem, we will first show how to construct an automaton from a given regular expression. Then, we will show how to construct a regular expression from a given finite automaton.

From regular expression to finite automaton: Let \(R\) be a regular expression. We can construct a finite automaton \(M\) with \(L(M)=L(R)\) recursively. The idea is to use the fact that the set of languages of finite automata is closed under union, concatenation, and Kleene star operations.

In the base case, \(R\) is a constant, either \(R=x\) for \(x\in \Sigma\), \(R=\varepsilon\) or \(R=\emptyset\). In all cases, \(L(R)\) is finite. Hence, there exists a finite automaton for \(L(R)\). (See Problem 2 in Homework 2.)

Otherwise, \(R\) is an operation applied to one or two smaller expressions. Either, \(R=R_1\cup R_2\), \(R=R_1 R_2\), or \(R=R_1^\ast\). Since \(R_1\) and \(R_2\) are smaller regular expressions, we can construct automata \(M_1\) and \(M_2\) with \(L(M_1)=L(R_1)\) and \(L(M_2)=L(R_2)\). Then, by closure properties of finite automata (see Lecture 6), there exist finite automata for the languages \(L(M_1)\cup L(M_2)\), \(L(M_1)L(M_2)\) and \(L(M_1)^\ast\). Therefore, there exists a finite automaton for \(L(R)\).

From finite automaton to regular expression: Let \(M\) be a finite automaton. We are to construct a regular expression with the same language as \(M\). The idea is to transform \(M\) step-by-step into a regular expression. We will refer to the intermediate objects as generalized finite automata (GFA). In contrast to the usual kind of finite automata, generalized finite automata have transitions labeled by regular expressions. We say that a generalized finite automaton \(M\) accepts a word \(w\) if there exists an accepting computation path of \(M\) with transitions \(R_1,\ldots,R_m\) such that \(w\in L(R_1)\cdots L(R_m)\).

We say that a GFA has dedicated start and accept states, if it has exactly one start state and exactly one accept state and if no transitions go into the start state and no transitions go out of the accept state.

Lemma: Suppose \(M\) is a GFA with dedicated start and accept states and with at least one additional state (not the start or accept state). Then, there exists a GFA \(M'\) with dedicated start and accept states that has fewer states than \(M\), but accepts the same language, \(L(M')=L(M)\).

Proof: Let \(q_0\) be an additional state of the automaton \(M\) (not the start of accept state). We show how to remove this state and update the transitions between the remaining states such that the language of the automaton doesn’t change. For simplicity, we will assume that \(M\) contains a transition labeled by \(R_0\) from \(q_0\) to itself. (We can take \(R_0=\emptyset\) if \(M\) doesn’t contain such a transition.)

Construction of \(M'\): Let \(q_0,q_1,\ldots,q_m\) be the states of \(M\). Let \(R_{i\to j}\) be the label of the transition from \(q_i\) to \(q_j\), or if there is no such transition, let \(R_{i\to j}=\emptyset\). For all \(i,j\ge 1\), we label the transition from \(q_i\) to \(q_j\) in \(M'\) by \[ R'_{i\to j}=R_{i\to j} \cup R_{i\to 0}R_{0\to 0}^\ast R_{0\to j}. \]

We claim that the GFA \(M'\) has the same language as \(M\). The reason is that every computation path in \(M\) that uses \(q_0\) can be shortcut to an equivalent computation path in \(M'\). Similarly, every computation path in \(M'\) can be extended to an equivalent computation path in \(M\). \(\square\)