# **What is grammar analysis?**

In the course of natural language learning, everyone must have learned grammar, such as sentences can be represented by subject, predicate, and object. In the processing of natural language, many application scenarios need to consider the grammar of sentences, so the analysis of grammar has become very important.

grammar analysis has two main problems, one is the expression and storage method of sentence syntax in the computer, as well as corpus data sets; the other is the algorithm of grammar analysis.

For the first question, we can use a tree structure diagram to express, as shown in the figure below, S represents sentences; np, VP, PP are nouns, verbs, prepositional phrases (phrases); n, v, and p respectively, respectively It is noun, verb, preposition.

During actual storage, the above trees can be represented as (s (np (n boeing)) (vp (vp (vp (vp (pp (p in)) (np (np (n seattle))))))))))))))))))) Essence There are already mature, hand -marked corpus data sets on the Internet, such asThe Penn Treebank Project （

Penn Treebank II Constituent Tags）。

Penn Treebank II Constituent Tags）。

For the second question, we need a suitable algorithm to deal with it. This is also what we will discuss this chapter.

In order to generate the syntax tree of sentences, we can define the following set of contexts.

1) n represents the labeling of a set of non -leaf nodes, such as {s, np, vp, n …}

2) σ represents the labeling of a set of leaf nodes, such as {boeing, is …}

3) R represents a set of rules. Each rule can be expressed as x-> y1y2 … yn, x∈n, yi∈ (n∪σ)

4) s indicates the label of the grammar tree start

For example, a syntax of grammar can be expressed as shown in the figure below. When a sentence is given, we can analyze the grammar in the order from left to right. For example, sentence the manleps can be represented (s (np (dt the) (nn man)) (vp sleeps).

This context -free syntax can easily derive the grammatical structure of a sentence, but the disadvantage is that there may be duality derived from the structure derived. For example, the syntax trees in the following two pictures can represent the same sentence. Common binary problems are: 1) different words of words. For example, CAN generally represents the “can” monundile verb, sometimes the jar is expressed; The phrase may describe VP or the first PP; 3) continuous names, such as nn nn nn.

Due to the two meaning of grammar analysis, we need to find a way to find the most likely one tree from a variety of possible grammar trees. A common method is PCFG (Probabilistic Context-Free Grammar). As shown in the figure below, in addition to conventional grammatical rules, we also give a probability to each rule. For each grammar tree, we use the probability of the probability of the rules as the probability of the syntax tree.

In summary, when we may have multiple grammar trees, we can calculate the probability P (t) of each grammar tree, and the grammar tree with the largest probability is the result we want, that is, ARG MAX, that is, ARG MAX p (t).

We have defined the algorithm of grammar, and this algorithm depends on the definition of N, σ, R, and S in CFG and P (X) in PCFG. In the above, we mentioned that Penn Treebank has provided a very large corpus data set through manual methods. Our task is to train the parameters required by PCFG from the corpus.

1) Statistics all n and σ in the language library;

2) Use all the rules in the corpus as R;

3) For each rule A-> B, estimate p (x) = p (a-> b) / p (a) from the corpus;

Suppose we already have a PCFG model, including parameters such as n, σ, r, s, p (x), and the total number of syntax trees chomsky syntax format. When entering a sentence x1, x2, …, xn, how can we calculate the grammar tree corresponding to the sentence?

The first method is the method of violent traversal. Each word x may have M = Len (n) species with value. The length of the sentence is n. There are at least n rules in each situation. In the case of*n*n), we can judge the possible grammar tree and calculate the best one.

The second method of

is of course dynamic planning. We define W [i, J, X] is the most high probability that the word I to the junction is marked with X. PP] It means that when we continue to recuble in the upper layer, we only choose the most current combination method. In special cases, w [i, i, x] = p (x-> xi). Therefore, the equation of dynamic planning can be expressed as w [i, j, x] = max (p (x-> y z) * w (i, s, y) * w (s+1, j, z)). About dynamic planning methods,

There are many cases inleetcodeto explain.

grammar analysis is completed according to the algorithm process above. Although PCFG also has some disadvantages, such as: 1) lack of phrase information; 2) the processing of continuous phrases (such as nouns, prepositions). But in general, it provides a very effective implementation method for grammar analysis.