I will talk about designing a probabilistic model using a graphical model several times from this time. Compared with conventional statistical models, one of the things that learn machine learning as machine learning is the complexity of the phenomena to be handled. In order to analyze complex phenomena, a complicated model corresponding to it is necessary, and a graphical model has been developed as a method for describing it briefly.

“Model phenomena using graphical models and estimate unknown values using approximate inference method as needed”

As you acquire a series of flow, you will be able to tackle various data science issues simply and formally.

Let’s start by checking the additivity theorem and the multiplication theorem of super-ultra super important probabilities.

· Additive theorem (sum rule) * 1

P (x) = Σyp (x, y)

P (x) = Σyp (x, y)

· Multiplication theorem (product rule)

P (x, y) = p (x | y) p (y)

P (x, y) = p (x | y) p (y)

P (x, y) = p (x, y) = p (y, x) where p (x, y) is called a joint distribution and p ) Is established. P (x | y) p (x | y) is called the conditional distribution and is the distribution of xx given yy (not the distribution of yy).

The machine learning technique called Bayesian learning actually uses only the above two rules silently. The rest has only probability distributions, which are parts for specifically calculating these, and approximate reasoning (which we use as necessary).

Also, because two famous Bayes’ theorems can be derived from the two rules, there is no loss as an important result to remember.

· Bayes ‘Theorem (Bayes’ theorem)

P (x | y) = p (y | x) p (x) Σxp (x, y)

P (x | y) = p (y | x) p (x) Σxp (x, y)

In addition it is independent that I want you to know as a term. It is said that p (x) p (x) and p (y) p (y) are independent only when the following equations are satisfied.

P (x, y) = p (x) p (y)

P (x, y) = p (x) p (y)

By the way, if you divide both sides by p (y) p (y), you can write the independence of the distribution as follows.

P (x | y) = p (x)

P (x | y) = p (x)

Remember this and there is no loss.

Well, here and here is the main subject of today, this time we will deal with the most popular graphical model without a loop structure called DAG (Directed Acyclic Graph) * 2. First of all, I would like to confirm how the probability model using the above notation and the DAG using the node and the arrow correspond to each other. I think that it is faster to illustrate such a definition before saying a detailed definition.

Model 1) head to tail

F: id: sammy-suyama: 20160209232422 p: plain

The three white circle nodes A, B, C are connected by only the rightward arrow. * 3. Since the white circle is three, the formula of the probability model can also be written as the product of three distributions as follows.

P (A, B, C) = p (A) p (B | A) p (C |

P (A, B, C) = p (A) p (B | A) p (C | B)

You can write the probability distribution p (⋅) p (⋅) for each node and write the node that is the source of the arrow on the right side as “condition”.

Well, what happens if node B is “observed” in this model? “Observation” means that a specific numerical value is given to a random variable (conditioned is more accurate). In the graphical model, observed nodes are represented by black circles.

F: id: sammy-suyama: 20160209233138 p: plain

Yes, it is like this. At this time, what is the probability distribution of A and C? Since B is a black circle, it no longer has a distribution, it is OK if we consider the distribution of only white circles.

P (A, C | B)

P (A, C | B)

So let’s calculate this a bit using equation (1) and additive theorem of probability and multiplicative theorem.

(B | A) p (C | B) p (B) = p (A | B) ( C | B)

(B | A) p (C | B) p (B) = p (A | B) ( C | B)

So we found that nodes A and C are “independent” when B is observed. This is called Conditional Independence. Or say “B blocks A and C”.

This result in Model 1 is very important. Once B is observed, the values of A and C no longer have a correlation.

Model 2) tail to tail

F: id: sammy-suyama: 20160209233550 p: plain

Now, the direction of the arrow between ABs is different from the previous one, and this time the node B is the parent of A, C this time. It is like this when writing with expressions.

P (A | B, C) = p (B) p (A | B) p (C | B) (2)

P (A, B, C) = p (B) p (A | B) p (C | B)

Let’s try observing B as before.

F: id: sammy-suyama: 20160209233607 p: plain

Let’s calculate whether this posterior distribution becomes conditionally independent as in the previous example.

P (A | B) = p (A | B, C) p (B) = p (A | B) (C | B)

P (A | B) = p (A | B, C) p (B) = p (A | B) (C | B)

Yes, so again A and C seem to be independent when B is given. It was said that Node B would block A and C.

Model 3) head to head

F: id: sammy-suyama: 20160209234931 p: plain

Finally, the third model called this head to head is the most troublesome thing. It is not an exaggeration to say that they introduced the previous two models because they wanted to explain this.

The corresponding expression can be written as follows.

P (A, B, C) = p (A) p (C) p (B | A, C) (3)

(3) p (A, B, C) = p (A) p (C) p (B | A, C)

Unlike the previous model, there is a term p (B | A, C) p (B | A, C) where two variables are conditioned. By the way, I think whether it is intuitive even if reading from the graph at this time, A and C are independent. Just to be sure, let’s investigate using additive theorems.

A (C) = p (A) p (C) = p (A, C) = ΣBp (A,

A (C) = p (A) p (C) = p (A, C) = ΣBp (A,

Is it safe to use ΣBp (B | A, C) = 1 ΣBp (B | A, C) = 1? As B’s probability distribution is added, it is absolutely 1. So A and C are independent. This can also be said that the unobserved node B blocks A and C.

Well, let’s condition B next.

F: id: sammy-suyama: 20160209234943 p: plain

Let’s see if the corresponding posterior distribution becomes conditionally independent as before.

P (B | A, C) p (B) = p (B) = p (A)

P (B | A, C) p (B) = p (B) = p (A)

Unlike the other two examples, this can not be decomposed into p (A | B) p (C | B) p (A | B) p (C | B) in general. No matter how much you connect and ceremony any more, you can not do it. Originally independent A and C became dependent on each other by observation of B.

In this way, by observing B in Model 3, the distribution of A and C became more complicated. Actually, this phenomenon is the reason why an approximate reasoning method (variational approximation * 4 or MCMC * 5) is required in Bayesian learning.

Well, this time I saw the introduction of the graphical model and the reasoning of the posterior distribution in a simple model. Next time we introduce a node independence judgment algorithm for more complex graphical model called directed separation. Furthermore, I would like to express specific examples (unsupervised learning, regression / identification, semi-supervised learning, covariate shift, transition learning, etc.) in a graphical model and apply directional separation to them think.